### Predicting stock price movements: Part 2 of 2


This program predicts future price movements of stocks based on past returns, average trading volume, their lagged values, and calendar (day of week and month) data. A preceding program creates the dataset to be used in this step.

The general approach used in prediction is based on a paper by Jim Kyung-Soo Liew and Boris Mayster: "Forecasting ETFs with Machine Learning Algorithms" (2017).

For each ticker and three models (Neural Networks (NN), Support Vector Machines (SVM), and Random Forests (RF)) the program first identifies the best model metaparameters on a training set using a grid search + cross validation approach, then uses these parameters to make predictions on a test set.

July 16, 2017

Murat Aydogdu

In [1]:
from IPython.display import display
import pandas as pd
import numpy as np
import datetime as dt
import time
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:20,.2f}'.format
from __future__ import print_function

In [11]:
# Import models to be used here as well as other scikit components
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler

In [3]:
# The models used and parameter spaces are defined in the same way as 
# the Liew Mayster (2017).
models = { 
    'RandomForestClassifier': RandomForestClassifier(),
    'SVC': SVC(),
    'MLPClassifier': MLPClassifier(max_iter=10000)
}

parameters = {     
    'RandomForestClassifier': {"n_estimators": [100, 200, 300], "criterion": ["gini", "entropy"]},
    'SVC': [{'kernel': ['rbf', 'linear'], 
             'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}],
    'MLPClassifier': [{'activation' : ['logistic', 'tanh', 'relu'],
                     'solver' : ['sgd'],
                     'alpha' : [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000],
                     'hidden_layer_sizes' : [(100,100), (100, 100, 100)]}]
}

In [4]:
# The input file is already created
# and is saved in a CSV file.
# It is in "wide" format and has
# for each date
# Y (price movement over the next 20 days) 
# R (return over the past 20 days as well as its 10 lags)
# AV (average volume over the past 20 days as well as its 10 lags)

# Read the previously-saved CSV if needed
# N-day return and j lags of this variable
# N-day average volume and j lags of this variable
# Y = 1 if N-day return (from today to N-days ahead) >= 0 
df = pd.read_csv('Predict_01.CSV')

In [5]:
list(df)

['Unnamed: 0',
 'Date',
 'Y AAPL',
 'Y AMZN',
 'Y GOOGL',
 'Y IBM',
 'Y MSFT',
 'R AAPL',
 'R AMZN',
 'R GOOGL',
 'R IBM',
 'R MSFT',
 'R01 AAPL',
 'R01 AMZN',
 'R01 GOOGL',
 'R01 IBM',
 'R01 MSFT',
 'R02 AAPL',
 'R02 AMZN',
 'R02 GOOGL',
 'R02 IBM',
 'R02 MSFT',
 'R03 AAPL',
 'R03 AMZN',
 'R03 GOOGL',
 'R03 IBM',
 'R03 MSFT',
 'R04 AAPL',
 'R04 AMZN',
 'R04 GOOGL',
 'R04 IBM',
 'R04 MSFT',
 'R05 AAPL',
 'R05 AMZN',
 'R05 GOOGL',
 'R05 IBM',
 'R05 MSFT',
 'R06 AAPL',
 'R06 AMZN',
 'R06 GOOGL',
 'R06 IBM',
 'R06 MSFT',
 'R07 AAPL',
 'R07 AMZN',
 'R07 GOOGL',
 'R07 IBM',
 'R07 MSFT',
 'R08 AAPL',
 'R08 AMZN',
 'R08 GOOGL',
 'R08 IBM',
 'R08 MSFT',
 'R09 AAPL',
 'R09 AMZN',
 'R09 GOOGL',
 'R09 IBM',
 'R09 MSFT',
 'R10 AAPL',
 'R10 AMZN',
 'R10 GOOGL',
 'R10 IBM',
 'R10 MSFT',
 'AV AAPL',
 'AV AMZN',
 'AV GOOGL',
 'AV IBM',
 'AV MSFT',
 'AV01 AAPL',
 'AV01 AMZN',
 'AV01 GOOGL',
 'AV01 IBM',
 'AV01 MSFT',
 'AV02 AAPL',
 'AV02 AMZN',
 'AV02 GOOGL',
 'AV02 IBM',
 'AV02 MSFT',
 'AV03 AAPL',
 '

In [6]:
# Generate month and weekday dummies, then append those columns 
# to the main data frame

Wkday = pd.get_dummies(pd.to_datetime(df['Date']).dt.dayofweek,prefix='D')
Mon = pd.get_dummies(pd.to_datetime(df['Date']).dt.month,prefix='M')
Mon1 = pd.concat([Wkday, Mon], axis=1)

df = pd.concat([df, Mon1], axis=1)

# Rename columns so that Y-column names only have the tickers ('Y AAPL' -> 'AAPL')
df.columns=df.columns.str.replace('Y ','')

In [7]:
df

Unnamed: 0.1,Unnamed: 0,Date,AAPL,AMZN,GOOGL,IBM,MSFT,R AAPL,R AMZN,R GOOGL,R IBM,R MSFT,R01 AAPL,R01 AMZN,R01 GOOGL,R01 IBM,R01 MSFT,R02 AAPL,R02 AMZN,R02 GOOGL,R02 IBM,R02 MSFT,R03 AAPL,R03 AMZN,R03 GOOGL,R03 IBM,R03 MSFT,R04 AAPL,R04 AMZN,R04 GOOGL,R04 IBM,R04 MSFT,R05 AAPL,R05 AMZN,R05 GOOGL,R05 IBM,R05 MSFT,R06 AAPL,R06 AMZN,R06 GOOGL,R06 IBM,R06 MSFT,R07 AAPL,R07 AMZN,R07 GOOGL,R07 IBM,R07 MSFT,R08 AAPL,R08 AMZN,R08 GOOGL,R08 IBM,R08 MSFT,R09 AAPL,R09 AMZN,R09 GOOGL,R09 IBM,R09 MSFT,R10 AAPL,R10 AMZN,R10 GOOGL,R10 IBM,R10 MSFT,AV AAPL,AV AMZN,AV GOOGL,AV IBM,AV MSFT,AV01 AAPL,AV01 AMZN,AV01 GOOGL,AV01 IBM,AV01 MSFT,AV02 AAPL,AV02 AMZN,AV02 GOOGL,AV02 IBM,AV02 MSFT,AV03 AAPL,AV03 AMZN,AV03 GOOGL,AV03 IBM,AV03 MSFT,AV04 AAPL,AV04 AMZN,AV04 GOOGL,AV04 IBM,AV04 MSFT,AV05 AAPL,AV05 AMZN,AV05 GOOGL,AV05 IBM,AV05 MSFT,AV06 AAPL,AV06 AMZN,AV06 GOOGL,AV06 IBM,AV06 MSFT,AV07 AAPL,AV07 AMZN,AV07 GOOGL,AV07 IBM,AV07 MSFT,AV08 AAPL,AV08 AMZN,AV08 GOOGL,AV08 IBM,AV08 MSFT,AV09 AAPL,AV09 AMZN,AV09 GOOGL,AV09 IBM,AV09 MSFT,AV10 AAPL,AV10 AMZN,AV10 GOOGL,AV10 IBM,AV10 MSFT,D_0,D_1,D_2,D_3,D_4,M_1,M_2,M_3,M_4,M_5,M_6,M_7,M_8,M_9,M_10,M_11,M_12
0,0,2004-09-17,1.00,0.00,1.00,0.00,1.00,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,0.15,0.04,0.08,0.03,-0.00,0.15,0.02,0.08,0.01,-0.02,0.18,0.09,0.07,0.02,0.01,0.18,0.09,0.07,0.01,-0.00,0.14,0.10,0.07,-0.01,0.00,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,209843079.33,276929220.24,3476070785.80,270901977.21,868888248.06,204623590.50,280958269.06,3480190122.43,272912432.90,865451049.19,203476436.62,287074085.06,3478905651.54,268223778.32,870221627.50,201875425.28,292031919.35,3514978060.75,266772942.72,876251112.86,205360390.52,302333536.62,3528430379.53,272257300.66,903231652.31,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,1,2004-09-20,1.00,0.00,1.00,1.00,1.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,0.15,0.04,0.08,0.03,-0.00,0.15,0.02,0.08,0.01,-0.02,0.18,0.09,0.07,0.02,0.01,0.18,0.09,0.07,0.01,-0.00,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,209843079.33,276929220.24,3476070785.80,270901977.21,868888248.06,204623590.50,280958269.06,3480190122.43,272912432.90,865451049.19,203476436.62,287074085.06,3478905651.54,268223778.32,870221627.50,201875425.28,292031919.35,3514978060.75,266772942.72,876251112.86,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
2,2,2004-09-21,1.00,0.00,1.00,1.00,1.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,0.15,0.04,0.08,0.03,-0.00,0.15,0.02,0.08,0.01,-0.02,0.18,0.09,0.07,0.02,0.01,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,209843079.33,276929220.24,3476070785.80,270901977.21,868888248.06,204623590.50,280958269.06,3480190122.43,272912432.90,865451049.19,203476436.62,287074085.06,3478905651.54,268223778.32,870221627.50,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
3,3,2004-09-22,1.00,0.00,1.00,1.00,1.00,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,0.15,0.04,0.08,0.03,-0.00,0.15,0.02,0.08,0.01,-0.02,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,209843079.33,276929220.24,3476070785.80,270901977.21,868888248.06,204623590.50,280958269.06,3480190122.43,272912432.90,865451049.19,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,4,2004-09-23,1.00,0.00,1.00,1.00,1.00,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,0.15,0.04,0.08,0.03,-0.00,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,209843079.33,276929220.24,3476070785.80,270901977.21,868888248.06,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
5,5,2004-09-24,1.00,0.00,1.00,1.00,1.00,0.08,0.02,0.11,-0.00,-0.01,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,0.18,0.06,0.08,0.06,0.02,209793465.19,315346225.14,441859306.99,277231309.77,946899492.84,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,213791432.11,269322060.58,3305496494.59,260827920.41,872290144.11,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
6,6,2004-09-27,1.00,0.00,1.00,1.00,1.00,0.09,0.00,0.11,-0.01,-0.01,0.08,0.02,0.11,-0.00,-0.01,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,0.15,0.11,0.05,0.03,0.01,211053920.92,323652035.81,446270344.13,284324595.16,960039513.83,209793465.19,315346225.14,441859306.99,277231309.77,946899492.84,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,213725751.70,276458134.97,3208938125.76,258064698.59,877284133.29,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
7,7,2004-09-28,1.00,0.00,1.00,1.00,1.00,0.11,0.03,0.24,0.00,-0.00,0.09,0.00,0.11,-0.01,-0.01,0.08,0.02,0.11,-0.00,-0.01,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,0.15,0.12,0.05,0.03,0.01,215886136.18,332347224.53,486748259.86,289538638.45,983495071.98,211053920.92,323652035.81,446270344.13,284324595.16,960039513.83,209793465.19,315346225.14,441859306.99,277231309.77,946899492.84,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,210204598.47,294829499.08,3221311366.90,257153246.13,879542110.67,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
8,8,2004-09-29,1.00,0.00,1.00,1.00,1.00,0.12,0.07,0.28,0.00,0.01,0.11,0.03,0.24,0.00,-0.00,0.09,0.00,0.11,-0.01,-0.01,0.08,0.02,0.11,-0.00,-0.01,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,0.14,0.10,0.06,0.03,0.01,213944096.10,340886332.04,574252127.29,292218090.39,992813117.89,215886136.18,332347224.53,486748259.86,289538638.45,983495071.98,211053920.92,323652035.81,446270344.13,284324595.16,960039513.83,209793465.19,315346225.14,441859306.99,277231309.77,946899492.84,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,208767832.07,296993529.92,3174982857.13,261047814.69,876020800.35,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0
9,9,2004-09-30,1.00,0.00,1.00,1.00,1.00,0.08,0.07,0.29,0.02,0.01,0.12,0.07,0.28,0.00,0.01,0.11,0.03,0.24,0.00,-0.00,0.09,0.00,0.11,-0.01,-0.01,0.08,0.02,0.11,-0.00,-0.01,0.13,0.04,0.14,-0.01,-0.01,0.16,0.06,0.13,-0.00,-0.00,0.22,0.10,0.08,0.01,0.00,0.22,0.10,0.10,0.01,0.01,0.21,0.11,0.17,0.01,0.01,0.15,0.08,0.07,0.01,-0.01,212311312.92,350074923.88,595947130.78,293984873.82,1014559939.00,213944096.10,340886332.04,574252127.29,292218090.39,992813117.89,215886136.18,332347224.53,486748259.86,289538638.45,983495071.98,211053920.92,323652035.81,446270344.13,284324595.16,960039513.83,209793465.19,315346225.14,441859306.99,277231309.77,946899492.84,225399146.59,316652788.35,433659678.40,271508563.29,933743930.15,226930891.07,315771102.75,432226256.27,270429744.44,935367207.57,224611086.39,305412423.55,449781859.73,262897371.99,910145968.56,219142114.70,301833067.25,478445205.49,263419386.51,878373383.16,219559758.92,295295629.57,508588884.88,263741569.62,873070421.48,214147068.13,302882886.14,3250494371.22,258640513.47,854596244.17,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0


In [8]:
# Returns and Average volumes need to be scaled to 0-mean unit standard deviation
# returns are the first 50 columns, Volumes are the next 50 columns
# This function accepts X and y for a ticker, 
#  splits the data into training and test sets,
#  transforms returns and volumes for both training and testing data sets
#  based on the training set fit
#  and returns the resulting training / testing X and y datasets for that ticker

def preprocess(X, y):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    
    # Normalize return and volume variables
    std_scaler = StandardScaler()
    X1 = X_train[X_train.columns[0:110]] 
    X2 = X_train[X_train.columns[110:128]] 
    std_scaler.fit(X1)
    X1 = std_scaler.transform(X1)
    X_train = np.concatenate((X1, X2), axis=1)
    
    X1 = X_test[X_test.columns[0:110]] 
    X2 = X_test[X_test.columns[110:128]] 
    X1 = std_scaler.transform(X1) # Use the same transformation on the test set
    X_test = np.concatenate((X1, X2), axis=1)
    
    return (X_train, y_train, X_test, y_test)

In [19]:
def runGS(model, params, X_train, X_test, y_train, y_test):
    print (dt.datetime.now())
    print("Running GridSearchCV for", model)
    
    gs = GridSearchCV(model, params, cv=3, n_jobs=1)
    gs.fit(X_train,y_train)

    print("Best parameters set found using training set:")
    print()
    print(gs.best_params_)
    print()
    #print("Grid scores on training set:")
    #print()
    #means = gs.cv_results_['mean_test_score']
    #stds = gs.cv_results_['std_test_score']
    #for mean, std, params in zip(means, stds, gs.cv_results_['params']):
    #    print("%0.3f (+/-%0.03f) for %r"
    #          % (mean, std * 2, params))
    #print()
    
    y_true, y_pred = y_test, gs.predict(X_test)
    print("Accuracy score:", accuracy_score(y_true, y_pred))
    print()
    print("Classification report:")
    print(classification_report(y_true, y_pred))
    print()
    print("--- %s seconds ---" % (time.time() - start_time))

### Data subsets

Predictive models are run four times for each ticker:

**Dataset A:** has a stock's return as well as 10 lags of that return plus all other stocks' returns (11 features per stock)

**Dataset B:** has a stock's 20-day average volume as well as 10 lags of that volume plus all other stocks' average volumes (11 features per stock)

**Dataset C:** has trading day and month features: one dummy for each weekday and one dummy for each month. This dataset doesn't have any stock-specific information

**Dataset D:** has combines all of the datasets above.

In [20]:
tickers = ["AAPL", "GOOGL","AMZN","MSFT","IBM"]
for ticker in tickers:
    df['Y'] = df[ticker]
    d1 = df.drop(['Unnamed: 0','Date','AAPL','AMZN',\
                  'GOOGL','IBM','MSFT'], axis=1)
    #d1 = d1.sample(frac=0.50) # Possibly work with a smaller data set
 
    X = d1.drop('Y', axis=1) # All columns but Y are preditors  
    y = d1['Y']

    X_train, y_train, X_test, y_test = preprocess(X, y)
    
    for key, value in models.iteritems():
        model = models[key]
        params = parameters[key]

        # For each ticker, run all the models for four datasets
        # by selecting some (returns, volumes, date-related)
        # or all features
        # Note : Y is not impacted

        # Dataset A: returns only
        Xtrain = X_train[:,0:55] 
        Xtest = X_test[:,0:55]
        start_time = time.time()
        print (ticker, 'Dataset A')

        runGS(model, params, Xtrain, Xtest, y_train, y_test)

        # Dataset B: volumes only
        Xtrain = X_train[:,55:110] 
        Xtest = X_test[:,55:110]
        start_time = time.time()
        print (ticker, 'Dataset B')

        runGS(model, params, Xtrain, Xtest, y_train, y_test)

        # Dataset C: date-related only
        Xtrain = X_train[:,110:127] 
        Xtest = X_test[:,110:127]
        start_time = time.time()
        print (ticker, 'Dataset C')

        runGS(model, params, Xtrain, Xtest, y_train, y_test)

        # Dataset D: All = A + B + C
        Xtrain = X_train[:,0:128] 
        Xtest = X_test[:,0:128]
        start_time = time.time()
        print (ticker, 'Dataset ABC')

        runGS(model, params, Xtrain, Xtest, y_train, y_test)


AAPL Dataset A
2017-07-15 21:06:45.650967
Running GridSearchCV for MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=1000, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
Best parameters set found using training set:

{'alpha': 0.01, 'activation': 'tanh', 'solver': 'sgd', 'hidden_layer_sizes': (100, 100, 100)}

Accuracy score: 0.807291666667

Classification report:
             precision    recall  f1-score   support

        0.0       0.78      0.68      0.72       357
        1.0       0.82      0.88      0.85       603

avg / total       0.81      0.81      0.80       960


--- 1291.09256816 seconds ---
AAPL Dataset B
2017-07-15 21:28:16.743773
Running GridS

Best parameters set found using training set:

{'n_estimators': 300, 'criterion': 'entropy'}

Accuracy score: 0.6125

Classification report:
             precision    recall  f1-score   support

        0.0       0.36      0.06      0.10       357
        1.0       0.63      0.94      0.75       603

avg / total       0.53      0.61      0.51       960


--- 16.8210771084 seconds ---
AAPL Dataset ABC
2017-07-16 02:20:54.198615
Running GridSearchCV for RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
Best parameters set found using training set:

{'n_estimators': 100, 'criterion': 'entropy'}

Accuracy score: 0.921875

Classification report:
             preci

Accuracy score: 0.836458333333

Classification report:
             precision    recall  f1-score   support

        0.0       0.86      0.70      0.77       379
        1.0       0.83      0.92      0.87       581

avg / total       0.84      0.84      0.83       960


--- 48.591588974 seconds ---
GOOGL Dataset B
2017-07-16 06:32:07.595184
Running GridSearchCV for RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
Best parameters set found using training set:

{'n_estimators': 200, 'criterion': 'gini'}

Accuracy score: 0.907291666667

Classification report:
             precision    recall  f1-score   support

        0.0       0.91      0.85      0.88       

Best parameters set found using training set:

{'kernel': 'linear', 'C': 0.1}

Accuracy score: 0.613541666667

Classification report:
             precision    recall  f1-score   support

        0.0       0.50      0.30      0.38       371
        1.0       0.65      0.81      0.72       589

avg / total       0.59      0.61      0.59       960


--- 6.17796397209 seconds ---
AMZN Dataset ABC
2017-07-16 09:23:20.225689
Running GridSearchCV for SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Best parameters set found using training set:

{'kernel': 'rbf', 'C': 1000}

Accuracy score: 0.89375

Classification report:
             precision    recall  f1-score   support

        0.0       0.87      0.86      0.86       371
        1.0       0.91      0.92      0.91       589

avg / total       0.89      0.89      0.8

Best parameters set found using training set:

{'kernel': 'rbf', 'C': 10}

Accuracy score: 0.765625

Classification report:
             precision    recall  f1-score   support

        0.0       0.76      0.68      0.72       419
        1.0       0.77      0.84      0.80       541

avg / total       0.77      0.77      0.76       960


--- 2863.2706039 seconds ---
MSFT Dataset B
2017-07-16 12:45:27.750179
Running GridSearchCV for SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Best parameters set found using training set:

{'kernel': 'rbf', 'C': 1000}

Accuracy score: 0.880208333333

Classification report:
             precision    recall  f1-score   support

        0.0       0.87      0.85      0.86       419
        1.0       0.89      0.90      0.89       541

avg / total       0.88      0.88      0.88     

Best parameters set found using training set:

{'alpha': 0.01, 'activation': 'tanh', 'solver': 'sgd', 'hidden_layer_sizes': (100, 100)}

Accuracy score: 0.888541666667

Classification report:
             precision    recall  f1-score   support

        0.0       0.87      0.88      0.88       428
        1.0       0.90      0.89      0.90       532

avg / total       0.89      0.89      0.89       960


--- 1506.4010098 seconds ---
IBM Dataset A
2017-07-16 15:23:54.670310
Running GridSearchCV for SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Best parameters set found using training set:

{'kernel': 'rbf', 'C': 10}

Accuracy score: 0.785416666667

Classification report:
             precision    recall  f1-score   support

        0.0       0.77      0.75      0.76       428
        1.0       0.80      0.82    

In [25]:
# Left-align the table below
%%html
<style>
  table {margin-left: 0 !important;}
</style>

### Performance metrics

Performance of models are measured using two metrics. In a binary classification such as this (i.e., will the stock price move up or not over the next 20 days?) if we define upward move (positive return) as positive, there are four possible outcomes: 


|    |      Predicted|  | |
|:----------|:-------------|:-----|:-----|
| **Actual**   |  Positive | Negative | Total|
| Positive |    TP (True Positive)  |  FN (False Negative) | P|
| Negative | FP (False Positive) |  TN (True Negative) | N|
| Total | P\* |N\*|| 


** Accuracy score:** measures the percent of predictions that are correct: (TP + TN) / (P\* + N\*)

** Classification report:** gives, for each class (Up or Not), 
* Precision: percent of predictions that are actually of that class: TP / P\* and TN / N\*
* Recall: percent of positives / negatives that are predicted as such: TP / P and TN / N
* F1-scores: 2 \* (Precision * Recall) / (Precision + Recall)
* Support: Number of observations in that class


### Results: background

**A caveat:** The dataset is limited in that it covers 5 of the most active technology stocks. In this sense, the dataset is pretty small and homogenous; generalizing these results is probably not warranted.

* Going forward, using a larger dataset covering hundreds or thousands of stocks with varying liquidity levels could inform whether liquidity is an important aspect of predictability. 


* Such a project would be fairly parallelizable since once the main dataset is constructed, predictive models can me run simultaneously on subsets of data (where the feature set would be fixed and target variables would be based on the subset of stocks). Once all the models are run on all subsets, results could be consolidated for interpretation.

**Operationalizing the predictive models:** building a trading model where stock price movement predictions are used to buy or short sell securities is feasible. 

* The model would be  built at a point in time, either by picking one of the best performing models here (other, new modesl could also be used), or by using all models in an ensemble setting where the up/not predictions would be aggregated across models to obtain a single signal per stock. 

* A long/short portfolio would buy all the Up prediction stocks and short-sell the others on the first day of the next month. These positions would be held for a month (20 trading days). At the end of the month, predictive models would be run again to generate new signals. 

* In order to reduce transaction costs, portfolio can be rebalanced so that only stocks whose signals change are bought or sold. Stocks whose signals do not change will remain long (or short). Portfolio weights would have to be carefully adjusted in this case.

**Benchmark model:**  To provide a benchmark for the performance of predictive models, we could construct a simple "null model" which predicts the more common outcome to be the prediction in all instances. In this dataset, all stocks have a tendency to move up over the next month (20-trading day) horizon. So the simple benchmark model would predict an upmove in prices at all times. 

* In the test datasets, the percent of upmoves run from 55% (IBM) to 63% (AAPL). The training set upmoves should be similar since training and test datasets are split randomly. These figures set benchmark accuracy rates for the three models used.

### Results: a first look

The predictive models work well in general: performance metrics are generally much better compared to the simple benchmark model. For instance, accuracy scores are often in 80% - 90% range. 


#### Model comparison

* Overall, all three modelsperform quite well. Their performances are about the same, with f1 and accuracy scores around 80% to low 90% range. 


* Where performances differ, **random forests** have a small edge over others. If only one model was needed, RF would be the choice. One other justification for this choice is time needed to train models: RF training was fairly fast on the computer used to run this program. NN training times are much longer in general, often 20-30 times more, relative to RF. SVM training times vary from subset to subset and can be very long (several hundred times that of RF training times) to relatively short (on par with RF training times). This drastic variation in SVM training implies, at least on this machine, that SVM training times need to be seriously considered.


* Earlier version of this program was run on a subset of the dataset (about 30% of the avilable data). In that run, NN performance typically lagged those of the other two models significantly and consistently. This observation suggests NN benefits more from more data. 


* Better performance of RF and SVM than NN is also reported by Liew and Mayster (2017) in their paper. In the full-data set version reported here, model performances are not as significant.

#### Data subset comparison

* Among subsets of data, trading volume (Dataset B) results are the best, generally markedly better than return (Dataset A) results. f1-scores and accuracy scores are often a few percent to several percent higher with volume subsets. This observation (better performance for volume than returns) is also in line with the Liew and Mayster (2017) paper. 

* Calendar dummies subset (Dataset C) often performs poorly (with accuracy scores in the high 50% low 60% range and f1-scores in high 40% to high 50% range), which is perhaps not surprising. After all, this data set has no security-specific features. 

* An earlier version of this program didn't have R (most recent return) and AV (most recent average volume): it only had the 10 lags of R and AV. f1-scores and accuracy scores rose substantially with the addition of R and AV. 

* Combined dataset (Dataset ABC) does not necessarily improve model performance. For RF and SVM, performance for the combined dataset is typically on par or only slightly better (by about 1%). This is surprising as it implies that volume captures most of teh predictability in stock returns and return and calendar effects do not contribute much.

* For NN, combined dataset perfromance is often several percent better than that of volume subset, implying neural nets benefit from more data more.

### Conclusions and extensions

This analysis shows that it is possible to predict the future direction of stock prices using past prices, trading volume, calendar-effect variables with 80% to 90% accuracy. The analysis is performed using a small subset of actively-traded technology stocks. Past trading volume data prove to be especially useful in predicting future stock price movements. 

This analysis is, in some ways, a replication of Liew and Mayster (2017) paper that looks at the predictability of ETF returns. While the results here are somewhat weaker, at least part of this discrepancy could be that the price movements of ETF, which are essentially portfolios, are easier to predict than individual stocks.

Another revision of this analysis could look at which features specifically contribute to predicatiblity. This is relatively easy in the context of random forests, which turns out to be one of the best performing models in this analysis.

Also, there are many more machine learning models that could be considered. While the three models used here some of the stronger models, other models could still perform better. If this analysis formed a basis for a portfolio construction and implementation effort, a broader array of models should be considered as the first step. Of course, the dataset could also be expanded in that context. 