## PART 2: Modeling
In this notebook, I will walkthrough the process of modelling after we got the data from part 1. I hope that we would see some useful results at the end of this notebook.

In [1]:
import pandas as pd
import numpy as np

## Many scikit-learn packages to import
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [18]:
## First, we read the dataframe that we got from the last part (part 1)
## Remember the rule of ML: analyze only training set (do not touch validating or testing set if not specified so)
Train_df = pd.read_csv('SET_Train.csv')
Validate_df = pd.read_csv('SET_Validate.csv')
Test_a_df = pd.read_csv('SET_Test_a.csv')
Test_b_df = pd.read_csv('SET_Test_b.csv')
Train_df.head()

Unnamed: 0,Date,Close,Volume,MA_diff_3,MA_diff_5,MA_diff_10,MA_diff_14,MA_diff_20,EMA_diff,MACD,...,STD_14,STD_20,Volume_Agent,Y_1,Y_3,Y_5,Y_10,Y_14,Y_20,Y_N_1
0,2008-01-02,842.97,1686634.0,0.59,5.874,0.657,0.112857,1.1225,0.846747,-0.843428,...,19.004899,16.768455,1,-1,-1,-1,-1,-1,-1,1
1,2008-01-03,832.63,2203218.0,-6.476667,-2.13,1.501,-0.558571,-0.6085,-0.271927,-1.046243,...,18.795705,16.548025,1,-1,-1,-1,-1,-1,-1,1
2,2008-01-04,821.71,2244205.0,-12.13,-3.898,0.781,-0.884286,-1.2365,-1.336735,-2.064332,...,18.863764,16.38875,1,-1,0,-1,-1,-1,-1,1
3,2008-01-07,808.31,1749336.0,-11.553333,-8.75,0.333,-1.77,-1.1405,-2.543061,-3.9074,...,19.579607,17.186561,1,0,-1,-1,-1,-1,0,1
4,2008-01-08,811.69,1746730.0,-6.98,-9.282,1.998,-1.765,-1.0825,-1.950755,-5.037241,...,19.783006,17.660359,0,1,-1,-1,-1,-1,0,0


In [3]:
Train_df['Y_1'].value_counts()

 0    1117
 1     720
-1     604
Name: Y_1, dtype: int64

In [4]:
Train_df['Y_3'].value_counts()

 1    1047
-1     765
 0     629
Name: Y_3, dtype: int64

In [5]:
Train_df['Y_5'].value_counts()

 1    1154
-1     830
 0     457
Name: Y_5, dtype: int64

In [6]:
Train_df['Y_10'].value_counts()

 1    1293
-1     846
 0     302
Name: Y_10, dtype: int64

In [7]:
Train_df['Y_14'].value_counts()

 1    1337
-1     851
 0     253
Name: Y_14, dtype: int64

In [8]:
Train_df['Y_20'].value_counts()

 1    1402
-1     831
 0     208
Name: Y_20, dtype: int64

In [9]:
pd.crosstab(Train_df['Y_1'], Train_df['Y_10']) # 40% in the diagonal

Y_10,-1,0,1
Y_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,319,68,217
0,368,168,581
1,159,66,495


In [10]:
pd.crosstab(Train_df['Y_3'], Train_df['Y_14']) # 53% in the diagonal

Y_14,-1,0,1
Y_3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,455,80,230
0,203,87,339
1,193,86,768


In [11]:
pd.crosstab(Train_df['Y_5'], Train_df['Y_20']) # 59% in the diagonal

Y_20,-1,0,1
Y_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,483,59,288
0,152,74,231
1,196,75,883


### Observations
- For one-day ahead forecast, the most frequent class is "sideway" while "up" and "down" classes are roughly similar.  
- For three-day, five-day and 14-day ahead forecast and beyond, "sideway" became the least popular class while "up" is the most popular class among the three.
- Looking at cross-tab, different-period forecast followed the same prediction around 50% of the time

In [23]:
Train_df.describe()

Unnamed: 0,Close,Volume,MA_diff_3,MA_diff_5,MA_diff_10,MA_diff_14,MA_diff_20,EMA_diff,MACD,MACD_diff,...,CCI_10,CCI_14,CCI_20,STD_3,STD_5,STD_10,STD_14,STD_20,Volume_Agent,Y_N_1
count,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,...,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0
mean,1178.293695,7406376.0,0.368501,0.372335,0.374966,0.370038,0.36508,0.367274,2.516408,0.006069,...,14.350821,16.589049,18.845889,7.97656,10.362944,14.606987,17.237003,20.663256,0.463335,-0.045063
std,362.904923,5205370.0,7.243376,5.611197,3.8899,3.30559,2.799292,2.867838,14.407065,1.585529,...,79.799658,83.666191,86.737375,6.050765,6.864183,8.559796,9.550124,10.795102,0.498756,0.735249
min,384.15,905657.4,-41.843333,-27.618,-18.487,-14.585,-11.747,-13.808023,-53.049231,-7.719244,...,-178.925342,-200.91877,-238.933328,0.09,0.826819,2.191589,2.476709,3.225144,0.0,-1.0
25%,845.83,3665487.0,-3.313333,-2.506,-1.796,-1.393571,-1.1785,-1.170292,-5.620914,-0.810094,...,-50.163099,-50.84872,-52.379445,3.937795,5.911622,8.768859,10.601046,13.430422,0.0,-1.0
50%,1289.07,6316200.0,1.01,0.884,0.813,0.810714,0.7215,0.792966,4.267453,0.124999,...,26.61549,31.863177,34.632393,6.648301,8.690091,12.464816,15.033639,18.629973,0.0,0.0
75%,1497.98,9743239.0,4.763333,3.946,2.868,2.652143,2.3525,2.315792,12.987342,0.877887,...,83.083804,87.948493,90.336189,10.092078,12.503994,18.074873,21.068076,25.566598,1.0,0.0
max,1753.71,52941460.0,27.453333,21.594,17.557,13.911429,8.057,10.147395,30.595982,6.960438,...,174.382674,207.239309,241.354532,50.284455,53.863507,58.286524,67.345369,73.995076,1.0,1.0


### Observations
(For MA and EMA, most of values are very close to each other: as expected)  
- For "RSI", the average value is 55, with min = 1.2, max = 99. (as expected because overtime degree of overbought and oversold should be cancelled out)   
- For "MACD", the average value is 2.5 and median is 4.3, indicating left-skewed distribution of MACD. (Tail chance that the short-trend goes way below long-trend)  
- For MOM1 - MOM14, the mean is very close to zero, indicating if you blindly trade stock every day, the average return that you should get is zero.  
- For CCI_20, the average value is 18.8 while the median is 34, also indicating left-skewed distribution of CCI. (Tail chance that the index goes way below the trend in more than 2SD)  


In [19]:
Train_y_1 = Train_df['Y_1']
Train_y_3 = Train_df['Y_3']
Train_y_5 = Train_df['Y_5']
Train_y_10 = Train_df['Y_10']
Train_y_14 = Train_df['Y_14']
Train_y_20 = Train_df['Y_20']

Test_y_1 = Validate_df['Y_1']
Test_y_3 = Validate_df['Y_3']
Test_y_5 = Validate_df['Y_5']
Test_y_10 = Validate_df['Y_10']
Test_y_14 = Validate_df['Y_14']
Test_y_20 = Validate_df['Y_20']

In [20]:
Train_df.drop(['Y_1', 'Y_3', 'Y_5', 'Y_10', 'Y_14', 'Y_20'], axis = 1, inplace = True)
Validate_df.drop(['Y_1', 'Y_3', 'Y_5', 'Y_10', 'Y_14', 'Y_20'], axis = 1, inplace = True)

#### We have defined y at each different interval already, but we will define x as we go (to remind myself, I remove y from dataframe first)

In [21]:
## First, build the most fundamental benchmark model: dummy classifiers
## I will build two versions: most-frequent version and stratified version
## Most-frequent version is to measure accuracy on validating set
## Stratified version is to report accuracy table -> Will do later if have to

clf_dummy_mf_1 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_1.fit(Train_df, Train_y_1)
print('Dummy classifier (most frequent) training accuracy on 1 day ahead:', clf_dummy_mf_1.score(Train_df, Train_y_1))
print('Dummy classifier (most frequent) prediction accuracy on 1 day ahead:', clf_dummy_mf_1.score(Validate_df, Test_y_1))

clf_dummy_mf_3 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_3.fit(Train_df, Train_y_3)
print('Dummy classifier (most frequent) training accuracy on 3 days ahead:', clf_dummy_mf_3.score(Train_df, Train_y_3))
print('Dummy classifier (most frequent) prediction accuracy on 3 days ahead:', clf_dummy_mf_3.score(Validate_df, Test_y_3))

clf_dummy_mf_5 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_5.fit(Train_df, Train_y_5)
print('Dummy classifier (most frequent) training accuracy on 5 days ahead:', clf_dummy_mf_5.score(Train_df, Train_y_5))
print('Dummy classifier (most frequent) prediction accuracy on 5 days ahead:', clf_dummy_mf_5.score(Validate_df, Test_y_5))

clf_dummy_mf_10 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_10.fit(Train_df, Train_y_10)
print('Dummy classifier (most frequent) training accuracy on 10 days ahead:', clf_dummy_mf_10.score(Train_df, Train_y_10))
print('Dummy classifier (most frequent) prediction accuracy on 10 days ahead:', clf_dummy_mf_10.score(Validate_df, Test_y_10))

clf_dummy_mf_14 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_14.fit(Train_df, Train_y_14)
print('Dummy classifier (most frequent) training accuracy on 14 days ahead:', clf_dummy_mf_14.score(Train_df, Train_y_14))
print('Dummy classifier (most frequent) prediction accuracy on 14 days ahead:', clf_dummy_mf_14.score(Validate_df, Test_y_14))

clf_dummy_mf_20 = DummyClassifier(strategy="most_frequent")
clf_dummy_mf_20.fit(Train_df, Train_y_20)
print('Dummy classifier (most frequent) training accuracy on 20 days ahead:', clf_dummy_mf_20.score(Train_df, Train_y_20))
print('Dummy classifier (most frequent) prediction accuracy on 20 days ahead:', clf_dummy_mf_20.score(Validate_df, Test_y_20))


Dummy classifier (most frequent) training accuracy on 1 day ahead: 0.45759934453092993
Dummy classifier (most frequent) prediction accuracy on 1 day ahead: 0.5591836734693878
Dummy classifier (most frequent) training accuracy on 3 days ahead: 0.42892257271609996
Dummy classifier (most frequent) prediction accuracy on 3 days ahead: 0.3224489795918367
Dummy classifier (most frequent) training accuracy on 5 days ahead: 0.4727570667759115
Dummy classifier (most frequent) prediction accuracy on 5 days ahead: 0.34285714285714286
Dummy classifier (most frequent) training accuracy on 10 days ahead: 0.5297009422367882
Dummy classifier (most frequent) prediction accuracy on 10 days ahead: 0.3346938775510204
Dummy classifier (most frequent) training accuracy on 14 days ahead: 0.5477263416632527
Dummy classifier (most frequent) prediction accuracy on 14 days ahead: 0.2938775510204082
Dummy classifier (most frequent) training accuracy on 20 days ahead: 0.5743547726341663
Dummy classifier (most freq

In [22]:
## Second, build a slightly smarter version: logistic regression on a lagged dependent variable
## For visualizing feature importances, use standardized parameters instead
## Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
X_train = Train_df[['Y_N_1']]
X_test = Validate_df[['Y_N_1']]
clf_LR_1 = LogisticRegression(random_state=0).fit(X_train, Train_y_1)
print('Logistic regression on lagged training accuracy on 1 day ahead:', clf_LR_1.score(X_train, Train_y_1))
print('Logistic regression on lagged prediction accuracy on 1 day ahead:', clf_LR_1.score(X_test, Test_y_1))

clf_LR_3 = LogisticRegression(random_state=0).fit(X_train, Train_y_3)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_3.score(X_train, Train_y_3))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_3.score(X_test, Test_y_3))

clf_LR_5 = LogisticRegression(random_state=0).fit(X_train, Train_y_5)
print('Logistic regression on lagged training accuracy on 5 days ahead:', clf_LR_5.score(X_train, Train_y_5))
print('Logistic regression on lagged prediction accuracy on 5 days ahead:', clf_LR_5.score(X_test, Test_y_5))

clf_LR_10 = LogisticRegression(random_state=0).fit(X_train, Train_y_10)
print('Logistic regression on lagged training accuracy on 10 days ahead:', clf_LR_10.score(X_train, Train_y_10))
print('Logistic regression on lagged prediction accuracy on 10 days ahead:', clf_LR_10.score(X_test, Test_y_10))

clf_LR_14 = LogisticRegression(random_state=0).fit(X_train, Train_y_14)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_14.score(X_train, Train_y_14))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_14.score(X_test, Test_y_14))

clf_LR_20 = LogisticRegression(random_state=0).fit(X_train, Train_y_20)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_20.score(X_train, Train_y_20))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_20.score(X_test, Test_y_20))


Logistic regression on lagged training accuracy on 1 day ahead: 0.45759934453092993
Logistic regression on lagged prediction accuracy on 1 day ahead: 0.5591836734693878
Logistic regression on lagged training accuracy on 3 days ahead: 0.42892257271609996
Logistic regression on lagged prediction accuracy on 3 days ahead: 0.3224489795918367
Logistic regression on lagged training accuracy on 5 days ahead: 0.4727570667759115
Logistic regression on lagged prediction accuracy on 5 days ahead: 0.34285714285714286
Logistic regression on lagged training accuracy on 10 days ahead: 0.5297009422367882
Logistic regression on lagged prediction accuracy on 10 days ahead: 0.3346938775510204
Logistic regression on lagged training accuracy on 3 days ahead: 0.5477263416632527
Logistic regression on lagged prediction accuracy on 3 days ahead: 0.2938775510204082
Logistic regression on lagged training accuracy on 3 days ahead: 0.5743547726341663
Logistic regression on lagged prediction accuracy on 3 days ahe

In [25]:
X_train_1 = Train_df[['Y_N_1', 'EMA_diff', 'MACD', 'MOM1', 'RSI', 'MACD_diff', 'Volume_Agent']]
X_train_3 = Train_df[['MA_diff_3', 'EMA_diff', 'MACD', 'MOM3', 'RSI', 'MACD_diff', 'CCI_3', 'STD_3', 'Volume_Agent']]
X_train_5 = Train_df[['MA_diff_5', 'EMA_diff', 'MACD', 'MOM5', 'RSI', 'MACD_diff', 'CCI_5', 'STD_5', 'Volume_Agent']]
X_train_10 = Train_df[['MA_diff_10', 'EMA_diff', 'MACD', 'MOM10', 'RSI', 'MACD_diff', 'CCI_10', 'STD_10', 'Volume_Agent']]
X_train_14 = Train_df[['MA_diff_14', 'EMA_diff', 'MACD', 'MOM14', 'RSI', 'MACD_diff', 'CCI_14', 'STD_14', 'Volume_Agent']]
X_train_20 = Train_df[['MA_diff_20', 'EMA_diff', 'MACD', 'MOM20', 'RSI', 'MACD_diff', 'CCI_20', 'STD_20', 'Volume_Agent']]

In [26]:
X_test_1 = Validate_df[['Y_N_1', 'EMA_diff', 'MACD', 'MOM1', 'RSI', 'MACD_diff', 'Volume_Agent']]
X_test_3 = Validate_df[['MA_diff_3', 'EMA_diff', 'MACD', 'MOM3', 'RSI', 'MACD_diff', 'CCI_3', 'STD_3', 'Volume_Agent']]
X_test_5 = Validate_df[['MA_diff_5', 'EMA_diff', 'MACD', 'MOM5', 'RSI', 'MACD_diff', 'CCI_5', 'STD_5', 'Volume_Agent']]
X_test_10 = Validate_df[['MA_diff_10', 'EMA_diff', 'MACD', 'MOM10', 'RSI', 'MACD_diff', 'CCI_10', 'STD_10', 'Volume_Agent']]
X_test_14 = Validate_df[['MA_diff_14', 'EMA_diff', 'MACD', 'MOM14', 'RSI', 'MACD_diff', 'CCI_14', 'STD_14', 'Volume_Agent']]
X_test_20 = Validate_df[['MA_diff_20', 'EMA_diff', 'MACD', 'MOM20', 'RSI', 'MACD_diff', 'CCI_20', 'STD_20', 'Volume_Agent']]

In [43]:
# Let's run logistic regression with at max 9 features

clf_LR_all_1 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_1, Train_y_1)
print('Logistic regression (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('Logistic regression (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_3, Train_y_3)
print('Logistic regression (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('Logistic regression (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_5, Train_y_5)
print('Logistic regression (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('Logistic regression (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

Logistic regression (all features) training accuracy on 1 day ahead: 0.45759934453092993
Logistic regression (all features) prediction accuracy on 1 day ahead: 0.5591836734693878
Logistic regression (all features)  training accuracy on 3 days ahead: 0.42892257271609996
Logistic regression (all features) prediction accuracy on 3 days ahead: 0.3224489795918367
Logistic regression (all features) training accuracy on 5 days ahead: 0.4727570667759115
Logistic regression (all features) prediction accuracy on 5 days ahead: 0.34285714285714286


In [42]:
# Let's run logistic regression with at max 9 features

clf_LR_all_10 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_10, Train_y_10)
print('Logistic regression (all features) training accuracy on 10 day ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('Logistic regression (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_14, Train_y_14)
print('Logistic regression (all features)  training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('Logistic regression (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = LogisticRegression(random_state=0, multi_class = "multinomial", C = 10**(-6)).fit(X_train_20, Train_y_20)
print('Logistic regression (all features) training accuracy on 20 days ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('Logistic regression (all features) prediction accuracy on 20 days ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

Logistic regression (all features) training accuracy on 10 day ahead: 0.5276526013928717
Logistic regression (all features) prediction accuracy on 10 day ahead: 0.3306122448979592
Logistic regression (all features)  training accuracy on 14 days ahead: 0.5477263416632527
Logistic regression (all features) prediction accuracy on 14 days ahead: 0.2979591836734694
Logistic regression (all features) training accuracy on 20 days ahead: 0.5743547726341663
Logistic regression (all features) prediction accuracy on 20 days ahead: 0.2816326530612245


In [62]:
from sklearn.tree import DecisionTreeClassifier
clf_LR_all_1 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_1, Train_y_1)
print('DT (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('DT (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_3, Train_y_3)
print('DT (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('DT (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_5, Train_y_5)
print('DT (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('DT (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

clf_LR_all_10 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_10, Train_y_10)
print('DT (all features) training accuracy on 10 day ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('DT (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_14, Train_y_14)
print('DT (all features) training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('DT (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = DecisionTreeClassifier(random_state=0, max_features = "sqrt", max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_20, Train_y_20)
print('DT (all features) training accuracy on 20 day ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('DT (all features) prediction accuracy on 20 day ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

DT (all features) training accuracy on 1 day ahead: 0.47931175747644406
DT (all features) prediction accuracy on 1 day ahead: 0.5224489795918368
DT (all features)  training accuracy on 3 days ahead: 0.4559606718557968
DT (all features) prediction accuracy on 3 days ahead: 0.37142857142857144
DT (all features) training accuracy on 5 days ahead: 0.5018435067595248
DT (all features) prediction accuracy on 5 days ahead: 0.3795918367346939
DT (all features) training accuracy on 10 day ahead: 0.5571487095452683
DT (all features) prediction accuracy on 10 day ahead: 0.3183673469387755
DT (all features) training accuracy on 14 days ahead: 0.5661614092585007
DT (all features) prediction accuracy on 14 days ahead: 0.3877551020408163
DT (all features) training accuracy on 20 day ahead: 0.5989348627611635
DT (all features) prediction accuracy on 20 day ahead: 0.4489795918367347


In [86]:
from sklearn.ensemble import RandomForestClassifier
clf_LR_all_1 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_1, Train_y_1)
print('RF (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('RF (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_3, Train_y_3)
print('RF (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('RF (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_5, Train_y_5)
print('RF (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('RF (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

clf_LR_all_10 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_10, Train_y_10)
print('RF (all features) training accuracy on 10 days ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('RF (all features) prediction accuracy on 10 days ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_14, Train_y_14)
print('RF (all features) training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('RF (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = RandomForestClassifier(random_state=0, max_depth = 10, min_weight_fraction_leaf = 0.005).fit(X_train_20, Train_y_20)
print('RF (all features) training accuracy on 20 days ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('RF (all features) prediction accuracy on 20 days ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

RF (all features) training accuracy on 1 day ahead: 0.6419500204834084
RF (all features) prediction accuracy on 1 day ahead: 0.46530612244897956
RF (all features)  training accuracy on 3 days ahead: 0.6894715280622695
RF (all features) prediction accuracy on 3 days ahead: 0.3020408163265306
RF (all features) training accuracy on 5 days ahead: 0.6669397787791889
RF (all features) prediction accuracy on 5 days ahead: 0.363265306122449
RF (all features) training accuracy on 10 days ahead: 0.7042195821384678
RF (all features) prediction accuracy on 10 days ahead: 0.3306122448979592
RF (all features) training accuracy on 14 days ahead: 0.7505120852109791
RF (all features) prediction accuracy on 14 days ahead: 0.4204081632653061
RF (all features) training accuracy on 20 days ahead: 0.7795985251945924
RF (all features) prediction accuracy on 20 days ahead: 0.46122448979591835


In [87]:
Pred_y_1 = clf_LR_all_1.predict(X_test_1)
Pred_y_3 = clf_LR_all_3.predict(X_test_3)
Pred_y_5 = clf_LR_all_5.predict(X_test_5)
Pred_y_10 = clf_LR_all_10.predict(X_test_10)
Pred_y_14 = clf_LR_all_14.predict(X_test_14)
Pred_y_20 = clf_LR_all_20.predict(X_test_20)

In [88]:
from sklearn.metrics import classification_report
print(classification_report(Test_y_1, Pred_y_1))

              precision    recall  f1-score   support

          -1       0.15      0.08      0.11        60
           0       0.58      0.69      0.63       137
           1       0.29      0.31      0.30        48

    accuracy                           0.47       245
   macro avg       0.34      0.36      0.35       245
weighted avg       0.42      0.47      0.44       245



In [89]:
print(classification_report(Test_y_3, Pred_y_3))

              precision    recall  f1-score   support

          -1       0.36      0.34      0.35        95
           0       0.30      0.13      0.18        71
           1       0.26      0.42      0.32        79

    accuracy                           0.30       245
   macro avg       0.31      0.29      0.28       245
weighted avg       0.31      0.30      0.29       245



In [90]:
print(classification_report(Test_y_5, Pred_y_5))

              precision    recall  f1-score   support

          -1       0.41      0.29      0.34       103
           0       0.00      0.00      0.00        58
           1       0.35      0.70      0.46        84

    accuracy                           0.36       245
   macro avg       0.25      0.33      0.27       245
weighted avg       0.29      0.36      0.30       245



  _warn_prf(average, modifier, msg_start, len(result))


In [91]:
print(classification_report(Test_y_10, Pred_y_10))

              precision    recall  f1-score   support

          -1       0.41      0.22      0.29       127
           0       0.00      0.00      0.00        36
           1       0.30      0.65      0.41        82

    accuracy                           0.33       245
   macro avg       0.24      0.29      0.23       245
weighted avg       0.31      0.33      0.29       245



In [92]:
print(classification_report(Test_y_14, Pred_y_14))

              precision    recall  f1-score   support

          -1       0.58      0.37      0.45       128
           0       0.00      0.00      0.00        45
           1       0.34      0.78      0.47        72

    accuracy                           0.42       245
   macro avg       0.31      0.38      0.31       245
weighted avg       0.40      0.42      0.37       245



In [93]:
print(classification_report(Test_y_20, Pred_y_20))

              precision    recall  f1-score   support

          -1       0.71      0.37      0.49       145
           0       0.00      0.00      0.00        31
           1       0.35      0.86      0.50        69

    accuracy                           0.46       245
   macro avg       0.35      0.41      0.33       245
weighted avg       0.52      0.46      0.43       245



In [95]:
clf_LR_all_1.feature_importances_

array([0.0289022 , 0.20393818, 0.19167446, 0.19947846, 0.18255213,
       0.16817808, 0.0252765 ])

In [96]:
clf_LR_all_3.feature_importances_

array([0.11867165, 0.13744927, 0.14360675, 0.12186161, 0.13869216,
       0.11188888, 0.10162636, 0.1101321 , 0.01607122])

In [97]:
clf_LR_all_5.feature_importances_

array([0.10763555, 0.12147414, 0.13819545, 0.12357743, 0.1486735 ,
       0.11341031, 0.10482803, 0.13060396, 0.01160163])

In [99]:
clf_LR_all_10.feature_importances_

array([0.12285178, 0.13636612, 0.12918539, 0.12414469, 0.1324066 ,
       0.09216624, 0.09316876, 0.15127436, 0.01843605])

In [94]:
clf_LR_all_14.feature_importances_

array([0.12606207, 0.11283539, 0.1381796 , 0.14343982, 0.12970229,
       0.07100453, 0.08417255, 0.17552063, 0.01908312])

In [98]:
clf_LR_all_20.feature_importances_

array([0.10104805, 0.0927546 , 0.13779034, 0.14643275, 0.1339007 ,
       0.08880395, 0.09542577, 0.19224921, 0.01159462])

In [134]:
from sklearn.ensemble import GradientBoostingClassifier
clf_GT_all_1 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_1, Train_y_1)
print('GBDT (all features) training accuracy on 1 day ahead:', clf_GT_all_1.score(X_train_1, Train_y_1))
print('GBDT (all features) prediction accuracy on 1 day ahead:',clf_GT_all_1.score(X_test_1, Test_y_1))

GBDT (all features) training accuracy on 1 day ahead: 0.546497337156903
GBDT (all features) prediction accuracy on 1 day ahead: 0.46938775510204084


In [133]:
clf_GT_all_3 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_3, Train_y_3)
print('GBDT (all features) training accuracy on 3 day ahead:', clf_GT_all_3.score(X_train_3, Train_y_3))
print('GBDT (all features) prediction accuracy on 3 day ahead:',clf_GT_all_3.score(X_test_3, Test_y_3))

GBDT (all features) training accuracy on 3 day ahead: 0.5776321179844326
GBDT (all features) prediction accuracy on 3 day ahead: 0.33877551020408164


In [132]:
clf_GT_all_5 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_5, Train_y_5)
print('GBDT (all features) training accuracy on 5 day ahead:', clf_GT_all_5.score(X_train_5, Train_y_5))
print('GBDT (all features) prediction accuracy on 5 day ahead:',clf_GT_all_5.score(X_test_5, Test_y_5))

GBDT (all features) training accuracy on 5 day ahead: 0.605489553461696
GBDT (all features) prediction accuracy on 5 day ahead: 0.33877551020408164


In [131]:
clf_GT_all_10 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_10, Train_y_10)
print('GBDT (all features) training accuracy on 10 day ahead:', clf_GT_all_10.score(X_train_10, Train_y_10))
print('GBDT (all features) prediction accuracy on 10 day ahead:',clf_GT_all_10.score(X_test_10, Test_y_10))

GBDT (all features) training accuracy on 10 day ahead: 0.6423596886521917
GBDT (all features) prediction accuracy on 10 day ahead: 0.3469387755102041


In [135]:
clf_GT_all_14 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_14, Train_y_14)
print('GBDT (all features) training accuracy on 14 day ahead:', clf_GT_all_14.score(X_train_14, Train_y_14))
print('GBDT (all features) prediction accuracy on 14 day ahead:',clf_GT_all_14.score(X_test_14, Test_y_14))

GBDT (all features) training accuracy on 14 day ahead: 0.6583367472347399
GBDT (all features) prediction accuracy on 14 day ahead: 0.4204081632653061


In [136]:
clf_GT_all_20 = GradientBoostingClassifier(random_state=0, max_depth = 5, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_20, Train_y_20)
print('GBDT (all features) training accuracy on 20 day ahead:', clf_GT_all_20.score(X_train_20, Train_y_20))
print('GBDT (all features) prediction accuracy on 20 day ahead:',clf_GT_all_20.score(X_test_20, Test_y_20))

GBDT (all features) training accuracy on 20 day ahead: 0.6796394920114707
GBDT (all features) prediction accuracy on 20 day ahead: 0.4857142857142857


In [137]:
Pred_y_1 = clf_GT_all_1.predict(X_test_1)
Pred_y_3 = clf_GT_all_3.predict(X_test_3)
Pred_y_5 = clf_GT_all_5.predict(X_test_5)
Pred_y_10 = clf_GT_all_10.predict(X_test_10)
Pred_y_14 = clf_GT_all_14.predict(X_test_14)
Pred_y_20 = clf_GT_all_20.predict(X_test_20)

In [138]:
print(classification_report(Test_y_1, Pred_y_1))

              precision    recall  f1-score   support

          -1       0.29      0.23      0.26        60
           0       0.60      0.66      0.63       137
           1       0.22      0.21      0.22        48

    accuracy                           0.47       245
   macro avg       0.37      0.37      0.37       245
weighted avg       0.45      0.47      0.46       245



In [139]:
print(classification_report(Test_y_3, Pred_y_3))

              precision    recall  f1-score   support

          -1       0.42      0.34      0.37        95
           0       0.33      0.18      0.23        71
           1       0.30      0.48      0.37        79

    accuracy                           0.34       245
   macro avg       0.35      0.33      0.32       245
weighted avg       0.35      0.34      0.33       245



In [140]:
print(classification_report(Test_y_5, Pred_y_5))

              precision    recall  f1-score   support

          -1       0.39      0.29      0.34       103
           0       0.24      0.07      0.11        58
           1       0.32      0.58      0.42        84

    accuracy                           0.34       245
   macro avg       0.32      0.31      0.29       245
weighted avg       0.33      0.34      0.31       245



In [141]:
print(classification_report(Test_y_10, Pred_y_10))

              precision    recall  f1-score   support

          -1       0.41      0.26      0.32       127
           0       0.40      0.06      0.10        36
           1       0.31      0.61      0.41        82

    accuracy                           0.35       245
   macro avg       0.37      0.31      0.28       245
weighted avg       0.38      0.35      0.32       245



In [142]:
print(classification_report(Test_y_14, Pred_y_14))

              precision    recall  f1-score   support

          -1       0.68      0.33      0.44       128
           0       0.12      0.02      0.04        45
           1       0.34      0.83      0.49        72

    accuracy                           0.42       245
   macro avg       0.38      0.39      0.32       245
weighted avg       0.48      0.42      0.38       245



In [143]:
print(classification_report(Test_y_20, Pred_y_20))

              precision    recall  f1-score   support

          -1       0.76      0.43      0.55       145
           0       0.00      0.00      0.00        31
           1       0.35      0.81      0.49        69

    accuracy                           0.49       245
   macro avg       0.37      0.42      0.35       245
weighted avg       0.55      0.49      0.46       245



In [144]:
clf_GT_all_1.feature_importances_

array([0.        , 0.29827194, 0.17183633, 0.19990629, 0.15997652,
       0.16100166, 0.00900725])

In [145]:
clf_GT_all_3.feature_importances_

array([0.10435695, 0.11534909, 0.16439193, 0.14161082, 0.15073945,
       0.08508032, 0.09469754, 0.12505739, 0.01871651])

In [146]:
clf_GT_all_5.feature_importances_

array([0.1090932 , 0.12614487, 0.15802491, 0.15340729, 0.1269216 ,
       0.07227506, 0.09980479, 0.14222659, 0.01210169])

In [147]:
clf_GT_all_10.feature_importances_

array([0.08089871, 0.10088227, 0.14429805, 0.11331059, 0.15046714,
       0.09380513, 0.09237344, 0.18655525, 0.0374094 ])

In [148]:
clf_GT_all_14.feature_importances_

array([0.07471256, 0.07096399, 0.14524494, 0.18286065, 0.10304173,
       0.05823706, 0.08481881, 0.2546925 , 0.02542777])

In [149]:
clf_GT_all_20.feature_importances_

array([0.07911951, 0.08905021, 0.12365312, 0.12967339, 0.13499752,
       0.10396411, 0.08518436, 0.24679956, 0.00755822])

In [151]:
# Use test data from 2019 to report in paper
Test_y_1 = Test_a_df['Y_1']
Test_y_3 = Test_a_df['Y_3']
Test_y_5 = Test_a_df['Y_5']
Test_y_10 = Test_a_df['Y_10']
Test_y_14 = Test_a_df['Y_14']
Test_y_20 = Test_a_df['Y_20']

X_test_1 = Test_a_df[['Y_N_1', 'EMA_diff', 'MACD', 'MOM1', 'RSI', 'MACD_diff', 'Volume_Agent']]
X_test_3 =Test_a_df[['MA_diff_3', 'EMA_diff', 'MACD', 'MOM3', 'RSI', 'MACD_diff', 'CCI_3', 'STD_3', 'Volume_Agent']]
X_test_5 = Test_a_df[['MA_diff_5', 'EMA_diff', 'MACD', 'MOM5', 'RSI', 'MACD_diff', 'CCI_5', 'STD_5', 'Volume_Agent']]
X_test_10 = Test_a_df[['MA_diff_10', 'EMA_diff', 'MACD', 'MOM10', 'RSI', 'MACD_diff', 'CCI_10', 'STD_10', 'Volume_Agent']]
X_test_14 = Test_a_df[['MA_diff_14', 'EMA_diff', 'MACD', 'MOM14', 'RSI', 'MACD_diff', 'CCI_14', 'STD_14', 'Volume_Agent']]
X_test_20 = Test_a_df[['MA_diff_20', 'EMA_diff', 'MACD', 'MOM20', 'RSI', 'MACD_diff', 'CCI_20', 'STD_20', 'Volume_Agent']]

In [152]:
print(clf_LR_all_1.score(X_test_1, Test_y_1))
print(clf_LR_all_3.score(X_test_3, Test_y_3))
print(clf_LR_all_5.score(X_test_5, Test_y_5))
print(clf_LR_all_10.score(X_test_10, Test_y_10))
print(clf_LR_all_14.score(X_test_14, Test_y_14))
print(clf_LR_all_20.score(X_test_20, Test_y_20))

0.5390946502057613
0.34156378600823045
0.3333333333333333
0.4074074074074074
0.45267489711934156
0.4567901234567901


In [153]:
print(clf_GT_all_1.score(X_test_1, Test_y_1))
print(clf_GT_all_3.score(X_test_3, Test_y_3))
print(clf_GT_all_5.score(X_test_5, Test_y_5))
print(clf_GT_all_10.score(X_test_10, Test_y_10))
print(clf_GT_all_14.score(X_test_14, Test_y_14))
print(clf_GT_all_20.score(X_test_20, Test_y_20))

0.48559670781893005
0.3292181069958848
0.32510288065843623
0.4074074074074074
0.46502057613168724
0.48148148148148145


### Whew, this is the end of part 2. For this part, coding is relatively easy compared to the first. However, the tough part is tweaking hyperparameters to get higher validation score. I have to admit that it is difficult to raise validation score significantly beyond this point. Here are some quick notes about what I could observe. I think that could help refine the topic that we could go next, using this as the starting point.

- Using dummy classifier, we see the shifting sand in the distribution of (up, sideway, down), and the prediction gets worse as we increase window length.  
- Using lagged dependent variable in logistic regression classifier, it is not better than dummy classifier. Unfortunately, the same holds true for the full model using logistic regression framework.  
- Using decision tree classifiers (including RF and GBDT), for window lengths of 1 day - 10 days, I do not see any improvements (esp 1-5 days). However, we could see significant improvement (from 30% to 45% accuracy) in window lengths of 14 days & 20 days. The longer, the better the prediction is (seems counterintuitive?) This is confirmed by running the model with test set.  
- By inspecting the classification report, for RF, as we increase time span, the model tries to ignore sideway as it became less common. This model is more like a bit risk-seeking because it tries to ignore sideway and bets up or down. However, for GBDT, the degree of aggressiveness is less compared to RF.  
- By inspecting the feature importances, I could notice that as we increase the time span, "trend", "momentum" (esp CCI, MACD) are less important over window length while "uncertainty" (STD) is more important over window length. (Importance is measured by feature importances as explanatory power in RF and GBDT)
- Looking at similarities among classification report (esp window_length = 14, 20), for (down) the precision is high and recall is low, implying that if the model says stock is going to go down, then it is likely so. However, for (up) the precision is low and recall is high, implying that among opportunities when stock index is going to rise, the model is likely to capture that opportunity. However, by capturing opportunity, it generates more false alarms.