## PART 2: Modeling
In this notebook, I will walkthrough the process of modelling after we got the data from part 1. I hope that we would see some useful results at the end of this notebook.

In [1]:
import pandas as pd
import numpy as np

## Many scikit-learn packages to import
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
## First, we read the dataframe that we got from the last part (part 1)
## Remember the rule of ML: analyze only training set (do not touch validating or testing set if not specified so)
Train_df = pd.read_csv('SET_Train.csv')
Validate_df = pd.read_csv('SET_Validate.csv')
Test_a_df = pd.read_csv('SET_Test_a.csv')
Test_b_df = pd.read_csv('SET_Test_b.csv')
Train_df.head()

Unnamed: 0,Date,Close,Volume,MA_diff_3,MA_diff_5,MA_diff_10,MA_diff_14,MA_diff_20,EMA_diff,MACD,...,STD_14,STD_20,Volume_Agent,Y_1,Y_3,Y_5,Y_10,Y_14,Y_20,Y_N_1
0,2008-01-02,842.97,1686634.0,0.59,5.874,0.657,0.112857,1.1225,0.846747,-0.843428,...,19.004899,16.768455,1,-1,-1,-1,-1,-1,-1,1
1,2008-01-03,832.63,2203218.0,-6.476667,-2.13,1.501,-0.558571,-0.6085,-0.271927,-1.046243,...,18.795705,16.548025,1,-1,-1,-1,-1,-1,-1,1
2,2008-01-04,821.71,2244205.0,-12.13,-3.898,0.781,-0.884286,-1.2365,-1.336735,-2.064332,...,18.863764,16.38875,1,-1,0,-1,-1,-1,-1,1
3,2008-01-07,808.31,1749336.0,-11.553333,-8.75,0.333,-1.77,-1.1405,-2.543061,-3.9074,...,19.579607,17.186561,1,0,-1,-1,-1,-1,0,1
4,2008-01-08,811.69,1746730.0,-6.98,-9.282,1.998,-1.765,-1.0825,-1.950755,-5.037241,...,19.783006,17.660359,0,1,-1,-1,-1,-1,0,0


In [3]:
Train_df['Y_1'].value_counts()

 0    1117
 1     720
-1     604
Name: Y_1, dtype: int64

In [4]:
Train_df['Y_3'].value_counts()

 1    1047
-1     765
 0     629
Name: Y_3, dtype: int64

In [5]:
Train_df['Y_5'].value_counts()

 1    1154
-1     830
 0     457
Name: Y_5, dtype: int64

In [6]:
Train_df['Y_10'].value_counts()

 1    1293
-1     846
 0     302
Name: Y_10, dtype: int64

In [7]:
Train_df['Y_14'].value_counts()

 1    1337
-1     851
 0     253
Name: Y_14, dtype: int64

In [8]:
Train_df['Y_20'].value_counts()

 1    1402
-1     831
 0     208
Name: Y_20, dtype: int64

In [9]:
pd.crosstab(Train_df['Y_1'], Train_df['Y_10']) # 40% in the diagonal

Y_10,-1,0,1
Y_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,319,68,217
0,368,168,581
1,159,66,495


In [10]:
pd.crosstab(Train_df['Y_3'], Train_df['Y_14']) # 53% in the diagonal

Y_14,-1,0,1
Y_3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,455,80,230
0,203,87,339
1,193,86,768


In [11]:
pd.crosstab(Train_df['Y_5'], Train_df['Y_20']) # 59% in the diagonal

Y_20,-1,0,1
Y_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,483,59,288
0,152,74,231
1,196,75,883


### Observations
- For one-day ahead forecast, the most frequent class is "sideway" while "up" and "down" classes are roughly similar.  
- For three-day, five-day and 14-day ahead forecast and beyond, "sideway" became the least popular class while "up" is the most popular class among the three.
- Looking at cross-tab, different-period forecast followed the same prediction around 50% of the time

In [23]:
Train_df.describe()

Unnamed: 0,Close,Volume,MA_diff_3,MA_diff_5,MA_diff_10,MA_diff_14,MA_diff_20,EMA_diff,MACD,MACD_diff,...,CCI_10,CCI_14,CCI_20,STD_3,STD_5,STD_10,STD_14,STD_20,Volume_Agent,Y_N_1
count,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,...,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0,2441.0
mean,1178.293695,7406376.0,0.368501,0.372335,0.374966,0.370038,0.36508,0.367274,2.516408,0.006069,...,14.350821,16.589049,18.845889,7.97656,10.362944,14.606987,17.237003,20.663256,0.463335,-0.045063
std,362.904923,5205370.0,7.243376,5.611197,3.8899,3.30559,2.799292,2.867838,14.407065,1.585529,...,79.799658,83.666191,86.737375,6.050765,6.864183,8.559796,9.550124,10.795102,0.498756,0.735249
min,384.15,905657.4,-41.843333,-27.618,-18.487,-14.585,-11.747,-13.808023,-53.049231,-7.719244,...,-178.925342,-200.91877,-238.933328,0.09,0.826819,2.191589,2.476709,3.225144,0.0,-1.0
25%,845.83,3665487.0,-3.313333,-2.506,-1.796,-1.393571,-1.1785,-1.170292,-5.620914,-0.810094,...,-50.163099,-50.84872,-52.379445,3.937795,5.911622,8.768859,10.601046,13.430422,0.0,-1.0
50%,1289.07,6316200.0,1.01,0.884,0.813,0.810714,0.7215,0.792966,4.267453,0.124999,...,26.61549,31.863177,34.632393,6.648301,8.690091,12.464816,15.033639,18.629973,0.0,0.0
75%,1497.98,9743239.0,4.763333,3.946,2.868,2.652143,2.3525,2.315792,12.987342,0.877887,...,83.083804,87.948493,90.336189,10.092078,12.503994,18.074873,21.068076,25.566598,1.0,0.0
max,1753.71,52941460.0,27.453333,21.594,17.557,13.911429,8.057,10.147395,30.595982,6.960438,...,174.382674,207.239309,241.354532,50.284455,53.863507,58.286524,67.345369,73.995076,1.0,1.0


### Observations
(For MA and EMA, most of values are very close to each other: as expected)  
- For "RSI", the average value is 55, with min = 1.2, max = 99. (as expected because overtime degree of overbought and oversold should be cancelled out)   
- For "MACD", the average value is 2.5 and median is 4.3, indicating left-skewed distribution of MACD. (Tail chance that the short-trend goes way below long-trend)  
- For MOM1 - MOM14, the mean is very close to zero, indicating if you blindly trade stock every day, the average return that you should get is zero.  
- For CCI_20, the average value is 18.8 while the median is 34, also indicating left-skewed distribution of CCI. (Tail chance that the index goes way below the trend in more than 2SD)  


In [3]:
Train_y_1 = Train_df['Y_1']
Train_y_3 = Train_df['Y_3']
Train_y_5 = Train_df['Y_5']
Train_y_10 = Train_df['Y_10']
Train_y_14 = Train_df['Y_14']
Train_y_20 = Train_df['Y_20']

Test_y_1 = Validate_df['Y_1']
Test_y_3 = Validate_df['Y_3']
Test_y_5 = Validate_df['Y_5']
Test_y_10 = Validate_df['Y_10']
Test_y_14 = Validate_df['Y_14']
Test_y_20 = Validate_df['Y_20']

In [4]:
Train_df.drop(['Y_1', 'Y_3', 'Y_5', 'Y_10', 'Y_14', 'Y_20'], axis = 1, inplace = True)
Validate_df.drop(['Y_1', 'Y_3', 'Y_5', 'Y_10', 'Y_14', 'Y_20'], axis = 1, inplace = True)

#### We have defined y at each different interval already, but we will define x as we go (to remind myself, I remove y from dataframe first)

In [51]:
## First, build the most fundamental benchmark model: dummy classifiers
## I will build two versions: most-frequent version and stratified version
## Most-frequent version is to measure accuracy on validating set
## Stratified version is to report accuracy table -> Will do later if have to

clf_dummy_mf_1 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_1)
print('Dummy classifier (most frequent) training accuracy on 1 day ahead:', clf_dummy_mf_1.score(Train_df, Train_y_1))
print('Dummy classifier (most frequent) prediction accuracy on 1 day ahead:', clf_dummy_mf_1.score(Validate_df, Test_y_1))

clf_dummy_mf_3 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_3)
print('Dummy classifier (most frequent) training accuracy on 3 days ahead:', clf_dummy_mf_3.score(Train_df, Train_y_3))
print('Dummy classifier (most frequent) prediction accuracy on 3 days ahead:', clf_dummy_mf_3.score(Validate_df, Test_y_3))

clf_dummy_mf_5 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_5)
print('Dummy classifier (most frequent) training accuracy on 5 days ahead:', clf_dummy_mf_5.score(Train_df, Train_y_5))
print('Dummy classifier (most frequent) prediction accuracy on 5 days ahead:', clf_dummy_mf_5.score(Validate_df, Test_y_5))

clf_dummy_mf_10 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_10)
print('Dummy classifier (most frequent) training accuracy on 10 days ahead:', clf_dummy_mf_10.score(Train_df, Train_y_10))
print('Dummy classifier (most frequent) prediction accuracy on 10 days ahead:', clf_dummy_mf_10.score(Validate_df, Test_y_10))

clf_dummy_mf_14 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_14)
print('Dummy classifier (most frequent) training accuracy on 14 days ahead:', clf_dummy_mf_14.score(Train_df, Train_y_14))
print('Dummy classifier (most frequent) prediction accuracy on 14 days ahead:', clf_dummy_mf_14.score(Validate_df, Test_y_14))

clf_dummy_mf_20 = DummyClassifier(strategy="most_frequent").fit(Train_df, Train_y_20)
print('Dummy classifier (most frequent) training accuracy on 20 days ahead:', clf_dummy_mf_20.score(Train_df, Train_y_20))
print('Dummy classifier (most frequent) prediction accuracy on 20 days ahead:', clf_dummy_mf_20.score(Validate_df, Test_y_20))


Dummy classifier (most frequent) training accuracy on 1 day ahead: 0.45759934453092993
Dummy classifier (most frequent) prediction accuracy on 1 day ahead: 0.5591836734693878
Dummy classifier (most frequent) training accuracy on 3 days ahead: 0.42892257271609996
Dummy classifier (most frequent) prediction accuracy on 3 days ahead: 0.3224489795918367
Dummy classifier (most frequent) training accuracy on 5 days ahead: 0.4727570667759115
Dummy classifier (most frequent) prediction accuracy on 5 days ahead: 0.34285714285714286
Dummy classifier (most frequent) training accuracy on 10 days ahead: 0.5297009422367882
Dummy classifier (most frequent) prediction accuracy on 10 days ahead: 0.3346938775510204
Dummy classifier (most frequent) training accuracy on 14 days ahead: 0.5477263416632527
Dummy classifier (most frequent) prediction accuracy on 14 days ahead: 0.2938775510204082
Dummy classifier (most frequent) training accuracy on 20 days ahead: 0.5743547726341663
Dummy classifier (most freq

In [6]:
## Second, build a slightly smarter version: logistic regression on a lagged dependent variable
## For visualizing feature importances, use standardized parameters instead
## Source: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
X_train = Train_df[['Y_N_1']]
X_test = Validate_df[['Y_N_1']]
clf_LR_1 = LogisticRegression(random_state=0).fit(X_train, Train_y_1)
print('Logistic regression on lagged training accuracy on 1 day ahead:', clf_LR_1.score(X_train, Train_y_1))
print('Logistic regression on lagged prediction accuracy on 1 day ahead:', clf_LR_1.score(X_test, Test_y_1))

clf_LR_3 = LogisticRegression(random_state=0).fit(X_train, Train_y_3)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_3.score(X_train, Train_y_3))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_3.score(X_test, Test_y_3))

clf_LR_5 = LogisticRegression(random_state=0).fit(X_train, Train_y_5)
print('Logistic regression on lagged training accuracy on 5 days ahead:', clf_LR_5.score(X_train, Train_y_5))
print('Logistic regression on lagged prediction accuracy on 5 days ahead:', clf_LR_5.score(X_test, Test_y_5))

clf_LR_10 = LogisticRegression(random_state=0).fit(X_train, Train_y_10)
print('Logistic regression on lagged training accuracy on 10 days ahead:', clf_LR_10.score(X_train, Train_y_10))
print('Logistic regression on lagged prediction accuracy on 10 days ahead:', clf_LR_10.score(X_test, Test_y_10))

clf_LR_14 = LogisticRegression(random_state=0).fit(X_train, Train_y_14)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_14.score(X_train, Train_y_14))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_14.score(X_test, Test_y_14))

clf_LR_20 = LogisticRegression(random_state=0).fit(X_train, Train_y_20)
print('Logistic regression on lagged training accuracy on 3 days ahead:', clf_LR_20.score(X_train, Train_y_20))
print('Logistic regression on lagged prediction accuracy on 3 days ahead:', clf_LR_20.score(X_test, Test_y_20))


Logistic regression on lagged training accuracy on 1 day ahead: 0.45759934453092993
Logistic regression on lagged prediction accuracy on 1 day ahead: 0.5591836734693878
Logistic regression on lagged training accuracy on 3 days ahead: 0.42892257271609996
Logistic regression on lagged prediction accuracy on 3 days ahead: 0.3224489795918367
Logistic regression on lagged training accuracy on 5 days ahead: 0.4727570667759115
Logistic regression on lagged prediction accuracy on 5 days ahead: 0.34285714285714286
Logistic regression on lagged training accuracy on 10 days ahead: 0.5297009422367882
Logistic regression on lagged prediction accuracy on 10 days ahead: 0.3346938775510204
Logistic regression on lagged training accuracy on 3 days ahead: 0.5477263416632527
Logistic regression on lagged prediction accuracy on 3 days ahead: 0.2938775510204082
Logistic regression on lagged training accuracy on 3 days ahead: 0.5743547726341663
Logistic regression on lagged prediction accuracy on 3 days ahe

In [5]:
X_train_1 = Train_df[['MACD', 'MOM1', 'Volume_Agent']]
X_train_3 = Train_df[['MACD', 'MOM3', 'STD_3', 'Volume_Agent']]
X_train_5 = Train_df[[ 'MACD', 'MOM5', 'STD_5', 'Volume_Agent']]
X_train_10 = Train_df[[ 'MACD', 'MOM10', 'STD_10', 'Volume_Agent']]
X_train_14 = Train_df[['MACD', 'MOM14', 'STD_14', 'Volume_Agent']]
X_train_20 = Train_df[['MACD', 'MOM20', 'STD_20', 'Volume_Agent']]

In [6]:
X_test_1 = Validate_df[['MACD', 'MOM1', 'Volume_Agent']]
X_test_3 = Validate_df[['MACD', 'MOM3', 'STD_3', 'Volume_Agent']]
X_test_5 = Validate_df[['MACD', 'MOM5', 'STD_5', 'Volume_Agent']]
X_test_10 = Validate_df[[ 'MACD', 'MOM10', 'STD_10', 'Volume_Agent']]
X_test_14 = Validate_df[['MACD', 'MOM14', 'STD_14', 'Volume_Agent']]
X_test_20 = Validate_df[['MACD', 'MOM20', 'STD_20', 'Volume_Agent']]

In [27]:
# Let's run logistic regression with at max 9 features

clf_LR_all_1 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_1, Train_y_1)
print('Logistic regression (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('Logistic regression (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_3, Train_y_3)
print('Logistic regression (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('Logistic regression (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_5, Train_y_5)
print('Logistic regression (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('Logistic regression (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

Logistic regression (all features) training accuracy on 1 day ahead: 0.4756247439573945
Logistic regression (all features) prediction accuracy on 1 day ahead: 0.5387755102040817
Logistic regression (all features)  training accuracy on 3 days ahead: 0.4244162228594838
Logistic regression (all features) prediction accuracy on 3 days ahead: 0.3142857142857143
Logistic regression (all features) training accuracy on 5 days ahead: 0.46907005325686196
Logistic regression (all features) prediction accuracy on 5 days ahead: 0.3224489795918367


In [28]:
# Let's run logistic regression with at max 9 features

clf_LR_all_10 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_10, Train_y_10)
print('Logistic regression (all features) training accuracy on 10 day ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('Logistic regression (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_14, Train_y_14)
print('Logistic regression (all features)  training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('Logistic regression (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = LogisticRegression(random_state=0, max_iter = 1000).fit(X_train_20, Train_y_20)
print('Logistic regression (all features) training accuracy on 20 days ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('Logistic regression (all features) prediction accuracy on 20 days ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

Logistic regression (all features) training accuracy on 10 day ahead: 0.5198689061859894
Logistic regression (all features) prediction accuracy on 10 day ahead: 0.3183673469387755
Logistic regression (all features)  training accuracy on 14 days ahead: 0.5497746825071692
Logistic regression (all features) prediction accuracy on 14 days ahead: 0.2938775510204082
Logistic regression (all features) training accuracy on 20 days ahead: 0.5764031134780827
Logistic regression (all features) prediction accuracy on 20 days ahead: 0.2816326530612245


In [9]:
import statsmodels.api as sm
logit_model3=sm.MNLogit(Train_y_3,sm.add_constant(X_train_3))
logit_model3
result3=logit_model3.fit()
stats31=result3.summary()
stats32=result3.summary2()
print(stats31)
print(stats32)

Optimization terminated successfully.
         Current function value: 1.067194
         Iterations 5
                          MNLogit Regression Results                          
Dep. Variable:                    Y_3   No. Observations:                 2441
Model:                        MNLogit   Df Residuals:                     2431
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                0.008302
Time:                        15:23:45   Log-Likelihood:                -2605.0
converged:                       True   LL-Null:                       -2626.8
Covariance Type:            nonrobust   LLR p-value:                 6.729e-07
       Y_3=0       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.1569      0.107     -1.471      0.141      -0.366       0.052
MACD             0.0158

In [10]:
import statsmodels.api as sm
logit_model5=sm.MNLogit(Train_y_5,sm.add_constant(X_train_5))
logit_model5
result5=logit_model5.fit()
stats51=result5.summary()
stats52=result5.summary2()
print(stats51)
print(stats52)

Optimization terminated successfully.
         Current function value: 1.027787
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:                    Y_5   No. Observations:                 2441
Model:                        MNLogit   Df Residuals:                     2431
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                0.006637
Time:                        15:24:00   Log-Likelihood:                -2508.8
converged:                       True   LL-Null:                       -2525.6
Covariance Type:            nonrobust   LLR p-value:                 4.948e-05
       Y_5=0       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.6111      0.125     -4.871      0.000      -0.857      -0.365
MACD             0.0192

In [11]:
import statsmodels.api as sm
logit_model10=sm.MNLogit(Train_y_10,sm.add_constant(X_train_10))
logit_model10
result10=logit_model10.fit()
stats1=result10.summary()
stats2=result10.summary2()
print(stats1)
print(stats2)

Optimization terminated successfully.
         Current function value: 0.954845
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:                   Y_10   No. Observations:                 2441
Model:                        MNLogit   Df Residuals:                     2431
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                0.007837
Time:                        15:24:14   Log-Likelihood:                -2330.8
converged:                       True   LL-Null:                       -2349.2
Covariance Type:            nonrobust   LLR p-value:                 1.241e-05
      Y_10=0       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -1.0239      0.155     -6.605      0.000      -1.328      -0.720
MACD             0.0104

In [12]:
import statsmodels.api as sm
logit_model14=sm.MNLogit(Train_y_14,sm.add_constant(X_train_14))
logit_model14
result14=logit_model14.fit()
stats1=result14.summary()
stats2=result14.summary2()
print(stats1)
print(stats2)

Optimization terminated successfully.
         Current function value: 0.925881
         Iterations 6
                          MNLogit Regression Results                          
Dep. Variable:                   Y_14   No. Observations:                 2441
Model:                        MNLogit   Df Residuals:                     2431
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                0.006597
Time:                        15:24:31   Log-Likelihood:                -2260.1
converged:                       True   LL-Null:                       -2275.1
Covariance Type:            nonrobust   LLR p-value:                 0.0002100
      Y_14=0       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.7437      0.177     -4.203      0.000      -1.091      -0.397
MACD             0.0046

In [13]:
import statsmodels.api as sm
logit_model20=sm.MNLogit(Train_y_20,sm.add_constant(X_train_20))
logit_model20
result20=logit_model20.fit()
stats1=result20.summary()
stats2=result20.summary2()
print(stats1)
print(stats2)

Optimization terminated successfully.
         Current function value: 0.888919
         Iterations 7
                          MNLogit Regression Results                          
Dep. Variable:                   Y_20   No. Observations:                 2441
Model:                        MNLogit   Df Residuals:                     2431
Method:                           MLE   Df Model:                            8
Date:                Sat, 13 Feb 2021   Pseudo R-squ.:                0.006968
Time:                        15:24:40   Log-Likelihood:                -2169.9
converged:                       True   LL-Null:                       -2185.1
Covariance Type:            nonrobust   LLR p-value:                 0.0001758
      Y_20=0       coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
const           -0.9573      0.203     -4.719      0.000      -1.355      -0.560
MACD             0.0182

In [26]:
#https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.get_margeff.html
result3.get_margeff(at = 'overall').summary()
#result3.get_margeff(at = 'mean').summary()
result3.get_margeff(at = 'median').summary()

0,1
Dep. Variable:,Y_3
Method:,dydx
At:,median

Y_3=-1,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,-0.0025,0.001,-3.589,0.000,-0.004,-0.001
MOM3,-0.7531,0.430,-1.753,0.080,-1.595,0.089
STD_3,5.169e-05,0.002,0.031,0.975,-0.003,0.003
Volume_Agent,0.0026,0.019,0.138,0.890,-0.035,0.040
Y_3=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0020,0.001,2.853,0.004,0.001,0.003
MOM3,0.9119,0.454,2.007,0.045,0.022,1.802
STD_3,-0.0045,0.002,-2.600,0.009,-0.008,-0.001
Volume_Agent,0.0254,0.017,1.462,0.144,-0.009,0.060
Y_3=1,dy/dx,std err,z,P>|z|,[0.025,0.975]


In [24]:
#https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.get_margeff.html
result5.get_margeff(at = 'overall').summary()
result5.get_margeff(at = 'mean').summary()
result5.get_margeff(at = 'median').summary()

0,1
Dep. Variable:,Y_5
Method:,dydx
At:,median

Y_5=-1,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,-0.0035,0.001,-4.742,0.000,-0.005,-0.002
MOM5,0.3108,0.356,0.874,0.382,-0.386,1.008
STD_5,-0.0006,0.002,-0.372,0.710,-0.004,0.002
Volume_Agent,-0.0051,0.020,-0.255,0.799,-0.044,0.034
Y_5=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0016,0.001,2.443,0.015,0.000,0.003
MOM5,0.2328,0.319,0.729,0.466,-0.393,0.859
STD_5,-0.0017,0.001,-1.221,0.222,-0.004,0.001
Volume_Agent,0.0112,0.016,0.700,0.484,-0.020,0.043
Y_5=1,dy/dx,std err,z,P>|z|,[0.025,0.975]


In [22]:
#https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.get_margeff.html
result10.get_margeff(at = 'overall').summary()
#result10.get_margeff(at = 'mean').summary()
#result10.get_margeff(at = 'median').summary()

0,1
Dep. Variable:,Y_10
Method:,dydx
At:,overall

Y_10=-1,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,-0.0007,0.001,-0.831,0.406,-0.003,0.001
MOM10,-1.1170,0.309,-3.618,0.000,-1.722,-0.512
STD_10,0.0001,0.001,0.098,0.922,-0.002,0.003
Volume_Agent,0.0117,0.020,0.597,0.551,-0.027,0.050
Y_10=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0010,0.001,1.607,0.108,-0.000,0.002
MOM10,-0.1182,0.217,-0.546,0.585,-0.543,0.307
STD_10,0.0007,0.001,0.790,0.430,-0.001,0.002
Volume_Agent,-0.0268,0.014,-1.941,0.052,-0.054,0.000
Y_10=1,dy/dx,std err,z,P>|z|,[0.025,0.975]


In [28]:
#https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.get_margeff.html
result14.get_margeff(at = 'overall').summary()
result14.get_margeff(at = 'mean').summary()
result14.get_margeff(at = 'median').summary()

0,1
Dep. Variable:,Y_14
Method:,dydx
At:,median

Y_14=-1,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0003,0.001,0.318,0.751,-0.002,0.002
MOM14,-0.7428,0.301,-2.469,0.014,-1.332,-0.153
STD_14,0.0012,0.001,1.110,0.267,-0.001,0.003
Volume_Agent,0.0003,0.020,0.015,0.988,-0.038,0.039
Y_14=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0006,0.001,0.835,0.404,-0.001,0.002
MOM14,-0.2640,0.228,-1.156,0.248,-0.712,0.184
STD_14,-0.0023,0.001,-2.587,0.010,-0.004,-0.001
Volume_Agent,-0.0236,0.015,-1.554,0.120,-0.053,0.006
Y_14=1,dy/dx,std err,z,P>|z|,[0.025,0.975]


In [30]:
#https://www.statsmodels.org/devel/generated/statsmodels.discrete.discrete_model.LogitResults.get_margeff.html
result20.get_margeff(at = 'overall').summary()
result20.get_margeff(at = 'mean').summary()
result20.get_margeff(at = 'median').summary()

0,1
Dep. Variable:,Y_20
Method:,dydx
At:,median

Y_20=-1,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,-0.0006,0.001,-0.491,0.623,-0.003,0.002
MOM20,-0.4436,0.302,-1.470,0.141,-1.035,0.148
STD_20,0.0011,0.001,1.130,0.259,-0.001,0.003
Volume_Agent,-0.0142,0.020,-0.722,0.470,-0.053,0.024
Y_20=0,dy/dx,std err,z,P>|z|,[0.025,0.975]
MACD,0.0015,0.001,1.777,0.076,-0.000,0.003
MOM20,-0.1411,0.210,-0.672,0.502,-0.553,0.271
STD_20,-0.0018,0.001,-2.379,0.017,-0.003,-0.000
Volume_Agent,-0.0160,0.013,-1.184,0.237,-0.042,0.010
Y_20=1,dy/dx,std err,z,P>|z|,[0.025,0.975]


In [41]:
clf_LR_all_1.coef_

array([[-0.00943621, -0.82353974, -0.11395149],
       [ 0.01331613,  1.34686485,  0.11040158],
       [-0.00387992, -0.5233251 ,  0.00354991]])

In [42]:
clf_LR_all_3.coef_

array([[-0.00874855, -0.6401581 ,  0.00370314, -0.01456536],
       [ 0.00807165,  0.66880405, -0.0162707 ,  0.09555339],
       [ 0.0006769 , -0.02864595,  0.01256756, -0.08098803]])

In [43]:
clf_LR_all_5.coef_

array([[-1.11344693e-02,  3.45653536e-01, -1.18957408e-05,
        -2.33130146e-02],
       [ 8.28640734e-03,  2.02705731e-01, -7.33126645e-03,
         5.49183085e-02],
       [ 2.84806199e-03, -5.48359267e-01,  7.34316220e-03,
        -3.16052939e-02]])

In [44]:
clf_LR_all_10.coef_

array([[-6.13223219e-03, -1.33476835e+00, -7.05186098e-04,
         7.41928203e-02],
       [ 6.06637562e-03, -1.27543462e-01,  3.86049359e-03,
        -1.65605056e-01],
       [ 6.58565660e-05,  1.46231181e+00, -3.15530749e-03,
         9.14122353e-02]])

In [45]:
clf_LR_all_14.coef_

array([[-1.54993192e-03, -9.46011261e-01,  8.48112694e-03,
         5.24079221e-02],
       [ 1.30535769e-03, -2.57262557e-01, -1.55407677e-02,
        -1.55903002e-01],
       [ 2.44574225e-04,  1.20327382e+00,  7.05964081e-03,
         1.03495079e-01]])

In [46]:
clf_LR_all_20.coef_

array([[-0.00672221, -0.55903002,  0.00800641,  0.00988758],
       [ 0.00955596, -0.13962922, -0.01415543, -0.11717878],
       [-0.00283375,  0.69865924,  0.00614901,  0.10729121]])

In [47]:
# Use test data from 2019 to report in paper
Test_y_1 = Test_a_df['Y_1']
Test_y_3 = Test_a_df['Y_3']
Test_y_5 = Test_a_df['Y_5']
Test_y_10 = Test_a_df['Y_10']
Test_y_14 = Test_a_df['Y_14']
Test_y_20 = Test_a_df['Y_20']

X_test_1 = Test_a_df[['MACD', 'MOM1', 'Volume_Agent']]
X_test_3 =Test_a_df[['MACD', 'MOM3', 'STD_3', 'Volume_Agent']]
X_test_5 = Test_a_df[['MACD', 'MOM5', 'STD_5', 'Volume_Agent']]
X_test_10 = Test_a_df[['MACD', 'MOM10', 'STD_10', 'Volume_Agent']]
X_test_14 = Test_a_df[['MACD', 'MOM14', 'STD_14', 'Volume_Agent']]
X_test_20 = Test_a_df[['MACD', 'MOM20', 'STD_20', 'Volume_Agent']]

print(clf_LR_all_1.score(X_test_1, Test_y_1))
print(clf_LR_all_3.score(X_test_3, Test_y_3))
print(clf_LR_all_5.score(X_test_5, Test_y_5))
print(clf_LR_all_10.score(X_test_10, Test_y_10))
print(clf_LR_all_14.score(X_test_14, Test_y_14))
print(clf_LR_all_20.score(X_test_20, Test_y_20))

0.6090534979423868
0.36213991769547327
0.37448559670781895
0.41975308641975306
0.4279835390946502
0.41975308641975306


In [30]:
from sklearn.tree import DecisionTreeClassifier
clf_LR_all_1 = DecisionTreeClassifier(random_state=0,  max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_1, Train_y_1)
print('DT (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('DT (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = DecisionTreeClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_3, Train_y_3)
print('DT (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('DT (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = DecisionTreeClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_5, Train_y_5)
print('DT (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('DT (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

clf_LR_all_10 = DecisionTreeClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_10, Train_y_10)
print('DT (all features) training accuracy on 10 day ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('DT (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = DecisionTreeClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_14, Train_y_14)
print('DT (all features) training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('DT (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = DecisionTreeClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.05).fit(X_train_20, Train_y_20)
print('DT (all features) training accuracy on 20 day ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('DT (all features) prediction accuracy on 20 day ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

DT (all features) training accuracy on 1 day ahead: 0.48586644817697666
DT (all features) prediction accuracy on 1 day ahead: 0.5142857142857142
DT (all features)  training accuracy on 3 days ahead: 0.4678410487505121
DT (all features) prediction accuracy on 3 days ahead: 0.3183673469387755
DT (all features) training accuracy on 5 days ahead: 0.5043015157722245
DT (all features) prediction accuracy on 5 days ahead: 0.35918367346938773
DT (all features) training accuracy on 10 day ahead: 0.5571487095452683
DT (all features) prediction accuracy on 10 day ahead: 0.3795918367346939
DT (all features) training accuracy on 14 days ahead: 0.5899221630479312
DT (all features) prediction accuracy on 14 days ahead: 0.45714285714285713
DT (all features) training accuracy on 20 day ahead: 0.6058992216304793
DT (all features) prediction accuracy on 20 day ahead: 0.47346938775510206


In [49]:
# Use test data from 2019 to report in paper
Test_y_1 = Test_a_df['Y_1']
Test_y_3 = Test_a_df['Y_3']
Test_y_5 = Test_a_df['Y_5']
Test_y_10 = Test_a_df['Y_10']
Test_y_14 = Test_a_df['Y_14']
Test_y_20 = Test_a_df['Y_20']

X_test_1 = Test_a_df[['MACD', 'MOM1', 'Volume_Agent']]
X_test_3 =Test_a_df[['MACD', 'MOM3', 'STD_3', 'Volume_Agent']]
X_test_5 = Test_a_df[['MACD', 'MOM5', 'STD_5', 'Volume_Agent']]
X_test_10 = Test_a_df[['MACD', 'MOM10', 'STD_10', 'Volume_Agent']]
X_test_14 = Test_a_df[['MACD', 'MOM14', 'STD_14', 'Volume_Agent']]
X_test_20 = Test_a_df[['MACD', 'MOM20', 'STD_20', 'Volume_Agent']]

In [50]:
print(clf_LR_all_1.score(X_test_1, Test_y_1))
print(clf_LR_all_3.score(X_test_3, Test_y_3))
print(clf_LR_all_5.score(X_test_5, Test_y_5))
print(clf_LR_all_10.score(X_test_10, Test_y_10))
print(clf_LR_all_14.score(X_test_14, Test_y_14))
print(clf_LR_all_20.score(X_test_20, Test_y_20))

0.5802469135802469
0.3374485596707819
0.32098765432098764
0.4156378600823045
0.522633744855967
0.5390946502057613


In [51]:
Pred_y_1 = clf_LR_all_1.predict(X_test_1)
Pred_y_3 = clf_LR_all_3.predict(X_test_3)
Pred_y_5 = clf_LR_all_5.predict(X_test_5)
Pred_y_10 = clf_LR_all_10.predict(X_test_10)
Pred_y_14 = clf_LR_all_14.predict(X_test_14)
Pred_y_20 = clf_LR_all_20.predict(X_test_20)

In [52]:
from sklearn.metrics import classification_report
print(classification_report(Test_y_1, Pred_y_1))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00        46
           0       0.61      0.91      0.73       148
           1       0.27      0.12      0.17        49

    accuracy                           0.58       243
   macro avg       0.29      0.34      0.30       243
weighted avg       0.43      0.58      0.48       243



  _warn_prf(average, modifier, msg_start, len(result))


In [53]:
print(classification_report(Test_y_3, Pred_y_3))

              precision    recall  f1-score   support

          -1       0.35      0.44      0.39        73
           0       0.19      0.04      0.06        82
           1       0.35      0.53      0.42        88

    accuracy                           0.34       243
   macro avg       0.29      0.34      0.29       243
weighted avg       0.29      0.34      0.29       243



In [54]:
print(classification_report(Test_y_5, Pred_y_5))

              precision    recall  f1-score   support

          -1       0.28      0.28      0.28        82
           0       0.00      0.00      0.00        69
           1       0.34      0.60      0.43        92

    accuracy                           0.32       243
   macro avg       0.21      0.29      0.24       243
weighted avg       0.22      0.32      0.26       243



In [55]:
print(classification_report(Test_y_10, Pred_y_10))

              precision    recall  f1-score   support

          -1       0.43      0.16      0.23        96
           0       0.00      0.00      0.00        45
           1       0.41      0.84      0.55       102

    accuracy                           0.42       243
   macro avg       0.28      0.33      0.26       243
weighted avg       0.34      0.42      0.32       243



In [56]:
print(classification_report(Test_y_14, Pred_y_14))

              precision    recall  f1-score   support

          -1       0.49      0.64      0.56        97
           0       0.00      0.00      0.00        42
           1       0.56      0.62      0.59       104

    accuracy                           0.52       243
   macro avg       0.35      0.42      0.38       243
weighted avg       0.43      0.52      0.47       243



In [57]:
print(classification_report(Test_y_20, Pred_y_20))

              precision    recall  f1-score   support

          -1       0.65      0.45      0.53       113
           0       0.00      0.00      0.00        28
           1       0.48      0.78      0.60       102

    accuracy                           0.54       243
   macro avg       0.38      0.41      0.38       243
weighted avg       0.51      0.54      0.50       243



In [58]:
clf_LR_all_1.feature_importances_

array([0.44745931, 0.55254069, 0.        ])

In [59]:
clf_LR_all_3.feature_importances_

array([0.56976271, 0.21561306, 0.21462423, 0.        ])

In [60]:
clf_LR_all_5.feature_importances_

array([0.41295118, 0.08018301, 0.50686581, 0.        ])

In [61]:
clf_LR_all_10.feature_importances_

array([0.08621677, 0.52131511, 0.36069666, 0.03177146])

In [62]:
clf_LR_all_14.feature_importances_

array([0.33943182, 0.20550703, 0.45506116, 0.        ])

In [63]:
clf_LR_all_20.feature_importances_

array([0.20996348, 0.35508355, 0.43495297, 0.        ])

### This is for 2020!

In [22]:
# Use test data from 2019 to report in paper
Test_b_df = pd.read_csv('SET_Test_b.csv')
Test_b_df = Test_b_df.iloc[:-20, :]

Test_y_1b = Test_b_df['Y_1']
Test_y_3b = Test_b_df['Y_3']
Test_y_5b = Test_b_df['Y_5']
Test_y_10b = Test_b_df['Y_10']
Test_y_14b = Test_b_df['Y_14']
Test_y_20b = Test_b_df['Y_20']

X_test_1b = Test_b_df[['MACD', 'MOM1', 'Volume_Agent']]
X_test_3b =Test_b_df[['MACD', 'MOM3', 'STD_3', 'Volume_Agent']]
X_test_5b = Test_b_df[['MACD', 'MOM5', 'STD_5', 'Volume_Agent']]
X_test_10b = Test_b_df[['MACD', 'MOM10', 'STD_10', 'Volume_Agent']]
X_test_14b = Test_b_df[['MACD', 'MOM14', 'STD_14', 'Volume_Agent']]
X_test_20b = Test_b_df[['MACD', 'MOM20', 'STD_20', 'Volume_Agent']]


X_test_lag_one = Test_b_df[['Y_N_1']]

In [42]:
Test_y_1b.value_counts()

 1    78
 0    73
-1    72
Name: Y_1, dtype: int64

In [43]:
Test_y_3b.value_counts()

-1    94
 1    86
 0    43
Name: Y_3, dtype: int64

In [44]:
Test_y_5b.value_counts()

 1    97
-1    96
 0    30
Name: Y_5, dtype: int64

In [45]:
Test_y_10b.value_counts()

-1    102
 1     98
 0     23
Name: Y_10, dtype: int64

In [46]:
Test_y_14b.value_counts()

-1    123
 1     87
 0     13
Name: Y_14, dtype: int64

In [47]:
Test_y_20b.value_counts()

-1    127
 1     90
 0      6
Name: Y_20, dtype: int64

In [48]:
pd.crosstab(Test_y_1b, Test_y_10b) # 43% in the diagonal

Y_10,-1,0,1
Y_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,41,8,23
0,38,7,28
1,23,8,47


In [49]:
pd.crosstab(Test_y_3b, Test_y_14b) # 61.4% in the diagonal

Y_14,-1,0,1
Y_3,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,76,2,16
0,23,5,15
1,24,6,56


In [50]:
pd.crosstab(Test_y_5b, Test_y_20b) # 63.3% in the diagonal

Y_20,-1,0,1
Y_5,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1,78,2,16
0,17,1,12
1,32,3,62


In [54]:
print('Dummy classifier (most frequent) prediction accuracy on 1 day ahead:', clf_dummy_mf_1.score(X_test_lag_one, Test_y_1b))
print('Dummy classifier (most frequent) prediction accuracy on 3 day ahead:', clf_dummy_mf_3.score(X_test_lag_one, Test_y_3b))
print('Dummy classifier (most frequent) prediction accuracy on 5 day ahead:', clf_dummy_mf_5.score(X_test_lag_one, Test_y_5b))
print('Dummy classifier (most frequent) prediction accuracy on 10 day ahead:', clf_dummy_mf_10.score(X_test_lag_one, Test_y_10b))
print('Dummy classifier (most frequent) prediction accuracy on 14 day ahead:', clf_dummy_mf_14.score(X_test_lag_one, Test_y_14b))
print('Dummy classifier (most frequent) prediction accuracy on 20 day ahead:', clf_dummy_mf_20.score(X_test_lag_one, Test_y_20b))

Dummy classifier (most frequent) prediction accuracy on 1 day ahead: 0.3273542600896861
Dummy classifier (most frequent) prediction accuracy on 3 day ahead: 0.38565022421524664
Dummy classifier (most frequent) prediction accuracy on 5 day ahead: 0.4349775784753363
Dummy classifier (most frequent) prediction accuracy on 10 day ahead: 0.43946188340807174
Dummy classifier (most frequent) prediction accuracy on 14 day ahead: 0.3901345291479821
Dummy classifier (most frequent) prediction accuracy on 20 day ahead: 0.40358744394618834


In [25]:
# This is for benchmark
print('Logistic regression on lagged prediction accuracy on 1 day ahead:', clf_LR_1.score(X_test_lag_one, Test_y_1b))
print('Logistic regression on lagged prediction accuracy on 3 day ahead:', clf_LR_3.score(X_test_lag_one, Test_y_3b))
print('Logistic regression on lagged prediction accuracy on 5 day ahead:', clf_LR_5.score(X_test_lag_one, Test_y_5b))
print('Logistic regression on lagged prediction accuracy on 10 day ahead:', clf_LR_10.score(X_test_lag_one, Test_y_10b))
print('Logistic regression on lagged prediction accuracy on 14 day ahead:', clf_LR_14.score(X_test_lag_one, Test_y_14b))
print('Logistic regression on lagged prediction accuracy on 20 day ahead:', clf_LR_20.score(X_test_lag_one, Test_y_20b))

Logistic regression on lagged prediction accuracy on 1 day ahead: 0.3273542600896861
Logistic regression on lagged prediction accuracy on 3 day ahead: 0.38565022421524664
Logistic regression on lagged prediction accuracy on 5 day ahead: 0.4349775784753363
Logistic regression on lagged prediction accuracy on 10 day ahead: 0.43946188340807174
Logistic regression on lagged prediction accuracy on 14 day ahead: 0.3901345291479821
Logistic regression on lagged prediction accuracy on 20 day ahead: 0.40358744394618834


In [29]:
# This is for logistic regression
print('Logistic regression (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1b, Test_y_1b))
print('Logistic regression (all features) prediction accuracy on 3 day ahead:', clf_LR_all_3.score(X_test_3b, Test_y_3b))
print('Logistic regression (all features) prediction accuracy on 5 day ahead:', clf_LR_all_5.score(X_test_5b, Test_y_5b))
print('Logistic regression (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10b, Test_y_10b))
print('Logistic regression (all features) prediction accuracy on 14 day ahead:', clf_LR_all_14.score(X_test_14b, Test_y_14b))
print('Logistic regression (all features) prediction accuracy on 20 day ahead:', clf_LR_all_20.score(X_test_20b, Test_y_20b))

Logistic regression (all features) prediction accuracy on 1 day ahead: 0.3721973094170404
Logistic regression (all features) prediction accuracy on 3 day ahead: 0.35874439461883406
Logistic regression (all features) prediction accuracy on 5 day ahead: 0.4125560538116592
Logistic regression (all features) prediction accuracy on 10 day ahead: 0.4439461883408072
Logistic regression (all features) prediction accuracy on 14 day ahead: 0.3542600896860987
Logistic regression (all features) prediction accuracy on 20 day ahead: 0.3452914798206278


In [31]:
# This is for decision tree
print('DT (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1b, Test_y_1b))
print('DT (all features) prediction accuracy on 3 day ahead:', clf_LR_all_3.score(X_test_3b, Test_y_3b))
print('DT (all features) prediction accuracy on 5 day ahead:', clf_LR_all_5.score(X_test_5b, Test_y_5b))
print('DT (all features) prediction accuracy on 10 day ahead:', clf_LR_all_10.score(X_test_10b, Test_y_10b))
print('DT (all features) prediction accuracy on 14 day ahead:', clf_LR_all_14.score(X_test_14b, Test_y_14b))
print('DT (all features) prediction accuracy on 20 day ahead:', clf_LR_all_20.score(X_test_20b, Test_y_20b))

DT (all features) prediction accuracy on 1 day ahead: 0.3991031390134529
DT (all features) prediction accuracy on 3 day ahead: 0.3811659192825112
DT (all features) prediction accuracy on 5 day ahead: 0.47085201793721976
DT (all features) prediction accuracy on 10 day ahead: 0.4798206278026906
DT (all features) prediction accuracy on 14 day ahead: 0.5022421524663677
DT (all features) prediction accuracy on 20 day ahead: 0.42152466367713004


In [32]:
Pred_y_1b = clf_LR_all_1.predict(X_test_1b)
Pred_y_3b = clf_LR_all_3.predict(X_test_3b)
Pred_y_5b = clf_LR_all_5.predict(X_test_5b)
Pred_y_10b = clf_LR_all_10.predict(X_test_10b)
Pred_y_14b = clf_LR_all_14.predict(X_test_14b)
Pred_y_20b = clf_LR_all_20.predict(X_test_20b)

In [34]:
from sklearn.metrics import classification_report
print(classification_report(Test_y_1b, Pred_y_1b))

              precision    recall  f1-score   support

          -1       0.00      0.00      0.00        72
           0       0.38      0.79      0.51        73
           1       0.45      0.40      0.42        78

    accuracy                           0.40       223
   macro avg       0.28      0.40      0.31       223
weighted avg       0.28      0.40      0.31       223



  _warn_prf(average, modifier, msg_start, len(result))


In [35]:
print(classification_report(Test_y_3b, Pred_y_3b))

              precision    recall  f1-score   support

          -1       0.41      0.36      0.38        94
           0       0.00      0.00      0.00        43
           1       0.38      0.59      0.46        86

    accuracy                           0.38       223
   macro avg       0.26      0.32      0.28       223
weighted avg       0.32      0.38      0.34       223



In [36]:
print(classification_report(Test_y_5b, Pred_y_5b))

              precision    recall  f1-score   support

          -1       0.51      0.39      0.44        96
           0       0.00      0.00      0.00        30
           1       0.45      0.70      0.55        97

    accuracy                           0.47       223
   macro avg       0.32      0.36      0.33       223
weighted avg       0.42      0.47      0.43       223



In [37]:
print(classification_report(Test_y_10b, Pred_y_10b))

              precision    recall  f1-score   support

          -1       0.56      0.24      0.33       102
           0       0.00      0.00      0.00        23
           1       0.46      0.85      0.60        98

    accuracy                           0.48       223
   macro avg       0.34      0.36      0.31       223
weighted avg       0.46      0.48      0.41       223



In [39]:
print(classification_report(Test_y_14b, Pred_y_14b))

              precision    recall  f1-score   support

          -1       0.58      0.48      0.52       123
           0       0.00      0.00      0.00        13
           1       0.44      0.61      0.51        87

    accuracy                           0.50       223
   macro avg       0.34      0.36      0.34       223
weighted avg       0.49      0.50      0.49       223



In [40]:
print(classification_report(Test_y_20b, Pred_y_20b))

              precision    recall  f1-score   support

          -1       0.51      0.36      0.42       127
           0       0.00      0.00      0.00         6
           1       0.36      0.53      0.43        90

    accuracy                           0.42       223
   macro avg       0.29      0.30      0.28       223
weighted avg       0.43      0.42      0.41       223



## Below will not be used anymore!

In [130]:
from sklearn.ensemble import RandomForestClassifier
clf_LR_all_1 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_1, Train_y_1)
print('RF (all features) training accuracy on 1 day ahead:', clf_LR_all_1.score(X_train_1, Train_y_1))
print('RF (all features) prediction accuracy on 1 day ahead:', clf_LR_all_1.score(X_test_1, Test_y_1))

clf_LR_all_3 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_3, Train_y_3)
print('RF (all features)  training accuracy on 3 days ahead:', clf_LR_all_3.score(X_train_3, Train_y_3))
print('RF (all features) prediction accuracy on 3 days ahead:', clf_LR_all_3.score(X_test_3, Test_y_3))

clf_LR_all_5 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_5, Train_y_5)
print('RF (all features) training accuracy on 5 days ahead:', clf_LR_all_5.score(X_train_5, Train_y_5))
print('RF (all features) prediction accuracy on 5 days ahead:', clf_LR_all_5.score(X_test_5, Test_y_5))

clf_LR_all_10 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_10, Train_y_10)
print('RF (all features) training accuracy on 10 days ahead:', clf_LR_all_10.score(X_train_10, Train_y_10))
print('RF (all features) prediction accuracy on 10 days ahead:', clf_LR_all_10.score(X_test_10, Test_y_10))

clf_LR_all_14 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_14, Train_y_14)
print('RF (all features) training accuracy on 14 days ahead:', clf_LR_all_14.score(X_train_14, Train_y_14))
print('RF (all features) prediction accuracy on 14 days ahead:', clf_LR_all_14.score(X_test_14, Test_y_14))

clf_LR_all_20 = RandomForestClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_20, Train_y_20)
print('RF (all features) training accuracy on 20 days ahead:', clf_LR_all_20.score(X_train_20, Train_y_20))
print('RF (all features) prediction accuracy on 20 days ahead:', clf_LR_all_20.score(X_test_20, Test_y_20))

RF (all features) training accuracy on 1 day ahead: 0.5092175337976239
RF (all features) prediction accuracy on 1 day ahead: 0.4775510204081633
RF (all features)  training accuracy on 3 days ahead: 0.49733715690290864
RF (all features) prediction accuracy on 3 days ahead: 0.3020408163265306
RF (all features) training accuracy on 5 days ahead: 0.5292912740680049
RF (all features) prediction accuracy on 5 days ahead: 0.3510204081632653
RF (all features) training accuracy on 10 days ahead: 0.5788611224907825
RF (all features) prediction accuracy on 10 days ahead: 0.32653061224489793
RF (all features) training accuracy on 14 days ahead: 0.6222859483818107
RF (all features) prediction accuracy on 14 days ahead: 0.37142857142857144
RF (all features) training accuracy on 20 days ahead: 0.6337566571077428
RF (all features) prediction accuracy on 20 days ahead: 0.4


In [131]:
Pred_y_1 = clf_LR_all_1.predict(X_test_1)
Pred_y_3 = clf_LR_all_3.predict(X_test_3)
Pred_y_5 = clf_LR_all_5.predict(X_test_5)
Pred_y_10 = clf_LR_all_10.predict(X_test_10)
Pred_y_14 = clf_LR_all_14.predict(X_test_14)
Pred_y_20 = clf_LR_all_20.predict(X_test_20)

In [132]:
from sklearn.metrics import classification_report
print(classification_report(Test_y_1, Pred_y_1))

              precision    recall  f1-score   support

          -1       0.14      0.07      0.09        60
           0       0.57      0.77      0.65       137
           1       0.23      0.15      0.18        48

    accuracy                           0.48       245
   macro avg       0.31      0.33      0.31       245
weighted avg       0.40      0.48      0.42       245



In [133]:
print(classification_report(Test_y_3, Pred_y_3))

              precision    recall  f1-score   support

          -1       0.36      0.28      0.32        95
           0       0.36      0.06      0.10        71
           1       0.27      0.54      0.36        79

    accuracy                           0.30       245
   macro avg       0.33      0.29      0.26       245
weighted avg       0.33      0.30      0.27       245



In [134]:
print(classification_report(Test_y_5, Pred_y_5))

              precision    recall  f1-score   support

          -1       0.41      0.25      0.31       103
           0       0.00      0.00      0.00        58
           1       0.33      0.71      0.45        84

    accuracy                           0.35       245
   macro avg       0.25      0.32      0.25       245
weighted avg       0.28      0.35      0.29       245



  _warn_prf(average, modifier, msg_start, len(result))


In [135]:
print(classification_report(Test_y_10, Pred_y_10))

              precision    recall  f1-score   support

          -1       0.39      0.16      0.22       127
           0       0.00      0.00      0.00        36
           1       0.31      0.73      0.43        82

    accuracy                           0.33       245
   macro avg       0.23      0.30      0.22       245
weighted avg       0.31      0.33      0.26       245



In [136]:
print(classification_report(Test_y_14, Pred_y_14))

              precision    recall  f1-score   support

          -1       0.56      0.27      0.37       128
           0       0.00      0.00      0.00        45
           1       0.31      0.78      0.44        72

    accuracy                           0.37       245
   macro avg       0.29      0.35      0.27       245
weighted avg       0.38      0.37      0.32       245



In [137]:
print(classification_report(Test_y_20, Pred_y_20))

              precision    recall  f1-score   support

          -1       0.77      0.26      0.38       145
           0       0.00      0.00      0.00        31
           1       0.31      0.88      0.46        69

    accuracy                           0.40       245
   macro avg       0.36      0.38      0.28       245
weighted avg       0.54      0.40      0.36       245



In [138]:
clf_LR_all_1.feature_importances_

array([0.47759579, 0.487616  , 0.03478821])

In [139]:
clf_LR_all_3.feature_importances_

array([0.39684013, 0.31326035, 0.25787383, 0.03202569])

In [140]:
clf_LR_all_5.feature_importances_

array([0.39971865, 0.25544499, 0.32013944, 0.02469691])

In [141]:
clf_LR_all_10.feature_importances_

array([0.28319147, 0.34646972, 0.33255063, 0.03778818])

In [142]:
clf_LR_all_14.feature_importances_

array([0.26886272, 0.34316155, 0.35129727, 0.03667845])

In [143]:
clf_LR_all_20.feature_importances_

array([0.26023451, 0.31570017, 0.40519232, 0.018873  ])

In [160]:
from sklearn.ensemble import GradientBoostingClassifier
clf_GT_all_1 = GradientBoostingClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_1, Train_y_1)
print('GBDT (all features) training accuracy on 1 day ahead:', clf_GT_all_1.score(X_train_1, Train_y_1))
print('GBDT (all features) prediction accuracy on 1 day ahead:',clf_GT_all_1.score(X_test_1, Test_y_1))

GBDT (all features) training accuracy on 1 day ahead: 0.51372388365424
GBDT (all features) prediction accuracy on 1 day ahead: 0.4897959183673469


In [161]:
clf_GT_all_3 = GradientBoostingClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.2, learning_rate = 0.5).fit(X_train_3, Train_y_3)
print('GBDT (all features) training accuracy on 3 day ahead:', clf_GT_all_3.score(X_train_3, Train_y_3))
print('GBDT (all features) prediction accuracy on 3 day ahead:',clf_GT_all_3.score(X_test_3, Test_y_3))

GBDT (all features) training accuracy on 3 day ahead: 0.5297009422367882
GBDT (all features) prediction accuracy on 3 day ahead: 0.3183673469387755


In [162]:
clf_GT_all_5 = GradientBoostingClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.03).fit(X_train_5, Train_y_5)
print('GBDT (all features) training accuracy on 5 day ahead:', clf_GT_all_5.score(X_train_5, Train_y_5))
print('GBDT (all features) prediction accuracy on 5 day ahead:',clf_GT_all_5.score(X_test_5, Test_y_5))

GBDT (all features) training accuracy on 5 day ahead: 0.5530520278574355
GBDT (all features) prediction accuracy on 5 day ahead: 0.3877551020408163


In [174]:
clf_GT_all_10 = GradientBoostingClassifier(random_state=0, max_depth = 6, min_weight_fraction_leaf = 0.03).fit(X_train_10, Train_y_10)
print('GBDT (all features) training accuracy on 10 day ahead:', clf_GT_all_10.score(X_train_10, Train_y_10))
print('GBDT (all features) prediction accuracy on 10 day ahead:',clf_GT_all_10.score(X_test_10, Test_y_10))

GBDT (all features) training accuracy on 10 day ahead: 0.6091765669807456
GBDT (all features) prediction accuracy on 10 day ahead: 0.3306122448979592


In [178]:
clf_GT_all_14 = GradientBoostingClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_14, Train_y_14)
print('GBDT (all features) training accuracy on 14 day ahead:', clf_GT_all_14.score(X_train_14, Train_y_14))
print('GBDT (all features) prediction accuracy on 14 day ahead:',clf_GT_all_14.score(X_test_14, Test_y_14))

GBDT (all features) training accuracy on 14 day ahead: 0.7251126587464154
GBDT (all features) prediction accuracy on 14 day ahead: 0.3469387755102041


In [177]:
clf_GT_all_20 = GradientBoostingClassifier(random_state=0, max_depth = 8, min_weight_fraction_leaf = 0.03).fit(X_train_20, Train_y_20)
print('GBDT (all features) training accuracy on 20 day ahead:', clf_GT_all_20.score(X_train_20, Train_y_20))
print('GBDT (all features) prediction accuracy on 20 day ahead:',clf_GT_all_20.score(X_test_20, Test_y_20))

GBDT (all features) training accuracy on 20 day ahead: 0.7611634575993446
GBDT (all features) prediction accuracy on 20 day ahead: 0.42857142857142855


In [137]:
Pred_y_1 = clf_GT_all_1.predict(X_test_1)
Pred_y_3 = clf_GT_all_3.predict(X_test_3)
Pred_y_5 = clf_GT_all_5.predict(X_test_5)
Pred_y_10 = clf_GT_all_10.predict(X_test_10)
Pred_y_14 = clf_GT_all_14.predict(X_test_14)
Pred_y_20 = clf_GT_all_20.predict(X_test_20)

In [138]:
print(classification_report(Test_y_1, Pred_y_1))

              precision    recall  f1-score   support

          -1       0.29      0.23      0.26        60
           0       0.60      0.66      0.63       137
           1       0.22      0.21      0.22        48

    accuracy                           0.47       245
   macro avg       0.37      0.37      0.37       245
weighted avg       0.45      0.47      0.46       245



In [139]:
print(classification_report(Test_y_3, Pred_y_3))

              precision    recall  f1-score   support

          -1       0.42      0.34      0.37        95
           0       0.33      0.18      0.23        71
           1       0.30      0.48      0.37        79

    accuracy                           0.34       245
   macro avg       0.35      0.33      0.32       245
weighted avg       0.35      0.34      0.33       245



### Whew, this is the end of part 2. For this part, coding is relatively easy compared to the first. However, the tough part is tweaking hyperparameters to get higher validation score. I have to admit that it is difficult to raise validation score significantly beyond this point. Here are some quick notes about what I could observe. I think that could help refine the topic that we could go next, using this as the starting point.

- Using dummy classifier, we see the shifting sand in the distribution of (up, sideway, down), and the prediction gets worse as we increase window length.  
- Using lagged dependent variable in logistic regression classifier, it is not better than dummy classifier. Unfortunately, the same holds true for the full model using logistic regression framework.  
- Using decision tree classifiers (including RF and GBDT), for window lengths of 1 day - 10 days, I do not see any improvements (esp 1-5 days). However, we could see significant improvement (from 30% to 45% accuracy) in window lengths of 14 days & 20 days. The longer, the better the prediction is (seems counterintuitive?) This is confirmed by running the model with test set.  
- By inspecting the classification report, for RF, as we increase time span, the model tries to ignore sideway as it became less common. This model is more like a bit risk-seeking because it tries to ignore sideway and bets up or down. However, for GBDT, the degree of aggressiveness is less compared to RF.  
- By inspecting the feature importances, I could notice that as we increase the time span, "trend", "momentum" (esp CCI, MACD) are less important over window length while "uncertainty" (STD) is more important over window length. (Importance is measured by feature importances as explanatory power in RF and GBDT)
- Looking at similarities among classification report (esp window_length = 14, 20), for (down) the precision is high and recall is low, implying that if the model says stock is going to go down, then it is likely so. However, for (up) the precision is low and recall is high, implying that among opportunities when stock index is going to rise, the model is likely to capture that opportunity. However, by capturing opportunity, it generates more false alarms.