This project forecasts US stock market by the performance of stock markets in other countries.
# The goals of this project
1. Predicting US stock market based on information from interenational stock markets.
2. Applying various statistical/machine learning methods for prediction, and compares their performances.
3. Providing insightful investment advices.

# Findings
1. Stock markets in other countries have predictive power for US stock market.
2. OLS and Ridge regressions provide the best prediction.
3. The data set is not large enough (215 data points in tatal) for the machine learning methods to outperform regressions. 
4. The performance comparison is summarized at the end of the project.
5. The stock markets in France, Portugal, Italy and Greece are negatively related to the US stock movement. Thus, investing in these countries is good hedging strategy.


# 1. The Data

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn import linear_model, tree, ensemble, neighbors, neural_network
from sklearn.metrics import mean_squared_error
from math import sqrt
import warnings
warnings.simplefilter("ignore")

# The data is the monthly stock returns of 10 developed countries. Augmented dickey fuller test shows there is no unit root.
# The data starts at Aug. 1997 and ends at Sep. 2015.
df = pd.read_csv('stock_data.csv')
df.date = pd.to_datetime(df.date)
# The first column is the date, and the second column is US stock return, which is the variable we are interested in.
df.head()

  from pandas.core import datetools


Unnamed: 0,date,US,JP,GB,DE,FR,IT,CA,ES,PT,GR
0,1997-10-31,3.664936,1.072069,-0.228268,4.269542,3.901979,3.315279,-4.936632,7.586259,4.156334,-0.979569
1,1997-11-30,2.393549,-8.643722,5.208389,5.635678,4.800509,9.279251,2.825756,4.428291,5.778483,0.379869
2,1997-12-31,-0.109771,8.596826,5.08145,4.247254,5.276067,11.507212,0.011348,9.513052,13.316926,-5.861094
3,1998-01-31,6.378283,1.214592,5.622851,5.894949,7.591167,5.480019,5.68991,10.765061,10.668401,1.692631
4,1998-02-28,5.107415,-1.82565,3.596323,7.289553,11.518873,20.824999,6.363625,13.718557,13.697838,34.594548


In [2]:
# The past information of the US and other countries are used for forecasting US stock market. 
# Create lag variables
for i in [1,2,3]:
    df_lag = df.iloc[:,-10:].drop(df.index[-1:])
    df_lag.loc[-1] = "NULL" 
    df_lag.index = df_lag.index + 1  
    df_lag = df_lag.sort_index() 
    df_lag.columns = ['US_L{}'.format(i), 'JP_L{}'.format(i), 'GB_L{}'.format(i), 'DE_L{}'.format(i), 
                      'FR_L{}'.format(i), 'IT_L{}'.format(i), 'CA_L{}'.format(i), 'ES_L{}'.format(i),
                      'PT_L{}'.format(i), 'GR_L{}'.format(i)]
    df = pd.concat([df, df_lag], axis=1)

In [3]:
# Create the dependent and independent variables
Y = df.iloc[3:,1]
X = df.iloc[3:,2:]

# Take 80% sample as the training set, and 20% sample as the test set
# Because it is time-series data, it is not propriate using random sampling to split training/test sets. 
# The training set must be at the begining of the data, and test set is at the end.
Y_train, Y_test = Y.iloc[:172], Y.iloc[172:]
X_train, X_test = X.iloc[:172,:], X.iloc[172:,:]

# 2. The Models
In this section, different machine learning models are applied. The coefficient of determination and root-mean-square error are used to compare the prediction performance of the models.
## 2.1 Regression
### 2.1.1 Ordinary Least Square
The linear OLS is the benchmark model.

In [4]:
OLS = linear_model.LinearRegression()
OLS_Result = OLS.fit(X_train, Y_train)
OLS_score_train = OLS_Result.score(X_train, Y_train)
print('The RMES for the training set: ', OLS_score_train)
OLS_RMSE_train = sqrt(mean_squared_error(Y_train, OLS_Result.predict(X_train)))
print('The coefficient of determination for the training set: ', OLS_RMSE_train)
OLS_RMSE_test = sqrt(mean_squared_error(Y_test, OLS_Result.predict(X_test)))
print('The RMES for the test set: ', OLS_RMSE_test)
OLS_score = OLS_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', OLS_score)

The RMES for the training set:  0.8752219241464316
The coefficient of determination for the training set:  1.720586221735386
The RMES for the test set:  1.4468699686979707
The coefficient of determination for the test set:  0.724885922738558


### 2.1.2 Lasso

In [5]:
Lasso = linear_model.Lasso()
Lasso_Result = Lasso.fit(X_train, Y_train)
Lasso_RMSE_train = sqrt(mean_squared_error(Y_train, Lasso_Result.predict(X_train)))
print('The RMES for the training set: ', Lasso_RMSE_train)
Lasso_score_train = Lasso_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', Lasso_score_train)
Lasso_RMSE_test = sqrt(mean_squared_error(Y_test, Lasso_Result.predict(X_test)))
print('The RMES for the test set: ', Lasso_RMSE_test)
Lasso_score = Lasso_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', Lasso_score)

The RMES for the training set:  2.093006913223005
The coefficient of determination for the training set:  0.8153596016947813
The RMES for the test set:  1.5516014640309304
The coefficient of determination for the test set:  0.6836162496984587


### 2.1.3 Ridge

In [6]:
Ridge = linear_model.Ridge()
Ridge_Result = Ridge.fit(X_train, Y_train)
Ridge_RMSE_train = sqrt(mean_squared_error(Y_train, Ridge_Result.predict(X_train)))
print('The RMES for the training set: ', Ridge_RMSE_train)
Ridge_score_train = Ridge_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', Ridge_score_train)
Ridge_score = Ridge_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', Ridge_score)
Ridge_RMSE_test = sqrt(mean_squared_error(Y_test, Ridge_Result.predict(X_test)))
print('The RMES for the test set: ', Ridge_RMSE_test)

The RMES for the training set:  1.7205897860334354
The coefficient of determination for the training set:  0.8752214071752884
The coefficient of determination for the test set:  0.7250953432032068
The RMES for the test set:  1.446319175607337


### 2.1.4 Least Angle Regression

In [7]:
Lars = linear_model.Lars()
Lars_Result = Lars.fit(X_train, Y_train)
Lars_RMSE_train = sqrt(mean_squared_error(Y_train, Lars_Result.predict(X_train)))
print('The RMES for the training set: ', Lars_RMSE_train)
Lars_score_train = Lars_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', Lars_score_train)
Lars_RMSE_test = sqrt(mean_squared_error(Y_test, Lars_Result.predict(X_test)))
print('The RMES for the test set: ', Lars_RMSE_test)
Lars_score = Lars_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', Lars_score)

The RMES for the training set:  2.32158831868825
The coefficient of determination for the training set:  0.7728274659507648
The RMES for the test set:  2.501962267134681
The coefficient of determination for the test set:  0.1773495877990554


## 2.2 Tree-based Models
### 2.2.1 Decision Tree

In [8]:
# Different max depth is tested, and when it is 3, the model achieved the best prediction result.
# The deeper the tree is, the more likely it's overfitted, and the out-of-sample prediction would be worse.
# So, the depth must be carefully chosen for the best prediction performance.
Tree = tree.DecisionTreeRegressor(max_depth = 3) 

Tree_Result = Tree.fit(X_train, Y_train)
Tree_RMSE_train = sqrt(mean_squared_error(Y_train, Tree_Result.predict(X_train)))
print('The RMES for the training set: ', Tree_RMSE_train)
Tree_score_train = Tree_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', Tree_score_train)
Tree_RMSE_test = sqrt(mean_squared_error(Y_test, Tree_Result.predict(X_test)))
print('The RMES for the test set: ', Tree_RMSE_test)
Tree_score = Tree_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', Tree_score)

The RMES for the training set:  1.9610306824501544
The coefficient of determination for the training set:  0.8379107621658095
The RMES for the test set:  2.1974089423824448
The coefficient of determination for the test set:  0.3654357760521424


## 2.3 Ensemble Models
### 2.3.1 Random Forest

In [9]:
# max_depth is chosen as 2 for the best prediction result.
RF = ensemble.RandomForestRegressor(max_depth = 2) 

RF_Result = RF.fit(X_train, Y_train)
RF_RMSE_train = sqrt(mean_squared_error(Y_train, RF_Result.predict(X_train)))
print('The RMES for the training set: ', RF_RMSE_train)
RF_score_train = RF_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', RF_score_train)
RF_RMSE_test = sqrt(mean_squared_error(Y_test, RF_Result.predict(X_test)))
print('The RMES for the test set: ', RF_RMSE_test)
RF_score = RF_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', RF_score)

The RMES for the training set:  2.2929929096604464
The coefficient of determination for the training set:  0.7783892485732464
The RMES for the test set:  1.9851480063261138
The coefficient of determination for the test set:  0.482107535261296


### 2.3.2 Extramemly Randomized Trees

In [10]:
ET = ensemble.ExtraTreesRegressor(max_depth = 2) 

ET_Result = ET.fit(X_train, Y_train)
ET_RMSE_train = sqrt(mean_squared_error(Y_train, ET_Result.predict(X_train)))
print('The RMES for the training set: ', ET_RMSE_train)
ET_score_train = ET_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', ET_score_train)
ET_RMSE_test = sqrt(mean_squared_error(Y_test, ET_Result.predict(X_test)))
print('The RMES for the test set: ', ET_RMSE_test)
ET_score = ET_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', ET_score)

The RMES for the training set:  2.5315237594999327
The coefficient of determination for the training set:  0.7298845545352218
The RMES for the test set:  1.7851723472252061
The coefficient of determination for the test set:  0.5811928282074339


### 2.3.3 AdaBoost

In [11]:
AdaBoost = ensemble.AdaBoostRegressor() 

AdaBoost_Result = AdaBoost.fit(X_train, Y_train)
AdaBoost_RMSE_train = sqrt(mean_squared_error(Y_train, AdaBoost_Result.predict(X_train)))
print('The RMES for the training set: ', AdaBoost_RMSE_train)
AdaBoost_score_train = AdaBoost_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', AdaBoost_score_train)
AdaBoost_RMSE_test = sqrt(mean_squared_error(Y_test, AdaBoost_Result.predict(X_test)))
print('The RMES for the test set: ', AdaBoost_RMSE_test)
AdaBoost_score = AdaBoost_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', AdaBoost_score)


The RMES for the training set:  1.4361593884495967
The coefficient of determination for the training set:  0.9130657951912406
The RMES for the test set:  1.903491394815737
The coefficient of determination for the test set:  0.5238370019823799


### 2.3.4 Bagging

In [12]:
Bagging = ensemble.BaggingRegressor() 

Bagging_Result = Bagging.fit(X_train, Y_train)
Bagging_RMSE_train = sqrt(mean_squared_error(Y_train, Bagging_Result.predict(X_train)))
print('The RMES for the training set: ', Bagging_RMSE_train)
Bagging_score_train = Bagging_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', Bagging_score_train)
Bagging_RMSE_test = sqrt(mean_squared_error(Y_test, Bagging_Result.predict(X_test)))
print('The RMES for the test set: ', Bagging_RMSE_test)
Bagging_score = Bagging_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', Bagging_score)

The RMES for the training set:  1.217923879160087
The coefficient of determination for the training set:  0.9374790319637931
The RMES for the test set:  2.074355844566227
The coefficient of determination for the test set:  0.4345159927307778


### 2.3.5 Gradient Boosting

In [13]:
GradientB = ensemble.GradientBoostingRegressor() 

GradientB_Result = GradientB.fit(X_train, Y_train)
GradientB_RMSE_train = sqrt(mean_squared_error(Y_train, GradientB_Result.predict(X_train)))
print('The RMES for the training set: ', GradientB_RMSE_train)
GradientB_score_train = GradientB_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', GradientB_score_train)
GradientB_RMSE_test = sqrt(mean_squared_error(Y_test, GradientB_Result.predict(X_test)))
print('The RMES for the test set: ', GradientB_RMSE_test)
GradientB_score = GradientB_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', GradientB_score)

The RMES for the training set:  0.3499006538889303
The coefficient of determination for the training set:  0.9948397004717816
The RMES for the test set:  1.9246034363967868
The coefficient of determination for the test set:  0.5132159697056533


## 2.4 K-Nearest Neighbors

In [14]:
# Leaf size doesn't matter
KNN = neighbors.KNeighborsRegressor(n_neighbors = 10) 

KNN_Result = KNN.fit(X_train, Y_train)
KNN_RMSE_train = sqrt(mean_squared_error(Y_train, KNN_Result.predict(X_train)))
print('The RMES for the training set: ', KNN_RMSE_train)
KNN_score_train = KNN_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', KNN_score_train)
KNN_RMSE_test = sqrt(mean_squared_error(Y_test, KNN_Result.predict(X_test)))
print('The RMES for the test set: ', KNN_RMSE_test)
KNN_score = KNN_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', KNN_score)

The RMES for the training set:  2.930335157889267
The coefficient of determination for the training set:  0.6380738240285253
The RMES for the test set:  1.8382644606132141
The coefficient of determination for the test set:  0.55591123068175


## 2.5 Neural Network

In [15]:
NNW = neural_network.MLPRegressor() 

NNW_Result = NNW.fit(X_train, Y_train)
NNW_RMSE_train = sqrt(mean_squared_error(Y_train, NNW_Result.predict(X_train)))
print('The RMES for the training set: ', NNW_RMSE_train)
NNW_score_train = NNW_Result.score(X_train, Y_train)
print('The coefficient of determination for the training set: ', NNW_score_train)
NNW_RMSE_test = sqrt(mean_squared_error(Y_test, NNW_Result.predict(X_test)))
print('The RMES for the test set: ', NNW_RMSE_test)
NNW_score = NNW_Result.score(X_test, Y_test)
print('The coefficient of determination for the test set: ', NNW_score)

The RMES for the training set:  2.945498029020555
The coefficient of determination for the training set:  0.6343185960808185
The RMES for the test set:  3.1692069472721074
The coefficient of determination for the test set:  -0.3199424823507644


# 3. Summary
Prediction Performance Comparison

In [16]:
Results = np.array([[OLS_score, OLS_RMSE_test], [Lasso_score, Lasso_RMSE_test], 
                   [Ridge_score, Ridge_RMSE_test], [Lars_score, Lars_RMSE_test], 
                   [Tree_score, Tree_RMSE_test], [RF_score, RF_RMSE_test], 
                   [ET_score, ET_RMSE_test], [AdaBoost_score, AdaBoost_RMSE_test], 
                   [Bagging_score, Bagging_RMSE_test], [GradientB_score, GradientB_RMSE_test], 
                   [KNN_score, KNN_RMSE_test], [NNW_score, NNW_RMSE_test]])
df_Result = pd.DataFrame(Results, ['OLS', 'Lasso', 'Ridge', 'Lars', 'Decision Tree', 'Random Forest',
                                   'ExtraTree', 'AdaBoost', 'Bagging', 'Gradient Boosting', 'KNN', 'Neural Network'],
                        ['R^2', 'RMSE'])
df_Result

Unnamed: 0,R^2,RMSE
OLS,0.724886,1.44687
Lasso,0.683616,1.551601
Ridge,0.725095,1.446319
Lars,0.17735,2.501962
Decision Tree,0.365436,2.197409
Random Forest,0.482108,1.985148
ExtraTree,0.581193,1.785172
AdaBoost,0.523837,1.903491
Bagging,0.434516,2.074356
Gradient Boosting,0.513216,1.924603
