# Predicting The Stock Market

In this project, we will work with the data from the S&P500 Index and predict the future price. The data we use contains a daily record of the price of the S&P500 Index from 1950 to 2015. We will train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

sp = pd.read_csv('sphist.csv')
sp['Date'] = pd.to_datetime(sp['Date'])
sp = sp.sort_values('Date').set_index('Date')
sp.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


We create new columns to contain the average and standard deviation of price in the past n days. Besides, we will also add the ratio between weekly and annual average and sd of the price. The current price will not be included due to the information leak.

In [2]:
# avg5 - average price from the past 5 days
sp['avg5'] = sp['Close'].shift(1).rolling('5d').mean()
      
# avg30 - average price for the past 30 days
sp['avg30'] = sp['Close'].shift(1).rolling('30d').mean()

# avg365 - average price for the past 365 days
sp['avg365'] = sp['Close'].shift(1).rolling('365d').mean()

In [3]:
# sd5 - standard deviation of the price over the past 5 days
sp['sd5'] = sp['Close'].shift(1).rolling('5d').std()

# sd365 - standard deviation of the price over the past 365 days
sp['sd365'] = sp['Close'].shift(1).rolling('365d').std()

In [4]:
# ratio between avg5 and avg365
sp['ratio_avg5_365'] = sp['avg5']/sp['avg365']

# ratio between sd5 and sd365
sp['ratio_sd5_365'] = sp['sd5']/sp['sd365']

Date information like year, month and day may also help in prediction.

In [15]:
sp['year'] = sp.index.year
sp['month'] = sp.index.month
sp['day'] = sp.index.day
sp['weekday'] = sp.index.weekday

In addition, we would also like to add the last price as a new feature to predict.

In [37]:
sp['last_price'] = sp['Close'].shift(1)

In [38]:
sp.head(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,avg5,avg30,avg365,sd5,sd365,ratio_avg5_365,ratio_sd5_365,year,month,day,weekday,last_price
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1951-01-03,20.690001,20.690001,20.690001,20.690001,3370000.0,20.690001,20.6,19.801,18.40676,0.240416,1.068383,1.119154,0.225028,1951,1,3,2,
1951-01-04,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.63,19.8855,18.42288,0.177764,1.072317,1.119803,0.165776,1951,1,4,3,20.690001
1951-01-05,20.870001,20.870001,20.870001,20.870001,3390000.0,20.870001,20.69,19.9635,18.43896,0.188326,1.078758,1.122081,0.174577,1951,1,5,4,20.870001


In [39]:
sp.tail(3)

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Adj Close,avg5,avg30,avg365,sd5,sd365,ratio_avg5_365,ratio_sd5_365,year,month,day,weekday,last_price
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117,2088.164978,2080.496187,2061.330676,10.771359,55.327616,1.013018,0.194683,2015,12,3,3,2079.51001
2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941,2080.456006,2077.630952,2061.23262,19.599946,55.326382,1.009326,0.35426,2015,12,4,4,2049.620117
2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068,2073.606689,2075.984997,2061.311073,21.647273,55.355606,1.005965,0.391058,2015,12,7,0,2091.689941


Considering that some of our features require data from the past 365 days and our starting date is 1950-01-03, there are some rows where there isn't enough historical data to generate them before 1951-01-03. Therefore, we will remove all rows before 1951-01-03 and those that contain NA.

In [40]:
# drop NA rows
sp = sp.dropna(axis=0)

# keep all rows after 1951-01-02
sp = sp[sp.index>datetime(year = 1951, month=1, day=2)]

To test the accuracy of our model, we will split the data into train set (1950-2012) and test set(2012-2015).

In [41]:
train = sp[sp.index<datetime(year = 2013, month=1, day=1)]
test = sp[sp.index>=datetime(year = 2013, month=1, day=1)]

We will use different linear regression models to predict the future price of S&P500.

In [42]:
train.columns

Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close', 'avg5', 'avg30',
       'avg365', 'sd5', 'sd365', 'ratio_avg5_365', 'ratio_sd5_365', 'year',
       'month', 'day', 'weekday', 'last_price'],
      dtype='object')

In [43]:
# linear regression model
def train_test(features):
    lr = LinearRegression()
    target = 'Close'
    
    lr.fit(train[features], train[target])
    prediction = lr.predict(test[features])

    rmse = mean_squared_error(prediction, test[target])**(1/2)
    return rmse

In [44]:
# RMSE of all new created features model
train_test(train.columns[6:])

15.155646793076661

In [45]:
# RMSE of date model
train_test(['year','month','day'])

724.7651207013593

In [46]:
# RMSE of average model
train_test(['avg5', 'avg30', 'avg365'])

19.67430068286049

In [47]:
# RMSE of sd model
train_test(['sd5', 'sd365'])

896.8407534210022

In [48]:
# RMSE of ratio model
train_test(['ratio_avg5_365', 'ratio_sd5_365'])

1451.58053399896

In [49]:
# RMSE of last 5 days information model
train_test(['avg5', 'sd5'])

19.668688528427683

In [51]:
# RMSE of last price model
train_test(['last_price'])

15.145703700718299

## Summary

Using the most recent price as the estimator has the least error model. It also implies that studying old historical data alone does not help with predicting the future price.