# Predicting the stock market

In this project, we will try to predict the S&P500 Index movement using linear regression.
The data set we will use in this project can be found [here](https://www.kaggle.com/samaxtech/sp500-index-data).

In [36]:
import pandas as pd

stock = pd.read_csv('sphist.csv')
stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [37]:
stock.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume       float64
Adj Close    float64
dtype: object

In [38]:
# convert the object type to datetime type in the Date column
stock.Date = pd.to_datetime(stock.Date)
stock.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Volume              float64
Adj Close           float64
dtype: object

In [39]:
sorted_stock = stock.sort_values('Date')
sorted_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


# Generating Indicators

We will generate some indicators to help us predict the future market:

- The average price from the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.
- The ratio between the average price for the past 5 days, and the average price for the past 365 days.
- The standard deviation of the price over the past 5 days.
- The standard deviation of the price over the past 365 days.
- The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

In [40]:
sorted_stock['MA_5'] = sorted_stock.Close.rolling(window=5).mean().shift(1)
sorted_stock['MA_30'] = sorted_stock.Close.rolling(window=30).mean().shift(1)
sorted_stock['MA_365'] = sorted_stock.Close.rolling(window=365).mean().shift(1)

sorted_stock['STD_5'] = sorted_stock.Close.rolling(window=5).std().shift(1)
sorted_stock['STD_365'] = sorted_stock.Close.rolling(window=365).std().shift(1)

sorted_stock['MA_5_to_365_ratio'] = sorted_stock.MA_5 / sorted_stock.MA_365
sorted_stock['STD_5_to_365_ratio'] = sorted_stock.STD_5 / sorted_stock.STD_365

sorted_stock.head(367)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,MA_5,MA_30,MA_365,STD_5,STD_365,MA_5_to_365_ratio,STD_5_to_365_ratio
16589,1950-01-03,16.660000,16.660000,16.660000,16.660000,1260000.0,16.660000,,,,,,,
16588,1950-01-04,16.850000,16.850000,16.850000,16.850000,1890000.0,16.850000,,,,,,,
16587,1950-01-05,16.930000,16.930000,16.930000,16.930000,2550000.0,16.930000,,,,,,,
16586,1950-01-06,16.980000,16.980000,16.980000,16.980000,2010000.0,16.980000,,,,,,,
16585,1950-01-09,17.080000,17.080000,17.080000,17.080000,2520000.0,17.080000,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16227,1951-06-14,21.840000,21.840000,21.840000,21.840000,1300000.0,21.840000,21.546,21.779000,,0.045056,,,
16226,1951-06-15,22.040001,22.040001,22.040001,22.040001,1370000.0,22.040001,21.602,21.753000,,0.140250,,,
16225,1951-06-18,22.049999,22.049999,22.049999,22.049999,1050000.0,22.049999,21.712,21.727333,,0.222194,,,
16224,1951-06-19,22.020000,22.020000,22.020000,22.020000,1100000.0,22.020000,21.800,21.703333,19.447726,0.256223,1.790253,1.120954,0.143121


In [41]:
clean_stock = sorted_stock.dropna(axis=0)
clean_stock.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,MA_5,MA_30,MA_365,STD_5,STD_365,MA_5_to_365_ratio,STD_5_to_365_ratio
16224,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,21.703333,19.447726,0.256223,1.790253,1.120954,0.143121
16223,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,21.683,19.462411,0.213659,1.789307,1.125246,0.119409
16222,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,21.659667,19.476274,0.092574,1.788613,1.128142,0.051758
16221,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,21.631,19.489562,0.115108,1.787659,1.126757,0.06439
16220,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,21.599,19.502082,0.204132,1.786038,1.121008,0.114293


In [42]:
# split data into train and test
from datetime import datetime
train = clean_stock[clean_stock.Date < datetime(year = 2013, month = 1, day = 1)]
test = clean_stock[clean_stock.Date >= datetime(year = 2013, month = 1, day = 1)]

print(train.shape)
print(test.shape)

(15486, 14)
(739, 14)


# Make predictions


In [43]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

features = ['MA_5', 'MA_30', 'MA_365', 'STD_5', 'STD_365', 'MA_5_to_365_ratio', 'STD_5_to_365_ratio']
target = 'Close'

lr = LinearRegression()
lr.fit(train[features], train[target])
predictions = lr.predict(test[features])

mae = mean_absolute_error(test[target], predictions)

print('Mean absolute error: {}'.format(mae))
print('Coefficient of determination: {}'.format(lr.score(train[features], train[target])))

Mean absolute error: 16.145140609743724
Coefficient of determination: 0.9995223668123336
