In [1]:
a = [1,2,3,4]
a[True,False,False,False]

TypeError: list indices must be integers or slices, not tuple

## Introduction
In this mission, we'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv.

The columns of the dataset are:

* **Date** -- The date of the record.
* **Open** -- The opening price of the day (when trading starts).
* **High** -- The highest trade price during the day.
* **Low** -- The lowest trade price during the day.
* **Close** -- The closing price for the day (when trading is finished).
* **Volume** -- The number of shares traded.
* **Adj Close** -- The daily closing price, adjusted retroactively to include any corporate actions.

we'll be using this dataset to develop a predictive model. we'll train the model with data from 1950-2012, and try to make predictions from 2013-2015.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from datetime import datetime

In [2]:
# Read the Dataset.
df = pd.read_csv('sphist.csv')
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [3]:
# Print the datatypes of each column.
df.dtypes

Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume       float64
Adj Close    float64
dtype: object

In [4]:
# Convert the Data column to Datetime.
df['Date'] = pd.to_datetime(df['Date'])

# Again check the datatypes of each column.
df.dtypes

Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Volume              float64
Adj Close           float64
dtype: object

In [5]:
# Sort df in ascending order of date.
df = df.sort_values('Date', ascending=True)
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


# Calculate moving averages

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it comes time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thus, the observations are not all independent, and you can't treat them as such.

This means you have to be extra careful to not inject "future" knowledge into past rows when you do training and prediction. Injecting future knowledge will make our model look good when you're training and testing it, but will make it fail in the real world. This is how many algorithmic traders lose money.

The time series nature of the data means that can generate indicators to make our model more accurate. For instance, you can create a new column that contains the average price of the last 10 trades for each row. This will incorporate information from multiple prior rows into one, and will make predictions much more accurate.

When you do this, you have to be careful not to use the current row in the values you average. You want to teach the model how to predict the current price from historical prices. If you include the current price in the prices you average, it will be equivalent to handing the answers to the model upfront, and will make it impossible to use in the "real world", where you don't know the price upfront.

Here are some indicators that are interesting to generate for each row:

* The average price from the past 5 days.
* The average price for the past 30 days.
* The average price for the past 365 days.
* The ratio between the average price for the past 5 days, and the average price for the past 365 days.
* The standard deviation of the price over the past 5 days.
* The standard deviation of the price over the past 365 days.
* The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.


In [6]:
# Compute the rolling means.
# The 'shift' function is used because 'rolling' uses the current day's price.
df['day_5_mean'] = df['Close'].rolling(5).mean().shift()
df['day_30_mean'] = df['Close'].rolling(30).mean().shift()
df['day_365_mean'] = df['Close'].rolling(365).mean().shift()
df['mean_ratio_5_365'] = df['day_5_mean'] / df['day_365_mean']

df['day_5_std'] = df['Close'].rolling(5).std().shift()
df['day_365_std'] = df['Close'].rolling(365).std().shift()
df['std_ratio_5_365'] = df['day_5_std'] / df['day_365_std']
df.head(10)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5_mean,day_30_mean,day_365_mean,mean_ratio_5_365,day_5_std,day_365_std,std_ratio_5_365
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66,,,,,,,
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85,,,,,,,
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93,,,,,,,
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98,,,,,,,
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08,,,,,,,
16584,1950-01-10,17.030001,17.030001,17.030001,17.030001,2160000.0,17.030001,16.9,,,,0.157956,,
16583,1950-01-11,17.09,17.09,17.09,17.09,2630000.0,17.09,16.974,,,,0.089051,,
16582,1950-01-12,16.76,16.76,16.76,16.76,2970000.0,16.76,17.022,,,,0.067602,,
16581,1950-01-13,16.67,16.67,16.67,16.67,3330000.0,16.67,16.988,,,,0.134796,,
16580,1950-01-16,16.719999,16.719999,16.719999,16.719999,1460000.0,16.719999,16.926,,,,0.196545,,


## Splitting up the data
Since you're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators. You'll need to remove these rows before you split the data.

In [7]:
# Only keep rows after 1951-01-2
df = df[df['Date'] > datetime(year=1951, month=1, day=2)]

# drop all rows with null values
df = df.dropna(axis=0)

In [8]:
# Create train and test data
train = df[df['Date'] < datetime(year=2013, month=1, day=1)]
test = df[df['Date'] >= datetime(year=2013, month=1, day=1)]

In [9]:
# features to be used
features = ['day_5_mean', 'day_30_mean', 'day_365_mean', 'mean_ratio_5_365','day_5_std', 
            'day_365_std', 'std_ratio_5_365']

X_train, y_train = train[features], train['Close']
X_test, y_test = test[features], test['Close']

## Linear Regression Modelling

In [10]:
lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
print('Mean Absolute Error : {}'.format(mae))
print('R^2 Score : {}'.format(lr.score(X_train, y_train)))

Mean Absolute Error : 16.145140609743667
R^2 Score : 0.9995223668123336
