<h1>
<center>
Dataquest Guided Project 16:
Predicting The Stock Market
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of paths **Data Scientist in Python**
    - Step 6: **Machine Learning**
        - Course 5:**Machine Learning in Python : Intermediate **
            - Logistic Regression
            - Evaluating Binary Classifiers
            - Multiclass Classification
            - Clustering basics
            - K-Means clustering
            - Gradient Descent
            - Introduction to Neural networks
            
As this is a guided project, we are following and deepening the steps suggested by Dataquest. In this project, we will practice to use a dataset to develop a predictive model.

## Use case : Predicting The Stock Market

We'll be working with data from the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index), a stock market index. We'll be using historical data on the price of the S&P Index to make predictions about the future prices. Predicting whether an index will go up or go down sill help us forecast how the stock market as a whole will perform. 

We'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset is stored in sphist.csv. 

The columns of the dataset are : 

| Header | Definition   |
|------|------|
|   **Date**  | the date of the record|
|   **Open**  | the opening price of the day (when trading starts)|
|   **High**  | the highest trade price during the day|
|   **Low**  | the lowest trade price during the day|
|   **Close**  | the closing price for the day (when trading is finished)|
|   **Volume**  | the number of shares traded|
|   **Adj Close**  | the daily closing price, adjusted retroactively to include any corporate actions|

## Reading in the data

In [2]:
import pandas as pd
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

In [3]:
df = pd.read_csv("sphist.csv")
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values("Date", ascending=True)

## Generating Indicators

Datasets taken from the stock market need to be handled differently than dataset from other sectors when it comes time to make predictions. In other machine learning exercises, we can treat each row as independent. Stock market data is sequential, and each observation comes a day after the previous observation. Thrus, the observations are not all independent, and we can't treat them as such.

We have to be careful to not inject "future" knowledge into past rows when we do training and prediction. Thus, we will generate the following indicators :
- The average price from the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.

In [4]:
df['5 Days Open'] = df['Open'].rolling(center=False, window=5).mean()
df['5 Days High'] = df['High'].rolling(center=False, window=5).mean()
df['5 Days Low'] = df['Low'].rolling(center=False, window=5).mean()
df['5 Days Volume'] = df['Volume'].rolling(center=False, window=5).mean()

df['Year'] = df['Date'].apply(lambda x: x.year)

#Adding Day of week column and set it to categorical
df['DOW'] = df['Date'].apply(lambda x: x.weekday())
dow_df = pd.get_dummies(df['DOW'])
df = pd.concat([df, dow_df], axis=1)
df = df.drop(['DOW'], axis=1)

## Splitting up the data

Since we're computing indicators that use historical data, there are some rows where there isn't enough historical data to generate them. 
Some of the indicators use 365 days of historical data, and the dataset starts on 1950-01-03. Thus, any rows that fall before 1951-01-03 don't have enough historical data to compute all the indicators, we'll remove this before splitting the data between training and testing.

In [5]:
df = df[df['Date'] >= datetime(year=1951, month=1, day=3)]
df.dropna(axis=0)

train = df[df['Date'] < datetime(year=2013, month=1, day=1)]
test = df[df['Date'] >= datetime(year=2013, month=1, day=1)]

## Making Predictions

In [6]:
features = ['5 Days Open', '5 Days Volume', '5 Days High', '5 Days Low', 'Year', 0, 1, 2, 3, 4]

lr = LinearRegression()
lr.fit(train[features], train['Close'])
predictions = lr.predict(test[features])

mae = mean_absolute_error(test['Close'] ,predictions)

In [9]:
print(mae)

9.11778468411


We can now predict the S&P500 ! We can still generate new indicators to improve our model, here are some ideas : 

- The ratio between the average volume for the past five days, and the average volume for the past year.
- The standard deviation of the average volume over the past five days.
- The standard deviation of the average volume over the past year.
- The ratio between the standard deviation of the average volume for the past five days, and the standard deviation of the average volume for the past year.
- The ratio between the lowest price in the past year and the current price.
- The ratio between the highest price in the past year and the current price.
- The number of holidays in the prior month.