# Predicting The Stock Market

## Objective

Use machine learning techniques to predict the price of the SP500.

## Introduction

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company, and to some percentage (or share) of the earnings of the company. When you buy or sell shares, it's common to say that you're trading a stock.

The price of a share is based mainly on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.

Stocks are traded daily, and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole is performing. For example, the Dow Jones Industrial Average aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the underlying market or sector it represents is also going up or down. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

## Data Set

The data set come from the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index). The S&P500 is a stock market index. The data set named sphist.csv that contains a daily record of the price of the S&P500 Index from 1950 to 2015. The dataset was taken from [here](https://www.quandl.com/data/YAHOO/INDEX_GSPC-S-P-500-Index)

The columns of the dataset are:

| Columns   | Description                                                                       |
|-----------|-----------------------------------------------------------------------------------|
| Date      | The date of the record.                                                           |
| Open      | The opening price of the day (when trading starts).                               |
| High      | The highest trade price during the day.                                           |
| Low       | The lowest trade price during the day.                                            |
| Close     | The closing price for the day (when trading is finished).                         |
| Volume    | The number of shares traded.                                                      |
| Adj Close | The daily closing price, adjusted retroactively to include any corporate actions. |

## Reading In the Data

In [1]:
import pandas as pd
import datetime as dt
from sklearn.linear_model import LinearRegression

In [3]:
stock = pd.read_csv('C:/Users/i7/csv/stock/sphist.csv')
stock.head(5)

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


In [4]:
stock['Date'] = pd.to_datetime(stock['Date'])
stock.sort('Date', inplace = True)
stock.head()

  from ipykernel import kernelapp as app


Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
16589,1950-01-03,16.66,16.66,16.66,16.66,1260000.0,16.66
16588,1950-01-04,16.85,16.85,16.85,16.85,1890000.0,16.85
16587,1950-01-05,16.93,16.93,16.93,16.93,2550000.0,16.93
16586,1950-01-06,16.98,16.98,16.98,16.98,2010000.0,16.98
16585,1950-01-09,17.08,17.08,17.08,17.08,2520000.0,17.08


## Generating Indicators

Here are some indicators that are interesting to generate for each row:

    •	The average price from the past 5 days.
    •	The average price for the past 30 days.
    •	The average price for the past 365 days.
    •	The ratio between the average price for the past 5 days, and the average price for the past 365 days.
    •	The standard deviation of the price over the past 5 days.
    •	The standard deviation of the price over the past 365 days.
    •	The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

In [5]:
stock['Av_5day'] = pd.rolling_mean(stock['Close'], window = 5).shift(1)
stock['Av_1yr'] = pd.rolling_mean(stock['Close'], window = 365).shift(1)
stock['ratio_dy'] = stock['Av_5day']/stock['Av_1yr']
stock['AvVol_5day'] = pd.rolling_mean(stock['Volume'], window = 5).shift(1)
stock['AvVol_1yr'] = pd.rolling_mean(stock['Volume'], window = 365).shift(1)

	Series.rolling(window=5,center=False).mean()
  if __name__ == '__main__':
	Series.rolling(window=365,center=False).mean()
  from ipykernel import kernelapp as app
	Series.rolling(window=5,center=False).mean()
	Series.rolling(window=365,center=False).mean()


## Splitting Up The Data

#### Remove any rows from the DataFrame that fall before 1951-01-03 and remove any rows with NaN values

In [6]:
stock = stock[stock['Date'] > dt.datetime(year = 1951, month = 1, day = 2)]
stock.dropna(axis = 0, inplace = True)

#### Train contain any rows in the data with a date less than 2013-01-01. Test contain any rows with a date greater than or equal to 2013-01-01.

In [7]:
train = stock[stock['Date'] < dt.datetime(year = 2013, month = 1, day = 1)]
test = stock[stock['Date'] >= dt.datetime(year = 2013, month = 1, day = 1)]

## Making Predictions

Use Mean Absolute Error (MAE), as an error metric, because it will show how "close" to the price in intuitive terms

In [8]:
columns_use = ['Av_5day', 'Av_1yr', 'ratio_dy', 'AvVol_5day', 'AvVol_1yr']
y_train = train['Close']
y_test = test['Close']
lr = LinearRegression()

lr.fit(train[columns_use], y_train)
fitted = lr.predict(train[columns_use])

predictions = lr.predict(test[columns_use])
#using MAE for error measure.
mae = sum(abs(predictions - y_test)) / len(predictions)
print(mae)

16.1149844051
