# Predictingt the stock market

In this project, you'll work with data from the [S&P500 Index](https://en.wikipedia.org/wiki/S%26P_500). The S&P500 is a stock market index. Before we get into what an index is, we'll need to start with the basics of the stock market.

Some companies are publicly traded, which means that anyone can buy and sell their shares on the open market. A share entitles the owner to some control over the direction of the company and to a percentage (or share) of the earnings of the company. When you buy or sell shares, it's common known as trading a stock.

The price of a share is based on supply and demand for a given stock. For example, Apple stock has a price of 120 dollars per share as of December 2015 -- http://www.nasdaq.com/symbol/aapl. A stock that is in less demand, like Ford Motor Company, has a lower price -- http://finance.yahoo.com/q?s=F. Stock price is also influenced by other factors, including the number of shares a company has issued.

Stocks are traded daily and the price can rise or fall from the beginning of a trading day to the end based on demand. Stocks that are in more in demand, such as Apple, are traded more often than stocks of smaller companies.

Indexes aggregate the prices of multiple stocks together, and allow you to see how the market as a whole performs. For example, the [Dow Jones Industrial Average](https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average) aggregates the stock prices of 30 large American companies together. The S&P500 Index aggregates the stock prices of 500 large companies. When an index fund goes up or down, you can say that the primary market or sector it represents is doing the same. For example, if the Dow Jones Industrial Average price goes down one day, you can say that American stocks overall went down (ie, most American stocks went down in price).

## The Dataset

You'll be using historical data on the price of the S&P500 Index to make predictions about future prices. Predicting whether an index goes up or down helps forecast how the stock market as a whole performs. Since stocks tend to correlate with how well the economy as a whole is performs, it can also help with economic forecasts.

There are thousands of traders who make money by buying and selling Exchange Traded Funds. ETFs allow you to buy and sell indexes like stocks. This means that you could "buy" the S&P500 Index ETF when the price is low and sell when it's high to make a profit. Creating a predictive model could allow traders to make money on the stock market.


In this lesson, you'll be working with a csv file containing index prices. Each row in the file contains a daily record of the price of the S&P500 Index from `1950` to `2015`. The dataset is stored in sphist.csv.

The columns of the dataset are:

- `Date` -- The date of the record.
- `Open` -- The opening price of the day (when trading starts).
- `High` -- The highest trade price during the day.
- `Low` -- The lowest trade price during the day.
- `Close` -- The closing price for the day (when trading is finished).
- `Volume` -- The number of shares traded.
- `Adj Close` -- The daily closing price, adjusted retroactively to include any corporate actions. Read more [here](https://www.investopedia.com/terms/a/adjusted_closing_price.asp).

You'll be using this dataset to develop a predictive model. You'll train the model with data from 1950-2012 and try to make predictions from 2013-2015.

## Reading in the Data

In [41]:
# from IPython import get_ipython
# get_ipython().magic('reset -sf')

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt



from datetime import datetime

df = pd.read_csv('sphist.csv')
df['Date'] = pd.to_datetime(df['Date']) 



df = df.sort_values(by = ['Date'], ascending = True).reset_index(drop=True)  # resetting index so we make the OLDEST date have the zeroth index so we can work with it easier. 
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000
...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941


My plan is to write a function that gets the DF and then makes a smaller df with a single column containing the required indicator (running average for whatever days). Then we concatenate that with the original df giving us columns with indicators in them. This is not the best code, but it works with 'iterrows()' and its my code. 

## Generating Indicators

Datasets taken from the stock market need to be handled differently than datasets from other sectors when it's time to make predictions. In a normal machine learning exercise, we treat each row as independent. Stock market data is sequential and each observation comes a day after the previous observation. Thus, the observations are not all independent and you can't treat them as such.

This means you have to be extra careful not to inject "future" knowledge into past rows when you train and predict. Injecting future knowledge makes our model look good when we train and test it, but it fails in the real world. This is how many algorithmic traders lose money.

The time series nature of the data means that we can generate indicators to make our model more accurate. For instance, you can create a new column that contains the average price of the last 10 trades for each row. This incorporates information from multiple prior rows into one and makes predictions much more accurate.

When you do this, you have to be careful not to use the current row in the values you average. You want to teach the model how to predict the current price from historical prices. If you include the current price in the prices you average, it will be equivalent to handing the answers to the model upfront, and will make it impossible to use in the "real world", where you don't know the price upfront.

Here are some indicators that are interesting to generate for each row:

- The average price from the past 5 days.
- The average price for the past 30 days.
- The average price for the past 365 days.
- The ratio between the average price for the past 5 days, and the average price for the past 365 days.
- The standard deviation of the price over the past 5 days.
- The standard deviation of the price over the past 365 days.
- The ratio between the standard deviation for the past 5 days, and the standard deviation for the past 365 days.

"Days" means "trading days" -- so if you're computing the average of the past 5 days, it should be the 5 most recent dates before the current one. Assume that "price" means the Close column. Always be careful not to include the current price in these indicators! You're predicting the next day price, so our indicators are designed to predict the current price from the previous prices.

Some of these indicators require a year of historical data to compute. Our first day of data falls on 1950-01-03, so the first day you can start computing indicators on is 1951-01-03.

To compute indicators, you'll need to loop through each day from 1951-01-03 to 2015-12-07 (the last day you have prices for). For instance, if we were computing the average price from the past 5 days, we'd start at 1951-01-03, get the prices for each day from 1950-12-26 to 1951-01-02, and find the average. The reason why we start on the 26th and take more than 5 calendar days into account is because the stock market is shutdown on certain holidays. Since we're looking at the past 5 trading days, we need to look at more than 5 calendar days to find them.

### In Version 2: 
I will be adding more indicators using the same loop thing I have made. I could do a function that takes the length of time, but I do not know how to transfer that to the date time thing. So I will just do several functions and modify them. Then I will invode them and concatinate all of them to the original df so we have indicatrs. I will make running avergaes for 30 and 365, and then I will make stddev for these two. I will then use their ratio 

## Splitting up the Data

In [42]:
#Finding running 5 day previous average 

def five_day_average (df):  #Best do it with a function

    avg_dict = dict() #making a dictionary to store the values for the result column. 

    for index, row in df.iterrows():  #loop over each row with index and row itself 
        '''#the indicators requre previous data, so we make sure we leave enough rows
            from the oldest data at the begining and pad them with zero. '''

        if row['Date'] < datetime(year = 1950, month = 1, day = 10):  
            avg_dict[index] = 0 #pad with zero

        elif row['Date'] >= datetime(year = 1950, month = 1, day = 10):
            avg_5_day = np.mean(df.Close[index-5:index])  #in this caser get the mean of the last 5 rows
            avg_dict[index] = avg_5_day #append to the dictionary
    
    running_avg_df = pd.DataFrame(list(avg_dict.items()), columns=['col_1','day_5'])
    ''' #Make dictionary to list with the items and then name the columns, The first 
    column is the inherited index which I do not know how to drop above, so I give it a nme and drop it later on and name the other column. 
    '''
    running_avg_df = running_avg_df.drop(['col_1'], axis = 1) #We drop column 1 because we do not need it so we get a df with one column contaning running average and zero padding. 

    return (running_avg_df) 

In [43]:
#Finding running 5 day previous average VOLUME 

def five_day_average_vol (df):  #Best do it with a function

    avg_dict = dict() #making a dictionary to store the values for the result column. 

    for index, row in df.iterrows():  #loop over each row with index and row itself 
        '''#the indicators requre previous data, so we make sure we leave enough rows
            from the oldest data at the begining and pad them with zero. '''

        if row['Date'] < datetime(year = 1950, month = 1, day = 10):  
            avg_dict[index] = 0 #pad with zero

        elif row['Date'] >= datetime(year = 1950, month = 1, day = 10):
            avg_5_day = np.mean(df.Volume[index-5:index])  #in this caser get the mean of the last 5 rows
            avg_dict[index] = avg_5_day #append to the dictionary
    
    running_avg_df = pd.DataFrame(list(avg_dict.items()), columns=['col_1','day_5_vol'])
    ''' #Make dictionary to list with the items and then name the columns, The first 
    column is the inherited index which I do not know how to drop above, so I give it a nme and drop it later on and name the other column. 
    '''
    running_avg_df_vol = running_avg_df.drop(['col_1'], axis = 1) #We drop column 1 because we do not need it so we get a df with one column contaning running average and zero padding. 

    return (running_avg_df_vol) 

In [44]:
#Finding running 365 day previous average 

def year_average (df):  #Best do it with a function

    avg_dict = dict() #making a dictionary to store the values for the result column. 

    for index, row in df.iterrows():  #loop over each row with index and row itself 
        '''#the indicators requre previous data, so we make sure we leave enough rows
            from the oldest data at the begining and pad them with zero. '''

        if row['Date'] < datetime(year = 1951, month = 6, day = 19):  # The caldendar year is not the same as 365 working days. It is actually #249 working days. 
            avg_dict[index] = 0 #pad with zero

        elif row['Date'] >= datetime(year = 1951, month = 6, day = 19):
            avg_year = np.mean(df.Close[index-365:index])  #in this caser get the mean of the last 365 rows
            avg_dict[index] = avg_year #append to the dictionary
            if np.isnan(avg_year): #for debugging purposes. 
                print(index)
    
    running_avg_df = pd.DataFrame(list(avg_dict.items()), columns=['col_1','day_365'])
    ''' #Make dictionary to list with the items and then name the columns, The first 
    column is the inherited index which I do not know how to drop above, so I give it a nme and drop it later on and name the other column. 
    '''
    running_avg_df = running_avg_df.drop(['col_1'], axis = 1) #We drop column 1 because we do not need it so we get a df with one column contaning running average and zero padding. 

    return (running_avg_df) 

In [45]:
#Finding running 5 day previous standard deviation 

def five_day_std (df):  #Best do it with a function

    std_dict = dict() #making a dictionary to store the values for the result column. 

    for index, row in df.iterrows():  #loop over each row with index and row itself 
        '''#the indicators requre previous data, so we make sure we leave enough rows
            from the oldest data at the begining and pad them with zero. '''

        if row['Date'] < datetime(year = 1950, month = 1, day = 10):  
            std_dict[index] = 0 #pad with zero

        elif row['Date'] >= datetime(year = 1950, month = 1, day = 10):
            std_5_day = np.std(df.Close[index-5:index])  #in this caser get the std of the last 5 rows
            std_dict[index] = std_5_day #append to the dictionary
    
    running_std_df = pd.DataFrame(list(std_dict.items()), columns=['col_1','day_5_std'])
    ''' #Make dictionary to list with the items and then name the columns, The first 
    column is the inherited index which I do not know how to drop above, so I give it a nme and drop it later on and name the other column. 
    '''
    running_std_df = running_std_df.drop(['col_1'], axis = 1) #We drop column 1 because we do not need it so we get a df with one column contaning running average and zero padding. 

    return (running_std_df) 

In [46]:
#Finding running 365 day previous standard deviation 

def year_std (df):  #Best do it with a function

    std_dict = dict() #making a dictionary to store the values for the result column. 

    for index, row in df.iterrows():  #loop over each row with index and row itself 
        '''#the indicators requre previous data, so we make sure we leave enough rows
            from the oldest data at the begining and pad them with zero. '''

        if row['Date'] < datetime(year = 1951, month = 6, day = 19):  
            std_dict[index] = 0 #pad with zero

        elif row['Date'] >= datetime(year = 1951, month = 6, day = 19):  #365 working days as opposed to a calendar year. 
            std_365_day = np.std(df.Close[index-365:index])  #in this caser get the std of the last 365 rows
            std_dict[index] = std_365_day #append to the dictionary
    
    running_std_df = pd.DataFrame(list(std_dict.items()), columns=['col_1','day_365_std'])
    ''' #Make dictionary to list with the items and then name the columns, The first 
    column is the inherited index which I do not know how to drop above, so I give it a nme and drop it later on and name the other column. 
    '''
    running_std_df = running_std_df.drop(['col_1'], axis = 1) #We drop column 1 because we do not need it so we get a df with one column contaning running average and zero padding. 

    return (running_std_df) 

In [34]:
running_avg_5day = five_day_average (df)
running_avg_365day = year_average (df)

running_std_5day = five_day_std (df)
running_std_365day = year_std (df)

five_day_average_5vol = five_day_average_vol(df)

result_df =pd.concat([df,running_avg_5day,running_avg_365day,running_std_5day,running_std_365day,five_day_average_5vol], axis = 1)  #concatenate with the existing df adding axis 1 and it gets atatched. We check and it looks correct. 



In [35]:
result_df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_365,day_5_std,day_365_std,day_5_vol
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,0.000000,0.000000,0.000000,0.000000,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,3.502675,64.282022,3.207544e+09
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,7.116786,64.264312,3.232372e+09
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,8.348225,64.189442,3.245514e+09
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,17.530725,64.033724,3.536224e+09


In [36]:
#diving columns to get composit indicators
result_df['std_ratio'] =  result_df['day_5_std']/result_df['day_365_std']
result_df['avg_ratio'] =  result_df['day_5']/result_df['day_365']
result_df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_365,day_5_std,day_365_std,day_5_vol,std_ratio,avg_ratio
0,1950-01-03,16.660000,16.660000,16.660000,16.660000,1.260000e+06,16.660000,0.000000,0.000000,0.000000,0.000000,0.000000e+00,,
1,1950-01-04,16.850000,16.850000,16.850000,16.850000,1.890000e+06,16.850000,0.000000,0.000000,0.000000,0.000000,0.000000e+00,,
2,1950-01-05,16.930000,16.930000,16.930000,16.930000,2.550000e+06,16.930000,0.000000,0.000000,0.000000,0.000000,0.000000e+00,,
3,1950-01-06,16.980000,16.980000,16.980000,16.980000,2.010000e+06,16.980000,0.000000,0.000000,0.000000,0.000000,0.000000e+00,,
4,1950-01-09,17.080000,17.080000,17.080000,17.080000,2.520000e+06,17.080000,0.000000,0.000000,0.000000,0.000000,0.000000e+00,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16585,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3.712120e+09,2102.629883,2087.024023,2035.531178,3.502675,64.282022,3.207544e+09,0.054489,1.025297
16586,2015-12-02,2101.709961,2104.270020,2077.110107,2079.510010,3.950640e+09,2079.510010,2090.231982,2035.914082,7.116786,64.264312,3.232372e+09,0.110742,1.026680
16587,2015-12-03,2080.709961,2085.000000,2042.349976,2049.620117,4.306490e+09,2049.620117,2088.306006,2036.234356,8.348225,64.189442,3.245514e+09,0.130056,1.025573
16588,2015-12-04,2051.239990,2093.840088,2051.239990,2091.689941,4.214910e+09,2091.689941,2080.456006,2036.507343,17.530725,64.033724,3.536224e+09,0.273773,1.021580


In [37]:
'''#Keeping the rows that occure after 1951 since we need historical data to compute these. 
I used 365 working days so it is around 18 calendar months. SO my filtering is a tiny bit 
different form the solutions regarding the date cutoff. 
'''


clean_df = result_df[result_df["Date"] > datetime(year=1951, month=6, day=18)] 
clean_df = clean_df.dropna(axis = 0)

clean_df.head(20) # now we have all the numbers!!

Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close,day_5,day_365,day_5_std,day_365_std,day_5_vol,std_ratio,avg_ratio
365,1951-06-19,22.02,22.02,22.02,22.02,1100000.0,22.02,21.8,19.447726,0.229173,1.787799,1196000.0,0.128187,1.120954
366,1951-06-20,21.91,21.91,21.91,21.91,1120000.0,21.91,21.9,19.462411,0.191102,1.786854,1176000.0,0.106949,1.125246
367,1951-06-21,21.780001,21.780001,21.780001,21.780001,1100000.0,21.780001,21.972,19.476274,0.082801,1.786161,1188000.0,0.046357,1.128142
368,1951-06-22,21.549999,21.549999,21.549999,21.549999,1340000.0,21.549999,21.96,19.489562,0.102956,1.785209,1148000.0,0.057672,1.126757
369,1951-06-25,21.290001,21.290001,21.290001,21.290001,2440000.0,21.290001,21.862,19.502082,0.182582,1.783589,1142000.0,0.102367,1.121008
370,1951-06-26,21.299999,21.299999,21.299999,21.299999,1260000.0,21.299999,21.71,19.513617,0.261916,1.7815,1420000.0,0.14702,1.112556
371,1951-06-27,21.370001,21.370001,21.370001,21.370001,1360000.0,21.370001,21.566,19.525315,0.249528,1.779171,1452000.0,0.140249,1.104515
372,1951-06-28,21.1,21.1,21.1,21.1,1940000.0,21.1,21.458,19.537041,0.186054,1.777185,1500000.0,0.10469,1.098324
373,1951-06-29,20.959999,20.959999,20.959999,20.959999,1730000.0,20.959999,21.322,19.548932,0.144969,1.773079,1668000.0,0.081761,1.090699
374,1951-07-02,21.1,21.1,21.1,21.1,1350000.0,21.1,21.204,19.560685,0.151341,1.768168,1746000.0,0.085592,1.084011


In [38]:
df = clean_df #putting everything back so now we have a usable df

In [39]:
#splitting the data into train and test

train = df[df["Date"] < datetime(year=2013, month=1, day=1)] 
test = df[df["Date"] >= datetime(year=2013, month=1, day=1)] 

print(train.shape)
print(test.shape)

(15486, 14)
(739, 14)


## Making Predicitons

 Now applying Linear rigression.

In [49]:
#Pick an error metric:
from sklearn.metrics import mean_absolute_error

#initialise instance of linear regression
from sklearn.linear_model import LinearRegression

lr = LinearRegression()


y_train = train['Close'] #select target for train
y_test = test['Close'] #select target for test

#features = ['day_5','day_365'] #select fetures
features = ['day_5','day_365','avg_ratio','std_ratio','day_5_vol']

X_train = train[features]
X_test = test[features]

trained_model = lr.fit(X_train,y_train) #training the Linear Regression Model
y_pred = trained_model.predict(X_test)  #usingt he trained model to preduct

MAE = mean_absolute_error(y_test,y_pred)

MAE

16.106975571674425

In [53]:
# Trying with Random Forrest regressor

from sklearn.ensemble import RandomForestRegressor

# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 200, random_state = 42, min_samples_leaf = 100 )
# Train the model on training data
rf.fit(X_train, y_train)

# Use the forest's predict method on the test data
y_pred = rf.predict(X_test)

MAE = mean_absolute_error(y_test, y_pred)

# Print out the mean absolute error (mae)
print('Mean Absolute Error:', round(np.mean(MAE), 2))

Mean Absolute Error: 369.19


The Random Forrest Regressor is performing waaaay worse!

 