<h2>SPYPredictor</h2>
Final Project Prototype Shuzo Katayama, 15 November 2020

Description

In this project, I hope to accomplish the task of predicting the price of the SPRD S&P 500 ETF (SPY), a popular exchange-traded fund that trakcs the performance of the S&P 500 market index. Specifically, I want to use a regression algorithm to train a model to predict the closing price of SPY on a particular day given (but not limited to) these factors: the GDP, the unemployment rate, and the federal funds rate, and the opening price of SPY. 

As a prototype, I plan to use these few indicators to predict the closing price of SPY, but I do not expect for it to work well because these indicators are not updated daily. Throughout the project, I plan to add more indicators and train more models to fine tune the models I train. I will, therefore, run this final project as a long experiment, essentially using a trial and error tactic to find the best model for this task. Ultimately, I hope to have a long Jupyter Notebook that tracks the progress of the different models I trained, and a shorter Juptyer Notebook just containing the final model.

By predicting the price of SPY, I hope to accomplish the task of seeing which factors can best predict the movement of the S&P 500 index on a given day. This information can be used to better understand what affects the movement of the US market in general. Since SPY is a collection of a lot of the largest corporations in the US, seeing what data most accurately predicts the movement of SPY can generalise what kind of data moves the US stock market in general. 

Progress

For this prototype, I created a very basic version of the final product. 

The project now needs more fine tuning to be complete. For the rest of the project, I will be adding more data to fine tune the regression model, while simultaneously testing different regression models (symbolic regression, random trees regression, and linear regression) to improve the model. Essentially, I plan to run the weeks leading up to the due date of the final project as a long trial and error experiment to see how well I can train the SPYPredictor. 

Data:

SPY Prices: https://finance.yahoo.com/quote/SPY/history?period1=728265600&period2=1604966400&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=true \
GDP: https://fred.stlouisfed.org/series/GDP \
Unemployment: https://fred.stlouisfed.org/series/UNRATE \
Federal Funds Rate: https://fred.stlouisfed.org/series/FEDFUNDS 

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

Importing the data:\
All the data will come initially in a DataFrame with the name of the data followed by 'df'

In [2]:
# Daily prices of SPY
SPYdf = pd.read_csv('SPYDaily.csv')
SPYdf

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,1993-01-29,43.968750,43.968750,43.750000,43.937500,26.079659,1003200
1,1993-02-01,43.968750,44.250000,43.968750,44.250000,26.265144,480500
2,1993-02-02,44.218750,44.375000,44.125000,44.343750,26.320782,201300
3,1993-02-03,44.406250,44.843750,44.375000,44.812500,26.599014,529400
4,1993-02-04,44.968750,45.093750,44.468750,45.000000,26.710312,531500
...,...,...,...,...,...,...,...
6992,2020-11-03,333.690002,338.250000,330.290009,336.029999,336.029999,93294200
6993,2020-11-04,340.859985,347.940002,339.589996,343.540009,343.540009,126959700
6994,2020-11-05,349.239990,352.190002,348.859985,350.239990,350.239990,82039700
6995,2020-11-06,349.929993,351.510010,347.649994,350.160004,350.160004,74973000


In [3]:
GDPdf = pd.read_csv('GDPQuarterly.csv')
GDPdf

Unnamed: 0,DATE,GDP
0,1993-01-01,6729.459
1,1993-04-01,6808.939
2,1993-07-01,6882.098
3,1993-10-01,7013.738
4,1994-01-01,7115.652
...,...,...
106,2019-07-01,21540.325
107,2019-10-01,21747.394
108,2020-01-01,21561.139
109,2020-04-01,19520.114


In [4]:
Unemploymentdf = pd.read_csv('UnemploymentMonthly.csv')
Unemploymentdf['UNRATE'].astype('float')
Unemploymentdf

Unnamed: 0,DATE,UNRATE
0,1993-01-01,7.3
1,1993-02-01,7.1
2,1993-03-01,7.0
3,1993-04-01,7.1
4,1993-05-01,7.1
...,...,...
329,2020-06-01,11.1
330,2020-07-01,10.2
331,2020-08-01,8.4
332,2020-09-01,7.9


In [5]:
FedFundspd = pd.read_csv('Fedfunds.csv')
FedFundspd

Unnamed: 0,DATE,FEDFUNDS
0,1993-01-01,3.02
1,1993-02-01,3.03
2,1993-03-01,3.07
3,1993-04-01,2.96
4,1993-05-01,3.00
...,...,...
329,2020-06-01,0.08
330,2020-07-01,0.09
331,2020-08-01,0.10
332,2020-09-01,0.09


Splitting Data into X and y

In [6]:
# Training targets are the closing prices of SPY
y = SPYdf.to_numpy()
y = y[:, 4]

In [7]:
# Initialise Xdf as initially having 5 columns and the number of rows in SPYdf
c = 4
r = len(SPYdf)

a = SPYdf.to_numpy()
X = np.empty((r,c))

# Keep a separate array of the dates
Xdate = SPYdf.to_numpy()
Xdate = Xdate[:,0]

For the training data, the following columns in the array X will contain the following data:\
0: Opening price\
1: GDP\
2: Unemployment\
3: Federal Funds Rate

In [8]:
# Copy into column 0 in X, the opening price from a
counter = 0
for item in a:
    X[counter, 0] = item[1]
    counter = counter+1

In [9]:
# Copy in to column 1, the GDP
# GDP is calculated quarterly, and SPY is daily; writing GDP value every day in the quarter in X
g = GDPdf.to_numpy()
counter = r-1
gdpcounter = (len(g)-1)
for item in X:
    if g[gdpcounter,0] == Xdate[counter]:
        gdpcounter = gdpcounter - 1
    
    item[1] = g[gdpcounter, 1]
    counter = counter - 1
    
X

array([[   43.96875 , 21157.635   ,     0.      ,     0.      ],
       [   43.96875 , 21157.635   ,     0.      ,     0.      ],
       [   44.21875 , 21157.635   ,     0.      ,     0.      ],
       ...,
       [  349.23999 , 21561.139   ,     0.      ,     0.      ],
       [  349.929993, 21561.139   ,     0.      ,     0.      ],
       [  363.970001, 21561.139   ,     0.      ,     0.      ]])

In [10]:
# Copy into column 2, the Unemployment
# Unemployment is calculated monthly; writing Unemployment every day in the month in X
u = Unemploymentdf.to_numpy()
counter = r-1
ucounter = (len(u)-1)
for item in X:
    if u[ucounter, 0] == Xdate[counter]:
        ucounter = ucounter - 1
        
    item[2] = u[ucounter, 1]
    counter = counter - 1

np.set_printoptions(suppress=True)
X

array([[   43.96875 , 21157.635   ,     6.9     ,     0.      ],
       [   43.96875 , 21157.635   ,     6.9     ,     0.      ],
       [   44.21875 , 21157.635   ,     6.9     ,     0.      ],
       ...,
       [  349.23999 , 21561.139   ,     8.4     ,     0.      ],
       [  349.929993, 21561.139   ,     8.4     ,     0.      ],
       [  363.970001, 21561.139   ,     8.4     ,     0.      ]])

In [11]:
# Copy into column 3, the Federal Funds Rate
# Federal Funds Rate is calculated monthly;
f = FedFundspd.to_numpy()
counter = r-1
fcounter = (len(f)-1)
for item in X:
    if f[fcounter, 0] == Xdate[counter]:
        fcounter = fcounter - 1
        
    item[3] = f[fcounter, 1]
    counter = counter - 1
    
X

array([[   43.96875 , 21157.635   ,     6.9     ,     0.09    ],
       [   43.96875 , 21157.635   ,     6.9     ,     0.09    ],
       [   44.21875 , 21157.635   ,     6.9     ,     0.09    ],
       ...,
       [  349.23999 , 21561.139   ,     8.4     ,     0.1     ],
       [  349.929993, 21561.139   ,     8.4     ,     0.1     ],
       [  363.970001, 21561.139   ,     8.4     ,     0.1     ]])

Splitting data into training and testing sets using train_test_split

In [12]:
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [13]:
'''
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

sc_y = StandardScaler()
sc.fit(y_train[:, np.newaxis])
y_train_std = sc.transform(y_train[:, np.newaxis]).flatten()
y_test_std = sc.transform(y_test[:, np.newaxis]).flatten() 
'''

'\nsc = StandardScaler()\nsc.fit(X_train)\nX_train_std = sc.transform(X_train)\nX_test_std = sc.transform(X_test)\n\nsc_y = StandardScaler()\nsc.fit(y_train[:, np.newaxis])\ny_train_std = sc.transform(y_train[:, np.newaxis]).flatten()\ny_test_std = sc.transform(y_test[:, np.newaxis]).flatten() \n'

To keep track of my progress, I'm going to name each model a two letter abbreviation of the model type followed by a number

Model RF0: Training the Random Forest Regressor, from sci-kit learn

In [14]:
from sklearn.ensemble import RandomForestRegressor

In [15]:
est = RandomForestRegressor(n_estimators=1000, criterion='mse', random_state=1, n_jobs=-1)
est.fit(X_train, y_train)
# est.fit(X_train_std, y_train_std)

RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=1)

In [16]:
y_train_pred = est.predict(X_train)
y_test_pred = est.predict(X_test)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
        # mean_squared_error(y_train_std, y_train_pred),
        # mean_squared_error(y_test_std, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
        # r2_score(y_train_std, y_train_pred),
        # r2_score(y_test_std, y_test_pred)))

MSE train: 0.521, test: 2.904
R^2 train: 1.000, test: 0.999


Model SR0: Training the Symbolic Regressor model, from gp learn

In [17]:
from gplearn.genetic import SymbolicRegressor

In [18]:
est = SymbolicRegressor(population_size=1000,
                        init_depth=(4,6),
                        generations=100, stopping_criteria=0.01,
                        p_crossover=0.3, p_subtree_mutation=0.35,
                        p_hoist_mutation=0.0, p_point_mutation=0.35,
                        max_samples=1.0, verbose=1,
                        #const_range=None,
                        const_range=(-1.0,1.0),
                        tournament_size=5,
                        function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log', 
                                      'abs', 'neg', 'inv', 'max','min', 'sin', 'cos', 'tan'),
                        parsimony_coefficient=0.0001, random_state=0)
est.fit(X_train, y_train)

    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.96      1.18179e+10        3         0.894873              N/A      1.21m
   1    14.79           103822       35         0.894873              N/A      1.12m
   2    17.58           473541        2         0.894873              N/A      1.20m
   3    20.95           7439.1       11         0.894385              N/A      1.16m
   4    22.06          8278.99        5         0.894208              N/A      1.25m
   5    19.78          569.126        5         0.894208              N/A      1.21m
   6    11.39          38356.1       20         0.894068              N/A     57.30s
   7     5.57          50228.4       14         0.894385              N/A     53.34s
   8     3.36          39085.1        5         0.894385              N/A  

SymbolicRegressor(function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log',
                                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos',
                                'tan'),
                  generations=100, init_depth=(4, 6), p_crossover=0.3,
                  p_hoist_mutation=0.0, p_point_mutation=0.35,
                  p_subtree_mutation=0.35, parsimony_coefficient=0.0001,
                  random_state=0, stopping_criteria=0.01, tournament_size=5,
                  verbose=1)

Model LR0: Training the Linear Regression model with default settings

In [19]:
from sklearn.linear_model import LinearRegression

In [20]:
est = LinearRegression()
est.fit(X_train, y_train)

LinearRegression()

In [21]:
y_train_pred = est.predict(X_train)
y_test_pred = est.predict(X_test)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train, y_train_pred),
        mean_squared_error(y_test, y_test_pred)))
        # mean_squared_error(y_train_std, y_train_pred),
        # mean_squared_error(y_test_std, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train, y_train_pred),
        r2_score(y_test, y_test_pred)))
        # r2_score(y_train_std, y_train_pred),
        # r2_score(y_test_std, y_test_pred)))

MSE train: 1.955, test: 1.933
R^2 train: 1.000, test: 1.000


Thus far, this code shows the first step in the experimentation process that I will continue to do. This project will mainly consist of adding data to X or adjusting the parameters of the models, and training all three algorithms on the new data to see how it improves prediction. The code runnnin