### Introduction

In this assignment, we will build an end to end machine learning pipeline to predict if S&P 500 is likely to move up or down. But first of all, what is S&P500? S&P500 or simply S&P is a "stock market index". A stock market index is designed to replicate the performance of the entire stock market. S&P500, consists of 500 large companies listed on stock exchanges in US; and it is the most commonly followed index globally. You can read more about S&P500 [on this link](https://www.thebalance.com/what-is-the-sandp-500-3305888).

In [44]:
from matplotlib import pyplot as plt # for visualization
from datetime import datetime, timedelta # handle datetime
import pandas as pd
import numpy as np
import re
from sklearn import metrics
import math
import json
from pprint import pprint
# machine learning with sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
import _pickle as cPickle # save ML model
from google.cloud import storage # save the model to GCS

### Data Explanation

In [45]:
# data path
raw_data_path = "gs://mlops-weather-prediction/raw/"
feature_data_path = "gs://mlops-weather-prediction/feature_store/"
model_path = "model_repository/"
tmp_dir = '/tmp/'

# download data
df =  pd.read_csv("raw_data2.csv", error_bad_lines=False)

def dfreplace(df, *args, **kwargs):
    s = pd.Series(df.values.flatten())
    s = s.str.replace(*args, **kwargs)
    return pd.DataFrame(s.values.reshape(df.shape), df.index, df.columns)

weather_df = dfreplace(df, ',', '')

for i in weather_df.columns:
    weather_df[i] = weather_df[i].astype(str)
    weather_df[i][weather_df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=np.NaN
    
# persist data
weather_df.to_parquet(raw_data_path + 'weather.parquet', compression='GZIP')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

In [46]:
# let's just explore the data a little bit
weather_df.head()

Unnamed: 0,"STN,","YYYYMMDD,","DDVEC,","FHVEC,","FG,","FHX,","FHXH,","FHN,","FHNH,","FXX,",...,"VVNH,","VVX,","VVXH,","NG,","UG,","UX,","UXH,","UN,","UNH,",EV24
0,370,20000101,239,21,26,40,3,10,17,70,...,19,38,23,8,99,100,10,98,1,
1,370,20000102,194,35,36,50,11,10,1,90,...,4,59,22,8,98,100,2,95,12,
2,370,20000103,204,63,63,80,19,40,1,120,...,24,68,10,8,92,97,24,89,18,
3,370,20000104,222,53,59,90,17,30,7,170,...,8,75,19,7,93,99,5,77,18,
4,370,20000105,193,37,39,50,13,30,1,80,...,8,80,14,3,92,97,8,82,14,


In [47]:
# also let's just check for missing data
weather_df[weather_df['TG,']==0].sum()

STN,         0.0
YYYYMMDD,    0.0
DDVEC,       0.0
FHVEC,       0.0
FG,          0.0
FHX,         0.0
FHXH,        0.0
FHN,         0.0
FHNH,        0.0
FXX,         0.0
FXXH,        0.0
TG,          0.0
TN,          0.0
TNH,         0.0
TX,          0.0
TXH,         0.0
T10N,        0.0
T10NH,       0.0
SQ,          0.0
SP,          0.0
Q,           0.0
DR,          0.0
RH,          0.0
RHX,         0.0
RHXH,        0.0
PG,          0.0
PX,          0.0
PXH,         0.0
PN,          0.0
PNH,         0.0
VVN,         0.0
VVNH,        0.0
VVX,         0.0
VVXH,        0.0
NG,          0.0
UG,          0.0
UX,          0.0
UXH,         0.0
UN,          0.0
UNH,         0.0
EV24         0.0
dtype: float64

As we can see from the data above, we have the following information:
* **Open**      : Price for the opening or start of the day
* **High**      : Highest price for that day
* **Low**       : Lowest price for that day
* **Close**     : Closing price
* **Adj Close** : closing price after adjustments for all applicable splits and dividend distributions.
* **Volume**    : Numbers of shares exchanging hands that day (buying & selling)


## Feature Engineering

Our ambition is to predict the closing price of the S&P500 for a given date. However, we only have a few features in our data; and machine learning model typically leverages large feature sets and picks the best features for solving the problem. So in this section, we will create more features from our data.

In [48]:
# create empty df to store feature
weather_feautres_df = weather_df

In [49]:
weather_feautres_df['YYYY'] = weather_feautres_df['YYYYMMDD,'].str.slice(0,4) #create a variable for years
weather_feautres_df['MM'] = weather_feautres_df['YYYYMMDD,'].str.slice(4,6)#create a variable for months
weather_feautres_df['DD'] = weather_feautres_df['YYYYMMDD,'].str.slice(6,8)
for i in weather_feautres_df.columns:
        weather_feautres_df[i] = weather_feautres_df[i].astype(float, errors= 'ignore') 
weather_feautres_df = weather_feautres_df.drop('YYYYMMDD,', axis=1)

In [50]:
weather_feautres_df = weather_feautres_df.drop(columns = ['STN,','EV24', 'NG,', 'TN,', 'TNH,', 'TX,', 'TXH,', 'T10N,', 'T10NH,'])
weather_feautres_df = weather_feautres_df.dropna()

In [51]:
weather_feautres_df

Unnamed: 0,"DDVEC,","FHVEC,","FG,","FHX,","FHXH,","FHN,","FHNH,","FXX,","FXXH,","TG,",...,"VVX,","VVXH,","UG,","UX,","UXH,","UN,","UNH,",YYYY,MM,DD
0,239.0,21.0,26.0,40.0,3.0,10.0,17.0,70.0,14.0,57.0,...,38.0,23.0,99.0,100.0,10.0,98.0,1.0,2000.0,1.0,1.0
1,194.0,35.0,36.0,50.0,11.0,10.0,1.0,90.0,22.0,70.0,...,59.0,22.0,98.0,100.0,2.0,95.0,12.0,2000.0,1.0,2.0
2,204.0,63.0,63.0,80.0,19.0,40.0,1.0,120.0,12.0,77.0,...,68.0,10.0,92.0,97.0,24.0,89.0,18.0,2000.0,1.0,3.0
3,222.0,53.0,59.0,90.0,17.0,30.0,7.0,170.0,17.0,76.0,...,75.0,19.0,93.0,99.0,5.0,77.0,18.0,2000.0,1.0,4.0
4,193.0,37.0,39.0,50.0,13.0,30.0,1.0,80.0,18.0,44.0,...,80.0,14.0,92.0,97.0,8.0,82.0,14.0,2000.0,1.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7300,74.0,26.0,28.0,50.0,11.0,10.0,23.0,70.0,7.0,54.0,...,74.0,20.0,87.0,97.0,1.0,77.0,20.0,2019.0,12.0,27.0
7301,128.0,14.0,17.0,30.0,15.0,10.0,3.0,40.0,2.0,11.0,...,70.0,14.0,91.0,98.0,4.0,81.0,14.0,2019.0,12.0,28.0
7302,153.0,19.0,19.0,30.0,10.0,10.0,4.0,50.0,18.0,7.0,...,75.0,13.0,87.0,98.0,5.0,73.0,15.0,2019.0,12.0,29.0
7303,209.0,33.0,35.0,50.0,19.0,20.0,1.0,100.0,19.0,41.0,...,83.0,5.0,56.0,78.0,2.0,37.0,14.0,2019.0,12.0,30.0


### Train  Linear Regression

In [52]:
# write out to parquet
weather_feautres_df.to_parquet(feature_data_path + 'weather_feautres_df.parquet', compression='GZIP')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

In [53]:
# get x and y
x_train, y_train = weather_feautres_df.drop('TG,', axis=1), weather_feautres_df['TG,']
# split the data for initial testing
X_train, X_test, Y_train, Y_test = train_test_split(x_train, y_train, test_size=0.2,random_state=1)
X_train

Unnamed: 0,"DDVEC,","FHVEC,","FG,","FHX,","FHXH,","FHN,","FHNH,","FXX,","FXXH,","SQ,",...,"VVX,","VVXH,","UG,","UX,","UXH,","UN,","UNH,",YYYY,MM,DD
1921,256.0,24.0,29.0,50.0,12.0,10.0,21.0,90.0,13.0,74.0,...,82.0,11.0,73.0,97.0,4.0,41.0,14.0,2005.0,4.0,5.0
6404,275.0,26.0,30.0,60.0,11.0,0.0,1.0,100.0,11.0,48.0,...,83.0,17.0,77.0,96.0,3.0,52.0,8.0,2017.0,7.0,14.0
820,193.0,23.0,30.0,50.0,11.0,20.0,1.0,150.0,16.0,56.0,...,81.0,14.0,75.0,92.0,24.0,41.0,13.0,2002.0,3.0,31.0
6464,226.0,54.0,59.0,80.0,11.0,20.0,18.0,130.0,10.0,96.0,...,82.0,16.0,79.0,96.0,3.0,56.0,16.0,2017.0,9.0,12.0
3407,36.0,13.0,20.0,50.0,19.0,10.0,1.0,80.0,18.0,126.0,...,81.0,18.0,65.0,99.0,3.0,36.0,13.0,2009.0,4.0,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
905,300.0,29.0,30.0,50.0,10.0,10.0,1.0,100.0,14.0,141.0,...,80.0,1.0,67.0,98.0,4.0,39.0,14.0,2002.0,6.0,24.0
5200,84.0,34.0,37.0,60.0,19.0,10.0,3.0,110.0,13.0,108.0,...,81.0,17.0,57.0,88.0,5.0,33.0,17.0,2014.0,3.0,28.0
3988,71.0,31.0,32.0,60.0,1.0,10.0,23.0,90.0,1.0,0.0,...,64.0,1.0,85.0,89.0,9.0,71.0,2.0,2010.0,12.0,2.0
235,51.0,32.0,32.0,50.0,11.0,10.0,3.0,90.0,14.0,121.0,...,80.0,16.0,73.0,97.0,5.0,50.0,14.0,2000.0,8.0,23.0


In [54]:
model = LinearRegression()
model.fit(X_train, Y_train)

print(model.coef_)

[ 0.05398324 -0.35832923  0.02515511  0.1393122   0.0208729   0.40397451
  0.02570389 -0.01369415  0.17627609  0.12898045 -0.95196522  0.0728551
 -0.02717224  0.02556007  0.32033743 -0.14544554 -0.02741121 -0.14468729
 -0.17560968  0.12268275  0.21564529  0.04238334  0.04898858  0.69222656
  0.10162301  0.07643803  0.03130588  0.18703017 -0.11025849  0.11310672
  0.02817462  5.05442702  0.06157034]


In [55]:
predictions = model.predict(X_test)
predictions

array([ 44.5726549 , 122.13738341, 134.88610896, ..., 123.25658893,
        80.22016913,  72.79614657])

In [56]:
df_test_results = pd.DataFrame({'Prediction':predictions, 'Real': Y_test})
df_test_results

Unnamed: 0,Prediction,Real
2560,44.572655,93.0
2298,122.137383,105.0
5248,134.886109,86.0
5976,179.884326,203.0
6010,209.272844,148.0
...,...,...
5171,51.168119,48.0
3137,155.432685,187.0
4486,123.256589,60.0
808,80.220169,96.0


### Evaluating The Model

In [57]:
MSE = metrics.mean_squared_error(df_test_results['Real'],df_test_results['Prediction'])
RMSE = math.sqrt(MSE)
MAXE = max((df_test_results['Real']- df_test_results['Prediction'])**2)
MAE = metrics.mean_absolute_error(df_test_results['Real'],df_test_results['Prediction'])

pd.Series([MSE,RMSE,MAXE,MAE], index = ['MSE','RMSE','MAXE','MAE'])

MSE      1442.071924
RMSE       37.974622
MAXE    13634.968120
MAE        30.436191
dtype: float64

### Save The Model

In [58]:
regression_best = model

In [59]:
# save the model into temp
with open('/tmp/model.pickle', 'wb') as f:
    cPickle.dump(regression_best, f, -1)

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/model.pickle'