# Naive Forecasting using SciKit-Learn

SciKit-Learn provides most of the performance metrics for measuring the accuracy of the results from forecasting:

    Sum of Squared Errors
    Mean Absolute Error
    Mean Absolute Percentage Error
    Symmetric Mean Absolute Percentage Error
    Mean Squared Error
    Root Mean Squared Error
    R-Squared
    
Using the dataset containing stock prices from S&P 500, build a naive forecasting model, that will predict the future stock price based on the previous price, i.e. you predict the previous value.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.metrics import mean_absolute_percentage_error, mean_absolute_error, r2_score, mean_squared_error

%matplotlib inline

In [2]:
df = pd.read_csv('data/SPY.csv', index_col='Date', parse_dates=True)

In [3]:
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-04,112.370003,113.389999,111.510002,113.330002,92.246048,118944600
2010-01-05,113.260002,113.68,112.849998,113.629997,92.490204,111579900
2010-01-06,113.519997,113.989998,113.43,113.709999,92.555328,116074400
2010-01-07,113.5,114.330002,113.18,114.190002,92.94606,131091100
2010-01-08,113.889999,114.620003,113.660004,114.57,93.255348,126402800


In [4]:
# Create new column of Close prices shifted forward by 1 row, i.e. predicted Close prices

df['close_prediction'] = df['Close'].shift(1)

In [5]:
# NaN value in 1st row since 1st value doesn't have previous value
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,close_prediction
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-04,112.370003,113.389999,111.510002,113.330002,92.246048,118944600,
2010-01-05,113.260002,113.68,112.849998,113.629997,92.490204,111579900,113.330002
2010-01-06,113.519997,113.989998,113.43,113.709999,92.555328,116074400,113.629997
2010-01-07,113.5,114.330002,113.18,114.190002,92.94606,131091100,113.709999
2010-01-08,113.889999,114.620003,113.660004,114.57,93.255348,126402800,114.190002


In [7]:
# Variable for performance metrics in SciKit (remove 1st row)

y_true = df.iloc[1:]['Close']
y_pred = df.iloc[1:]['close_prediction']

## Metrics

The idea is to get a feel for how the metric values relate to one another, i.e. what is good? what is bad? 

e.g. If R-Squared is good, does that mean the RMSE value is also good?

NOTE: There is no function in SciKit-Learn for Sum of Squared Errors (SSE), so you need to calculate this manually.

In [8]:
# Compute SSE - bit arbitrary

(y_true - y_pred).dot(y_true - y_pred)

6330.3742894926045

In [9]:
# Compute MSE - more reasonable range

mean_squared_error(y_true, y_pred)

2.798573956451196

In [10]:
# Compute MSE manually

(y_true - y_pred).dot(y_true - y_pred) / len(y_true)

2.7985739564511958

In [11]:
# Compute RMSE

mean_squared_error(y_true, y_pred, squared=False)

1.672893886787562

In [12]:
# Or compute RMSE with NumPy (same answer)

np.sqrt((y_true - y_pred).dot(y_true - y_pred) / len(y_true))

1.6728938867875618

In [13]:
# Compute MAE - off by approx $1.15 on average

mean_absolute_error(y_true, y_pred)

1.1457559803120336

In [14]:
# Compute R-Squared - suspiciously good...almost 100% perfect

r2_score(y_true, y_pred)

0.9989603259063914

From day-to-day, stock prices do not vary wildly, so it is not that surprising that the results are near perfect, so using the previous price in the Series is a good feature. However, there is no logic to the method of prediction hence 'naive' forecasting. Its just luck because the prices are so close day-to-day.

In [15]:
# Compute MAPE - nearly 0 which makes sense as RMSE is nearly 1

mean_absolute_percentage_error(y_true, y_pred)

0.006494073151422373

### sMAPE - Symmetric Mean Absolute Percentage Error


$$ E = {1 \over N} \sum_{i=1}^n {|y_i - \hat{y}_i| \over \left( |y_i| + |\hat{y}_i| \right) / 2} $$

The sMAPE metric is not implemented in SciKit-Learn, so you must calculate it manually:

In [16]:
def smape(y_true, y_pred):
    numerator = np.abs(y_true - y_pred) 
    denominator = (np.abs(y_true) + np.abs(y_pred)) / 2 
    ratio = numerator / denominator 
    return ratio.mean()


In [17]:
smape(y_true, y_pred)

0.006491365814068417

Only slightly less than non-symmetric MAPE, by 0.00001 point, which makes sense since model has near perfect accuracy