# 07.00 - Modeling - Prophet Model & Select Cross Validation Rolling Window Size

 + We have data for each summer from 1994 to 2018
 + We initially decided that the minimum size of the hold out test data is 5 years from 2014 to 2018
 + We want to select a rolling window that extracts as much value as possible fom the data, but that leaves as much data as possible as hold-out data
 + Prophet seems to have good out of the box performance, and runs faster than statsmodels ARIMA
 + We beleive that there are some underlying structural changes that have changed cause and effect relationships between features and power demand between 1994 and 2018
 + The feature data is limited to weather. We do not have data for items such as air conditioner penetration, conserrvation growth (eg LEDs), population growth, housing stock types.
 + Therefore, I am going to make the assertion that next year's power demand pattern more closely resembles this year's pattern rather than last year's
 + We could introduce some sort of decay scheme where more recent data is weighted more heavily than older data. But this does not help us maximize the size of the held-out test data
 
#### One approach could be:
 + We will use only the power data, and run a series of incrementally increasing cross validation windows across the data between 1994 and 2013
 + Based on the results we will select a window for the rolling time series cross validation to use in the rest of the modeling process. We will select the window by running prophet on an incremetally increasing sequence of rolling windows, and look for either a best size, or a size beyond which we get diminishing returns.
 + I realize that this is breaking some rules.If the window proves to be 3 years then to get 10 cross folds, my hold out data will be from 2008 to 2018. But, I will have already "touched" some of this data when I determined the size of the rolling window. 

#### Another approach could be:
 + Make a judgement as to a reasonable time period
 
#### Making a judgement:
 + If I had to draw a chart of next year's demand by reviewing a chart of the last 100 years of data, I would draw a chart that looked exactly the same as last year + or - any obvious trend.
 + We are making a prediction for a single year ahead, using our cross validation scheme i.e the validation set comprises one year. If we only choose a single year of test data, then our model will miss out on trends, and will be working on a 50/50 train test split. Therefore, our training period should be greater than 1 year.
 + Two years of training data is not enough because a degree of randomness is introduced by the weather. ie. if we have a hot summer followed by a cold summer, this could be seen as a trend, but it is really randomness. Therefore, our training period should be greater than 2 years.
 + Twenty years seems too long because diverse undelying structural changes in the demand patterns mean that year 1 is not really the "same" as year 20
 + At this point, I have delayed making this decision long enough, and I am going to (semi-)arbitrarily select a training period of 5 years. This gives a train/ validation split of 83/17% which seems reasonable. My opinion is that this period is long enough to capture trends, and short enough to give a reasonably close representation of the validation data
 + I want to keep 10 cross folds in order to capture the uncertainty in the model
 + Therefore my data split will look like this:
     + Training Data - 1994 to 2009 with a 10 fold rolling tiome series cross validation
     + Test Data - 2010 to 2018 - 9 years

## Imports & setup

In [1]:
import pathlib
import warnings
from datetime import datetime
import sys
import pickle
import joblib
import gc

import pandas as pd
import numpy as np

# Plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates


# Imports
sys.path.append("..")
from src.utils.utils import (AnnualTimeSeriesSplit,
                             RollingAnnualTimeSeriesSplit,
                             bound_precision,
                             run_cross_val,
                             run_data_split_cross_val,
                             save_run_results)
from src.features.features import CyclicalToCycle
from src.models.models import SK_SARIMAX, SK_Prophet, SetTempAsPower, SK_Prophet_1
from src.visualization.visualize import (plot_prediction,
                                         plot_joint_plot,
                                         residual_plots,
                                         print_residual_stats,
                                         resids_vs_preds_plot)
#b # Packages
from sklearn.pipeline import Pipeline
from skoot.feature_selection import FeatureFilter
from skoot.preprocessing import SelectiveRobustScaler
from sklearn.metrics import mean_absolute_error
from scipy.stats import norm
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import autocorrelation_plot
from statsmodels.graphics.tsaplots import plot_acf
import statsmodels.api as sm
from fbprophet import Prophet

# Display
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
figsize=(15,7)
warnings.filterwarnings(action='ignore')
%matplotlib inline

# Data
PROJECT_DIR = pathlib.Path.cwd().parent.resolve()
CLEAN_DATA_DIR = PROJECT_DIR / 'data' / '05-clean'
MODELS_DIR = PROJECT_DIR / 'data' / 'models'
RESULTS_PATH = PROJECT_DIR / 'data' /'results' / 'results.csv'

## Load Daily Data & Inspect

In [2]:
df = pd.read_csv(CLEAN_DATA_DIR / 'clean-features.csv', parse_dates=True, index_col=0)

In [3]:
X = df.copy(deep=True)
X = X.loc['1994': '2009']
y = X.pop('daily_peak')
X.head()

Unnamed: 0,hmdxx_min,hmdxx_max,hmdxx_median-1,hmdxx_max_hour,temp_min,temp_max,dew_point_temp_max,sun_rise,sun_set,visibility_mean,day_of_week,week_of_year,day_type
1994-05-24,8.998045,19.818202,19.655075,15.0,9.0,19.6,13.4,6.0,21.0,24.975,1.0,21.0,2
1994-05-25,11.406291,20.665711,17.205396,18.0,10.4,18.2,14.0,6.0,21.0,9.358333,2.0,21.0,0
1994-05-26,2.563201,15.259916,17.722172,2.0,3.9,13.0,12.3,6.0,21.0,9.65,3.0,21.0,0
1994-05-27,-0.012865,12.970553,6.567827,17.0,2.0,14.8,2.3,6.0,21.0,34.5,4.0,21.0,0
1994-05-30,13.632519,30.133976,18.724332,14.0,13.1,27.2,13.6,6.0,21.0,22.270833,0.0,22.0,0


In [4]:
y.tail()

2009-09-28    17197.0
2009-09-29    16969.0
2009-09-30    17026.0
2009-10-01    17462.0
2009-10-02    17147.0
Name: daily_peak, dtype: float64

## Prophet Model 

Run using just the y data - the daily peak demand

In [5]:
n_splits=10

prophet_model = SK_Prophet(pred_periods=96)
                           
ratscv = RollingAnnualTimeSeriesSplit(n_splits=n_splits, goback_years=5)

steps = [('prophet', prophet_model)]
pipeline = Pipeline(steps)
d = run_cross_val(X, y, ratscv, pipeline, scoring=['mae', 'bound_precision'])
d

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


2000
2001
2002
2003
2004
2005
2006
2007
2008
2009


{'train': {'mae': [861.9526886500258,
   848.7503649387098,
   951.9640884002106,
   1097.9631256223759,
   1046.3089180328611,
   1039.6653570615367,
   1120.9551950298394,
   1150.2820915475443,
   1153.026969775874,
   1158.48089195951],
  'bound_precision': [0.0, 0.0, 0.0, 0.2, 0.2, 0.4, 0.2, 0.0, 0.0, 0.0]},
 'test': {'mae': [1123.5511024344235,
   1314.1803916051847,
   1524.0323306039452,
   2071.5448825390963,
   1494.9661977544395,
   2266.413731752333,
   3216.962074235825,
   2046.537695763631,
   1745.254663302055,
   1608.148548303073],
  'bound_precision': [0.0, 0.0, 0.2, 0.0, 0.2, 0.2, 0.4, 0.0, 0.0, 0.0]}}

In [6]:
# Take a look at the results on the validation data
print(np.mean(d['test']['mae']))
print(np.mean(d['test']['bound_precision']))

1841.1591618294005
0.1
