# Demand forecasting based on weather and weekday

In [1]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
from joblib import dump, load
from Preprocessing import *

Datapath = "../Data/"

# Extracting
I use the pickle files that have already been preprocessed "Assignment 4 ETL.ipynb" and the fitted model from "Assignment 4 Weather prediction.ipynb" 

In [2]:
df_train = pd.read_pickle(Datapath+"df_train.p")
df_test = pd.read_pickle(Datapath+"df_test.p")

# Preprocessing
The goal is to predict today's and tomorrow's demand based on the weather prediction of today and tomorrow. That means our predictive model requires 4 input vectors:

 1. Predicted Temp for today
 2. Predicted Rainfall for today
 3. Predicted Temp for tomorrow
 4. Predicted Rainfall for tomorrow
 
And one output value:
 
 The predicted demand.
 
I prepare input vectors X and output values y that reflect these requirements.

In [3]:
vals_train = df_train[['Temp', 'Rainfall']].values
X_train = np.array([list(vals_train[i]) + list(vals_train[i+1]) for i in range(len(vals_train)-1)])

y_vals_train = df_train['Demand'].values
y_train = np.array([y_vals_train[i] + y_vals_train[i+1] for i in range(len(y_vals_train)-1)])

vals_test = df_test[['Temp', 'Rainfall']].values
X_test = np.array([list(vals_test[i]) + list(vals_test[i+1]) for i in range(len(vals_test)-1)])

y_vals_test = df_test['Demand'].values
y_test = np.array([y_vals_test[i] + y_vals_test[i+1] for i in range(len(y_vals_test)-1)])

# Predicting without weekday
## Comparing Models
First we look at a couple of models without parameter tuning (because parameter tuning can be quite expensive in time). Initially   only weather data is taken into account

We do some quick testing on Random forests...

In [4]:
m = RFR(n_estimators=80, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
m.score(X_test, y_test)

0.7634238786108772

Linear regression...

In [5]:
reg = LinearRegression().fit(X_train, y_train)
reg.score(X_test, y_test)

0.4768462319918412

Logistic regression...

In [6]:
log = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
log.score(X_test, y_test)

0.25824175824175827

## Zooming in on Random Forest Regression

It seems that a random forest performs best, this is actually quite common so I'll do a gridsearch in order to tune the hyperparemters. We check a variety of parameters and use 5-fold validation on the combined data set from 2014 to 2017

In [7]:
X_cv = np.concatenate((X_train, X_test))
y_cv = np.concatenate((y_train, y_test))

Warning!!! The next cell (gridsearch) takes almost 40 minutes to execute.
It finds the best parameters (from a grid of options) and has an attribute called "best_estimator_" that is precisely the model with the optimal parameters.

In [8]:
parameters = {'n_estimators' : list(range(5, 101, 5)), 'min_samples_split' : list(range(2, 10)), 
              'min_samples_leaf' : list(range(2, 10))}
m = RFR(n_jobs=-1)
clf = GridSearchCV(m, parameters, cv=5)
clf.fit(X_cv, y_cv)
m = clf.best_estimator_

Below is given a short description of the model with the optimal parameters.

In [9]:
m

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=4, min_samples_split=7,
           min_weight_fraction_leaf=0.0, n_estimators=95, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [10]:
clf.best_score_

0.7298151434768766

The model is dumped so it can be used later without doing the expensive gridsearch

In [11]:
dump(m, Datapath+"optimal_m_no_weekday.joblib")

['../Data/optimal_m_no_weekday.joblib']

So the model performs with an R2 of about 0.68 which is not great but definitely not bad either.

Let's save it for future use.

In [12]:
dump(m, Datapath+"fitted_m_no_weekday.joblib")

['../Data/fitted_m_no_weekday.joblib']

# Predicting with weekday
## Comparing Models
First we look at a couple of models without parameter tuning (because parameter tuning can be quite expensive in time). This time, we also take weekday data into account.

In [13]:
X_train = np.c_[X_train, df_train['Weekday'].values[:-1]]
X_test = np.c_[X_test, df_test['Weekday'].values[:-1]]

Linear regression...

In [14]:
reg = LinearRegression().fit(X_train, y_train)
reg.score(X_test, y_test)

0.48896434598538235

Logistic regression...

In [15]:
log = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
log.score(X_test, y_test)

0.22802197802197802

The Random Forest Regression, allows for the use of categorical data such as weekdays, so we can translate this to categorical data.

In [16]:
m = RFR(n_estimators=80, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
m.score(X_test, y_test)

0.8114914059487586

## Zooming in on Random Forest Regression

It seems that a random forest performs best, this is actually quite common so I'll do a gridsearch in order to tune the hyperparemters. We check a variety of parameters and use 5-fold validation on the combined data set from 2014 to 2017

In [17]:
X_cv = np.concatenate((X_train, X_test))
y_cv = np.concatenate((y_train, y_test))

Warning!!! The next cell (gridsearch) takes almost 40 minutes to execute.
It finds the best parameters (from a grid of options) and has an attribute called "best_estimator_" that is precisely the model with the optimal parameters.

In [18]:
parameters = {'n_estimators' : list(range(5, 101, 5)), 'min_samples_split' : list(range(2, 10)), 
              'min_samples_leaf' : list(range(2, 10))}
m = RFR(n_jobs=-1)
clf = GridSearchCV(m, parameters, cv=5)
clf.fit(X_cv, y_cv)
m = clf.best_estimator_

Below is given a short description of the model with the optimal parameters.

In [19]:
m

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=3, min_samples_split=5,
           min_weight_fraction_leaf=0.0, n_estimators=90, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [20]:
clf.best_score_

0.7821114012859117

So the model performs with an R2 of about 0.73 which is pretty good.

Let's save it for future use.

In [21]:
dump(m, Datapath+"fitted_m_weekday.joblib")

['../Data/fitted_m_weekday.joblib']

In [22]:
m = load(Datapath+"fitted_m_weekday.joblib")

In [23]:
m.score(X_test, y_test)

0.92558041297097