# Frame the problem

**Challenge** : Predict the consumption of electrictity in Paris every 15 minutes at J+1

**Goal** : implement a model to estimate the electrictity consumption based on some historical data. It is a typical supervised learning task since we will be working with some labelled training examples (each instance comes with the expected output, ie. the electricity consumption in Paris for a given date). Moreover, it is also a typical regression task, since we try to predict a value. More specifically this is a multivariate regression problem since the system will use multiple features to make a prediction.

# Select a performance measure

Accuracy of the models will be measured with the **mean absolute percentage error (MAPE)**.  
It is basically a measurement of prediction accuracy.

$ M = \frac{100\%}{n} \sum_{i=1}^n | \frac{C_i - {C_i}^*}{C_i} |\ $


where $C_i$  is the real consumption, $C_i^∗$ the estimated consumption, and n the number of guess (96 for one day).


# Get the data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import requests
import bs4 as BeautifulSoup
import datetime
from modules.utils import get_data_with_features

In [None]:
from modules.utils import get_data_with_features,is_day_off

df = get_data_with_features()

In [None]:
df.head(10)

In [None]:
df.describe()

# Take a quick look

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
sns.distplot(df['Conso'])

In [None]:
sns.distplot(df['Temp'])

# Create a test set

A test set must be set aside as soon as possible to avoid overfitting.   
Creating a test set is theoretically quite simple: just pick some instances randomly,
typically 20% of the dataset, set them aside and you are done.

In [None]:
import numpy as np
import numpy.random as rnd

def split_train_test(df,test_ratio=0.25):
    rnd.seed(42)
    shuffled = rnd.permutation(len(df))
    test_set_size = int(len(df)*test_ratio)
    test_indices = shuffled[:test_set_size]
    train_indices = shuffled[test_set_size:]
    return df.iloc[train_indices],df.iloc[test_indices]

In [None]:
train_set, test_set = split_train_test(df, 0.2)
print(len(train_set), "train +", len(test_set), "test")

But scikit learn provides a function to do that : 

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, random_state=42)

In [None]:
train_set.shape

In [None]:
test_set.shape

Note : random sampling is good if the dataset is large enough relative to the number of attributes, when it is not your risk of introducing a sampling bias and should prefere a stratified sampling. 

# Prepare the data

In [None]:
X_train = train_set.drop(["Conso","Date"], axis=1).copy()
y_train = train_set["Conso"].copy()

X_test = train_set.drop(["Conso","Date"], axis=1)
y_test = train_set["Conso"].copy()

In [None]:
X_train.head(1)

## Feature scaling

Most of machine learning algorithm are based on distances (for example the euclidian distance).   
When features have very different scales this can cause issues, the distances are dominated by some columns ....  
There are two types of feature scaling : **Standardisation** and **Normalisation**. 

Feature scaling can also help algorithm to converge faster. 


### Standardisation
This one does not bound values to a range but is less affected by outliers (which would crush values in normalisation...)

$$X_{stand} = \frac{x - mean(x)}{std(x)}$$


In [None]:
from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
X_train_scaled= std_scaler.fit_transform(X_train[['Temp','temp_24_lag']])

In [None]:
X_train_scaled

In [None]:
X_train.head()

## Handling text attributes
Most machine learning algorithm prefer to work with numerical values so you will need to convert text labels to numbers.


### OneHotEncoding
If you have ordinal categories (with a notion of order, like for example the size of a t-shirt), then transforming text categories to 1,2,3 ... is good, as long as the number attributed are in the same order.  
But if you have nominal attributes, this notion of order will bias your model, so you need to make further transformations to your dataset. 


# Building pipelines 

Scikit learn pipelines allow us to perform many transformations at once and save time.   
Each transformer output is sent as an input of the next transformer. 

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [None]:
X_train.columns

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.pipeline import FeatureUnion


num_attribs = ['Temp','temp_24_lag','conso_24_lag','conso_7_days_lag','heating_degrees','cooling_degrees']
cat_attribs = ['day_of_week','month']
num_pipeline = Pipeline([
        ('selector', DataFrameSelector(num_attribs)),
        ('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('one_hot_encoder', OneHotEncoder()),
])
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

In [None]:
X_train_prepared = full_pipeline.fit_transform(X_train)

# Training a LinearRegression model
Let's first train a simple 

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, y_train)

In [None]:
some_data = X_train.iloc[:1]
some_data

In [None]:
# let's try the full pipeline on a few training instances
some_data = X_train.iloc[:5]
some_labels = y_train.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

In [None]:
consumption_predictions = lin_reg.predict(X_train_prepared)
mean_absolute_percentage_error(y_train,consumption_predictions)

In [None]:
lin_reg.coef_

# Training a decision tree model

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(X_train_prepared, y_train)

In [None]:
consumption_predictions = tree_reg.predict(X_train_prepared)
mean_absolute_percentage_error(y_train,consumption_predictions)


That's strange, we may be overfitting the data here

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data.

## Better evaluation using cross validation

The following code performs K-fold cross-validation:   
It randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the decision tree model 10 times,
picking a different fold for evaluation every time and training on the other 9 folds.
The result is an array containing the 10 evaluation scores:


In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(tree_reg, X_train_prepared, y_train,scoring="neg_mean_absolute_error", cv=5)

# Choosing an algorithm 

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge

regressions = {}

X_train_prepared = full_pipeline.fit_transform(X_train)

for regressor in [ElasticNet,DecisionTreeRegressor, GradientBoostingRegressor,RandomForestRegressor,Ridge, Lasso, LinearRegression]:
    reg = regressor()
    regressions[reg.__class__.__name__] = reg
    print(reg.__class__.__name__)
    print(cross_val_score(reg, X_train_prepared[:10000], y_train[:10000],scoring="neg_mean_absolute_error", cv=5).mean()) 


In [None]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

parameters = {'n_estimators':(10,100,1000), 'max_depth':(1,10,100)}

random_forest = RandomForestRegressor()

grid_search = GridSearchCV(random_forest, parameters)
grid_search.fit(X_train_prepared[:2000], y_train[:2000])

In [None]:
grid_search.best_estimator_

# Making a prediction for tomorrow

In [None]:
X = df.drop(["Conso","Date","temp_rolling_7_days"], axis=1)
y = df["Conso"].copy()

In [None]:
X_prepared = full_pipeline.fit_transform(X)
rf = RandomForestRegressor(max_depth=100)
rf.fit(X_prepared,y)

In [None]:
X.columns

In [None]:
hour9 = {
    "Temp":23,
    "is_day_off":0,
    "conso_24_lag":7809,
    "temp_24_lag":24.6,
    "conso_7_days_lag":7930,
    "heating_degrees":0,
    "cooling_degrees":0,
    "is_weekend":0,
    "day_of_week":4,
    "month":6
}
X = pd.DataFrame([hour11], index=[0])

X_t = full_pipeline.transform(X)
rf.predict(X_t)
