# Thomas Households consumption

## Goal
The goal is to predict the consumption of each household at the next timestamp (half an hour) using the history of all consumptions. 

## Data
The data is as follow :   
- Z is a pandas.DataFrame where each row measure the consumtpions of the 172 households for a given time (one each half an hour). The first column is the timestamp (from the 1st of November 2013 to the 30th of November 2014). 

## The prediction task
We split the data in N_CV=3 CV. In each CV, there are BATCH_SIZE=600 instances.
For each instance, you are given all previous records so far. Your output is one row, the predicted next one. (Contrary to El Nino, there is no look_ahead.)

The metrics is RMSE across time and households

# Exploratory data analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

In [2]:
# Reading the data

Z = pd.read_excel("linky.xlsx")
print(Z.shape)

(17847, 173)


The first colomns in the pandas DataFrame correspond to the timestamp.

In [3]:
Z.head()

Unnamed: 0,Id_client,1172,1272,925,2185,1280,396,404,433,638,...,2167,2232,2233,2238,2248,2257,2274,2308,2413,2482
0,2013-11-01 00:00:00,0.0,1.081256,1.230105,0.0,2.635901,2.267873,0.0,3.233118,0.0,...,0.0,0.205134,1.367643,1.432094,2.405136,3.234911,0.0,0.0,0.320199,0.0
1,2013-11-01 00:30:00,0.0,0.964631,2.186107,0.0,2.642752,1.999912,0.0,2.910187,0.0,...,0.0,0.307133,0.784949,0.785386,2.039209,0.728714,0.0,0.0,0.300206,0.0
2,2013-11-01 01:00:00,0.0,0.750522,1.20472,0.0,2.613062,2.021422,0.0,0.945323,0.0,...,0.0,0.283008,0.490199,0.602877,3.305715,0.475029,0.0,0.0,0.318054,0.0
3,2013-11-01 01:30:00,0.0,0.781637,1.271173,0.0,0.489971,2.05682,0.0,0.27248,0.0,...,0.0,1.452652,0.35407,0.33389,1.832266,2.021647,0.0,0.0,0.281514,0.0
4,2013-11-01 02:00:00,0.0,0.853887,1.935372,0.0,0.178153,2.76237,0.0,0.197846,0.0,...,0.0,0.54111,0.26227,0.266219,1.609474,0.693087,0.0,0.0,0.218319,0.0


In [4]:
print('First timestamp:')
print(Z['Id_client'].iloc[0])
print('Last timestamp:')
print(Z['Id_client'].iloc[-1])

First timestamp:
2013-11-01 00:00:00
Last timestamp:
2014-11-30 23:30:00


In [5]:
# There are few columns with NaN
np.where(Z.isna().sum())

(array([ 16,  40,  74,  88, 110, 146, 161]),)

## The cross-validation object

For each CV, we choose BATCH_SIZE random time t.
for each time t, Z is the raw data from time 0 to time t.
We measure the RMSE on the prediction of the next timestamp.

## The pipeline

Same as El Nino, predictor is a composition $f(Z_t) = h(g(Z_t))$, a feature extractor and a predictor.

### The feature extractor

The feature extractor implements a single `transform` function. 
It receives a DataFrame from time 0 time to a arbitrary time t, and should return a vector of features of fixed size (which will be used by the predictor).

In [6]:
%%file submissions/starting_kit/ts_feature_extractor.py
import numpy as np


class FeatureExtractor(object):

    def __init__(self):
        pass
    
    def transform(self, Z):
        """Compute the running average of the last 10 days at the same time
        
        Z is the raw pd.DataFrame
        return x_vector of size 172 (which will be our final prediction here)
        """
        
        Z = Z.fillna(0).drop(columns = 'Id_client')
        
        nb_days = int(Z.shape[0] / 48)
        previous_measure= [-47] + [-47 - i*48 for i in range(1, min([nb_days,10]))]
        
        X_array = Z.iloc[previous_measure].values
        x_vector = np.mean(X_array, axis = 0)
    
        return x_vector    

Overwriting submissions/starting_kit/ts_feature_extractor.py


### The regressor

The regressor should implement a scikit-klearn-like regressor with fit and predict functions.  
The starting kit uses the identity function.

In [7]:
%%file submissions/starting_kit/regressor.py
from sklearn.base import BaseEstimator
from sklearn import linear_model

class Regressor(BaseEstimator):
    def __init__(self):
        pass #self.reg = linear_model.BayesianRidge()

    def fit(self, X, y = None):
        pass #self.reg.fit(X, y)

    def predict(self, X):
        y_pred = X
        return y_pred #self.reg.predict(X)



Overwriting submissions/starting_kit/regressor.py


# Local testing

In [8]:
!ramp_test_submission

[38;5;178m[1mTesting Thomas Households Consumptions[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/starting_kit ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.885[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.876[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.522[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.866[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.881[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.522[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.884[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.872[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.522[0m
[38;5;178m[1m----------------------------[0m
[38;5;178m[1mMean CV scores[0m
[38;5;178m[1m----------------------------[0m
	[38;5;178m[1mscore    

I do not understand why the test error is twice the train or valid error (since there is no fitting, there should be no overfitting)

In [9]:
# Other model, using only the last consumption to predict the next one.
!ramp_test_submission --submission last_point_prediction

[38;5;178m[1mTesting Thomas Households Consumptions[0m
[38;5;178m[1mReading train and test files from ./data ...[0m
[38;5;178m[1mReading cv ...[0m
[38;5;178m[1mTraining ./submissions/last_point_prediction ...[0m
[38;5;178m[1mCV fold 0[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.654[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.651[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.471[0m
[38;5;178m[1mCV fold 1[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.652[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.656[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.471[0m
[38;5;178m[1mCV fold 2[0m
	[38;5;178m[1mscore   rmse[0m
	[38;5;10m[1mtrain[0m  [38;5;10m[1m0.664[0m
	[38;5;12m[1mvalid[0m  [38;5;12m[1m0.652[0m
	[38;5;1m[1mtest[0m   [38;5;1m[1m1.471[0m
[38;5;178m[1m----------------------------[0m
[38;5;178m[1mMean CV scores[0m
[38;5;178m[1m----------------------------[0m
	[38;5;178m[1m

Note that predicting the next timestamp from the previous one is more accurate than using an runing average.

# Conclusion

In this starting kit, we only use each single column (household) to predict itself. We could have used the entire dataset to improves our prediction, e.g. using any delayed correlation among columns or using a more sophisticated feature extraction from timestamp.