# ML supervisé pour les timeseries

Dans ce premier notebook on va créer un modèle simple de la forme  $ Y_t = f(Y_{t-1}, Y_{t-2}, Y_{t-3}) $

Charger le dataset 'univariate_time_series.csv'. On utilisera les argument index_col et parse_dates

In [5]:

data = pd.read_csv("data/univariate_time_series.csv", parse_dates=["timestamp"], index_col="timestamp")
data.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 14398 entries, 2018-09-25 14:01:00 to 2018-10-05 13:58:00
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   count   14398 non-null  float64
dtypes: float64(1)
memory usage: 225.0 KB


Afficher la time series

In [7]:
data.index

DatetimeIndex(['2018-09-25 14:01:00', '2018-09-25 14:02:00',
               '2018-09-25 14:03:00', '2018-09-25 14:04:00',
               '2018-09-25 14:05:00', '2018-09-25 14:06:00',
               '2018-09-25 14:07:00', '2018-09-25 14:08:00',
               '2018-09-25 14:09:00', '2018-09-25 14:10:00',
               ...
               '2018-10-05 13:49:00', '2018-10-05 13:50:00',
               '2018-10-05 13:51:00', '2018-10-05 13:52:00',
               '2018-10-05 13:53:00', '2018-10-05 13:54:00',
               '2018-10-05 13:55:00', '2018-10-05 13:56:00',
               '2018-10-05 13:57:00', '2018-10-05 13:58:00'],
              dtype='datetime64[ns]', name='timestamp', length=14398, freq=None)

Avec la méthode shift de pandas créer les colonnes correspondants à $Y_{t-1}, Y_{t-2}, Y_{t-3}$ avec la méthode shift de pandas

In [9]:
data['"count_1'] = data['count'].shift(1)
data['"count_2'] = data['count'].shift(2)
data['"count_3'] = data['count'].shift(3)

In [10]:
data

Unnamed: 0_level_0,count,"""count_1","""count_2","""count_3"
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-09-25 14:01:00,182.478,,,
2018-09-25 14:02:00,176.231,182.478,,
2018-09-25 14:03:00,183.917,176.231,182.478,
2018-09-25 14:04:00,177.798,183.917,176.231,182.478
2018-09-25 14:05:00,165.469,177.798,183.917,176.231
...,...,...,...,...
2018-10-05 13:54:00,151.492,149.801,151.788,153.938
2018-10-05 13:55:00,151.724,151.492,149.801,151.788
2018-10-05 13:56:00,153.776,151.724,151.492,149.801
2018-10-05 13:57:00,150.481,153.776,151.724,151.492


On va normaliser les données de la série temporelle avec le scaler MinMaxScaler. Importer le MinMaxScaler du module preprocessing de sklearn et normaliser les données du dataframe

In [11]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = scaler.fit(data)

In [12]:
data_scaled

MinMaxScaler()

Le scaler retourne-t-il un dataframe ? Sinon le recréer à partir des données normalisées

In [27]:
data_scaled = pd.DataFrame(scaler.transform(data)).reset_index()
data_scaled= data_scaled.rename(columns={0: "count", 1: "count_1", 2: "count_2", 3: "count_3" })
data_scaled

Unnamed: 0,index,count,count_1,count_2,count_3
0,0,0.704691,,,
1,1,0.677370,0.704691,,
2,2,0.710985,0.677370,0.704691,
3,3,0.684223,0.710985,0.677370,0.704691
4,4,0.630302,0.684223,0.710985,0.677370
...,...,...,...,...,...
14393,14393,0.569174,0.561778,0.570468,0.579871
14394,14394,0.570188,0.569174,0.561778,0.570468
14395,14395,0.579163,0.570188,0.569174,0.561778
14396,14396,0.564752,0.579163,0.570188,0.569174


In [24]:
data_scaled

Sauvegarder le dataframe dans un fichier "time_series_preprocessed.csv"

In [29]:
data_scaled.to_csv("data/time_series_preprocessed.csv")

Séparer les données en train et test 

In [37]:
data_scaled = data_scaled.dropna().reset_index()

In [38]:
from sklearn.model_selection import train_test_split
X = data_scaled[["count_1", "count_2", "count_3"]]
y = data_scaled["count"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [39]:
print("X_train = ", X_train.shape, "y_train = ", y_train.shape)
print("X_test = ", X_test.shape, "y_test = ", y_test.shape)

X_train =  (9644, 3) y_train =  (9644,)
X_test =  (4751, 3) y_test =  (4751,)


Entrainer une régression linéaire sur les données de train

In [40]:
#entrainement du modèle
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
model = LinearRegression()
model.fit(X_train, y_train)


LinearRegression()

In [51]:
# Evaluation du training set
from sklearn.metrics import r2_score
import numpy as np
y_train_predict = model.predict(X_train)
rmse_train = (np.sqrt(mean_squared_error(y_train, y_train_predict)))
r2_train = r2_score(y_train, y_train_predict)


In [52]:
# model evaluation for testing set
y_test_predict = model.predict(X_test)
rmse_test = (np.sqrt(mean_squared_error(y_test, y_test_predict)))
r2_test = r2_score(y_test, y_test_predict)

Avec la méthode score du modèle, afficher le score de train et le score de test

In [56]:
print("---R2 Score---")
print("Train\n R2 = %.3f --- RMSE = %.3f" %(r2_train, rmse_train))
print("Test\n R2 = %.3f --- RMSE = %.3f" %(r2_test, rmse_test))

---R2 Score---
Train
 R2 = 0.933 --- RMSE = 0.030
Test
 R2 = 0.949 --- RMSE = 0.027


Afficher les coefficients du modèle

On va maintenant voir comment faire pour faire des prédiction à plusieurs time step.  

On veut désormais faire un modèle qui prédit $Y_{t+10} = f(Y_t, Y_{t_1}, Y_{t-2}) $

Préparer à nouveau le dataframe avec des shift pour quecela marche

In [25]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from sklearn.linear_model import SGDRegressor, LinearRegression

In [2]:
data1 = pd.read_csv("data/univariate_time_series.csv", parse_dates=["timestamp"], index_col="timestamp")


On va maintenant voir comment faire pour faire des prédiction à plusieurs time step. L'idée est similaire à avant. On va simplement shift du nombre de time step qu'on veut prédire. 

In [7]:
def creer_df_shift_time_step_normalizer(data1: pd.DataFrame, step: int):
    
    data1["count_t"+str(step)] = data1["count"].shift(step)
    data1["count_t"+str(step+1)] = data1["count"].shift(step+1)
    data1["count_t"+str(step+2)] = data1["count"].shift(step+2)

    data1 = data1.dropna()
    scaler = MinMaxScaler()
    data1_scaled = scaler.fit(data1)
    data1_scaled = pd.DataFrame(scaler.transform(data1)).reset_index()
    data1_scaled = data1_scaled.rename(columns={0:"count", 1: "count_t"+str(step), 2: "count_t"+str(step+1), 3:"count_t"+str(step+2)})

    return data1_scaled

In [8]:
data_normalize = creer_df_shift_time_step_normalizer(data1, 10)

In [9]:
data_normalize

Unnamed: 0,index,count,count_t10,count_t11,count_t12
0,0,0.730841,0.710985,0.677370,0.704691
1,1,0.675218,0.684223,0.710985,0.677370
2,2,0.674278,0.630302,0.684223,0.710985
3,3,0.733775,0.702067,0.630302,0.684223
4,4,0.680230,0.713543,0.702067,0.630302
...,...,...,...,...,...
14381,14381,0.569174,0.574982,0.593394,0.604437
14382,14382,0.570188,0.565745,0.574982,0.593394
14383,14383,0.579163,0.571216,0.565745,0.574982
14384,14384,0.564752,0.591050,0.571216,0.565745


Réentraîner une regression linéaire

In [149]:
cols_df = data1_scaled.columns
data1_scaled[cols_df[-4:-3]]

Unnamed: 0,count
0,0.730841
1,0.675218
2,0.674278
3,0.733775
4,0.680230
...,...
14381,0.569174
14382,0.570188
14383,0.579163
14384,0.564752


In [37]:
#from sklearn.linear_model import SGDRegressor
def model_regression_linear(df: pd.DataFrame):
    cols_df = df.columns 
    X = df[cols_df[-3:]] 
    y = df["count"]

    X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.4, 
                                                random_state = 42)

    #model = SGDRegressor(max_iter=1000, alpha=0.008)
    model = LinearRegression()
    model.fit(X_train, y_train)

    metric_train = { 'modele': 'model',
           'mean_absolute_error' : mean_absolute_error(y_train, model.predict(X_train)),
           'mean_squared_error' : mean_squared_error(y_train, model.predict(X_train)),
           'r2_score' : r2_score(y_train, model.predict(X_train))
            }
    metric_test = { 'modele': 'model',
           'mean_absolute_error' : mean_absolute_error(y_test, model.predict(X_test)),
           'mean_squared_error' : mean_squared_error(y_test, model.predict(X_test)),
           'r2_score' : r2_score(y_test, model.predict(X_test))
            }
    return metric_train, metric_test


In [38]:
mtx_train, mtx_test = model_regression_linear(data_normalize)

### Resume: 
- LinearRegression a mieux résultat que SGDRegression, R2 = 90% et mse 0.1%
- SGD : R2 = 83% et mse = 0.2%

In [39]:
print("---TRAIN")
print("R2 = %.3f --- mse = %.3f" %(mtx_train['r2_score'], mtx_train['mean_squared_error']))
print("\n---TEST")
print("R2 = %.3f --- mse = %.3f" %(mtx_test['r2_score'], mtx_test['mean_squared_error']))

---TRAIN
R2 = 0.898 --- mse = 0.001

---TEST
R2 = 0.905 --- mse = 0.001


A la place d'une régression linéaire on va utiliser un arbre de décision. Importer le modèle DecisionTreeRegressor