# Your mission

You started working on the Ecowatt project at RTE. In order to avoid possible shortage, one must plan for peaks in national electricity. You manager Mark is going on holidays for a week. You will be sole responsible for forecasting the weekly demand, while he is absent.

In order to prevent electricity shortage, you must accurately forecast the demand 7 days ahead, on an hourly basis.

Your mission is to train an accurate predictive model with the lowest root mean squared error (RMSE). Mark is a very technical guy, he likes to understand all technical details and would like you to compare the performances of classical models and neural-net based models.


Your **target variable** is the consommation_totale

**Data source** : https://data.enedis.fr/pages/accueil/

# Import

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive

In [13]:
drive.mount('/content/gdrive')
if os.getcwd() != "/content/gdrive/MyDrive/Thales/EI_ST4_G1/EI_TS_CS-20230526T084435Z-001/EI_TS_CS":
  os.chdir("/content/gdrive/MyDrive/Thales/EI_ST4_G1/EI_TS_CS-20230526T084435Z-001/EI_TS_CS")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [14]:
%run ./utils.ipynb

In [15]:
FILE_PATH = "data/aree-meters-history.csv"
TARGET = "Energie_froid_kW"

## Prepare the data

Define here the range of your train/test split

In [16]:
def read_data(data_path : str = "data/aree-meters-history.csv") -> pd.DataFrame:
    df = pd.read_csv(data_path)
    df['Date'] = pd.to_datetime(df['Date']) # Convert 'Date' column to datetime
    df = df.set_index('Date') # Set 'Date' as the index
    # Filter rows where the value of column "Mois" is 6
    # df = df[df['Mois'].isin([5,6])]
    # hourly_avg = df.groupby('Mois')['consommation_totale'].mean().reset_index(name='consommation_totale')

    return df

df = read_data("data/aree-meters-history.csv")

X_train = df[-1000:-100]
X_test = df[-100:]
print(X_train)
print(X_test)

                     Energie_froid_kW
Date                                 
2023-06-15 12:00:00        688.026138
2023-06-15 13:00:00        817.108319
2023-06-15 14:00:00        923.111992
2023-06-15 15:00:00        703.966316
2023-06-15 16:00:00        817.186235
...                               ...
2023-06-24 08:00:00        597.260931
2023-06-24 09:00:00        598.331800
2023-06-24 10:00:00        604.830936
2023-06-24 11:00:00        597.157071
2023-06-24 12:00:00        600.639374

[217 rows x 1 columns]
                     Energie_froid_kW
Date                                 
2023-06-24 13:00:00        602.617335
2023-06-24 14:00:00        598.535528
2023-06-24 15:00:00        595.586120
2023-06-24 16:00:00        598.020825
2023-06-24 17:00:00        593.513699
...                               ...
2023-06-28 12:00:00        757.548382
2023-06-28 13:00:00        762.251379
2023-06-28 14:00:00        800.177905
2023-06-28 15:00:00        529.190002
2023-06-28 16:00:00       

In [17]:
X_test

Unnamed: 0_level_0,Energie_froid_kW
Date,Unnamed: 1_level_1
2023-06-24 13:00:00,602.617335
2023-06-24 14:00:00,598.535528
2023-06-24 15:00:00,595.586120
2023-06-24 16:00:00,598.020825
2023-06-24 17:00:00,593.513699
...,...
2023-06-28 12:00:00,757.548382
2023-06-28 13:00:00,762.251379
2023-06-28 14:00:00,800.177905
2023-06-28 15:00:00,529.190002


# Modeling with ARIMA
In this section, you are to perform some classical modelings, the suggested method here is ARIMA, but you can try other models such as ARMA, ARIMAX, SARIMAX...

## Modeling
The following code allows ARIMA modeling with one combination of (p,d,q).

In [18]:
parameters = (2,1,1)
errors, predictions = evaluate_arima_model(
    X_train[TARGET],
    X_test[TARGET],
    parameters
    )
errors

  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


3971.347828169909

In [19]:
predictions

[599.6480115415794,
 601.4321255278804,
 600.2848081683497,
 597.5164120203002,
 597.3356915954363,
 595.6921945995331,
 594.1373420180558,
 594.1123598663916,
 595.7423987285164,
 594.2610764516379,
 594.770254055859,
 595.5732064847889,
 594.6031939729028,
 608.0771004296976,
 616.1181248507967,
 610.8070635772668,
 612.4201898776794,
 619.8457659534475,
 622.5738720031601,
 619.6821804120157,
 619.1112364056199,
 620.6008042299712,
 614.5259518317704,
 617.6127948478022,
 622.857098832488,
 624.7802369637556,
 622.9104774260958,
 622.4534664228164,
 621.3892647458338,
 616.7504078308363,
 612.4952117426132,
 612.4662001083716,
 613.4008666352231,
 618.9943350919027,
 624.0223128112436,
 625.5001055185433,
 618.595976577476,
 622.9182334502399,
 624.9721002978894,
 617.7351650957713,
 613.3923140010658,
 622.1931853806611,
 623.987819510151,
 624.9362161904236,
 624.7573029662648,
 606.824994385258,
 439.3666209936831,
 345.9177670006244,
 542.9647271386065,
 588.0360995128427,
 586.

## Search for the best ARIMA model
We use grid search to search for the best ARIMA parameters that gives the lowest error. This follows the Box-Jenkins methology.

In [20]:
best_cfg, best_score = arima_grid_search(X_train[TARGET],
                                            X_test[TARGET],
                                            range(1,3),range(0,3),range(0,3))

ARIMA(1,0,0) RMSE=3557.512
ARIMA(1,0,1) RMSE=3737.140
ARIMA(1,0,2) RMSE=3882.722
ARIMA(1,1,0) RMSE=3976.123
ARIMA(1,1,1) RMSE=3888.989
ARIMA(1,1,2) RMSE=3909.701
ARIMA(1,2,0) RMSE=6670.651
ARIMA(1,2,1) RMSE=3987.130


  warn('Non-invertible starting MA parameters found.'


ARIMA(1,2,2) RMSE=3920.898
ARIMA(2,0,0) RMSE=3989.688
ARIMA(2,0,1) RMSE=3880.541
ARIMA(2,0,2) RMSE=3879.040
ARIMA(2,1,0) RMSE=3930.827


  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


ARIMA(2,1,1) RMSE=3971.348
ARIMA(2,1,2) RMSE=4069.902
ARIMA(2,2,0) RMSE=5715.685


  warn('Non-invertible starting MA parameters found.'


ARIMA(2,2,1) RMSE=3920.564
ARIMA(2,2,2) RMSE=4012.820
Best ARIMA(1, 0, 0) MSE=3557.512


In [21]:
print(best_cfg, best_score)

(1, 0, 0) 3557.5115145530845


In [22]:
import statsmodels.api as sm

model = sm.tsa.ARIMA(X_train[TARGET], order=(2,1,1))
fitted = model.fit()
fitted.summary()


  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


0,1,2,3
Dep. Variable:,Energie_froid_kW,No. Observations:,217.0
Model:,"ARIMA(2, 1, 1)",Log Likelihood,-1187.891
Date:,"Thu, 29 Jun 2023",AIC,2383.782
Time:,12:48:27,BIC,2397.283
Sample:,06-15-2023,HQIC,2389.237
,- 06-24-2023,,
Covariance Type:,opg,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
ar.L1,0.3709,0.105,3.543,0.000,0.166,0.576
ar.L2,0.2998,0.079,3.798,0.000,0.145,0.455
ma.L1,-0.8956,0.100,-8.959,0.000,-1.092,-0.700
sigma2,3493.3293,164.827,21.194,0.000,3170.275,3816.384

0,1,2,3
Ljung-Box (L1) (Q):,0.05,Jarque-Bera (JB):,1026.98
Prob(Q):,0.82,Prob(JB):,0.0
Heteroskedasticity (H):,0.65,Skew:,-1.23
Prob(H) (two-sided):,0.07,Kurtosis:,13.4


In [None]:
df

In [24]:
df_reset = df.reset_index()
Date_column = df_reset["Date"]


In [25]:
Date_column

0     2023-06-15 12:00:00
1     2023-06-15 13:00:00
2     2023-06-15 14:00:00
3     2023-06-15 15:00:00
4     2023-06-15 16:00:00
              ...        
312   2023-06-28 12:00:00
313   2023-06-28 13:00:00
314   2023-06-28 14:00:00
315   2023-06-28 15:00:00
316   2023-06-28 16:00:00
Name: Date, Length: 317, dtype: datetime64[ns]

## Visualization
To have a better view on the difference between true and predict values, we visualize them by plotting both the signals.

In [26]:
# prepare the dataset for plotting
df_reset = df.reset_index()
predict_date = df_reset["Date"]
df_predict = pd.DataFrame(zip(predict_date[-100:],
                              predictions, X_test[TARGET].values),
                          columns=["date", "predict", "true"])

In [27]:
df_predict

Unnamed: 0,date,predict,true
0,2023-06-24 13:00:00,599.648012,602.617335
1,2023-06-24 14:00:00,601.432126,598.535528
2,2023-06-24 15:00:00,600.284808,595.586120
3,2023-06-24 16:00:00,597.516412,598.020825
4,2023-06-24 17:00:00,597.335692,593.513699
...,...,...,...
95,2023-06-28 12:00:00,751.214162,757.548382
96,2023-06-28 13:00:00,755.329707,762.251379
97,2023-06-28 14:00:00,759.090909,800.177905
98,2023-06-28 15:00:00,781.154657,529.190002


In [28]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, accuracy_score
import statsmodels.api as sm

model = ARIMA(X_train, order=(2, 1, 1))
model_fit = model.fit()

# Faire des prédictions sur l'ensemble de test

predictions = model_fit.predict(X_test)
accuracy = accuracy_score(X_test, predictions)
mae = mean_absolute_error(X_test, predictions)
rmse = mean_squared_error(X_test, predictions, squared=False)
r2 = r2_score(X_test, predictions)

print("Exactitude : ", accuracy)
print("MAE:", mae)
print("RMSE:", rmse)
print("R²:", r2)

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  warn('Non-stationary starting autoregressive parameters'
  warn('Non-invertible starting MA parameters found.'


TypeError: ignored

In [None]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=df_predict["date"], y=df_predict["predict"], name="predict"))
fig.add_trace(go.Scatter(x=df_predict["date"], y=df_predict["true"], name="true"))

fig.update_layout(title="Predictions vs true values")

# Modeling with other models

Try other other models : random forest, xgboost ...

In [None]:
model2 = ARIMA(df, order=(2, 1, 1))
model2_fit = model2.fit()

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


ValueError: ignored