In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
import os
from corrai.measure import MeasuredDats
from copy import deepcopy
import plotly.io as pio

pio.renderers.default = "browser"

# Time series forecasting

This tutorial guides you through the process of training a Machine Learning (ML) model in order to predict at short term ~6h the electricity consumption of a tertiary building office equipments. 

The tutorial will follow these big steps:
- Load the raw data provided by the Building Energy Monitoring System (BEMS)
- Use the Corrai object <code>MeasureDats</code> to pre-process, clean, and visualize the data
- Train several ML model, select the most appropriate and tune its hyper parameters
- Assess and discuss the model performance
- Wrap the model and the processing pipes for production

# Visualise the data

The data are coming from an electrical energy meter that returns an increasing index in kWh. The timestep is variable between 11 min and 22min. The electric equipments are office equipment, and probably some lab equipments

First lets load the csv file containing a single column of data using Pandas

In [None]:
data_df = pd.read_csv(Path(os.getcwd()) / "resources/compteur_elec.csv", index_col=0, sep=';')
data_df.index = pd.to_datetime(data_df.index, format='mixed', utc=True)
data_df = data_df.loc["2023-01-01 00:00:00":, :]

Lets use the <code>DataFrame</code> <code>info()</code> method to get some insights 

In [None]:
data_df.info()

It looks that data are available from the 2022-12-09 to the 2023-11-22 (in order to get full days).
It also looks that among the 22992 entries, none are Nan. This is good news

We will use the <code>MeasureDats</code> object to get a visual representation and try to identify the gaps that will be annoying (greater than 3h, where a linear the interpolation may become "dangerous").

_Note : this is a bit of a misuse of MeasureDats. For a proper introduction to the class, see the corresponding tutorial_    

In [None]:
my_data = MeasuredDats(data_df)

In [None]:
my_data.plot_gaps(gaps_timestep="3H")

In [None]:
my_data.get_gaps_description(gaps_timedelta='3H')

- A total of 5 gaps greater than 3h minutes were found.
- The greater gap is 4 days long

This is not bad for nearly a full year of data

We can go a bit further and get the date of the gaps

In [None]:
from corrai.measure import find_gaps

In [None]:
find_gaps(data_df, timestep='3h')

Given this results, we will probably trash the following periods:
- __2023-04-07__ : 16h is to long for interpolation
- __2023-05-19 to 2023-05-29__ : we choose to trash a full period. As we will see later, we are going to build sequences to train the ML models. Gaps will introduce errors, so we want to limit there number. It is better to throw away some good days
- __2023-10-28 to 2023-10-30__ : the gap span over these two days

Moreover, the previous figure showed an anomaly with the energy meter returning a negative value. This shall be removed*

⚠️ __WARNING__ ⚠️
Before going any further, we propose to split the dataset in 3 parts : __training, validation, testing__.
We will not even look at the testing sample during the ML training process.
This will avoid to introduce bias that could artificially improve our models performance.
For example, when fitting transformers on the full Dataset, we involuntary introduce knowledge on the data we later try to predict.
The test sample will only be used to evaluate the model performance

We proppose the following repartition : __80% training, 10% validation, 10% testing__

Moreover we will not shuffle the dataset before splitting it. We are working with time series so the data are coherent chronologically. The training dataset will corespond to the beginning of the year, while the testing will correspond to the end.
Although doing so, there might be a seasonal effect that it is not taken into account, it is better to have a coherent sequence and not random fragments of time series
This could lead a major issue if the data had seasonal trend. But we will see later that it is not the case.
 

In [None]:
data_size = data_df.shape[0]
train_raw_df = data_df.iloc[:int(data_size*0.8), :].copy()
train_raw_df = train_raw_df.loc[:"2023-05-01", :]
valid_raw_df = data_df.iloc[int(data_size*0.8):int(data_size*0.9), :].copy()
tests_raw_df = data_df.iloc[int(data_size*0.9):, :].copy()

# Clean the data

The energy meter data returns a monotonously increasing time series.
However we would like to have the electric energy rate [W] absorbed by the equipments.
We propose the following transformations to process the time series :
- remove negative values
- convert the kWh to Joules
- apply time gradient
- Fill in the small gaps (<=3h) using linear interpolation
- Apply Back filling and Front filling operation to make sure no Nan value are present at the beginning and at the and of the time series
- Resample the timeseries to get a constant 15 min timestep
- The resampling introduced new gaps, when the timestep was greater than 15min. Fill them using linear interpolation again.

To easily generate the pipeline and visualise its effect, we use the <code>common_transformations</code> of the <code>MeasureDats</code>. We define a single transformation process called ... <code>process</code>

In [None]:
from corrai.measure import Transformer, AGG_METHOD_MAP, AggMethod

In [None]:
train_md = MeasuredDats(train_raw_df)

In [None]:
train_md.common_trans = {
    "Process": [
        [Transformer.DROP_THRESHOLD, {'lower': 0}],
        [Transformer.APPLY_EXPRESSION, {"expression": "X * 1000 * 3600"}],
        [Transformer.TIME_GRADIENT, {}],
        [Transformer.INTERPOLATE, {"method": "linear"}],
        [Transformer.BFILL, {}],
        [Transformer.FFILL, {}],
        [Transformer.RESAMPLE, {"rule": "15T", "method": AGG_METHOD_MAP[AggMethod.MEAN]}],
        [Transformer.INTERPOLATE, {"method": "linear"}],
        [Transformer.GAUSSIAN_FILTER, {"sigma":2}]        
    ]
}

train_md.transformers_list = ["Process"]

Lets see the effect of the pipe !

In [None]:
dummy = MeasuredDats(train_md.get_corrected_data().copy())

In [None]:
dummy.common_trans = {
    "Process": [
        [Transformer.GAUSSIAN_FILTER, {}]
    ]
}
dummy.transformers_list = ["Process"]

In [None]:
dummy.plot(plot_raw=True)

In [None]:
train_md.plot()

It really looks good ! We can make the following observation :
- There is a weekly pattern with fewer consumption during the weekends
- There is a daily pattern, the energy rate starts to increase around 6 o clock and drops around 17h
- For an unknown reason, there seem to be a pattern of period ~2h. It could be chiller, for a server local for exemple
- It's not very clear, but the energy rate demand seem to increase slightly during the cold months
- There seem to be a holiday period from the 25 December to the 2nd of january

We can also see the gaps in the data. For example, if you try to zoom in on the second half of the month of May, you should see straight lines corresponding the linear interpolation. Lets remove them.

In [None]:
def drop_list_of_day(df, list_of_day):
    """
    Drops rows from a pandas DataFrame based on a list of specific days.
    Parameters:
    - df (pd.DataFrame): The input DataFrame from which rows will be dropped.
    - list_of_day (list): A list of dates in string format ('YYYY-MM-DD') representing the days to be dropped.
    """
    days_to_drop_dt = pd.to_datetime(list_of_day, errors='coerce')
    return df[~df.index.strftime('%Y-%m-%d').isin(days_to_drop_dt.strftime('%Y-%m-%d'))]

In [None]:
# List of full days to drop
days_to_drop_train = [
    "2023-04-06",
    "2023-04-07",
    "2023-05-19",
    "2023-05-20",
    "2023-05-21",
    "2023-05-22",
    "2023-05-23",
    "2023-05-24",
    "2023-05-25",
    "2023-05-26",
    "2023-05-27",
    "2023-05-28",
    "2023-05-29",
]
days_to_drop_test = [
    "2023-10-28",
    "2023-10-29",
    "2023-10-30",
]

train_raw_df = drop_list_of_day(train_raw_df, days_to_drop_train).copy()
tests_raw_df = drop_list_of_day(tests_raw_df, days_to_drop_test).copy()

In [None]:
# train_df_list = [train_raw_df.loc["2023-01-01": "2023-05-01"],
#                  train_raw_df.loc["2023-06-04":, :]]

# Feature engineering

We could only use the time series, split it into sequences 18h for example, and try to predict the next 6 hours based on the twelve previous. 
This is a classic auto regression approach like ARIMA. We tried, and got pretty bad results...

In this chapter, we try to add information by creating new features, based on the observation of the time series. 

## People

Looking at the data, it seem obvious that the electric power rate is related to the building occupancy.
In the previous plot you can easily identify the weekend, the days of the week, and even when occupants start or leave every day.
How ? Because you identify at least two behaviour, with high power demand, and with low power demand fluctuations.
Let's see if we can create a feature that indicates people presence in the building.

To do this, we will use <code>KdeSetpointIdentificator</code> object.
To help it find spots, we will smooth the data using a <code>PdGaussianFilter1D</code>

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from corrai.transformers import PdSkTransformer, PdGaussianFilter1D
from corrai.learning.cluster import KdeSetPointIdentificator, plot_kde_set_point, plot_time_series_kde

kde_pipe = make_pipeline(
    PdSkTransformer(StandardScaler()),
    PdGaussianFilter1D(sigma=4),
    KdeSetPointIdentificator(lik_filter=0.5, cluster_tol=0.3),
)

Let's fit this transformers on the training data.

In [None]:
kde_pipe.fit(train_md.get_corrected_data())

In [None]:
plot_kde_set_point(train_md.get_corrected_data(), estimator=kde_pipe)

It is not bad. It seems that the kde_pipe did a good job at identifying nights, weekends and holidays !

To create a feature from this transformer, we wrap it into a function.
It will use the predict method to create a time series equal to 0 when it identifies near 0 values (nights and weekends), 
and -1 when it identify fast changes.

In [None]:
def is_people(X) -> pd.DataFrame:
    X["is_people"] = kde_pipe.fit_predict(X.dropna())
    return X

Let's add this to the <code>MeasuredDats</code> pipe, to start building the full preprocessing pipe.
We also add a <code>StandardScaler</code> as it is good practice in machine learning process (it reduces the "distance" between the features and helps for convergence)

In [None]:
from sklearn.preprocessing import FunctionTransformer

preprocess_pipeline = deepcopy(train_md.get_pipeline())
preprocess_pipeline.steps.extend([
    ("Scaler", PdSkTransformer(StandardScaler())),
    ("People_schedule", FunctionTransformer(func=is_people))
])   

you can visualize the new feature using a dummy <code>MeasuredDats</code> object 

In [None]:
dummy_md = MeasuredDats(preprocess_pipeline.fit_transform(train_raw_df))
dummy_md.plot()

## Fourier pairs

One of the common solutions to help the model learn, consists in adding sine wave at the desired frequency. Here is an example : https://www.tensorflow.org/tutorials/structured_data/time_series
You can see these new time series as additional coordinate for the data points. It will make the measure happening at the same time of the day or at the same day of the week closer.

For practical reason, we will "automate" the addition of these sine waves time series using transformers, and the <code>MeasureDats</code> pipeline

We use a periodgram to show us the energy_signal harmonics

_Note that Electricity_index_kWh is no longer kWh but W. Since we transformed it. However the feature name remains. This is not very clean_ 

In [None]:
from corrai.learning.time_series import plot_periodogram

In [None]:
plot_periodogram(preprocess_pipeline.fit_transform(train_raw_df)["Electricity_index_kWh"])

We identify 4 harmonics :
- weekly and daily pattern are confirm by the periodgram
- we will also use semi weekly  1 / 313 year and 1 / 1095 year

In [None]:
year_second = 365 * 24 *3600

In [None]:
from corrai.transformers import PdAddFourierPairs

In [None]:
def weekday_encoding(X) -> pd.DataFrame:
    X["is_working_day"] = X.index.to_series().apply(
        lambda X: 1 if X.weekday() < 5 else 0
    )
    return X

In [None]:
preprocess_pipeline.steps.extend([
        ('week_day', PdSkTransformer(FunctionTransformer(func=weekday_encoding))),
        ('week', PdAddFourierPairs(frequency=1 / (7 * 24 * 3600), feature_prefix="week")),
        ('1/2week', PdAddFourierPairs(frequency=1 / (3.5 * 24 * 3600), feature_prefix="1/2week")),
        ('day', PdAddFourierPairs(frequency=1 / (1 * 24 * 3600), feature_prefix="day")),
        ('6h', PdAddFourierPairs(frequency=1 / (6 * 3600), feature_prefix="6h")),
        ('3h', PdAddFourierPairs(frequency=1 / (3 * 3600), feature_prefix="3h"))
])

In a jupyter Notebook, you can easily get a graphical representation of your pipeline

In [None]:
preprocess_pipeline

Now lets transform our training and validation samples

In [None]:
train_df = preprocess_pipeline.fit_transform(train_raw_df)
valid_df = preprocess_pipeline.transform(valid_raw_df)

You can use the correlation Matrix to see how the new features explains the Electric power rate we are trying to predict.
- "is_people" feature seem to bring a lot of information
- 8.67E-11 frequencies are not very useful.
 

In [None]:
abs(train_df.corr())

# Time Series Sampling

Our objective is to predict the future. We would like to know the energy rate absorbed by the equipments in the next 6h, based on the previous timesteps.
To know how many timestep we need to look back, we need to know how the time series is auto correlated.

We use the <code>statsmodels plot_pacf</code> to vizualise it 

In [None]:
from statsmodels.graphics.tsaplots import plot_pacf
plot_pacf(train_df["Electricity_index_kWh"])


Based one this results, we will use the 34 last time steps (11h) to predict the next 6h

To do this, we train Machine learning model using short sequence of the time series.
- Features are the last 11h : energy rate, people, sine waves.
- The target will be the next 6h energy rate  

For example, the timeseries :

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

with a <code>sequence_length = 5</code> and <code>sampling_rate = 1</code>  will become:

[0, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
...
[6, 7, 8, 9, 10]

For a 2D time series (columns are features, index is time step), the output will be a 3D numpy array of shape [_batch size, time steps, dimensionality_] where _dimensionality_ is the number of features [Aurélien Géron Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow]

The sampling of the time series is done using the function <code>time_series_sampling</code>

We will use <code>shuffle=True</code> to make sure sequences ... are shuffled, so when we split the dataset between training and validation we get sequences spread across the year, and not a single season

In [None]:
from corrai.learning.model_selection import time_series_sampling

In [None]:
train_np = train_df.to_numpy()
train_np = train_np.astype(np.float32)

n_step_history = 10 * 4 #12h
n_step_future = 6 * 4

train_sequences = time_series_sampling(
    train_np,
    sequence_length=n_step_history + n_step_future,
    sampling_rate=1,
    sequence_stride=4,
    shuffle=False,
    seed=42
)

In [None]:
dummy = MeasuredDats(train_df)
dummy.plot()

In [None]:
valid_np = valid_df.to_numpy()
valid_np = valid_np.astype(np.float32)

valid_sequences = time_series_sampling(
    valid_np,
    sequence_length=n_step_history + n_step_future,
    sampling_rate=1,
    sequence_stride=1,
    shuffle=False,
    seed=42
)

In [None]:
dummy = MeasuredDats(valid_df)

In [None]:
dummy.plot()

In [None]:
train_sequences.shape

The shape of the obtained training array is [20079, 50, 12] corresponding to the number of sequences, the number of time steps and the number of features.

The number of sequences is elevated, but we used <code>sampling_rate=1</code>, meaning that there will be a lot of overlaping between the sequences. This may lead to model over fitting, or artificially improve evaluation metrics, as part of the training sample will also be present in validation sample.

That is why final performance will be evaluated on the test sample

It is now time to isolate the target from the sequences. 
Using the previous example :
[0, 1, 2, 3, 4] 
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
 … 
[6, 7, 8, 9, 10]

With an history of <code>n_step_history=3</code> and <code>n_step_future=2</code> we get
X = [0, 1, 2] 
    [1, 2, 3]
    [2, 3, 4]
     … 
    [6, 7, 8]
    
Y = [3, 4] 
    [4, 5]
    [5, 6]
     … 
    [9, 10]

We also want to split X and Y into two pair of samples X_train, y_train, X_valid, y_valid

In [None]:
x_train, y_train = (
    train_sequences[:, :n_step_history, :],
    train_sequences[:, -n_step_future:, 0],
)

x_valid, y_valid = (
    valid_sequences[:, :n_step_history, :],
    valid_sequences[:, -n_step_future:, 0],
)

# Model training

In [None]:
from corrai.learning.time_series import TsDeepNN, DeepRNN
from corrai.metrics import last_time_step_rmse

In [None]:
res_metrics = {}

In [None]:
lstm_seq = DeepRNN(
    cells="LSTM",
    n_units=40,
    hidden_layers_size=1,
    reshape_sequence_to_sequence=True,
    metrics=[last_time_step_rmse],
    # optimizer=keras.optimizers.SGD(0.01),
    patience=200,
    max_epoch=20,
    # loss=smape,
)

lstm_seq.fit(x_train, y_train, x_valid, y_valid)
res_metrics["lstm_seq"] = lstm_seq.evaluate(x_valid, y_valid)

In [None]:
print("hey")

In [None]:
from corrai.learning.time_series import plot_sequence_forcast
plot_sequence_forcast(x_valid, y_valid, model=lstm_seq, batch_nb=350)

In [None]:
test = MeasuredDats(valid_df)

In [None]:
test.plot()

In [None]:
ts_linear = TsDeepNN()

In [None]:
ts_linear.fit(x_train, y_train, x_valid, y_valid)

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
print(f"R² = {ts_linear.score(x_valid, y_valid)} \n"
      f"MSE = {mean_squared_error(ts_linear.predict(x_valid), y_valid)}")

## Deep Neural Network

In [None]:
deep_rnn = TsDeepNN(hidden_layers_size=2)

In [None]:
deep_rnn.fit(x_train, y_train, x_valid, y_valid)

In [None]:
print(f"R² = {deep_rnn.score(x_valid, y_valid)} \n"
      f"MSE = {mean_squared_error(deep_rnn.predict(x_valid), y_valid)}")

In [None]:
plot_sequence_forcast(x_valid, y_valid, model=deep_rnn, batch_nb=31)

In [None]:
from corrai.metrics import last_time_step_rmse

In [None]:
rnn = DeepRNN(
    cells='LSTM',
    hidden_layers_size=1,
    reshape_sequence_to_sequence=True,
    n_units=40,
    metrics=last_time_step_rmse
)

In [None]:
rnn.fit(x_train, y_train, x_valid, y_valid)

In [None]:
rnn.evaluate(x_valid, y_valid)

In [None]:
plot_sequence_forcast(x_valid, y_valid, model=rnn, batch_nb=35)

In [None]:
from corrai.learning.time_series import SimplifiedWaveNet

In [None]:
wave_net = SimplifiedWaveNet(
    convolutional_layers=3,
    staked_groups=2,
    groups_filters=100,
    metrics=[last_time_step_rmse])
wave_net.fit(x_train, y_train, x_valid, y_valid)

In [None]:
import plotly.graph_objects as go

def plot_sequence_forcast(X, y_ref, model, X_target_index=0, batch_nb=0):
    """
    Plot the historical, reference, and predicted values for a given
    batch number in X using Plotly.

    Parameters:
    :param X: np.ndarray 3D array of shape [batch_size, time_step, dimension]
    Input sequences.
    :param y_ref: np.ndarray of shape [batch_size, time_step].
    :param model: keras.Model The trained model for making predictions.
    :param X_target_index: The index of 3rd dimension in X that contains
        historical values.
    :param batch_nb: int Batch number to visualize. Default is 0.
    """
    predictions = model.predict(X)
    predictions = predictions[batch_nb, :]
    y_ref_to_plot = y_ref[batch_nb, :]
    x_to_plot = X[batch_nb, :, X_target_index]

    # Create traces for historic, new, and predicted values
    historic_trace = go.Scatter(
        x=list(range(len(x_to_plot))),
        y=x_to_plot,
        mode='markers+lines',
        name='Historic Values'
    )

    new_trace = go.Scatter(
        x=list(range(len(x_to_plot), len(x_to_plot) + len(y_ref_to_plot))),
        y=y_ref_to_plot,
        mode='markers+lines',
        name='New Values'
    )

    predicted_trace = go.Scatter(
        x=list(range(len(x_to_plot), len(x_to_plot) + len(y_ref_to_plot))),
        y=predictions,
        mode='markers+lines',
        name='Predicted Values'
    )

    # Create layout
    layout = go.Layout(
        title='Actual and Predicted Values Over Time',
        xaxis=dict(title='Number of Timesteps'),
        yaxis=dict(title='x(t)'),
    )

    # Create figure
    fig = go.Figure(data=[historic_trace, new_trace, predicted_trace], layout=layout)

    # Show the plot
    fig.show()

In [None]:
plot_sequence_forcast(x_valid, y_valid, model=wave_net, batch_nb=20)

# Model test

In [None]:
test_transformed = preprocess_pipeline.transform(test_df)

In [None]:
test_transformed["Holidays"] = 0

In [None]:
test_transformed.shape

In [None]:
test_sequences = time_series_sampling(
    test_transformed,
    sequence_length=4 * (48 + 6),  #12h history + 6h in the future at a 15min timestep,
    sampling_rate=1,
    shuffle=False,
    seed=42  # Make sure the behaviour can be repeated in the notebook
)

In [None]:
x_test, x_empty, y_test, y_empty = sequences_train_test_split(
    test_sequences,
    targets_index=0,
    n_steps_history=48 * 4,
    n_steps_future=6 * 4,
    train_size=0.1,
    shuffle=False
)

In [None]:
x_test.shape

In [None]:
wave_net.evaluate(x_test, y_test)

In [None]:
plot_sequence_forcast(x_test, y_test, model=deep_rnn, batch_nb=10)