The objective of the following article is to obtain a step-by-step guide on building `Dynamic features` using `Mlforecast`.

During this walkthrough, we will become familiar with the main `MlForecast` class and some relevant methods such as `mlforecast.fit`, `mlforecast.predict` and `mlforecast.cross_validation` in other.

Let's start!!!

<a class="anchor" id="0.1"></a>



1.	[Introduction](#1)
2.	[Dynamic Features](#2)
3.	[Installing Mlforecast](#3)
4.	[Loading libraries and data](#4)
5.	[Explore Data with the plot method](#5)
6.	[Implementation of model with MLForecast](#6)
7.  [Evaluate the model’s performance](#7)
8.  [Evaluate the model](#8)
9.  [References](#9)

# **3. Installing Mlforecast** <a class="anchor" id="3"></a>

[Table of Contents](#0.1)

* using pip:

    - `pip install mlforecast`

* Specific version

    If you want a specific version you can include a filter, for example:

    - `pip install "mlforecast==0.3.0"` to install the 0.3.0 version
    - `pip install "mlforecast<0.4.0"` to install any version prior to 0.4.0

* using with conda:

    - `conda install -c conda-forge mlforecast`

* Specific version

    If you want a specific version you can include a filter, for example: 

    - `conda install -c conda-forge "mlforecast==0.3.0"` to install the 0.3.0 version
    - `conda install -c conda-forge "mlforecast<0.4.0"` to install any version prior to 0.4.0

# **4. Loading libraries and data** <a class="anchor" id="4"></a>

[Table of Contents](#0.1)

In [1]:
# Handling and processing of Data
# ==============================================================================
import numpy as np
import pandas as pd

import scipy.stats as stats

# Handling and processing of Data for Date (time)
# ==============================================================================
import datetime
import time
from datetime import datetime, timedelta

# 
# ==============================================================================
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.seasonal import seasonal_decompose 
# 
# ==============================================================================
from utilsforecast.plotting import plot_series

In [2]:
# Mlforecast resources
# ==============================================================================
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean
from window_ops.ewm import ewm_mean
from mlforecast.target_transforms import Differences

from mlforecast.utils import PredictionIntervals
from mlforecast.utils import generate_daily_series, generate_prices_for_series

from mlforecast import MLForecast

# Machine learning models
# ==============================================================================
import xgboost as xgb
import lightgbm as lgb

In [3]:
# Plot
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plt.style.use('grayscale') # fivethirtyeight  grayscale  classic
plt.rcParams['lines.linewidth'] = 1.5
dark_style = {
    'figure.facecolor': '#008080',  # #212946
    'axes.facecolor': '#008080',
    'savefig.facecolor': '#008080',
    'axes.grid': True,
    'axes.grid.which': 'both',
    'axes.spines.left': False,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.spines.bottom': False,
    'grid.color': '#000000',  #2A3459
    'grid.linewidth': '1',
    'text.color': '0.9',
    'axes.labelcolor': '0.9',
    'xtick.color': '0.9',
    'ytick.color': '0.9',
    'font.size': 12 }
plt.rcParams.update(dark_style)
# Define the plot size
# ==============================================================================

plt.rcParams['figure.figsize'] = (18,7)

# Hide warnings
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

We’re going to use a synthetic dataset from this point onwards to demonstrate some other functionalities regarding external regressors.

If `n_static_features > 0`, then each serie gets static features with random values. If `equal_ends == True` then all series end at the same date.

In [4]:
series = generate_daily_series(100, equal_ends=True, n_static_features=2, static_as_categorical=False)
series

Unnamed: 0,unique_id,ds,y,static_0,static_1
0,id_00,2000-10-05,3.981198,79,45
1,id_00,2000-10-06,10.327401,79,45
2,id_00,2000-10-07,17.657474,79,45
3,id_00,2000-10-08,25.898790,79,45
4,id_00,2000-10-09,34.494040,79,45
...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35
26999,id_99,2001-05-11,3.022948,69,35
27000,id_99,2001-05-12,10.131371,69,35
27001,id_99,2001-05-13,14.572434,69,35


In [5]:
series.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27003 entries, 0 to 27002
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   unique_id  27003 non-null  category      
 1   ds         27003 non-null  datetime64[ns]
 2   y          27003 non-null  float64       
 3   static_0   27003 non-null  int64         
 4   static_1   27003 non-null  int64         
dtypes: category(1), datetime64[ns](1), float64(1), int64(2)
memory usage: 875.2 KB


We have generated a large amount of data with certain characteristics as we have seen in the previous DataFrame, from which different `unique_id`s are created, each of which has different behavior.

The main objective of doing this is to have a simulated data set, with different series with different behavior each, in addition to having other exogenous variables that we will use to make forecasts.

As we saw in the previous `Dataframe`, the required columns are the series identifier, time and target. Whatever extra columns you have, like static_0 and static_1 here are considered to be static and are replicated when constructing the features for the next timestamp. You can disable this by passing `static_features` to `MLForecast.preprocess` or `MLForecast.fit` , which will only keep the columns you define there as static. Keep in mind that they will still be used for training, so you’ll have to provide them to `MLForecast.predict` through the `X_df` argument.

By default the predict method repeats the static features and updates the transformations and the date features. If you have dynamic features like prices or a calendar with holidays you can pass them as a dataframe to the `X_df` argument of `MLForecast.predict`, which will call `pd.DataFrame.merge` on it in each `timestep`.

# **5. Explore Data with the plot method** <a class="anchor" id="5"></a>

[Table of Contents](#0.1)

Plot some series using the plot method from the StatsForecast class. This method prints 8 random series from the dataset and is useful for basic EDA.

Let's use the `plot_series` function to be able to visualize the series that we have built previously.

In [63]:
fig = plot_series(series)
fig.savefig('../figs/Dynamic_features__eda.png')

![](../figs/Dynamic_features__eda.png)

## **5.2 Autocorrelation plots**

In [64]:
fig, axs = plt.subplots(nrows=1, ncols=2)

plot_acf(series["y"],  lags=30, ax=axs[0],color="fuchsia")
axs[0].set_title("Autocorrelation");

# Grafico
plot_pacf(series["y"],  lags=30, ax=axs[1],color="lime")
axs[1].set_title('Partial Autocorrelation')
plt.savefig("../figs/Dynamic_features__autocorrelation.png")
plt.close();

![](../figs/Dynamic_features__autocorrelation.png)

In [8]:
dynamic_series = series.rename(columns={'static_1': 'product_id'})
dynamic_series

Unnamed: 0,unique_id,ds,y,static_0,product_id
0,id_00,2000-10-05,3.981198,79,45
1,id_00,2000-10-06,10.327401,79,45
2,id_00,2000-10-07,17.657474,79,45
3,id_00,2000-10-08,25.898790,79,45
4,id_00,2000-10-09,34.494040,79,45
...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35
26999,id_99,2001-05-11,3.022948,69,35
27000,id_99,2001-05-12,10.131371,69,35
27001,id_99,2001-05-13,14.572434,69,35


In the generated `DataFrame` we have built two series `static_0` and `static_1`. We are going to rename `static_1` to `product_id` and then convert that series to a dynamic one using the `generate_prices_for_series()` function, which will generate a variable called `price`.

In [9]:
prices_catalog = generate_prices_for_series(dynamic_series)
prices_catalog

Unnamed: 0,ds,unique_id,price
0,2000-10-05,id_00,0.548814
1,2000-10-06,id_00,0.715189
2,2000-10-07,id_00,0.602763
3,2000-10-08,id_00,0.544883
4,2000-10-09,id_00,0.423655
...,...,...,...
27698,2001-05-17,id_99,0.682296
27699,2001-05-18,id_99,0.123657
27700,2001-05-19,id_99,0.068762
27701,2001-05-20,id_99,0.324157


We have renamed and generated a new dynamic variable called `price`, let's now join these `DataFrame`, for this we will use the `merge()` function to obtain the final dataset that contains the target variables, the variables static_0 and static_1, in addition to the dynamic variable `price`, as shown below.

In [10]:
series_with_prices = dynamic_series.merge(prices_catalog, how='left')
series_with_prices

Unnamed: 0,unique_id,ds,y,static_0,product_id,price
0,id_00,2000-10-05,3.981198,79,45,0.548814
1,id_00,2000-10-06,10.327401,79,45,0.715189
2,id_00,2000-10-07,17.657474,79,45,0.602763
3,id_00,2000-10-08,25.898790,79,45,0.544883
4,id_00,2000-10-09,34.494040,79,45,0.423655
...,...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35,0.112841
26999,id_99,2001-05-11,3.022948,69,35,0.883449
27000,id_99,2001-05-12,10.131371,69,35,0.762250
27001,id_99,2001-05-13,14.572434,69,35,0.025932


# **6. Modeling with MLForecast** <a class="anchor" id="6"></a>

[Table of Contents](#0.1)

## **6.1 Building Model**

We define the model that we want to use, for our example we are going to use the `LGBMRegressor() model`.

In [11]:
model = [lgb.LGBMRegressor(n_jobs=1, random_state=0, verbosity=-1, num_leaves= 512)]

We fit the models by instantiating a new `MlForecast` object with the following parameters:

* `models:` a list of models. Select the models you want from models and import them.

* `freq:` a string indicating the frequency of the data. (See [panda’s available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).)

* `lags:` Lags of the target to uses as feature.

* `lag_transforms:` Mapping of target lags to their transformations.

* `date_features:` Features computed from the dates. Can be `pandas` date attributes or functions that will take the dates as input.

* `differences:` Differences to take of the target before computing the features. These are restored at the forecasting step.

* `num_threads:` Number of threads to use when computing the features.

* `target_transforms:` Transformations that will be applied to the target computing the features and restored after the forecasting step.

Any settings are passed into the constructor. Then you call its fit method and pass in the historical data frame.

In [12]:
mlf = MLForecast(models=model,
                 freq='D', 
                 )

## **6.2 Adding features**

### Lags
Looks like the seasonality is gone, we can now try adding some lag features.

In [13]:
mlf = MLForecast(models=model,
                 freq='D',  
                 lags=[7],
                 ) 

We can use the `MLForecast.preprocess` method to explore different transformations.

In [14]:
prep = mlf.preprocess(series_with_prices)
prep

Unnamed: 0,unique_id,ds,y,static_0,product_id,price,lag7
7,id_00,2000-10-12,1.268807,79,45,0.891773,3.981198
8,id_00,2000-10-13,11.113382,79,45,0.963663,10.327401
9,id_00,2000-10-14,19.798284,79,45,0.383442,17.657474
10,id_00,2000-10-15,26.650107,79,45,0.791725,25.898790
11,id_00,2000-10-16,32.054287,79,45,0.528895,34.494040
...,...,...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35,0.112841,42.025252
26999,id_99,2001-05-11,3.022948,69,35,0.883449,3.340083
27000,id_99,2001-05-12,10.131371,69,35,0.762250,8.323020
27001,id_99,2001-05-13,14.572434,69,35,0.025932,17.469622


In [15]:
prep.drop(columns=['unique_id', 'ds']).corr()['y']

y             1.000000
static_0      0.598299
product_id    0.106384
price        -0.002330
lag7          0.996702
Name: y, dtype: float64

## **6.3 Lag transforms**
Lag transforms are defined as a dictionary where the keys are the lags and the values are lists of functions that transform an array. These must be [numba](http://numba.pydata.org) jitted functions (so that computing the features doesn’t become a bottleneck). There are some implemented in the [window-ops package](https://github.com/jmoralez/window_ops) but you can also implement your own.

If the function takes two or more arguments you can either:

* supply a tuple (tfm_func, arg1, arg2, …)
* define a new function fixing the arguments

In [16]:
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

In [17]:
mlf = MLForecast(models=model,
                 freq='D',  
                 lags=[7],
                 lag_transforms={1: [expanding_mean],
                                 7: [(rolling_mean, 14)] },
                 ) 

In [18]:
prep = mlf.preprocess(series_with_prices)
prep

Unnamed: 0,unique_id,ds,y,static_0,product_id,price,lag7,expanding_mean_lag1,rolling_mean_lag7_window_size14
20,id_00,2000-10-25,49.766844,79,45,0.978618,50.694639,25.001367,26.320060
21,id_00,2000-10-26,3.918347,79,45,0.799159,3.887780,26.180675,26.313387
22,id_00,2000-10-27,9.437778,79,45,0.461479,11.512774,25.168751,26.398056
23,id_00,2000-10-28,17.923574,79,45,0.780529,18.038498,24.484796,26.425272
24,id_00,2000-10-29,26.754645,79,45,0.118274,24.222859,24.211411,26.305563
...,...,...,...,...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35,0.112841,42.025252,22.753681,23.181411
26999,id_99,2001-05-11,3.022948,69,35,0.883449,3.340083,22.881287,23.217545
27000,id_99,2001-05-12,10.131371,69,35,0.762250,8.323020,22.769724,23.163272
27001,id_99,2001-05-13,14.572434,69,35,0.025932,17.469622,22.699118,23.397359


## **6.4 Date features**
If your time column is made of timestamps then it might make sense to extract features like week, dayofweek, quarter, etc. You can do that by passing a list of strings with pandas time/date components. You can also pass functions that will take the time column as input, as we’ll show here.

In [19]:
def even_day(dates):
    return dates.day % 2 == 0

In [20]:
mlf = MLForecast(
    models=model,
    freq='D',
    lags=[7],
    lag_transforms={
        1: [expanding_mean],
        7: [(rolling_mean, 14)]
    },
    date_features=['dayofweek', 'month', even_day],
    num_threads=2,
)


In [21]:
prep = mlf.preprocess(series_with_prices)
prep

Unnamed: 0,unique_id,ds,y,static_0,product_id,price,lag7,expanding_mean_lag1,rolling_mean_lag7_window_size14,dayofweek,month,even_day
20,id_00,2000-10-25,49.766844,79,45,0.978618,50.694639,25.001367,26.320060,2,10,False
21,id_00,2000-10-26,3.918347,79,45,0.799159,3.887780,26.180675,26.313387,3,10,True
22,id_00,2000-10-27,9.437778,79,45,0.461479,11.512774,25.168751,26.398056,4,10,False
23,id_00,2000-10-28,17.923574,79,45,0.780529,18.038498,24.484796,26.425272,5,10,True
24,id_00,2000-10-29,26.754645,79,45,0.118274,24.222859,24.211411,26.305563,6,10,False
...,...,...,...,...,...,...,...,...,...,...,...,...
26998,id_99,2001-05-10,45.340051,69,35,0.112841,42.025252,22.753681,23.181411,3,5,True
26999,id_99,2001-05-11,3.022948,69,35,0.883449,3.340083,22.881287,23.217545,4,5,False
27000,id_99,2001-05-12,10.131371,69,35,0.762250,8.323020,22.769724,23.163272,5,5,True
27001,id_99,2001-05-13,14.572434,69,35,0.025932,17.469622,22.699118,23.397359,6,5,False


## **6.5 Fit the Model**

The fit method uses the following parameters:
|Parameters |Type	|Default	|Details|
|-----------|-------|-----------|-------|
|df|	DataFrame|		|Series data in long format.|
|id_col|	str	|unique_id|	Column that identifies each serie|.
|time_col	|str	|ds	|Column that identifies each timestep, its values can be timestamps or integers.
|target_col	|str	|y	|Column that contains the target.
|static_features	|typing.Optional[typing.List[str]]|	None	|Names of the features that are static and will be repeated when forecasting. If None, will consider all columns (except id_col and time_col) as static.|
|dropna	|bool	|True	|Drop rows with missing values produced by the transformations.|
|keep_last_n	|typing.Optional[int]	|None	|Keep only these many records from each serie for the forecasting step. Can save time and memory if your features allow it.|
|max_horizon	|typing.Optional[int]	|None	|
|prediction_intervals	|typing.Optional[mlforecast.utils.PredictionIntervals]	|None	|Configuration to calibrate prediction intervals (Conformal Prediction).|
|fitted	|bool	|False	|Save in-sample predictions.|
|data	|typing.Optional[pandas.core.frame.DataFrame]	|None	|Series data in long format. This argument has been replaced by df and will be removed in a later release.|

In [22]:
mlf.fit(series_with_prices, static_features=['static_0', 'product_id'], fitted=True)

MLForecast(models=[LGBMRegressor], freq=<Day>, lag_features=['lag7', 'expanding_mean_lag1', 'rolling_mean_lag7_window_size14'], date_features=['dayofweek', 'month', <function even_day at 0x16732b1c0>], num_threads=2)

The features used for training are stored in `MLForecast.ts.features_order_`, as you can see price was used for training.

In [23]:
mlf.ts.features_order_

['static_0',
 'product_id',
 'price',
 'lag7',
 'expanding_mean_lag1',
 'rolling_mean_lag7_window_size14',
 'dayofweek',
 'month',
 'even_day']

Let's see the results of our model in this case the `LGBMRegressor() model`. We can observe it with the following instruction:

Let us now visualize the fitted values of our models.

In [24]:
result=mlf.forecast_fitted_values()
result=result.set_index("unique_id")
result

Unnamed: 0_level_0,ds,y,LGBMRegressor
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
id_00,2000-10-25,49.766844,49.248731
id_00,2000-10-26,3.918347,3.289390
id_00,2000-10-27,9.437778,10.049331
id_00,2000-10-28,17.923574,17.851305
id_00,2000-10-29,26.754645,26.529612
...,...,...,...
id_99,2001-05-10,45.340051,44.765031
id_99,2001-05-11,3.022948,2.297311
id_99,2001-05-12,10.131371,9.405745
id_99,2001-05-13,14.572434,14.803452


In [25]:
from statsmodels.stats.diagnostic import normal_ad
from scipy import stats

sw_result = stats.shapiro(result["LGBMRegressor"])
ad_result = normal_ad(np.array(result["LGBMRegressor"]), axis=0)
dag_result = stats.normaltest(result["LGBMRegressor"], axis=0, nan_policy='propagate')

It's important to note that we can only use this method if we assume that the residuals of our validation predictions are normally distributed. To see if this is the case, we will use a PP-plot and test its normality with the Anderson-Darling, Kolmogorov-Smirnov, and D’Agostino K^2 tests.

The PP-plot(Probability-to-Probability) plots the data sample against the normal distribution plot in such a way that if normally distributed, the data points will form a straight line.

The three normality tests determine how likely a data sample is from a normally distributed population using p-values. The null hypothesis for each test is that "the sample came from a normally distributed population". This means that if the resulting p-values are below a chosen alpha value, then the null hypothesis is rejected. Thus there is evidence to suggest that the data comes from a non-normal distribution. For this article, we will use an Alpha value of 0.01.

In [65]:
result=mlf.forecast_fitted_values()
fig, axs = plt.subplots(nrows=2, ncols=2)

# plot[1,1]
result["LGBMRegressor"].plot(ax=axs[0,0])
axs[0,0].set_title("Residuals model");

# plot
#plot(result["XGBRegressor"], ax=axs[0,1]);
axs[0,1].hist(result["LGBMRegressor"], density=True,bins=50, alpha=0.5 )
axs[0,1].set_title("Density plot - Residual");

# plot
stats.probplot(result["LGBMRegressor"], dist="norm", plot=axs[1,0])
axs[1,0].set_title('Plot Q-Q')
axs[1,0].annotate("SW p-val: {:.4f}".format(sw_result[1]), xy=(0.05,0.9), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("AD p-val: {:.4f}".format(ad_result[1]), xy=(0.05,0.8), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("DAG p-val: {:.4f}".format(dag_result[1]), xy=(0.05,0.7), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))
# plot
plot_acf(result["LGBMRegressor"],  lags=35, ax=axs[1,1],color="fuchsia")
axs[1,1].set_title("Autocorrelation");

plt.savefig("../figs/Dynamic_features__plot_residual_model.png")
plt.close();

![](../figs/Dynamic_features__plot_residual_model.png)

## **6.6 Predict method with prediction intervals**

To generate forecasts use the predict method.

The predict method takes several arguments.: 

|Parameters |Type	|Default	|Details|
|-----------|-------|-----------|-------|
|h	|int	|	|Number of periods to predict.|
|dynamic_dfs	|typing.Optional[typing.List[pandas.core.frame.DataFrame]]	|None	|Future values of the dynamic features, e.g. prices.
|before_predict_callback	|typing.Optional[typing.Callable]	|None	|Function to call on the features before computing the predictions. This function will take the input dataframe that will be passed to the model for predicting and should return a dataframe with the same structure. The series identifier is on the index.
|after_predict_callback	|typing.Optional[typing.Callable]	|None	|Function to call on the predictions before updating the targets. This function will take a pandas Series with the predictions and should return another one with the same structure. The series identifier is on the index.
|new_df	|typing.Optional[pandas.core.frame.DataFrame]|	None	|Series data of new observations for which forecasts are to be generated. This dataframe should have the same structure as the one used to fit the model, including any features and time series data. If new_df is not None, the method will generate forecasts for the new observations.
|level	|typing.Optional[typing.List[typing.Union[int, float]]]|	None	|Confidence levels between 0 and 100 for prediction intervals.
|X_df	|typing.Optional[pandas.core.frame.DataFrame]	|None	|Dataframe with the future exogenous features. Should have the id column and the time column.
|ids	|typing.Optional[typing.List[str]]	|None	|List with subset of ids seen during training for which the forecasts should be computed.
|horizon	|typing.Optional[int]|	None	|Number of periods to predict. This argument has been replaced by h and will be removed in a later release.
|new_data	|typing.Optional[pandas.core.frame.DataFrame]	|None	|Series data of new observations for which forecasts are to be generated. This dataframe should have the same structure as the one used to fit the model, including any features and time series data. If new_data is not None, the method will generate forecasts for the new observation



The forecast object here is a new data frame that includes a column with the name of the model and the y hat values, as well as columns for the uncertainty intervals.

This step should take less than 1 second.

In [27]:
preds = mlf.predict(7, X_df=prices_catalog)
preds

Unnamed: 0,unique_id,ds,LGBMRegressor
0,id_00,2001-05-15,42.054346
1,id_00,2001-05-16,50.192624
2,id_00,2001-05-17,1.591998
3,id_00,2001-05-18,10.127136
4,id_00,2001-05-19,17.250970
...,...,...,...
695,id_99,2001-05-17,43.797864
696,id_99,2001-05-18,2.505411
697,id_99,2001-05-19,8.385118
698,id_99,2001-05-20,15.525098


## **6.7 Plot prediction**



In [68]:
fig=plot_series(series_with_prices, preds,  max_insample_length=200,engine="matplotlib")
for ax in fig.get_axes():
   ax.set_title("Forecasting Dynaminc Features")
fig.savefig('../figs/Dynamic_features__plot_forecasting.png')

![](../figs/Dynamic_features__plot_forecasting.png)

# **7. Evaluate the model’s performance** <a class="anchor" id="7"></a>

[Table of Contents](#0.1)

In previous steps, we’ve taken our historical data to predict the future. However, to asses its accuracy we would also like to know how the model would have performed in the past. To assess the accuracy and robustness of your models on your data perform Cross-Validation.

With time series data, Cross Validation is done by defining a sliding window across the historical data and predicting the period following it. This form of cross-validation allows us to arrive at a better estimation of our model’s predictive abilities across a wider range of temporal instances while also keeping the data in the training set contiguous as is required by our models.

The following graph depicts such a Cross Validation Strategy:

![](https://raw.githubusercontent.com/Nixtla/statsforecast/main/nbs/imgs/ChainedWindows.gif)

## **7.1 Perform time series cross-validation**

In order to get an estimate of how well our model will be when predicting future data we can perform cross validation, which consist on training a few models independently on different subsets of the data, using them to predict a validation set and measuring their performance.

Since our data depends on time, we make our splits by removing the last portions of the series and using them as validation sets. This process is implemented in `MLForecast.cross_validation`.

Cross-validation of time series models is considered a best practice but most implementations are very slow. The `MlForecast` library implements cross-validation as a distributed operation, making the process less time-consuming to perform. If you have big datasets you can also perform Cross Validation in a distributed cluster using `Ray, Dask or Spark`.

Depending on your computer, this step should take around 1 min.

The cross_validation method from the StatsForecast class takes the following arguments.

* `df:` training data frame

* `h (int):` represents h steps into the future that are being forecasted. In this case, 12 months ahead.

* `step_size (int):` step size between each window. In other words: how often do you want to run the forecasting processes.

* `n_windows(int):` number of windows used for cross validation. In other words: what number of forecasting processes in the past do you want to evaluate.

In [47]:
cv_result = mlf.cross_validation(
    series_with_prices,
    n_windows=5,  # number of models to train/splits to perform
    window_size=5,  # length of the validation set in each window
)

The crossvaldation_df object is a new data frame that includes the following columns:

* `unique_id:` index. If you dont like working with index just run `crossvalidation_df.resetindex()`.
* `ds:` datestamp or temporal index
* `cutoff:` the last datestamp or temporal index for the `n_windows`.
* `y:` true value
* `model:` columns with the model’s name and fitted value.

In [48]:
cv_result

Unnamed: 0,unique_id,ds,cutoff,y,LGBMRegressor
0,id_00,2001-04-20,2001-04-19,10.987977,9.410215
1,id_00,2001-04-21,2001-04-19,16.370385,18.171132
2,id_00,2001-04-22,2001-04-19,24.869802,26.836918
3,id_00,2001-04-23,2001-04-19,34.997018,34.112048
4,id_00,2001-04-24,2001-04-19,42.926775,42.889656
...,...,...,...,...,...
495,id_99,2001-05-10,2001-05-09,45.340051,43.264028
496,id_99,2001-05-11,2001-05-09,3.022948,1.144563
497,id_99,2001-05-12,2001-05-09,10.131371,8.758152
498,id_99,2001-05-13,2001-05-09,14.572434,16.944101


We’ll now plot the forecast for each cutoff period. To make the plots clearer, we’ll rename the actual values in each period.

In [71]:
def plot_cv(df, df_cv, uid, fname, last_n=24 * 14):
    cutoffs = df_cv.query('unique_id == @uid')['cutoff'].unique()
    fig, ax = plt.subplots(nrows=len(cutoffs), ncols=1, figsize=(14, 6), gridspec_kw=dict(hspace=0.8))
    for cutoff, axi in zip(cutoffs, ax.flat):
        df.query('unique_id == @uid').tail(last_n).set_index('ds').plot(ax=axi, title=uid, y='y')
        df_cv.query('unique_id == @uid & cutoff == @cutoff').set_index('ds').plot(ax=axi, title=uid, y='LGBMRegressor')
    fig.savefig(fname, bbox_inches='tight')
    plt.close()

In [72]:
fig=plot_cv(series_with_prices, cv_result, "id_99", '../figs/Dynamic_features__plot_cross_validation.png')

![](../figs/Dynamic_features__plot_cross_validation.png)

# **8. Evaluate the model** <a class="anchor" id="8"></a>

[Table of Contents](#0.1)

We can now compute the accuracy of the forecast using an appropiate accuracy metric. Here we’ll use the Root Mean Squared Error (RMSE). To do this, we first need to `install datasetsforecast`, a Python library developed **by Nixtla** that includes a function to compute the RMSE.

`pip install datasetsforecast`

In [56]:
from datasetsforecast.losses import rmse

The function to compute the RMSE takes two arguments:

1. The actual values.
2. The forecasts, in this case, `LGBMRegressor() Model`.

In [57]:
from datasetsforecast.losses import mse, mae, rmse

def evaluate_cross_validation(df, metric):
    models = df.drop(columns=['ds', 'cutoff', 'y']).columns.tolist()
    evals = []
    for model in models:
        eval_ = df.groupby(['unique_id', 'cutoff']).apply(lambda x: metric(x['y'].values, x[model].values)).to_frame() # Calculate loss for every unique_id, model and cutoff.
        eval_.columns = [model]
        evals.append(eval_)
    evals = pd.concat(evals, axis=1)
    evals = evals.groupby(['unique_id']).mean(numeric_only=True) # Averages the error metrics for all cutoffs for every combination of model and unique_id
    evals['best_model'] = evals.idxmin(axis=1)
    return evals

In [59]:
evaluation_df = evaluate_cross_validation(cv_result.set_index("unique_id"), rmse)

evaluation_df

Unnamed: 0_level_0,LGBMRegressor,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1
id_00,1.124847,LGBMRegressor
id_01,0.586429,LGBMRegressor
id_02,0.598559,LGBMRegressor
id_03,0.724768,LGBMRegressor
id_04,1.271202,LGBMRegressor
...,...,...
id_95,0.817232,LGBMRegressor
id_96,1.243738,LGBMRegressor
id_97,1.067106,LGBMRegressor
id_98,0.268511,LGBMRegressor


In [60]:
def evaluate_cv(df):
    return df['y'].sub(df['LGBMRegressor']).pow(2).groupby(df['cutoff']).mean().pow(0.5)

split_rmse = evaluate_cv(cv_result)
split_rmse

cutoff
2001-04-19    0.960625
2001-04-24    0.933856
2001-04-29    0.908568
2001-05-04    0.894095
2001-05-09    0.928130
dtype: float64

In [61]:
cv_rmse = cv_result.groupby(['unique_id', 'cutoff']).apply(lambda df: rmse(df['y'], df['LGBMRegressor'])).mean()
print("RMSE using cross-validation: ", cv_rmse)

RMSE using cross-validation:  0.8031828633431441


# **9. References** <a class="anchor" id="9"></a>

[Table of Contents](#0)

1. Changquan Huang • Alla Petukhina. Springer series (2022). Applied Time Series Analysis and Forecasting with Python. 
2. Ivan Svetunkov. [Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM)](https://openforecast.org/adam/)
3. [James D. Hamilton. Time Series Analysis Princeton University Press, Princeton, New Jersey, 1st Edition, 1994.](https://press.princeton.edu/books/hardcover/9780691042893/time-series-analysis)
4. [Nixtla Parameters for Mlforecast](https://nixtla.github.io/mlforecast/forecast.html).
5. [Pandas available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).
6. [Rob J. Hyndman and George Athanasopoulos (2018). “Forecasting principles and practice, Time series cross-validation”.](https://otexts.com/fpp3/tscv.html).
7. [Seasonal periods- Rob J Hyndman](https://robjhyndman.com/hyndsight/seasonal-periods/).