The goal of the following article is to get a step-by-step guide on how to use or build `lag transform` and `lags` using `Mlforecast`.

During this walkthrough, we will become familiar with the main `MlForecast` class and some relevant methods such as `mlforecast.fit`, `mlforecast.predict` and `mlforecast.cross_validation` in other.

Let's start!!!

<a class="anchor" id="0.1"></a>



1.	[Introduction](#1)
2.	[Lags y Lag transforms](#2)
3.	[Installing Mlforecast](#3)
4.	[Loading libraries and data](#4)
5.	[Explore Data with the plot method](#5)
6.	[Split the data into training and testing](#6)
7.	[Implementation of model with MLForecast](#7)
8.  [References](#8)

# **1. Introduction** <a class="anchor" id="1"></a>

[Table of Contents](#0.1)

Time series lags are powerful tools that allow you to analyze patterns, identify dependencies and build predictive models. They are especially useful for capturing the temporal structure of data and improving prediction ability in time series analysis.

Lags in time series represent the time delays or lags used to analyze the relationship between past and future values of the series. They are essential for calculating autocorrelation and capturing temporal patterns in time series analysis.

# **2. Lags and lags transforms** <a class="anchor" id="2"></a>
[Table of Contents](#0.1)

## **2.1 Lags**


**Definition:** Given a time series $X_t$, where $t$ is the time index, the lag of order $k$, denoted as $X_{t-k}$, is defined as a lagged version of the original series $X_t $.

Mathematically, the k-order lag is calculated as follows:

$$X_{t-k} = X_t$$

In this definition, $X_{t-k}$ represents the value of the time series $X_t$ lagged by $k$ time periods. That is, the value of the time series $t-k$ is taken and assigned as the value corresponding to lag $X_{t-k}$.


In the context of time series, "lags" refer to the time delays or lags used to analyze the relationship between past and future values in the series. A lag is a measure of time that indicates how many previous periods are taken into account when analyzing the relationship between past and future values of a time series.

When studying autocorrelation in a time series, it is common to use lags to calculate autocorrelation coefficients. The autocorrelation coefficient indicates the relationship between the past and future values of the series depending on the number of lags considered.

Lags are used to examine autocorrelation in a time series, that is, the relationship or dependence between past and future values. By considering different lags, it is possible to evaluate how earlier values influence later values and whether there are recurring patterns or trends in the series.

For example, if we have a monthly time series that records the monthly average total number of sunspots, and we are interested in analyzing the relationship between the current monthly average total sunspot number and the previous month's monthly average total sunspot number, we would be using a lag of 1 month. This implies that we are shifting the values of the series one month back to compare them with the current values.

Lags can be used to calculate autocorrelation, which is a measure of the similarity between values in a time series at different periods. Autocorrelation coefficients indicate the strength and direction of the relationship between lags and current values.


## **2.2 Lag Properties**

Lags have several important properties in time series that can be useful for data analysis and modeling. Below are some of the most relevant properties:

1. Autocorrelation: Lags allow you to calculate the autocorrelation of a time series, that is, the correlation between the past and present values of the series. Autocorrelation can help identify repetitive patterns and stationarity in data.

2. Prediction: Lags are commonly used to build time series prediction models. By including lags as predictor variables in a model, the dependence on past values in predicting future values can be captured.

3. Seasonality: Lags can also reveal seasonal patterns in a time series. By looking at lags for previous seasonal periods (for example, monthly lags or quarterly lags), you can identify whether repetitive patterns exist at certain times of the year.

4. Trend analysis: Lags can be useful to analyze the trend of a time series. By calculating lags of different orders, you can evaluate how past values affect the overall trend of the data.

5. Dependency modeling: Lags allow you to capture dependencies between the past and present values of a time series. By including multiple lags in a model, complex relationships, such as lags in the response of one variable to another, can be detected and modeled.

6. Elimination of autocorrelation: Lags can also be used to eliminate autocorrelation in a time series. By calculating the residuals after fitting a model with lags, one can evaluate whether any unexplained autocorrelation structure remains.

In summary, time series lags are powerful tools that allow you to analyze patterns, identify dependencies and build predictive models. They are especially useful for capturing the temporal structure of data and improving prediction ability in time series analysis.

## **2.3 Lag transforms**

Lag transforms are defined as a dictionary where the keys are the lags and the values are lists of functions that transform an array. These must be [numba](http://numba.pydata.org) jitted functions (so that computing the features doesn’t become a bottleneck). There are some implemented in the [window-ops package](https://github.com/jmoralez/window_ops) but you can also implement your own.

If the function takes two or more arguments you can either:

- supply a tuple (tfm_func, arg1, arg2, …)
- define a new function fixing the arguments

### **Window ops**

This library is intended to be used as an alternative to `pd.Series.rolling` and `pd.Series.expanding` to gain a speedup by using `numba` optimized functions operating on numpy arrays. There are also online classes for more efficient updates of window statistics.

If you have an array for which you want to compute a window statistic and then keep updating it as more samples come in you can use the classes in the `window_ops.online` module. They all have a fit_transform method which take the array and return the transformations defined above but also have an update method that take a single value and return the new statistic.

To use `window_ops` we must previously install it using pip:

`pip install window-ops`

`Window_ops` has many functions that are available to use, but there is also the freedom to build your own function if you require or need it, some of these functions are:

| |window_ops|	pandas|
|-|----------|--------|
|rolling_mean|	0.03	|0.43|
|rolling_max|	|0.14	|0.57|
|rolling_min|	0.14|	0.58|
|rolling_std|	0.06|	0.54|
|expanding_mean|	0.03|	0.31|
|expanding_max|	0.05|	0.76|
|expanding_min|	0.05|	0.47|
|expanding_std|	0.09|	0.41|
|seasonal_rolling_mean|	0.05|	3.89|
|seasonal_rolling_max|	0.18|	4.27|
|seasonal_rolling_min|	0.18|	3.75|
|seasonal_rolling_std|	0.08|	4.38|
|seasonal_expanding_mean|	0.04|	3.18|
|seasonal_expanding_max|	0.06|	3.29|
|seasonal_expanding_min|	0.06|	3.28|
|seasonal_expanding_std|	0.12|	3.89|


# **3. Installing Mlforecast** <a class="anchor" id="3"></a>

[Table of Contents](#0.1)

* using pip:

    - `pip install mlforecast`

* Specific version

    If you want a specific version you can include a filter, for example:

    - `pip install "mlforecast==0.3.0"` to install the 0.3.0 version
    - `pip install "mlforecast<0.4.0"` to install any version prior to 0.4.0

* using with conda:

    - `conda install -c conda-forge mlforecast`

* Specific version

    If you want a specific version you can include a filter, for example: 

    - `conda install -c conda-forge "mlforecast==0.3.0"` to install the 0.3.0 version
    - `conda install -c conda-forge "mlforecast<0.4.0"` to install any version prior to 0.4.0

# **4. Loading libraries and data** <a class="anchor" id="4"></a>

[Table of Contents](#0.1)

In [1]:
# Handling and processing of Data
# ==============================================================================
import numpy as np
import pandas as pd

import scipy.stats as stats

# Handling and processing of Data for Date (time)
# ==============================================================================
import datetime
import time
from datetime import datetime, timedelta

# 
# ==============================================================================
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.seasonal import seasonal_decompose 
# 
# ==============================================================================
from utilsforecast.plotting import plot_series

In [2]:
from mlforecast import MLForecast
from sklearn.ensemble import RandomForestRegressor
# 
# ==============================================================================
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean
from window_ops.ewm import ewm_mean
from mlforecast.target_transforms import Differences

from mlforecast.utils import PredictionIntervals
from mlforecast.utils import generate_daily_series, generate_prices_for_series

In [3]:
# Plot
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plt.style.use('grayscale') # fivethirtyeight  grayscale  classic
plt.rcParams['lines.linewidth'] = 1.5
dark_style = {
    'figure.facecolor': '#008080',  # #212946
    'axes.facecolor': '#008080',
    'savefig.facecolor': '#008080',
    'axes.grid': True,
    'axes.grid.which': 'both',
    'axes.spines.left': False,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.spines.bottom': False,
    'grid.color': '#000000',  #2A3459
    'grid.linewidth': '1',
    'text.color': '0.9',
    'axes.labelcolor': '0.9',
    'xtick.color': '0.9',
    'ytick.color': '0.9',
    'font.size': 12 }
plt.rcParams.update(dark_style)
# Define the plot size
# ==============================================================================

plt.rcParams['figure.figsize'] = (18,7)

# Hide warnings
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

## **4.1 Read Data**

In [4]:
df=pd.read_csv("https://raw.githubusercontent.com/Naren8520/Serie-de-tiempo-con-Machine-Learning/main/Data/candy_production.csv",parse_dates=["observation_date"])
df.head()

Unnamed: 0,observation_date,IPG3113N
0,1972-01-01,85.6945
1,1972-02-01,71.82
2,1972-03-01,66.0229
3,1972-04-01,64.5645
4,1972-05-01,65.01


The input to MlForecast is always a data frame in long format with three columns: unique_id, ds and y:

* The `unique_id` (string, int or category) represents an identifier for the series.

* The `ds` (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp.

* The `y` (numeric) represents the measurement we wish to forecast.

In [5]:
df["unique_id"]="1"
df.columns=["ds", "y", "unique_id"]
df.head()

Unnamed: 0,ds,y,unique_id
0,1972-01-01,85.6945,1
1,1972-02-01,71.82,1
2,1972-03-01,66.0229,1
3,1972-04-01,64.5645,1
4,1972-05-01,65.01,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548 entries, 0 to 547
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ds         548 non-null    datetime64[ns]
 1   y          548 non-null    float64       
 2   unique_id  548 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 13.0+ KB


# **5. Explore Data with the plot method** <a class="anchor" id="5"></a>

[Table of Contents](#0.1)

Plot some series using the plot method from the StatsForecast class. This method prints 8 random series from the dataset and is useful for basic EDA.

In [7]:
fig = plot_series(df)
fig.savefig('../figs/lag_transforms__eda.png')

![](../figs/lag_transforms__eda.png)

## **5.1 The Augmented Dickey-Fuller Test**
An Augmented Dickey-Fuller (ADF) test is a type of statistical test that determines whether a unit root is present in time series data. Unit roots can cause unpredictable results in time series analysis. A null hypothesis is formed in the unit root test to determine how strongly time series data is affected by a trend. By accepting the null hypothesis, we accept the evidence that the time series data is not stationary. By rejecting the null hypothesis or accepting the alternative hypothesis, we accept the evidence that the time series data is generated by a stationary process. This process is also known as stationary trend. The values of the ADF test statistic are negative. Lower ADF values indicate a stronger rejection of the null hypothesis.

Augmented Dickey-Fuller Test is a common statistical test used to test whether a given time series is stationary or not. We can achieve this by defining the null and alternate hypothesis.

- Null Hypothesis: Time Series is non-stationary. It gives a time-dependent trend.
- Alternate Hypothesis: Time Series is stationary. In another term, the series doesn’t depend on time.

- ADF or t Statistic < critical values: Reject the null hypothesis, time series is stationary.
- ADF or t Statistic > critical values: Failed to reject the null hypothesis, time series is non-stationary.

In [8]:
def augmented_dickey_fuller_test(series , column_name):
    print (f'Dickey-Fuller test results for columns: {column_name}')
    dftest = adfuller(series, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','No Lags Used','Number of observations used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    if dftest[1] <= 0.05:
        print("Conclusion:====>")
        print("Reject the null hypothesis")
        print("The data is stationary")
    else:
        print("Conclusion:====>")
        print("The null hypothesis cannot be rejected")
        print("The data is not stationary")

In [9]:
augmented_dickey_fuller_test(df["y"],"Candy Production")

Dickey-Fuller test results for columns: Candy Production
Test Statistic                  -1.887050
p-value                          0.338178
No Lags Used                    14.000000
Number of observations used    533.000000
Critical Value (1%)             -3.442678
Critical Value (5%)             -2.866978
Critical Value (10%)            -2.569666
dtype: float64
Conclusion:====>
The null hypothesis cannot be rejected
The data is not stationary


## **5.2 Autocorrelation plots**

### **Autocorrelation Function**

**Definition 1.** Let $\{x_t;1 ≤ t ≤ n\}$ be a time series sample of size n from $\{X_t\}$.
1. $\bar x = \sum_{t=1}^n \frac{x_t}{n}$ is called the sample mean of $\{X_t\}$.
2. $c_k =\sum_{t=1}^{n−k} (x_{t+k}- \bar x)(x_t−\bar x)/n$ is known as the sample autocovariance function of $\{X_t\}$.
3. $r_k = c_k /c_0$ is said to be the sample autocorrelation function of $\{X_t\}$. 

Note the following remarks about this definition:
 
* Like most literature, this guide uses ACF to denote the sample autocorrelation function as well as the autocorrelation function. What is denoted by ACF can easily be identified in context.

* Clearly c0 is the sample variance of $\{X_t\}$. Besides, $r_0 = c_0/c_0 = 1$ and for any integer $k, |r_k| ≤ 1$.

* When we compute the ACF of any sample series with a fixed length $n$, we cannot put too much confidence in the values of $r_k$ for large k’s, since fewer pairs of $(x_{t +k }, x_t )$ are available for calculating $r_k$ as $k$ is large. One rule of thumb is not to estimate $r_k$ for $k > n/3$, and another is $n ≥ 50, k ≤ n/4$. In any case, it is always a good idea to be careful.

* We also compute the ACF of a nonstationary time series sample by Definition 1. In this case, however, the ACF or $r_k$ very slowly or hardly tapers off as $k$ increases.

* Plotting the ACF $(r_k)$ against lag $k$ is easy but very helpful in analyzing time series sample. Such an ACF plot is known as a correlogram.

* If $\{X_t\}$ is stationary with $E(X_t)=0$ and $\rho_k =0$ for all $k \neq 0$,thatis,itisa white noise series, then the sampling distribution of $r_k$ is asymptotically normal with the mean 0 and the variance of $1/n$. Hence, there is about 95% chance that $r_k$ falls in the interval $[−1.96/√n, 1.96/√n]$.

Now we can give a summary that (1) if the time series plot of a time series clearly shows a trend or/and seasonality, it is surely nonstationary; (2) if the ACF $r_k$ very slowly or hardly tapers off as lag $k$ increases, the time series should also be nonstationary.

In [10]:
fig, axs = plt.subplots(nrows=1, ncols=2)

plot_acf(df["y"],  lags=30, ax=axs[0],color="fuchsia")
axs[0].set_title("Autocorrelation");

# Grafico
plot_pacf(df["y"],  lags=30, ax=axs[1],color="lime")
axs[1].set_title('Partial Autocorrelation')
plt.savefig("../figs/lag_transforms__autocorrelation.png")
plt.close();

![](../figs/lag_transforms__autocorrelation.png)

## **5.3 Decomposition of the time series**

How to decompose a time series and why?

In time series analysis to forecast new values, it is very important to know past data. More formally, we can say that it is very important to know the patterns that values follow over time. There can be many reasons that cause our forecast values to fall in the wrong direction. Basically, a time series consists of four components. The variation of those components causes the change in the pattern of the time series. These components are:

* **Level:** This is the primary value that averages over time.
* **Trend:** The trend is the value that causes increasing or decreasing patterns in a time series.
* **Seasonality:** This is a cyclical event that occurs in a time series for a short time and causes short-term increasing or decreasing patterns in a time series.
* **Residual/Noise:** These are the random variations in the time series.

Combining these components over time leads to the formation of a time series. Most time series consist of level and noise/residual and trend or seasonality are optional values.

If seasonality and trend are part of the time series, then there will be effects on the forecast value. As the pattern of the forecasted time series may be different from the previous time series.

The combination of the components in time series can be of two types:
* Additive
* multiplicative

### Additive

In [11]:
a = seasonal_decompose(df["y"], model = "additive", period=24).plot()
a.savefig('../figs/lag_transforms__seasonal_decompose_aditive.png')
plt.close()

![](../figs/lag_transforms__seasonal_decompose_aditive.png)

### Multiplicative

In [12]:
m = seasonal_decompose(df["y"], model = "Multiplicative", period=24).plot()
m.savefig('../figs/lag_transforms__seasonal_decompose_multiplicative.png')
plt.close();

![](../figs/lag_transforms__seasonal_decompose_multiplicative.png)

# **6. Modeling with MLForecast** <a class="anchor" id="7"></a>

[Table of Contents](#0.1)

## **6.1 Building Model**

We define the model that we want to use, for our example we are going to use the `RandomForest() model`.

In [14]:
model = [RandomForestRegressor(n_estimators=1000,
                               max_depth=2, 
                               random_state=1234,
                               n_jobs=-1) ]

We fit the models by instantiating a new `MlForecast` object with the following parameters:

* `models:` a list of models. Select the models you want from models and import them.

* `freq:` a string indicating the frequency of the data. (See [panda’s available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).)

* `lags:` Lags of the target to uses as feature.

* `lag_transforms:` Mapping of target lags to their transformations.

* `date_features:` Features computed from the dates. Can be `pandas` date attributes or functions that will take the dates as input.

* `differences:` Differences to take of the target before computing the features. These are restored at the forecasting step.

* `num_threads:` Number of threads to use when computing the features.

* `target_transforms:` Transformations that will be applied to the target computing the features and restored after the forecasting step.

Any settings are passed into the constructor. Then you call its fit method and pass in the historical data frame.

Since the series we are using is non-stationary, we are going to add a differential to eliminate a little of the trend that the series has, for this we are going to use the `target_transforms` parameter.

In [15]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 target_transforms=[Differences([1])]
                 )

In [16]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id
1,1972-02-01,-13.8745,1
2,1972-03-01,-5.7971,1
3,1972-04-01,-1.4584,1
4,1972-05-01,0.4455,1
5,1972-06-01,2.6367,1
...,...,...,...
543,2017-04-01,2.2043,1
544,2017-05-01,-5.5079,1
545,2017-06-01,2.2813,1
546,2017-07-01,-1.6161,1


In [17]:
fig=plot_series(prep)
fig.savefig('../figs/lag_transforms__plot_differences.png')

![](../figs/lag_transforms__plot_differences.png)

In [18]:
augmented_dickey_fuller_test(prep["y"],"Candy Production")

Dickey-Fuller test results for columns: Candy Production
Test Statistic                -6.119512e+00
p-value                        8.925584e-08
No Lags Used                   1.300000e+01
Number of observations used    5.330000e+02
Critical Value (1%)           -3.442678e+00
Critical Value (5%)           -2.866978e+00
Critical Value (10%)          -2.569666e+00
dtype: float64
Conclusion:====>
Reject the null hypothesis
The data is stationary


## **6.2 Lags**

In [19]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 lags=[1],
                 target_transforms=[Differences([1])]
                 )

We can use the `MLForecast.preprocess` method to explore different transformations.

In [20]:
prep = mlf.preprocess(df)
prep.head()

Unnamed: 0,ds,y,unique_id,lag1
2,1972-03-01,-5.7971,1,-13.8745
3,1972-04-01,-1.4584,1,-5.7971
4,1972-05-01,0.4455,1,-1.4584
5,1972-06-01,2.6367,1,0.4455
6,1972-07-01,1.3962,1,2.6367


We can see that we have added $lags=1$ to our model. The `lags` parameter accepts a list of values that we can include in our model, as shown below:

In [21]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 lags=[1,2,4,6,8,10,12],
                 target_transforms=[Differences([1])]
                 )

In [22]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id,lag1,lag2,lag4,lag6,lag8,lag10,lag12
13,1973-02-01,-14.0297,1,-14.6676,0.3711,31.8827,1.7941,2.6367,-1.4584,-13.8745
14,1973-03-01,-7.6590,1,-14.0297,-14.6676,-1.3327,4.2092,1.3962,0.4455,-5.7971
15,1973-04-01,0.6876,1,-7.6590,-14.0297,0.3711,31.8827,1.7941,2.6367,-1.4584
16,1973-05-01,1.3836,1,0.6876,-7.6590,-14.6676,-1.3327,4.2092,1.3962,0.4455
17,1973-06-01,3.1813,1,1.3836,0.6876,-14.0297,0.3711,31.8827,1.7941,2.6367
...,...,...,...,...,...,...,...,...,...,...
543,2017-04-01,2.2043,1,-8.2416,3.9995,-0.3896,9.7311,1.7465,0.3228,-4.3238
544,2017-05-01,-5.5079,1,2.2043,-8.2416,-6.9869,-2.2071,4.6214,0.5468,-1.5363
545,2017-06-01,2.2813,1,-5.5079,2.2043,3.9995,-0.3896,9.7311,1.7465,0.3228
546,2017-07-01,-1.6161,1,2.2813,-5.5079,-8.2416,-6.9869,-2.2071,4.6214,0.5468


We can also use loop to tell it how many `lags` we need, as shown below:

In [23]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 lags=[1 * (i+1) for i in range(12)],
                 target_transforms=[Differences([1])]
                 )

In [24]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id,lag1,lag2,lag3,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12
13,1973-02-01,-14.0297,1,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584,-5.7971,-13.8745
14,1973-03-01,-7.6590,1,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584,-5.7971
15,1973-04-01,0.6876,1,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584
16,1973-05-01,1.3836,1,0.6876,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455
17,1973-06-01,3.1813,1,1.3836,0.6876,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2017-04-01,2.2043,1,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228,-1.5363,-4.3238
544,2017-05-01,-5.5079,1,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228,-1.5363
545,2017-06-01,2.2813,1,-5.5079,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228
546,2017-07-01,-1.6161,1,2.2813,-5.5079,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468


## **6.3 Lag transforms**

Let's now look at the `lag_transforms` parameter, as we mentioned before, there are a number of `lag_transforms` that we can use, depending on the need that is required with the data we are using, as well as you can build your own, let's look at a couple of examples of use.

First we must call the function we want to use.

In [25]:
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

`expanding.mean` is a `window_ops` function that calculates the expanded moving average of a time series. The expanded moving average is a measure of central tendency that is calculated by adding the time series values as new data is added.

The `expanding.mean` can be used to identify trends in time series. For example, if the `expanding.mean` is increasing, it means the time series is in an uptrend. If the expanded moving average is decreasing, it means that the time series is in a downtrend.

The `expanding.mean` can also be used to smooth out random fluctuations in a time series. This can be useful for identifying patterns in the time series that would otherwise be difficult to see.

In [26]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 lags=[1 * (i+1) for i in range(12)],
                 lag_transforms={1: [expanding_mean] },
                 target_transforms=[Differences([1])]
                 )

In [27]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id,lag1,lag2,lag3,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12,expanding_mean_lag1
13,1973-02-01,-14.0297,1,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584,-5.7971,-13.8745,0.467100
14,1973-03-01,-7.6590,1,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584,-5.7971,-0.648038
15,1973-04-01,0.6876,1,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.4584,-1.148821
16,1973-05-01,1.3836,1,0.6876,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,0.4455,-1.026393
17,1973-06-01,3.1813,1,1.3836,0.6876,-7.6590,-14.0297,-14.6676,0.3711,-1.3327,31.8827,4.2092,1.7941,1.3962,2.6367,-0.875769
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2017-04-01,2.2043,1,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228,-1.5363,-4.3238,0.036033
544,2017-05-01,-5.5079,1,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228,-1.5363,0.040026
545,2017-06-01,2.2813,1,-5.5079,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.3228,0.029828
546,2017-07-01,-1.6161,1,2.2813,-5.5079,2.2043,-8.2416,3.9995,-6.9869,-0.3896,-2.2071,9.7311,4.6214,1.7465,0.5468,0.033959


Let's look at another example, using the `rolling_mean` function.

Rolling_mean is a technique used to smooth time series data. This is done by calculating the average of the time series values over a given time period. The time period used to calculate the average is called the window.

The `rolling_mean` can be used to identify trends in a time series. For example, if the moving average is rising, it means that the time series is in an uptrend. If the moving average is decreasing, it means that the time series is in a downtrend.

The `rolling_mean` can also be used to smooth out random fluctuations in a time series. This can be useful for identifying patterns in the time series that would otherwise be difficult to see.

In [34]:
mlf = MLForecast(models=model,
                 freq='MS', 
                 lags=[1 * (i+1) for i in range(12)],
                 lag_transforms={1: [expanding_mean], 12: [(rolling_mean, 7)] },
                 target_transforms=[Differences([12])]
                 )

In [35]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id,lag1,lag2,lag3,lag4,lag5,lag6,lag7,lag8,lag9,lag10,lag11,lag12,expanding_mean_lag1,rolling_mean_lag12_window_size7
30,1974-07-01,-5.9896,1,4.7223,2.9374,-2.9777,7.6190,6.3398,-2.6012,-0.7463,3.6562,-4.0089,5.5453,2.3378,3.0035,3.338533,5.324271
31,1974-08-01,-1.9884,1,-5.9896,4.7223,2.9374,-2.9777,7.6190,6.3398,-2.6012,-0.7463,3.6562,-4.0089,5.5453,2.3378,2.847579,4.857500
32,1974-09-01,-10.4165,1,-1.9884,-5.9896,4.7223,2.9374,-2.9777,7.6190,6.3398,-2.6012,-0.7463,3.6562,-4.0089,5.5453,2.605780,4.871114
33,1974-10-01,-3.6988,1,-10.4165,-1.9884,-5.9896,4.7223,2.9374,-2.9777,7.6190,6.3398,-2.6012,-0.7463,3.6562,-4.0089,1.985671,3.785829
34,1974-11-01,-8.1323,1,-3.6988,-10.4165,-1.9884,-5.9896,4.7223,2.9374,-2.9777,7.6190,6.3398,-2.6012,-0.7463,3.6562,1.727286,3.488986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
543,2017-04-01,3.8109,1,-2.7172,5.3353,0.9625,-3.7839,-7.6745,-7.6938,-6.5773,-11.2053,0.1222,1.5705,5.3014,2.5548,0.661768,-0.590957
544,2017-05-01,-0.1607,1,3.8109,-2.7172,5.3353,0.9625,-3.7839,-7.6745,-7.6938,-6.5773,-11.2053,0.1222,1.5705,5.3014,0.667687,-0.830400
545,2017-06-01,1.7978,1,-0.1607,3.8109,-2.7172,5.3353,0.9625,-3.7839,-7.6745,-7.6938,-6.5773,-11.2053,0.1222,1.5705,0.666133,0.043143
546,2017-07-01,-0.3651,1,1.7978,-0.1607,3.8109,-2.7172,5.3353,0.9625,-3.7839,-7.6745,-7.6938,-6.5773,-11.2053,0.1222,0.668252,1.248514


In this way we can add as many `lags` as we want or need, as well as we can add the different `lag_transforms` functions that we are going to use or need depending on the case of our data set. In this tutorial our main objective is to teach you how to use the different types of tools and not to perform a particular data analysis.

Suppose we have already added the number of `lags` and `lag_transforms` to our model, then the next step would be to train our model and make the predictions.

## **6.4 Fit method**

In [36]:
# fit the models
mlf.fit(df,  
 fitted=True)

MLForecast(models=[RandomForestRegressor], freq=<MonthBegin>, lag_features=['lag1', 'lag2', 'lag3', 'lag4', 'lag5', 'lag6', 'lag7', 'lag8', 'lag9', 'lag10', 'lag11', 'lag12', 'expanding_mean_lag1', 'rolling_mean_lag12_window_size7'], date_features=[], num_threads=1)

Let's see the results of our model in this case the `RandomForestRegressor model`. We can observe it with the following instruction:

Let us now visualize the fitted values of our models.

In [37]:
result=mlf.forecast_fitted_values()
result=result.set_index("unique_id")
result

Unnamed: 0_level_0,ds,y,RandomForestRegressor
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1974-07-01,66.0568,3.636434
1,1974-08-01,71.1864,-4.448231
1,1974-09-01,70.1750,-1.200554
1,1974-10-01,99.2212,-6.396696
1,1974-11-01,101.1201,-3.074718
...,...,...,...
1,2017-04-01,107.4288,-1.629253
1,2017-05-01,101.9209,2.364812
1,2017-06-01,104.2022,0.823116
1,2017-07-01,102.5861,1.891464


In [38]:
from statsmodels.stats.diagnostic import normal_ad
from scipy import stats

sw_result = stats.shapiro(result["RandomForestRegressor"])
ad_result = normal_ad(np.array(result["RandomForestRegressor"]), axis=0)
dag_result = stats.normaltest(result["RandomForestRegressor"], axis=0, nan_policy='propagate')

It's important to note that we can only use this method if we assume that the residuals of our validation predictions are normally distributed. To see if this is the case, we will use a PP-plot and test its normality with the Anderson-Darling, Kolmogorov-Smirnov, and D’Agostino K^2 tests.

The PP-plot(Probability-to-Probability) plots the data sample against the normal distribution plot in such a way that if normally distributed, the data points will form a straight line.

The three normality tests determine how likely a data sample is from a normally distributed population using p-values. The null hypothesis for each test is that "the sample came from a normally distributed population". This means that if the resulting p-values are below a chosen alpha value, then the null hypothesis is rejected. Thus there is evidence to suggest that the data comes from a non-normal distribution. For this article, we will use an Alpha value of 0.01.

In [39]:
result=mlf.forecast_fitted_values()
fig, axs = plt.subplots(nrows=2, ncols=2)

# plot[1,1]
result["RandomForestRegressor"].plot(ax=axs[0,0])
axs[0,0].set_title("Residuals model");

# plot
#plot(result["XGBRegressor"], ax=axs[0,1]);
axs[0,1].hist(result["RandomForestRegressor"], density=True,bins=50, alpha=0.5 )
axs[0,1].set_title("Density plot - Residual");

# plot
stats.probplot(result["RandomForestRegressor"], dist="norm", plot=axs[1,0])
axs[1,0].set_title('Plot Q-Q')
axs[1,0].annotate("SW p-val: {:.4f}".format(sw_result[1]), xy=(0.05,0.9), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("AD p-val: {:.4f}".format(ad_result[1]), xy=(0.05,0.8), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("DAG p-val: {:.4f}".format(dag_result[1]), xy=(0.05,0.7), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))
# plot
plot_acf(result["RandomForestRegressor"],  lags=35, ax=axs[1,1],color="fuchsia")
axs[1,1].set_title("Autocorrelation");

plt.savefig("../figs/lag_transforms__plot_residual_model.png")
plt.close();

![](../figs/lag_transforms__plot_residual_model.png)

## **6.5 Predict method**

In [40]:
forecast_df = mlf.predict(h=30) 

forecast_df

Unnamed: 0,unique_id,ds,RandomForestRegressor
0,1,2017-09-01,115.271183
1,1,2017-10-01,124.275966
2,1,2017-11-01,121.05171
3,1,2017-12-01,119.346868
4,1,2018-01-01,111.890525
5,1,2018-02-01,115.720737
6,1,2018-03-01,107.390819
7,1,2018-04-01,109.410263
8,1,2018-05-01,103.819343
9,1,2018-06-01,106.088877


## **6.6 Plot prediction**

In [41]:
fig=plot_series(df, forecast_df, max_insample_length=300,engine="matplotlib")
for ax in fig.get_axes():
   ax.set_title("Prediction intervals")
fig.savefig('../figs/lag_transforms__plot_forecasting_intervals.png')

![](../figs/lag_transforms__plot_forecasting_intervals.png)

We can say that adding `lags` to our time series model is important for several reasons:

1. Capture of temporal dependencies: Lags allow capturing temporal dependence in time series data. By including lags as predictor variables in the model, the influence of past values in predicting future values is being considered. This is essential, since in time series it is common for present values to be correlated with past values.

2. Modeling trends and seasonality: Lags are especially useful for modeling trends and seasonal patterns in time series. By considering lags corresponding to the relevant time periods (for example, monthly lags or quarterly lags), seasonality and regular fluctuations in the data can be captured. This is essential to correctly understand and predict repetitive patterns in the series.

3. Improved predictive capacity: By incorporating lags into the model, the model's ability to predict future values of the time series is increased. This is because lags provide valuable information about the series' past dynamics and behavior, which can help make more accurate and realistic projections.

4. Identification of autocorrelation structures: By analyzing the lag coefficients in the model, autocorrelation structures can be identified in the time series. For example, if the lag coefficients are significantly different from zero, it indicates the presence of a dependency relationship between past and present values. This can be useful in understanding the nature and properties of the series.

In summary, adding lags to a time series model is important because it allows you to capture time dependence, model trends and seasonality, improve predictive ability, and detect autocorrelation structures in the data. Lags provide valuable information and allow the construction of more sophisticated and accurate models for time series analysis and forecasting.

# **7. References** <a class="anchor" id="10"></a>

[Table of Contents](#0)

1. Changquan Huang • Alla Petukhina. Springer series (2022). Applied Time Series Analysis and Forecasting with Python. 
2. Ivan Svetunkov. [Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM)](https://openforecast.org/adam/)
3. [James D. Hamilton. Time Series Analysis Princeton University Press, Princeton, New Jersey, 1st Edition, 1994.](https://press.princeton.edu/books/hardcover/9780691042893/time-series-analysis)
4. [Nixtla Parameters for Mlforecast](https://nixtla.github.io/mlforecast/forecast.html).
5. [Pandas available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).
6. [Rob J. Hyndman and George Athanasopoulos (2018). “Forecasting principles and practice, Time series cross-validation”.](https://otexts.com/fpp3/tscv.html).
7. [Seasonal periods- Rob J Hyndman](https://robjhyndman.com/hyndsight/seasonal-periods/).