The main objective of the following article is to obtain a step-by-step guide on how to make a `Pipeline` with `Sklearn` for our Time Series model using `Mlforecast`.

During this tutorial, we will get familiar with the main class `MlForecast` and some relevant methods such as `mlforecast.fit`, `mlforecast.predict` and `mlforecast.cross_validation` among others.

Let us begin!!!

<a class="anchor" id="0.1"></a>



1.	[Introduction](#1)
2.	[Loading libraries and Data](#3)
3.	[Explore Data with the plot method](#3)
4.	[Training A Multivariate Time Series Model With MLForecast](#4)
5.  [Feature importances](#5)
6.  [Evaluate the model’s performance](#6)
7.  [Evaluate the model](#7)
8.  [References](#8)

# **Introduction** <a class="anchor" id="1"></a>

[Table of Contents](#0)

A pipeline in the context of machine learning refers to a sequence of steps organized in a structured way to process and transform data before training a model and making predictions. The main goal of a pipeline is to automate and standardize the workflow, which facilitates reproducibility and the deployment of models in production.

A typical machine learning pipeline consists of the following stages:

1. **Data preprocessing**: In this stage, various tasks are performed to prepare the data before training the model. This may include data cleaning, handling of missing values, normalization or standardization of variables, coding of categorical variables, selection of relevant features, among others.

2. **Data Splitting**: It is common to split data into training, validation, and test sets. The training set is used to train the model, the validation set is used to tune hyperparameters and evaluate performance during model tuning, and the test set is used to evaluate the final performance of the model.

3. **Model training**: In this stage, the machine learning model is selected and trained using the training data. This involves tuning the model parameters based on the training data to minimize the loss function or maximize the target performance metric.

4. **Model validation**: After training the model, its performance is evaluated using the validation data. This allows tuning the model's hyperparameters and making comparisons between different configurations or algorithms.

5. **Model evaluation**: Once the hyperparameters have been tuned and the best model has been selected, its final performance is evaluated using the test set. This provides an unbiased estimate of model performance on unseen data.

6. **Predictions**: Finally, the trained model is used to make predictions on new data or real-time data.

Using a pipeline in machine learning has several benefits, such as automating repetitive tasks, standardizing workflow, ease of reproducing results, and the ability to scale and apply the model in production efficiently.

It is important to note that the specific steps and order of a pipeline can vary depending on the problem and project requirements. Additionally, additional stages can be included, such as selecting the best model or optimizing hyperparameters.

# **Loading libraries and Data** <a class="anchor" id="2"></a>

[Table of Contents](#0)

In [1]:
# Handling and processing of Data
# ==============================================================================
import numpy as np
import pandas as pd

import scipy.stats as stats

# Handling and processing of Data for Date (time)
# ==============================================================================
import datetime
import datetime as dt
import time
from datetime import datetime, timedelta

# 
# ==============================================================================
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.seasonal import seasonal_decompose 

In [2]:
# Models Sklearn
# ==============================================================================
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor

# Mlforecast
# ==============================================================================
from mlforecast import MLForecast
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean
from window_ops.ewm import ewm_mean
from mlforecast.target_transforms import Differences

from mlforecast.utils import PredictionIntervals
from mlforecast.utils import generate_daily_series, generate_prices_for_series
from utilsforecast.plotting import plot_series

In [3]:
# Plot
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plt.style.use('grayscale') # fivethirtyeight  grayscale  classic
plt.rcParams['lines.linewidth'] = 1.5

# Define the plot size
# ==============================================================================

plt.rcParams['figure.figsize'] = (18,7)

# Hide warnings
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

## **Read data**

We will use data from the [City of Los Angeles website traffic dataset](https://www.kaggle.com/datasets/cityofLA/lacity.org-website-traffic) on Kaggle.

This dataset contains the number of user sessions for the City of Los Angeles website for each day from 2014 to 2019.

Sessions are periods of time when a user is active in a website.

In [4]:
df=pd.read_csv("lacity.org-website-traffic.csv", ) #.loc[:, ['Date', 'Device Category', 'Sessions']]
df.head()

Unnamed: 0,Date,Device Category,Browser,# of Visitors,Sessions,Bounce Rate
0,2014-01-01T00:00:00.000,desktop,Chrome,900,934,55.5675
1,2014-01-01T00:00:00.000,desktop,Firefox,692,761,40.8673
2,2014-01-01T00:00:00.000,desktop,Internet Explorer,1038,1107,31.2556
3,2014-01-01T00:00:00.000,desktop,Opera,35,35,100.0
4,2014-01-01T00:00:00.000,desktop,Safari,484,554,24.9097


In [5]:
df.columns = [k.lower().replace(' ', '_').replace('#', 'qnty') for k in df.columns]
df['date'] = pd.to_datetime(df.date).dt.normalize()
df

Unnamed: 0,date,device_category,browser,qnty_of_visitors,sessions,bounce_rate
0,2014-01-01,desktop,Chrome,900,934,55.5675
1,2014-01-01,desktop,Firefox,692,761,40.8673
2,2014-01-01,desktop,Internet Explorer,1038,1107,31.2556
3,2014-01-01,desktop,Opera,35,35,100.0000
4,2014-01-01,desktop,Safari,484,554,24.9097
...,...,...,...,...,...,...
8348980,2019-08-27,mobile,Chrome,199,318,50.0000
8348981,2019-08-27,mobile,Firefox,40,40,100.0000
8348982,2019-08-27,mobile,Safari,199,199,79.8995
8348983,2019-08-27,tablet,Amazon Silk,40,40,100.0000


In [6]:
df.drop_duplicates(inplace=True)

In [7]:
# Aggregate the data based on Date, Device and Session.
data = df[['date', 'device_category', 'sessions']].groupby(['date', 'device_category']).sum().reset_index()

# Fill Missing Any Date to 0
df_pivot = data.pivot(index='device_category', values=['sessions'], columns='date')
df_pivot.fillna(0, inplace=True)

df_pivot = df_pivot.stack().reset_index().sort_values(by=['date', 'device_category']).reset_index(drop=True)
df_pivot.columns = ['device', 'date', 'session']
data = df_pivot[['date', 'device', 'session']]

del df_pivot

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6195 entries, 0 to 6194
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   date     6195 non-null   datetime64[ns]
 1   device   6195 non-null   object        
 2   session  6195 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 145.3+ KB


The input to MlForecast is always a data frame in long format with three columns: unique_id, ds and y:

* The `unique_id` (string, int or category) represents an identifier for the series.

* The `ds` (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp.

* The `y` (numeric) represents the measurement we wish to forecast.

In [9]:
data=data.rename(columns={"date": "ds","device":"unique_id", "session": "y"})
data.head()

Unnamed: 0,ds,unique_id,y
0,2014-01-01,desktop,616768.0
1,2014-01-01,mobile,287547.0
2,2014-01-01,tablet,15967.0
3,2014-01-02,desktop,2030513.0
4,2014-01-02,mobile,321557.0


# **Explore Data with the plot method** <a class="anchor" id="3"></a>

[Table of Contents](#0)

We are going to use the `plot_series` function to visualize our data.

In [10]:
fig=plot_series(data, palette="plasma", engine="matplotlib")
fig.savefig('../figs/pipeline_with_sklearn_and_xgboost_random_forest__eda.png')

![](../figs/pipeline_with_sklearn_and_xgboost_random_forest__eda.png)

## **The Augmented Dickey-Fuller Test**
An Augmented Dickey-Fuller (ADF) test is a type of statistical test that determines whether a unit root is present in time series data. Unit roots can cause unpredictable results in time series analysis. A null hypothesis is formed in the unit root test to determine how strongly time series data is affected by a trend. By accepting the null hypothesis, we accept the evidence that the time series data is not stationary. By rejecting the null hypothesis or accepting the alternative hypothesis, we accept the evidence that the time series data is generated by a stationary process. This process is also known as stationary trend. The values of the ADF test statistic are negative. Lower ADF values indicate a stronger rejection of the null hypothesis.

Augmented Dickey-Fuller Test is a common statistical test used to test whether a given time series is stationary or not. We can achieve this by defining the null and alternate hypothesis.

- Null Hypothesis: Time Series is non-stationary. It gives a time-dependent trend.
- Alternate Hypothesis: Time Series is stationary. In another term, the series doesn’t depend on time.

- ADF or t Statistic < critical values: Reject the null hypothesis, time series is stationary.
- ADF or t Statistic > critical values: Failed to reject the null hypothesis, time series is non-stationary.

In [11]:
def augmented_dickey_fuller_test(series , column_name):
    print (f'Dickey-Fuller test results for columns: {column_name}')
    dftest = adfuller(series, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','No Lags Used','Number of observations used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    if dftest[1] <= 0.05:
        print("Conclusion:====>")
        print("Reject the null hypothesis")
        print("The data is stationary")
    else:
        print("Conclusion:====>")
        print("The null hypothesis cannot be rejected")
        print("The data is not stationary")

In [12]:
augmented_dickey_fuller_test(data["y"],"website-traffic")

Dickey-Fuller test results for columns: website-traffic
Test Statistic                -7.269285e+00
p-value                        1.604152e-10
No Lags Used                   3.400000e+01
Number of observations used    6.160000e+03
Critical Value (1%)           -3.431412e+00
Critical Value (5%)           -2.862009e+00
Critical Value (10%)          -2.567020e+00
dtype: float64
Conclusion:====>
Reject the null hypothesis
The data is stationary


# **Training A Multivariate Time Series Model With MLForecast**<a class="anchor" id="4"></a>

[Table of Contents](#0)

## **Building Model**

Let’s see how we can engineer features and train an `XGBoost` and `RandomForest` with mlforecast.

In this case we are going to create a `Pipeline` with the `make_pileline` function in which we are going to use the `SimpleImputer` in case we have null data, and we will also use the `StandardScaler` for the standardization of the data. Let's remember that the `StandarScaler` and the `SimpleImputer` are executed in the fit method.

We have added the `StandardScaler` function to the construction of the model that we are going to use, however we can create this same function and use it in the `differences` parameter, however for the sake of exemplifying this tutorial we are going to use the `StandardScaler` from the construction of the `Pipeline`.

In [13]:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PowerTransformer

models = [make_pipeline(SimpleImputer(), 
                        StandardScaler(),
                        PowerTransformer(method='yeo-johnson'),
                        RandomForestRegressor(random_state=0, n_estimators=100)),
                        XGBRegressor(random_state=0, n_estimators=100)]

In [14]:
models

[Pipeline(steps=[('simpleimputer', SimpleImputer()),
                 ('standardscaler', StandardScaler()),
                 ('powertransformer', PowerTransformer()),
                 ('randomforestregressor',
                  RandomForestRegressor(random_state=0))]),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, ra

We fit the models by instantiating a new `MlForecast` object with the following parameters:

* `models:` a list of models. Select the models you want from models and import them.

* `freq:` a string indicating the frequency of the data. (See [panda’s available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).)

* `lags:` Lags of the target to uses as feature.

* `lag_transforms:` Mapping of target lags to their transformations.

* `date_features:` Features computed from the dates. Can be `pandas` date attributes or functions that will take the dates as input.

* `differences:` Differences to take of the target before computing the features. These are restored at the forecasting step.

* `num_threads:` Number of threads to use when computing the features.

* `target_transforms:` Transformations that will be applied to the target computing the features and restored after the forecasting step.

Any settings are passed into the constructor. Then you call its fit method and pass in the historical data frame.

In [15]:
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean, rolling_max, rolling_min

The `lags` argument is a list of lags we want to use in our model. Lags are the number of steps in the past we want to use to predict the future.

So in our case, we are using the value of the target variable 1, 7, and 14 days before the timestamp date of the observation.

This is why it’s important to have all dates, even the ones with missing values and zeros in the datasets, so these features can be computed correctly.

We also pass the `lag_transforms` argument, which is a dictionary with the `lag` as the key and a list of functions as the value.

These functions will be applied to the lagged series. For example, we are applying a `rolling mean` with a window of 7 days to the lag 1.

This means that we will have a feature that is the mean of the 7 last values of the target variable after shifting it 1 day.

We are using the `window_ops` library which is optimized and recommended for MLForecast.

Each tuple in the list of transforms is a function and its arguments. For example, the `rolling_mean` function takes a window size as its argument.

The best value for the window size is specific for each time series, so we need to try different values and see which one works best.

The `date_features` argument is a list of date components we want to extract from the date column.

In [16]:
mlf = MLForecast(models=models,
                   freq='D',
                   lags=[1,7,14],
                   lag_transforms={
                       1: [(rolling_mean, 4), (rolling_min, 4), (rolling_max, 4)], 1: [expanding_mean]},
                   date_features=['week', 'month','day'],
                   num_threads=6)

With the `preprocess` function we can observe all the transformation that has been made of the target variable before training the model.

In [17]:
prep = mlf.preprocess(data)
prep.head()

Unnamed: 0,ds,unique_id,y,lag1,lag7,lag14,expanding_mean_lag1,week,month,day
42,2014-01-15,desktop,2489993.0,2595781.0,3213820.0,616768.0,2018989.0,3,1,15
43,2014-01-15,mobile,158945.0,391598.0,1022962.0,287547.0,407467.5,3,1,15
44,2014-01-15,tablet,61490.0,88987.0,214816.0,15967.0,103891.7,3,1,15
45,2014-01-16,desktop,2134761.0,2489993.0,3522842.0,2030513.0,2050389.0,3,1,16
46,2014-01-16,mobile,339028.0,158945.0,691006.0,321557.0,390899.3,3,1,16


In [18]:
fig=plot_series(prep)
fig.savefig('../figs/pipeline_with_sklearn_and_xgboost_random_forest__prep.png')

![](../figs/pipeline_with_sklearn_and_xgboost_random_forest__prep.png)

In [19]:
prep.drop(columns=['unique_id', 'ds']).corr()['y']

y                      1.000000
lag1                   0.802928
lag7                   0.795004
lag14                  0.775561
expanding_mean_lag1    0.691355
week                   0.016375
month                  0.015591
day                   -0.023339
Name: y, dtype: float64

Here we can observe the relationship that exists between each day of one of the variables, with this information we can use it to make a decision about what process (lags or lag_transforms) we can use, at the end of the training and predictions of the model we are going to visualize the characteristics more amounts, we will have a little more foundation to be able to add or eliminate those transformations that will be of greater importance for our model.

## **Fit method**

In [20]:
mlf.fit(data, fitted=True,prediction_intervals=PredictionIntervals(n_windows=5, window_size=30, method="conformal_distribution"))

MLForecast(models=[Pipeline, XGBRegressor], freq=<Day>, lag_features=['lag1', 'lag7', 'lag14', 'expanding_mean_lag1'], date_features=['week', 'month', 'day'], num_threads=6)

## **Predict method with prediction intervals**

In [21]:
forecast_df = mlf.predict(horizon=30,level=[80,95])
forecast_df 

Unnamed: 0,unique_id,ds,Pipeline,XGBRegressor,Pipeline-lo-95,Pipeline-lo-80,Pipeline-hi-80,Pipeline-hi-95,XGBRegressor-lo-95,XGBRegressor-lo-80,XGBRegressor-hi-80,XGBRegressor-hi-95
0,desktop,2019-08-28,16656.18,28669.951172,-154525.58450,-63207.278,96519.638,187837.94450,-161211.866406,-51524.334766,108864.237109,218551.768750
1,desktop,2019-08-29,20496.38,30170.853516,-104696.24700,-73041.798,114034.558,145689.00700,-60974.853906,-58197.007422,118538.714453,121316.560938
2,desktop,2019-08-30,21187.80,29180.378906,14418.49425,14978.427,27397.173,27957.10575,16767.476416,17109.475977,41251.281836,41593.281396
3,desktop,2019-08-31,22426.90,22863.261719,-148873.45925,-73198.757,118052.557,193727.25925,-182880.302344,-86024.932031,131751.455469,228606.825781
4,desktop,2019-09-01,36605.67,18515.167969,-72885.17575,-68990.203,142201.543,146096.51575,-84823.936328,-72305.671094,109336.007031,121854.272266
...,...,...,...,...,...,...,...,...,...,...,...,...
85,tablet,2019-09-22,332.30,1718.857300,-227611.52400,-224512.356,225176.956,228276.12400,-209634.450122,-206278.888013,209716.602612,213072.164722
86,tablet,2019-09-23,356.66,1718.857300,-8433.62200,-8073.658,8786.978,9146.94200,-12324.783765,-11873.774341,15311.488940,15762.498364
87,tablet,2019-09-24,543.04,1718.857300,-230295.58825,-226853.473,227939.553,231381.66825,-265327.840356,-233968.120825,237405.835425,268765.554956
88,tablet,2019-09-25,391.96,2976.029541,-259706.93850,-200280.384,201064.304,260490.85850,-217907.899756,-202583.828271,208535.887354,223859.958838


We can see the predictions for the Pipeline (Random Forest) and the XGBRegressor models.

In [22]:
fig=plot_series(data, forecast_df, level=[80,95], max_insample_length=100,engine="matplotlib", palette="magma")
fig.get_axes()[0].set_title("Prediction of Pipeline with intervals")
fig.savefig('../figs/pipeline_with_sklearn_and_xgboost_random_forest__plot_forecasting_pipeline_intervals.png')

![](../figs/pipeline_with_sklearn_and_xgboost_random_forest__plot_forecasting_pipeline_intervals.png)

In [23]:
fig=pd.Series(mlf.models_['XGBRegressor'].feature_importances_, 
          index=mlf.ts.features_order_).sort_values(ascending=False).plot.bar(title='Feature Importance XGBRegressor')
plt.grid(True)
plt.savefig('../figs/pipeline_with_sklearn_and_xgboost_random_forest__plot_feature_importance.png',dpi=300)
plt.close()

![](../figs/pipeline_with_sklearn_and_xgboost_random_forest__plot_feature_importance.png)

# **Evaluate the model’s performance** <a class="anchor" id="6"></a>

[Table of Contents](#0.1)

In previous steps, we’ve taken our historical data to predict the future. However, to asses its accuracy we would also like to know how the model would have performed in the past. To assess the accuracy and robustness of your models on your data perform Cross-Validation.

With time series data, Cross Validation is done by defining a sliding window across the historical data and predicting the period following it. This form of cross-validation allows us to arrive at a better estimation of our model’s predictive abilities across a wider range of temporal instances while also keeping the data in the training set contiguous as is required by our models.

The following graph depicts such a Cross Validation Strategy:

![](https://raw.githubusercontent.com/Nixtla/statsforecast/main/nbs/imgs/ChainedWindows.gif)

## **Perform time series cross-validation**

In order to get an estimate of how well our model will be when predicting future data we can perform cross validation, which consist on training a few models independently on different subsets of the data, using them to predict a validation set and measuring their performance.

Since our data depends on time, we make our splits by removing the last portions of the series and using them as validation sets. This process is implemented in `MLForecast.cross_validation`.

Cross-validation of time series models is considered a best practice but most implementations are very slow. The `MlForecast` library implements cross-validation as a distributed operation, making the process less time-consuming to perform. If you have big datasets you can also perform Cross Validation in a distributed cluster using `Ray, Dask or Spark`.

Depending on your computer, this step should take around 1 min.

The cross_validation method from the StatsForecast class takes the following arguments.

* `df:` training data frame

* `h (int):` represents h steps into the future that are being forecasted. In this case, 12 months ahead.

* `step_size (int):` step size between each window. In other words: how often do you want to run the forecasting processes.

* `n_windows(int):` number of windows used for cross validation. In other words: what number of forecasting processes in the past do you want to evaluate.

In [24]:
cv_result = mlf.cross_validation(
    data,
    n_windows=5,  # number of models to train/splits to perform
    window_size=30,  # length of the validation set in each window
    prediction_intervals=PredictionIntervals(n_windows=5, window_size=30, method="conformal_distribution")
)

The crossvaldation_df object is a new data frame that includes the following columns:

* `unique_id:` index. If you dont like working with index just run `crossvalidation_df.resetindex()`.
* `ds:` datestamp or temporal index
* `cutoff:` the last datestamp or temporal index for the `n_windows`.
* `y:` true value
* `model:` columns with the model’s name and fitted value.

In [25]:
cv_result

Unnamed: 0,unique_id,ds,cutoff,y,Pipeline,XGBRegressor
0,desktop,2019-03-31,2019-03-30,134535.0,192685.39,146677.234375
1,mobile,2019-03-31,2019-03-30,181469.0,220560.10,226645.937500
2,tablet,2019-03-31,2019-03-30,1848.0,7974.42,11840.647461
3,desktop,2019-04-01,2019-03-30,581078.0,556486.20,675617.250000
4,mobile,2019-04-01,2019-03-30,432169.0,402846.15,353045.562500
...,...,...,...,...,...,...
85,mobile,2019-08-26,2019-07-28,4853.0,61083.81,-35539.656250
86,tablet,2019-08-26,2019-07-28,199.0,2100.08,5553.020508
87,desktop,2019-08-27,2019-07-28,438.0,63590.21,21381.087891
88,mobile,2019-08-27,2019-07-28,557.0,69718.89,-34723.070312


# **Evaluate the model** <a class="anchor" id="7"></a>

[Table of Contents](#0.1)

We can now compute the accuracy of the forecast using an appropiate accuracy metric. Here we’ll use the Root Mean Squared Error (RMSE). To do this, we first need to `install datasetsforecast`, a Python library developed **by Nixtla** that includes a function to compute the RMSE.

`pip install datasetsforecast`

In [26]:
from datasetsforecast.losses import rmse

The function to compute the RMSE takes two arguments:

1. The actual values.
2. The forecasts, in this case,`Pipeline` and `XGBRegressor() Model`.

In [27]:
from datasetsforecast.losses import mse, mae, rmse

def evaluate_cross_validation(df, metric):
    models = df.drop(columns=['ds', 'cutoff', 'y']).columns.tolist()
    evals = []
    for model in models:
        eval_ = df.groupby(['unique_id', 'cutoff']).apply(lambda x: metric(x['y'].values, x[model].values)).to_frame() # Calculate loss for every unique_id, model and cutoff.
        eval_.columns = [model]
        evals.append(eval_)
    evals = pd.concat(evals, axis=1)
    evals = evals.groupby(['unique_id']).mean(numeric_only=True) # Averages the error metrics for all cutoffs for every combination of model and unique_id
    evals['best_model'] = evals.idxmin(axis=1)
    return evals

In [28]:
evaluation_df = evaluate_cross_validation(cv_result.set_index("unique_id"), rmse)

evaluation_df

Unnamed: 0_level_0,Pipeline,XGBRegressor,best_model
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
desktop,84655.000697,75792.889169,XGBRegressor
mobile,71030.403796,94662.025775,Pipeline
tablet,3852.438835,6845.755177,Pipeline


The result obtained in this case is given by each type of `unique_ide` identifying the model created with the `RMSE` metric, in which both results are compared.

In [29]:
cv_rmse = cv_result.groupby(['unique_id', 'cutoff']).apply(lambda df: rmse(df['y'], df['XGBRegressor'])).mean()
print("RMSE using cross-validation: ", cv_rmse)

RMSE using cross-validation:  59100.22337380755


In [30]:
cv_rmse = cv_result.groupby(['unique_id', 'cutoff']).apply(lambda df: rmse(df['y'], df['Pipeline'])).mean()
print("RMSE using cross-validation: ", cv_rmse)

RMSE using cross-validation:  53179.28110932804


In the previous result, we obtain the `RMSE` metric in a general way without considering the `Unique_id` identifiers, only taking as reference the values obtained from the `Crossvalidation` result, in this case the `Pipeline` and the `XGBRegressor` model.

# **References** <a class="anchor" id="8"></a>

[Table of Contents](#0)


1. Changquan Huang • Alla Petukhina. Springer series (2022). Applied Time Series Analysis and Forecasting with Python. 
2. Ivan Svetunkov. [Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM)](https://openforecast.org/adam/)
3. [James D. Hamilton. Time Series Analysis Princeton University Press, Princeton, New Jersey, 1st Edition, 1994.](https://press.princeton.edu/books/hardcover/9780691042893/time-series-analysis)
4. [Nixtla Parameters for Mlforecast](https://nixtla.github.io/mlforecast/forecast.html).
5. [Pandas available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).
6. [Rob J. Hyndman and George Athanasopoulos (2018). “Forecasting principles and practice, Time series cross-validation”.](https://otexts.com/fpp3/tscv.html).
7. [Seasonal periods- Rob J Hyndman](https://robjhyndman.com/hyndsight/seasonal-periods/).
8. [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html)