The main objective of the following article is to obtain a step-by-step guide on how to find the `hyperparameters` of a *time series model with machine learning* using `Mlforecast`.

During this walkthrough, we will become familiar with the main `MlForecast` class and some relevant methods such as `mlforecast.fit`, `mlforecast.predict` and `mlforecast.cross_validation` in other.

Let's start!!!

<a class="anchor" id="0.1"></a>



1.	[Introduction](#1)
2.	[What is Optuna?](#2)
3.	[Loading libraries and data](#3)
4.	[Explore Data with the plot method](#4)
5.	[Implementation of model with MLForecast](#5)
6.  [How To Tune XGBoost Hyperparameters With Optuna](#6)
7.  [References](#7)

# **Introduction** <a class="anchor" id="1"></a>

[Table of Contents](#0.1)

When designing Machine Learning models, model optimization is always a very important issue. It is a generally very tedious process that is usually left at the very end of the production cycle due to the amount of time it requires, however, in this notebook we are going to focus on being able to find that hyperparameter of our machine learning model for series of time.

In machine learning, hyperparameters are fixed parameters that have been assigned before training, not identical to other parameters in a model.

Hyperparameters are made up of parts of the training algorithm and would characterize the architecture of a model. It consists of an improved pipeline that can greatly affect the efficiency of the model.

Addressing the need for this attribute and in terms of analyzing such specific algorithms and models that need fine tuning, we will use Optuna which has features that facilitate this problem by regulating the hyperparameter settings.

These processes are actually nothing more than choosing a series of values for each hyperparameter, making the possible combinations and starting to test different values. Some methods perform an exhaustive search and others choose only certain elements of the search space. They are very effective methods and generally allow us to obtain the optimal (quasi-optimal) hyperparameters.

# **What is Optuna?** <a class="anchor" id="2"></a>
[Table of Contents](#0.1)

Optuna is an automated hyperparameter optimization software framework that was knowingly invented for machine learning-based tasks. Emphasizes a run-defined authoritative approach user API.

Due to the run-defined API, code script written with Optuna retains extreme modularity, and Optuna users could actively compose “search spaces” for hyperparameters.

Optuna is a software framework for automated hyperparameter optimization procedure. By the known fact, it inspects and identifies optimal hyperparameter values through trial and error method for efficient performance and high efficiency.

Optuna emerges as a hyperparameter optimization software under a new design criterion that is based on three fundamental ideas:

- define-by-run API that allows users to dynamically build and manipulate search spaces,
- efficient implementation that focuses on the optimal functionality of sampling strategies as well as pruning algorithms, and
- easy to configure that focuses on versatility, that is, it allows optimizing functions in lightweight environments as well as large-scale experiments in environments based on distributed and parallel computing.

The criteria on which Optuna is designed make it easy to implement, flexible and scalable. Due to the scalability property of Optuna, optimization of large-scale experiments can be performed in a parallel and distributed manner. Optuna is framework agnostic, that is, it can be easily integrated with any of the machine learning and deep learning frameworks such as: PyTorch, Tensorflow, Keras, Scikit-Learn, XGBoost, etc.

### **What is a hyperparameter?**

A hyperparameter is a parameter to control how a machine learning algorithm behaves. In deep learning, the learning rate, batch size, and number of training iterations are hyperparameters. Hyperparameters also include the number of layers and channels of the neural network. They are not, however, just numerical values. Things like using Momentum SGD or Adam in training are also considered hyperparameters.

It is almost impossible to get a machine learning algorithm to do the job without tuning the hyperparameters. The number of hyperparameters tends to be high, especially in deep learning, and performance is thought to largely depend on how we tune them. Most researchers and engineers using deep learning technology manually tune these hyperparameters and spend a significant amount of their time doing so.

## **Optimal Strategy for Optimization**

Optuna generally uses the following strategy to find the best combination of hyperparameters.

### **Test strategy**

It uses a testing algorithm to select the best combination of parameters from a list of all possible combinations. It focuses on areas where hyperparameters are giving good results and ignores others, resulting in time savings.

Optuna allows you to build and manipulate hyperparameter search spaces dynamically. To sample settings from the search space, Optuna provides two types of photos:

- Relational sampling: this type of methods take into account information about the consequence between the parameters.
- Independent sampling.

The Tree-structured Parzen Estimator (TPE) is the default sampler in Optuna. It uses the history of previously evaluated hyperparameter configurations to sample the following ones.


### **Pruning strategy**

It uses a pruning strategy that constantly checks the performance of the algorithm during training and prunes (terminates) the training for a particular combination of hyperparameters if it is not giving good results. This also results in time savings.

A pruning mechanism refers to the termination of unpromising tests during hyperparameter optimization. Periodically monitor the learning curves of each test. Next, determine the sets of hyperparameters that will not lead to a good result and should not be considered.

The pruning mechanism implemented in Optuna is based on an asynchronous variant of the successive halving algorithm (SHA). Let's understand the general idea behind SHA:

- Assign the minimum amount of resources to each available hyperparameter configuration. The resources, for example, are the number of epochs, the number of training examples, the training duration, etc.
- Evaluate performance metrics of all configurations within allocated resources.
- Keep the upper settings 1/ η (η – a reduction factor) with the best pressures and discard the rest.
- Increase the minimum amount of resources per configuration by the factor η and repeat until the amount of resources per configuration reaches the maximum.

# **Loading libraries and data** <a class="anchor" id="3"></a>

[Table of Contents](#0.1)

In [1]:
# Handling and processing of Data
# ==============================================================================
import numpy as np
import pandas as pd

import scipy.stats as stats

# Handling and processing of Data for Date (time)
# ==============================================================================
import datetime
import time
from datetime import datetime, timedelta

# 
# ==============================================================================
from statsmodels.tsa.stattools import adfuller
import statsmodels.api as sm
import statsmodels.tsa.api as smt
from statsmodels.tsa.seasonal import seasonal_decompose 
# 
# ==============================================================================
from utilsforecast.plotting import plot_series

In [2]:
from mlforecast import MLForecast
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# 
# ==============================================================================
from numba import njit
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean
from window_ops.ewm import ewm_mean
from mlforecast.target_transforms import Differences

from mlforecast.utils import PredictionIntervals
from mlforecast.utils import generate_daily_series, generate_prices_for_series

In [3]:
# Plot
# ==============================================================================
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plt.style.use('grayscale') # fivethirtyeight  grayscale  classic
plt.rcParams['lines.linewidth'] = 1.5
dark_style = {
    'figure.facecolor': '#008080',  # #212946
    'axes.facecolor': '#008080',
    'savefig.facecolor': '#008080',
    'axes.grid': True,
    'axes.grid.which': 'both',
    'axes.spines.left': False,
    'axes.spines.right': False,
    'axes.spines.top': False,
    'axes.spines.bottom': False,
    'grid.color': '#000000',  #2A3459
    'grid.linewidth': '1',
    'text.color': '0.9',
    'axes.labelcolor': '0.9',
    'xtick.color': '0.9',
    'ytick.color': '0.9',
    'font.size': 12 }
plt.rcParams.update(dark_style)
# Define the plot size
# ==============================================================================

plt.rcParams['figure.figsize'] = (18,7)

# Hide warnings
# ==============================================================================
import warnings
warnings.filterwarnings("ignore")

## **Read Data**

In [4]:
df=pd.read_csv("https://raw.githubusercontent.com/Naren8520/Serie-de-tiempo-con-Machine-Learning/main/Data/tipos_malarias_choco_colombia.csv",parse_dates=["semanas"],sep=";",usecols=[0,3] )
df.head()

Unnamed: 0,semanas,malaria_vivax
0,2007-12-31,61.0
1,2008-01-07,87.0
2,2008-01-14,64.0
3,2008-01-21,69.0
4,2008-01-28,54.0


The input to MlForecast is always a data frame in long format with three columns: unique_id, ds and y:

* The `unique_id` (string, int or category) represents an identifier for the series.

* The `ds` (datestamp) column should be of a format expected by Pandas, ideally YYYY-MM-DD for a date or YYYY-MM-DD HH:MM:SS for a timestamp.

* The `y` (numeric) represents the measurement we wish to forecast.

In [5]:
df.dropna(inplace=True)
df["unique_id"]="1"
df.columns=["ds", "y", "unique_id"]
df.head()

Unnamed: 0,ds,y,unique_id
0,2007-12-31,61.0,1
1,2008-01-07,87.0,1
2,2008-01-14,64.0,1
3,2008-01-21,69.0,1
4,2008-01-28,54.0,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 782 entries, 0 to 782
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ds         782 non-null    datetime64[ns]
 1   y          782 non-null    float64       
 2   unique_id  782 non-null    object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 24.4+ KB


# **Explore Data with the plot method** <a class="anchor" id="4"></a>

[Table of Contents](#0.1)

Plot some series using the plot method from the StatsForecast class. This method prints 8 random series from the dataset and is useful for basic EDA.

In [7]:
fig = plot_series(df)
fig.savefig('../figs/hiperparameters__eda.png')

![](../figs/hiperparameters__eda.png)

## **The Augmented Dickey-Fuller Test**
An Augmented Dickey-Fuller (ADF) test is a type of statistical test that determines whether a unit root is present in time series data. Unit roots can cause unpredictable results in time series analysis. A null hypothesis is formed in the unit root test to determine how strongly time series data is affected by a trend. By accepting the null hypothesis, we accept the evidence that the time series data is not stationary. By rejecting the null hypothesis or accepting the alternative hypothesis, we accept the evidence that the time series data is generated by a stationary process. This process is also known as stationary trend. The values of the ADF test statistic are negative. Lower ADF values indicate a stronger rejection of the null hypothesis.

Augmented Dickey-Fuller Test is a common statistical test used to test whether a given time series is stationary or not. We can achieve this by defining the null and alternate hypothesis.

- Null Hypothesis: Time Series is non-stationary. It gives a time-dependent trend.
- Alternate Hypothesis: Time Series is stationary. In another term, the series doesn’t depend on time.

- ADF or t Statistic < critical values: Reject the null hypothesis, time series is stationary.
- ADF or t Statistic > critical values: Failed to reject the null hypothesis, time series is non-stationary.

In [8]:
def augmented_dickey_fuller_test(series , column_name):
    print (f'Dickey-Fuller test results for columns: {column_name}')
    dftest = adfuller(series, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','No Lags Used','Number of observations used'])
    for key,value in dftest[4].items():
       dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)
    if dftest[1] <= 0.05:
        print("Conclusion:====>")
        print("Reject the null hypothesis")
        print("The data is stationary")
    else:
        print("Conclusion:====>")
        print("The null hypothesis cannot be rejected")
        print("The data is not stationary")

In [9]:
augmented_dickey_fuller_test(df["y"],"Vivax Malaria")

Dickey-Fuller test results for columns: Vivax Malaria
Test Statistic                  -3.961578
p-value                          0.001626
No Lags Used                    13.000000
Number of observations used    768.000000
Critical Value (1%)             -3.438893
Critical Value (5%)             -2.865311
Critical Value (10%)            -2.568778
dtype: float64
Conclusion:====>
Reject the null hypothesis
The data is stationary


## **Autocorrelation plots**

### **Autocorrelation Function**

**Definition 1.** Let $\{x_t;1 ≤ t ≤ n\}$ be a time series sample of size n from $\{X_t\}$.
1. $\bar x = \sum_{t=1}^n \frac{x_t}{n}$ is called the sample mean of $\{X_t\}$.
2. $c_k =\sum_{t=1}^{n−k} (x_{t+k}- \bar x)(x_t−\bar x)/n$ is known as the sample autocovariance function of $\{X_t\}$.
3. $r_k = c_k /c_0$ is said to be the sample autocorrelation function of $\{X_t\}$. 

Note the following remarks about this definition:
 
* Like most literature, this guide uses ACF to denote the sample autocorrelation function as well as the autocorrelation function. What is denoted by ACF can easily be identified in context.

* Clearly c0 is the sample variance of $\{X_t\}$. Besides, $r_0 = c_0/c_0 = 1$ and for any integer $k, |r_k| ≤ 1$.

* When we compute the ACF of any sample series with a fixed length $n$, we cannot put too much confidence in the values of $r_k$ for large k’s, since fewer pairs of $(x_{t +k }, x_t )$ are available for calculating $r_k$ as $k$ is large. One rule of thumb is not to estimate $r_k$ for $k > n/3$, and another is $n ≥ 50, k ≤ n/4$. In any case, it is always a good idea to be careful.

* We also compute the ACF of a nonstationary time series sample by Definition 1. In this case, however, the ACF or $r_k$ very slowly or hardly tapers off as $k$ increases.

* Plotting the ACF $(r_k)$ against lag $k$ is easy but very helpful in analyzing time series sample. Such an ACF plot is known as a correlogram.

* If $\{X_t\}$ is stationary with $E(X_t)=0$ and $\rho_k =0$ for all $k \neq 0$,thatis,itisa white noise series, then the sampling distribution of $r_k$ is asymptotically normal with the mean 0 and the variance of $1/n$. Hence, there is about 95% chance that $r_k$ falls in the interval $[−1.96/√n, 1.96/√n]$.

Now we can give a summary that (1) if the time series plot of a time series clearly shows a trend or/and seasonality, it is surely nonstationary; (2) if the ACF $r_k$ very slowly or hardly tapers off as lag $k$ increases, the time series should also be nonstationary.

In [10]:
fig, axs = plt.subplots(nrows=1, ncols=2)

plot_acf(df["y"],  lags=30, ax=axs[0],color="fuchsia")
axs[0].set_title("Autocorrelation");

# Grafico
plot_pacf(df["y"],  lags=30, ax=axs[1],color="lime")
axs[1].set_title('Partial Autocorrelation')
plt.savefig("../figs/hiperparameters__autocorrelation.png")
plt.close();

![](../figs/hiperparameters__autocorrelation.png)

## **Decomposition of the time series**

How to decompose a time series and why?

In time series analysis to forecast new values, it is very important to know past data. More formally, we can say that it is very important to know the patterns that values follow over time. There can be many reasons that cause our forecast values to fall in the wrong direction. Basically, a time series consists of four components. The variation of those components causes the change in the pattern of the time series. These components are:

* **Level:** This is the primary value that averages over time.
* **Trend:** The trend is the value that causes increasing or decreasing patterns in a time series.
* **Seasonality:** This is a cyclical event that occurs in a time series for a short time and causes short-term increasing or decreasing patterns in a time series.
* **Residual/Noise:** These are the random variations in the time series.

Combining these components over time leads to the formation of a time series. Most time series consist of level and noise/residual and trend or seasonality are optional values.

If seasonality and trend are part of the time series, then there will be effects on the forecast value. As the pattern of the forecasted time series may be different from the previous time series.

The combination of the components in time series can be of two types:
* Additive
* multiplicative

### Additive

In [11]:
a = seasonal_decompose(df["y"], model = "additive", period=24).plot()
a.savefig('../figs/hiperparameters__seasonal_decompose_aditive.png')
plt.close()

![](../figs/hiperparameters__seasonal_decompose_aditive.png)

### Multiplicative

In [12]:
m = seasonal_decompose(df["y"], model = "Multiplicative", period=24).plot()
m.savefig('../figs/hiperparameters__seasonal_decompose_multiplicative.png')
plt.close();

![](../figs/hiperparameters__seasonal_decompose_multiplicative.png)

# **Modeling with MLForecast** <a class="anchor" id="5"></a>

[Table of Contents](#0.1)

## **Building Model**

We define the model that we want to use, for our example we are going to use the `XGBoost model`.

In [13]:
model = [XGBRegressor(n_estimators=100,                              
                               random_state=1234,
                               n_jobs=-1) ]

We fit the models by instantiating a new `MlForecast` object with the following parameters:

* `models:` a list of models. Select the models you want from models and import them.

* `freq:` a string indicating the frequency of the data. (See [panda’s available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).)

* `lags:` Lags of the target to uses as feature.

* `lag_transforms:` Mapping of target lags to their transformations.

* `date_features:` Features computed from the dates. Can be `pandas` date attributes or functions that will take the dates as input.

* `differences:` Differences to take of the target before computing the features. These are restored at the forecasting step.

* `num_threads:` Number of threads to use when computing the features.

* `target_transforms:` Transformations that will be applied to the target computing the features and restored after the forecasting step.

Any settings are passed into the constructor. Then you call its fit method and pass in the historical data frame.

In [14]:
from window_ops.expanding import expanding_mean
from window_ops.rolling import rolling_mean

In [15]:
def even_day(dates):
    return dates.day % 2 == 0

In [16]:
mlf = MLForecast(models=model,
                 freq='W', 
                 lags=[1,7,14],
                 lag_transforms={1: [expanding_mean], 12: [(rolling_mean, 7)] },
                 date_features=['dayofweek', 'month', even_day],
                 num_threads=2,        
                 )

In [17]:
prep = mlf.preprocess(df)
prep

Unnamed: 0,ds,y,unique_id,lag1,lag7,lag14,expanding_mean_lag1,rolling_mean_lag12_window_size7,dayofweek,month,even_day
18,2008-05-05,95.0,1,52.0,42.0,54.0,63.888889,64.142857,0,5,False
19,2008-05-12,76.0,1,95.0,64.0,58.0,65.526316,67.428571,0,5,True
20,2008-05-19,64.0,1,76.0,76.0,56.0,66.050000,64.571429,0,5,False
21,2008-05-26,114.0,1,64.0,80.0,84.0,65.952381,63.142857,0,5,True
22,2008-06-02,150.0,1,114.0,60.0,67.0,68.136364,59.000000,0,6,True
...,...,...,...,...,...,...,...,...,...,...,...
778,2022-11-28,143.0,1,186.0,161.0,207.0,154.800515,222.285714,0,11,True
779,2022-12-05,155.0,1,143.0,195.0,236.0,154.785347,217.000000,0,12,False
780,2022-12-12,134.0,1,155.0,186.0,206.0,154.785623,204.428571,0,12,True
781,2022-12-19,103.0,1,134.0,168.0,150.0,154.758974,196.000000,0,12,False


## **Fit method**

In [18]:
# fit the models
mlf.fit(df,  
 fitted=True)

MLForecast(models=[XGBRegressor], freq=<Week: weekday=6>, lag_features=['lag1', 'lag7', 'lag14', 'expanding_mean_lag1', 'rolling_mean_lag12_window_size7'], date_features=['dayofweek', 'month', <function even_day at 0x16283c310>], num_threads=2)

Let's see the results of our model in this case the `XGBRegressor model`. We can observe it with the following instruction:

Let us now visualize the fitted values of our models.

In [19]:
result=mlf.forecast_fitted_values()
result=result.set_index("unique_id")
result

Unnamed: 0_level_0,ds,y,XGBRegressor
unique_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2008-05-05,95.0,94.297630
1,2008-05-12,76.0,78.370583
1,2008-05-19,64.0,67.504013
1,2008-05-26,114.0,112.061485
1,2008-06-02,150.0,147.969345
...,...,...,...
1,2022-11-28,143.0,143.187683
1,2022-12-05,155.0,155.217651
1,2022-12-12,134.0,133.981125
1,2022-12-19,103.0,104.260315


In [20]:
from statsmodels.stats.diagnostic import normal_ad
from scipy import stats

sw_result = stats.shapiro(result["XGBRegressor"])
ad_result = normal_ad(np.array(result["XGBRegressor"]), axis=0)
dag_result = stats.normaltest(result["XGBRegressor"], axis=0, nan_policy='propagate')

It's important to note that we can only use this method if we assume that the residuals of our validation predictions are normally distributed. To see if this is the case, we will use a PP-plot and test its normality with the Anderson-Darling, Kolmogorov-Smirnov, and D’Agostino K^2 tests.

The PP-plot(Probability-to-Probability) plots the data sample against the normal distribution plot in such a way that if normally distributed, the data points will form a straight line.

The three normality tests determine how likely a data sample is from a normally distributed population using p-values. The null hypothesis for each test is that "the sample came from a normally distributed population". This means that if the resulting p-values are below a chosen alpha value, then the null hypothesis is rejected. Thus there is evidence to suggest that the data comes from a non-normal distribution. For this article, we will use an Alpha value of 0.01.

In [21]:
result=mlf.forecast_fitted_values()
fig, axs = plt.subplots(nrows=2, ncols=2)

# plot[1,1]
result["XGBRegressor"].plot(ax=axs[0,0])
axs[0,0].set_title("Residuals model");

# plot
#plot(result["XGBRegressor"], ax=axs[0,1]);
axs[0,1].hist(result["XGBRegressor"], density=True,bins=50, alpha=0.5 )
axs[0,1].set_title("Density plot - Residual");

# plot
stats.probplot(result["XGBRegressor"], dist="norm", plot=axs[1,0])
axs[1,0].set_title('Plot Q-Q')
axs[1,0].annotate("SW p-val: {:.4f}".format(sw_result[1]), xy=(0.05,0.9), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("AD p-val: {:.4f}".format(ad_result[1]), xy=(0.05,0.8), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))

axs[1,0].annotate("DAG p-val: {:.4f}".format(dag_result[1]), xy=(0.05,0.7), xycoords='axes fraction', fontsize=15,
            bbox=dict(boxstyle="round", fc="none", ec="gray", pad=0.6))
# plot
plot_acf(result["XGBRegressor"],  lags=35, ax=axs[1,1],color="fuchsia")
axs[1,1].set_title("Autocorrelation");

plt.savefig("../figs/hiperparameters__plot_residual_model.png")
plt.close();

![](../figs/hiperparameters__plot_residual_model.png)

## **Predict method**

In [22]:
from datetime import timedelta

forecast_df = mlf.predict(h=30) 
forecast_df["ds"]=forecast_df["ds"]+timedelta(days=1)
forecast_df

Unnamed: 0,unique_id,ds,XGBRegressor
0,1,2023-01-02,136.666107
1,1,2023-01-09,206.950119
2,1,2023-01-16,201.778046
3,1,2023-01-23,202.578064
4,1,2023-01-30,205.729736
5,1,2023-02-06,183.612137
6,1,2023-02-13,172.488846
7,1,2023-02-20,227.618134
8,1,2023-02-27,225.309097
9,1,2023-03-06,235.971085


## **Plot prediction**

In [23]:
fig=plot_series(df, forecast_df, max_insample_length=500,engine="matplotlib")
for ax in fig.get_axes():
   ax.set_title("Prediction")
fig.savefig('../figs/hiperparameters__plot_forecasting_intervals.png')

![](../figs/hiperparameters__plot_forecasting_intervals.png)

Take a look at the feature importances too.

This model is heavily dependent on the lag 1 feature, as you can see in the feature importance plot for `XGBRegressor`.

In [24]:
fig=pd.Series(mlf.models_['XGBRegressor'].feature_importances_, 
              index=mlf.ts.features_order_).sort_values(ascending=False).plot.bar(title='Feature Importance XGBRegressor')
plt.savefig('../figs/hiperparameters__plot_feature_importance.png',dpi=300)
plt.close()

![](../figs/hiperparameters__plot_feature_importance.png)

# **How To Tune XGBoost Hyperparameters With Optuna** <a class="anchor" id="6"></a>

[Table of Contents](#0)

Let’s use Optuna which is a straightforward hyperparameter tuning library in Python.

We need to pack our code inside an objective function that takes the trial object as an argument.

This object encapsulates one step of the optimization process.

### **Split the data into training and testing**

In [25]:
train = df[df.ds<='2022-05-30'] 
test = df[df.ds>'2022-05-30']

In [26]:
train.shape, test.shape

((752, 3), (30, 3))

In [27]:
from sklearn.metrics import mean_absolute_percentage_error
import optuna

In [28]:
# optimize XGBoost hyperparameters with optuna
def objective(trial):
    # create the regressor object
    lags = trial.suggest_int('lags', 14, 56, step=7)
    regressor = [XGBRegressor(n_estimators=trial.suggest_int('n_estimators', 70, 1000),
                             max_depth=trial.suggest_int('max_depth', 2, 6),
                             min_child_weight=trial.suggest_float('min_child_weight', 0, 6),
                             gamma=trial.suggest_float('gamma', 0.001, 6, log=True),
                             learning_rate=trial.suggest_float('learning_rate', 0.001, 0.3, log=True),
                             subsample=trial.suggest_float('subsample', 0.50, 1),
                             colsample_bytree=trial.suggest_float('colsample_bytree', 0.5, 1),
                             reg_lambda=trial.suggest_float('reg_lambda', 0.001, 5, log=True))]
    model = MLForecast(models=regressor,
                    freq='W',
                    lags=[1,7, lags],
                    lag_transforms={1: [expanding_mean], 12: [(rolling_mean, 7)] },)
    # fit model
    model.fit(train)
    # predict model
    p = model.predict(horizon=30)
    p["ds"]=p["ds"]+timedelta(days=1)
    p = p.merge(test, on=['unique_id', 'ds'], how='left')

    error = mean_absolute_percentage_error(p['y'], p['XGBRegressor'])

    return error
    
    

For each hyperparameter we want to tune, we use the `trial.suggest_*` methods.

This will sample a value for the hyperparameter from the specified distribution with different strategies that we can choose.

For now I will use the default strategy which is `Tree-structured Parzen Estimator`.

The ranges we picked for each hyperparameter are based on every that own experience with XGBoost

I added a `lags hyperparameter` to show that you can tune any number that is used in your code.

We are tuning the longest lag we are using, but we could tune everything, even the window sizes of the rolling transforms or even which transforms to use.

max_depth and min_child_weight control how deep (and complex) each tree can be.

Deeper trees can capture more complex patterns, but they are more likely to overfit.

On the other hand, limiting the number of samples in each leaf node can help prevent overfitting because the tree can’t grow unless it has enough samples to split.

`subsample` and `colsample_bytree` define what percentage of the samples and features to randomly sample before building each tree.

This is a very good way to prevent overfitting.

After we define the objective function, we can create the study object and pass the objective function to it.

We're going to try it!!!

In [29]:
# begin optimization
#optuna.logging.set_verbosity(optuna.logging.WARNING) # won't print progress for every single trial
study = optuna.create_study(directions=['maximize'])
study.optimize(objective, n_trials=20)

[I 2023-09-19 15:43:13,246] A new study created in memory with name: no-name-352e7b9d-fde2-438c-8cc4-282cea974102
[I 2023-09-19 15:43:13,611] Trial 0 finished with value: 0.6905306474799863 and parameters: {'lags': 14, 'n_estimators': 469, 'max_depth': 5, 'min_child_weight': 4.50163860564606, 'gamma': 0.013960193439367, 'learning_rate': 0.0013305504170062963, 'subsample': 0.8811803896252044, 'colsample_bytree': 0.5303358418038833, 'reg_lambda': 4.346896101328735}. Best is trial 0 with value: 0.6905306474799863.
[I 2023-09-19 15:43:14,028] Trial 1 finished with value: 0.38221777561508335 and parameters: {'lags': 21, 'n_estimators': 940, 'max_depth': 2, 'min_child_weight': 0.6102551304950998, 'gamma': 4.227154581592189, 'learning_rate': 0.07869555180432371, 'subsample': 0.7241820946807263, 'colsample_bytree': 0.7742561356512968, 'reg_lambda': 0.4348471621028174}. Best is trial 0 with value: 0.6905306474799863.
[I 2023-09-19 15:43:14,852] Trial 2 finished with value: 0.4342010164034546 an

Let's observe which are the best parameters that have been obtained.

In [30]:
best_params = study.best_trial.params
best_params

{'lags': 35,
 'n_estimators': 195,
 'max_depth': 5,
 'min_child_weight': 2.0120579027311982,
 'gamma': 0.004133037348170746,
 'learning_rate': 0.0017648808617812964,
 'subsample': 0.8144303458244111,
 'colsample_bytree': 0.6432445647169998,
 'reg_lambda': 0.5530873325444274}

In [31]:
study.best_value

0.8537706136901424

If we want we can also minimize our function by passing the parameter `direction='minimize'`

In [32]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20)

[I 2023-09-19 15:45:07,667] A new study created in memory with name: no-name-a93d3f8e-8d97-4966-9037-0781c6f20bc3
[I 2023-09-19 15:45:08,258] Trial 0 finished with value: 0.3923412045391076 and parameters: {'lags': 49, 'n_estimators': 654, 'max_depth': 5, 'min_child_weight': 4.608809928648014, 'gamma': 0.14677908331244996, 'learning_rate': 0.06883945998410401, 'subsample': 0.5008153054468771, 'colsample_bytree': 0.7292802265885936, 'reg_lambda': 0.03542846872595663}. Best is trial 0 with value: 0.3923412045391076.
[I 2023-09-19 15:45:08,361] Trial 1 finished with value: 0.5272550869750282 and parameters: {'lags': 28, 'n_estimators': 90, 'max_depth': 2, 'min_child_weight': 4.69818002249028, 'gamma': 0.05193291872225416, 'learning_rate': 0.017148826428957712, 'subsample': 0.8951282684657649, 'colsample_bytree': 0.6296805459514287, 'reg_lambda': 0.003014027615359843}. Best is trial 0 with value: 0.3923412045391076.
[I 2023-09-19 15:45:08,952] Trial 2 finished with value: 0.439830917139500

Let's observe which are the best parameters that have been obtained.

In [33]:
study.best_params

{'lags': 56,
 'n_estimators': 270,
 'max_depth': 3,
 'min_child_weight': 3.37279903829989,
 'gamma': 0.4128699413758573,
 'learning_rate': 0.00813196288011915,
 'subsample': 0.70885138767339,
 'colsample_bytree': 0.7953562005371594,
 'reg_lambda': 0.011815059048900754}

Now we are going to train our model with the best parameters obtained from the optimization process

In [34]:
mejor_param = [XGBRegressor(**best_params) ]

In [35]:
regressor = MLForecast(models=mejor_param,
                    freq='W',
                    lags=[28],
                    lag_transforms={1: [expanding_mean], 12: [(rolling_mean, 7)] },)

In [36]:
regressor.fit(df)

Parameters: { "lags" } are not used.



MLForecast(models=[XGBRegressor], freq=<Week: weekday=6>, lag_features=['lag28', 'expanding_mean_lag1', 'rolling_mean_lag12_window_size7'], date_features=[], num_threads=1)

In [37]:
forecast_df = mlf.predict(h=30) 
forecast_df["ds"]=forecast_df["ds"]+timedelta(days=1)
forecast_df

Unnamed: 0,unique_id,ds,XGBRegressor
0,1,2023-01-02,136.666107
1,1,2023-01-09,206.950119
2,1,2023-01-16,201.778046
3,1,2023-01-23,202.578064
4,1,2023-01-30,205.729736
5,1,2023-02-06,183.612137
6,1,2023-02-13,172.488846
7,1,2023-02-20,227.618134
8,1,2023-02-27,225.309097
9,1,2023-03-06,235.971085


In [38]:
fig=plot_series(df, forecast_df, max_insample_length=5000,engine="matplotlib")
for ax in fig.get_axes():
   ax.set_title("Prediction")
fig.savefig('../figs/hiperparameters__plot_best_model.png')

![](../figs/hiperparameters__plot_best_model.png)

# **References** <a class="anchor" id="7"></a>

[Table of Contents](#0)

1. Changquan Huang • Alla Petukhina. Springer series (2022). Applied Time Series Analysis and Forecasting with Python. 
2. Ivan Svetunkov. [Forecasting and Analytics with the Augmented Dynamic Adaptive Model (ADAM)](https://openforecast.org/adam/)
3. [James D. Hamilton. Time Series Analysis Princeton University Press, Princeton, New Jersey, 1st Edition, 1994.](https://press.princeton.edu/books/hardcover/9780691042893/time-series-analysis)
4. [Nixtla Parameters for Mlforecast](https://nixtla.github.io/mlforecast/forecast.html).
5. [Pandas available frequencies](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases).
6. [Rob J. Hyndman and George Athanasopoulos (2018). “Forecasting principles and practice, Time series cross-validation”.](https://otexts.com/fpp3/tscv.html).
7. [Seasonal periods- Rob J Hyndman](https://robjhyndman.com/hyndsight/seasonal-periods/).
8. [Optuna](https://optuna.org)