# Introduction to time series forecasting and XGBoost

<a href="https://colab.research.google.com/drive/1djcRz-WZtsEtRtvBNvdhRQYBQ6xPytWw" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
</a>

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).

`Time series forecasting` is a predictive modeling technique used to forecast a variable's future values based on historical data. In a time series, data points are collected at regular intervals over time, and the goal of time series forecasting is to make accurate predictions about future values of the variable based on the patterns and trends observed in the historical data.

<img src="https://quantdare.com/wp-content/uploads/2021/02/3D_animation_clear_background.gif" alt="drawing" width="400"/>

[Source](https://quantdare.com/improving-time-series-animations-in-matplotlib-from-2d-to-3d/3d_animation_clear_background/).

`XGBoost` is a popular machine learning algorithm often used for time series forecasting. `XGBoost` stands for "_eXtreme Gradient Boosting_,"  a gradient boosting algorithm designed to work well with large and complex datasets.

> `Gradient boosting` is a machine learning technique for building predictive models, specifically `decision trees`. It works by iteratively adding new `decision trees` to the model, each correcting the previous tree's errors. `Gradient boosting` is a type of `ensemble learning` that combines multiple weaker models to create a more robust overall model.

`XGBoost` works by building a series of decision trees, where each tree is trained to predict the residual error of the previous tree. This allows the algorithm to capture complex nonlinear relationships and interactions between variables, making it a powerful tool for `time series forecasting`.

Overall, time series forecasting and `XGBoost` are essential techniques in machine learning, and they are used in a wide range of applications, including finance, economics, and meteorology.

In this notebook, we will work with a fake dataset containing a `sales history` of three years of a particular product (chocolate 🍫). Let us load our dataset.

> **Note**: all datasets and models related to the course and repo are in the Hub 🤗.

In [None]:
# the `-q` (quiet) key makes the installation less "verbose"
%pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset

# Load the datasets from the hub
df = load_dataset('AiresPucrs/time-series-data', split='train')

# Turn the datasets into a pandas.DataFrame
df = df.to_pandas()

display(df.head())

Downloading readme:   0%|          | 0.00/517 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/1098 [00:00<?, ? examples/s]

Unnamed: 0,dates,product_id,sales
0,2020-01-01,chocolate,137.0
1,2020-01-02,chocolate,87.0
2,2020-01-03,chocolate,188.0
3,2020-01-04,chocolate,286.0
4,2020-01-05,chocolate,156.0


Let us look at our history of chocolate sales as a timeline.

In [None]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=df.dates,
                        y=df.sales,
                        name='Sales History', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales History",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)
fig.show()

This dataset looks quite noisy, but it could have some seasonal patterns.

> Note: A seasonal pattern is a recurring pattern that occurs at a fixed time interval within a time series. It is a type of pattern that occurs due to changes in the time of the year, such as weather, holidays, and other seasonal factors. For instance, sales of winter clothes tend to increase during the winter season, whereas sales of summer clothes tend to increase during the summer season.

Below, we print some important statistical information about our product's sales distribution.

Here's a brief explanation of each:

- `Mean`: The mean, or the average, is the sum of all values in a dataset divided by the total number of values.
- `Minimum`: The minimum is the smallest value in a dataset.
- `Maximum`: The maximum is the largest value in a dataset.
- `Variance`: The variance measures how the values in a dataset are spread out. It is calculated by taking the average squared differences between each value and the mean. A high variance indicates that the values are widely spread out, while a low variance suggests that the values are tightly clustered around the mean.
- `Standard deviation`: The standard deviation is the square root of the variance. It is also a measure of how spread out the values in a dataset are, but it is easier to interpret than the variance since it is expressed in the same units as the original data. A high standard deviation indicates that the values are widely spread out, while a low standard deviation indicates that the values are tightly clustered around the mean.

We are also calculating the `Pearson correlation` between the Sales Series and its shifted self. We want to find where the correlation is greater: when the seasonal cycle has the strongest chance to repeat.

In [None]:
import pandas as pd
from IPython.display import Markdown

autocorr = []

for i in range(1,400):
    x = df.sales.autocorr(lag=i)
    autocorr.append((i,x))

print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std())
print("Maximum Autocorrelation")
display(Markdown(pd.DataFrame(autocorr, columns=['Lag', 'Maximum Autocorrelation'])\
    .sort_values('Maximum Autocorrelation', ascending=False)\
        .set_index('Lag').head().to_markdown()))

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 103.3488160291439
Minimum Sold: 0.0
Maximum Sold: 466.0
Variance: 2976.16354007369
Standard Deviation: 54.5542256848513
Maximum Autocorrelation


|   Lag |   Maximum Autocorrelation |
|------:|--------------------------:|
|   366 |                  0.81522  |
|     7 |                  0.348714 |
|   373 |                  0.314376 |
|   359 |                  0.288542 |
|    28 |                  0.257385 |

Our data seems to give a pattern that repeats firmly every year (it also has a "_stable_" cycle for a week and a month). We also have a pretty considerable `variance`, i.e., Some of the values in our dataset could be `outliers`.

> Note: In statistics and machine learning, an `outlier` is an observation significantly different from other observations in the dataset. `Outliers` can be caused by measurement or recording errors, natural variations in the data, or rare events. `Outliers` can significantly affect the mean, variance, and other statistics used to summarize the data and can affect the performance of machine learning models trained on the data.

Let us plot the histogram to understand our sales value distribution better.

In [None]:
fig = go.Figure(data=[go.Histogram(x=df.sales)])

fig.update_xaxes(showgrid=False, ticksuffix=' Kg', showline=False, mirror=False)
fig.update_yaxes(showgrid=True)

fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales Histogram",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

fig = go.Figure(data=go.Scatter(x=df.dates,
                                y=df.sales, mode='markers'))

fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales Outliers",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()


We could, for example, determine that sales above 400 Kg are a statistical anomaly, and we will "_substitute these values_" by an acceptable maximum (e.g., 350 Kg). Or we could do something fancier and say if the value is above 5 `standard deviations` from the `mean`, we will set it to $mean + (std \times 3)$.

In [None]:
df.loc[df['sales'] > df.sales.mean() + \
       (df.sales.std() * 3), 'sales'] = \
        df.sales.mean() + (df.sales.std() * 3)

print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std())

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 102.23241838107728
Minimum Sold: 0.0
Maximum Sold: 267.0114930836978
Variance: 2482.95001490196
Standard Deviation: 49.82920845148917


### Feature Engineering

`Feature engineering` is the process of selecting, transforming, and creating input variables (`features`) to improve the performance of machine learning models. It involves using domain knowledge to effectively choose and change the most relevant input variables to train machine learning models.

In other words, `feature engineering` is converting raw data into a set of features that can be used to train a machine learning model. This process involves various techniques, such as data cleaning, feature selection, feature extraction, and feature scaling.

We will create some new features (we only have numerical sales so far) now.

First, how much chocolate I sold today correlates with what I'll sell tomorrow. Thus, let us create features that show the difference between sales, looking up to 7 days back in the past.

We will also create features that tell us the difference in sales looking one year back in the past, plus the moving average values in the window of one week and two weeks.

In [None]:
def create_sales_features(df):
    """
    Creates new features based on the `sales` column
    of the given DataFrame.

    Args:
        df (pandas.DataFrame): The DataFrame to create
        features for.

    Returns:
        pandas.DataFrame: A new DataFrame with the original
        columns and additional columns for each feature created:
            - difference in sales from 1 to 7 days back.
            - difference in sales from 28 days back.
            - difference in sales from 366 days back.
            - moving average from a 1 week window.
            - moving average from a 2 week window.
    """
    df = df.copy()

    previous = df.sales.shift(1)
    df['difference_1'] = df.sales - previous

    for i in range(1, 7):
        column = 'difference_' + str(i+1)
        df[column] = df['difference_1'].shift(i)

    df['moving_average_week'] = df.sales.rolling(window=7).mean()

    df['moving_average_two_weeks'] = df.sales.rolling(window=14).mean()

    df['difference_month'] = df.sales - df.sales.shift(28)

    df['difference_year'] = df.sales - df.sales.shift(366)

    df = df.dropna()

    return df

df_sales_features = create_sales_features(df)

display(df_sales_features.head())

Unnamed: 0,dates,product_id,sales,difference_1,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,moving_average_week,moving_average_two_weeks,difference_month,difference_year
366,2021-01-01,chocolate,91.0,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,116.0,115.571429,118.714286,20.0,-46.0
367,2021-01-02,chocolate,69.0,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,94.571429,114.714286,-101.0,-18.0
368,2021-01-03,chocolate,159.0,90.0,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,100.714286,118.285714,59.0,-29.0
369,2021-01-04,chocolate,232.0,73.0,90.0,-22.0,-8.0,-9.0,-21.0,79.0,126.714286,125.071429,172.0,-35.011493
370,2021-01-05,chocolate,126.0,-106.0,73.0,90.0,-22.0,-8.0,-9.0,-21.0,126.285714,121.071429,23.0,-30.0


Second, if our data has some form of seasonality, information like the `day of the week`, `day of the year`, `quarter`, `month`, and `year`, could help us in this forecasting problem.

Luckily, pandas give us all of these for free, as long as our index is of a `datatime` format.

In [None]:
def create_time_features(df):
    """
    Extracts various time-related features from a DataFrame
    containing time-series data and returns the updated DataFrame.

    Args:
        - df (pandas.DataFrame): The DataFrame containing time-series
        data to process. This DataFrame must have a 'dates' column
        with datetime values.

    Returns:
        - pandas.DataFrame: The updated DataFrame with additional
        time-related features added as columns:
            - day of week.
            - day of year.
            - quarter of the year.
            - month of the year.
            - year.
    """
    df = df.copy()

    df = df.set_index('dates')

    df.index = pd.to_datetime(df.index)

    df['day_of_week'] = df.index.day_of_week
    df['day_of_year'] = df.index.day_of_year
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year

    df = df.sort_index()

    return df

df_time_features = create_time_features(df_sales_features)

display(df_time_features.head())

Unnamed: 0_level_0,product_id,sales,difference_1,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,moving_average_week,moving_average_two_weeks,difference_month,difference_year,day_of_week,day_of_year,quarter,month,year
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2021-01-01,chocolate,91.0,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,116.0,115.571429,118.714286,20.0,-46.0,4,1,1,1,2021
2021-01-02,chocolate,69.0,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,94.571429,114.714286,-101.0,-18.0,5,2,1,1,2021
2021-01-03,chocolate,159.0,90.0,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,100.714286,118.285714,59.0,-29.0,6,3,1,1,2021
2021-01-04,chocolate,232.0,73.0,90.0,-22.0,-8.0,-9.0,-21.0,79.0,126.714286,125.071429,172.0,-35.011493,0,4,1,1,2021
2021-01-05,chocolate,126.0,-106.0,73.0,90.0,-22.0,-8.0,-9.0,-21.0,126.285714,121.071429,23.0,-30.0,1,5,1,1,2021


As a last modification, we can use the `StandardScaler` to standardize the scale of the features in the data, which can improve the performance of machine learning models that are sensitive to the scale of the features. For categorical data (weekdays, months, etc.), we will hot encode them with the `pandas.get_dummies` method.

In general, it is a good practice to standardize the features in a regression problem.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

def scale_dataset(df):
    """
    Preprocesses a given DataFrame by scaling its numerical features using a MinMaxScaler,
    one-hot encoding its categorical features, and concatenating them together with the
    product ID and target variables.

    Parameters
    ----------
    df : pandas.DataFrame
        The DataFrame to be preprocessed.

    Returns
    -------
    pandas.DataFrame
        The preprocessed DataFrame with scaled numerical features, one-hot encoded
        categorical features, and concatenated product ID and target variables.
    """

    df = df.copy()

    numerical_features = df[['difference_1', 'difference_2', 'difference_3',
       'difference_4', 'difference_5', 'difference_6', 'difference_7',
       'moving_average_week', 'moving_average_two_weeks', 'difference_month',
       'difference_year']]

    categorical_features = df[['day_of_week', 'day_of_year',
            'quarter', 'month', 'year']]

    categorical_features = pd.get_dummies(categorical_features,
        columns = ['day_of_week', 'day_of_year',
        'quarter', 'month', 'year'])

    product_id = df[['product_id']]
    target = df[['sales']]

    scaler.fit(numerical_features)

    numerical_features = pd.DataFrame(
       scaler.transform(numerical_features),
       columns=numerical_features.columns,
       index=numerical_features.index)

    df = pd.concat([product_id, target, numerical_features, categorical_features], axis=1)

    return df

df_scaled = scale_dataset(df_time_features)

display(df_scaled.head())

Unnamed: 0_level_0,product_id,sales,difference_1,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,moving_average_week,...,month_6,month_7,month_8,month_9,month_10,month_11,month_12,year_2021,year_2022,year_2023
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2021-01-01,chocolate,91.0,-0.12336,-0.139426,-0.32432,1.222318,-1.020935,-1.546401,1.793529,1.009674,...,0,0,0,0,0,0,0,1,0,0
2021-01-02,chocolate,69.0,-0.339905,-0.123955,-0.13868,-0.324412,1.2224,-1.02051,-1.545988,0.004941,...,0,0,0,0,0,0,0,1,0,0
2021-01-03,chocolate,159.0,1.392457,-0.340543,-0.12321,-0.138804,-0.324728,1.222262,-1.020323,0.298843,...,0,0,0,0,0,0,0,1,0,0
2021-01-04,chocolate,232.0,1.129509,1.392163,-0.33979,-0.123337,-0.139072,-0.324477,1.221482,1.542798,...,0,0,0,0,0,0,0,1,0,0
2021-01-05,chocolate,126.0,-1.639177,1.129163,1.392847,-0.339879,-0.123601,-0.138868,-0.324591,1.522293,...,0,0,0,0,0,0,0,1,0,0


With these new features, we can explore now how they are related to the sales of chocolate. For example, what are the days of the weak, or moths, with higher deals? We can plot `box plots` to find that out.

> Note: Box plots, also known as box-and-whisker plots, are a graphical representation of a dataset that displays the distribution of the data based on five summary statistics: minimum value, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), and maximum value.

In [None]:
import plotly.express as px


fig = px.box(df_time_features, x="day_of_week", y="sales", points="all", color="day_of_week")

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales by Day of the Week",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

fig = px.box(df_time_features, x="month", y="sales", points="all", color="month")

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales by Month",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

For the train/test split, we could do something very simple and use the last 30 days of our dataset for testing.

In [None]:
train = df_scaled.loc[df_scaled.index < '2022-12-04']
test = df_scaled.loc[df_scaled.index >= '2022-12-04']

print('Number of samples (days) for training: ', len(train))
print('Number of samples (days) for testing: ', len(test))

fig = go.Figure()

fig.add_trace(go.Scatter(x=train.index,
                        y=train.sales,
                        name='Train Data', mode='lines'))

fig.add_trace(go.Scatter(x=test.index,
                        y=test.sales,
                        name='Test Data', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
fig.update_layout(
template='plotly_dark',
title=f"Dataset Train/Test Split",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)
fig.show()

Number of samples (days) for training:  702
Number of samples (days) for testing:  30


However, we could do something more robust: using time series `cross-validation` and a `TimeSeriesSplit`.

> Note: `Cross-validation` is a machine learning technique to evaluate a model's performance and find the optimal model hyperparameters. It involves dividing the available data into subsets or "_folds_." The model is trained on a subset of the data called the training set and then evaluated on the remaining subset called the validation set. This process is repeated several times, with different subsets of data being used for training and validation in each iteration. The results are then averaged to obtain a more accurate estimate of the model's performance.

[`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) is a `cross-validation` technique for time series data provided by the `scikit-learn` library. It is similar to the traditional `cross-validation` technique, but it is designed to preserve the temporal order of the data when splitting it into training and validation sets.

Below, we are breaking our training set into four folds; each test set has 30 days, with a 1-day gap between them.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

tss = TimeSeriesSplit(n_splits=4, test_size=30, gap=1)

fold= 0
for train_idx, val_idx in tss.split(df_scaled):
    train = df_scaled.iloc[train_idx]
    test = df_scaled.iloc[val_idx]

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=train.index,
                            y=train.sales,
                            name='Train Data', mode='lines'))

    fig.add_trace(go.Scatter(x=test.index,
                            y=test.sales,
                            name='Test Data', mode='lines'))

    fig.update_xaxes(showgrid=False, showline=False, mirror=False)
    fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
    fig.update_layout(
    template='plotly_dark',
    title=f"Dataset Train/Test Split (Fold {fold+1})",
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)',
    )
    fig.show()

    fold += 1

Let us now also separate `features` from our `target` variable.

In [None]:
FEATURES = df_scaled.columns[2:]
TARGET = ['sales']

print('Number of features: ', len(FEATURES))
print('Features shape: ', df_scaled[FEATURES].values.shape)
print('Targets shape: ', df_scaled[TARGET].values.shape)


Number of features:  402
Features shape:  (732, 402)
Targets shape:  (732, 1)


### Creating a `Regression` Model

In machine learning, a `regression model` is a predictive model used to estimate a continuous output variable based on one or more input variables. It is a `supervised learning` technique that aims to find the relationship between the `input variables` (also called predictors or independent variables) and the `output variable` (also called the dependent variable).

Regression models can be `linear` or `nonlinear`, depending on the form of the relationship between the input and output variables. Linear regression models assume a linear relationship between the variables, while nonlinear regression models allow for more complex relationships.

To create our regression model, we will use the `XGBRegressor` from [`xgboost`](https://xgboost.readthedocs.io/en/stable/index.html). Some of the arguments you can pass when building this model are:

- `n_estimators`: Number of gradient-boosted trees. Equivalent to the number of boosting rounds.
- `early_stopping_rounds`: Activates early stopping. The validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training.
- `max_depth`: Maximum tree depth for base learners.
- `learning_rate`: Boosting learning rate.
- `booster`: Specify which booster to use ("gbtree", "gblinear" or "dart").

For more information, read the [documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn).

Below we are giving the model up to 1000 trees, an early stopping (to prevent `overfitting`) mark to trigger after 100 rounds, and a learning rate of 0.01 (low learning rates also help to prevent `overfitting`.)

> Note: In machine learning, `overfitting` refers to a phenomenon where a model learns the training data too well, to the point that it starts to memorize noise and irrelevant patterns in the data rather than learning the underlying patterns that generalize well to new data. As a result, an overfit model performs well on the training data but needs improvement on the validation or test data, as it fails to capture the genuine relationship between the input and output variables. To avoid `overfitting`, it is important to use techniques such as cross-validation, early stopping, and regularization, which help to prevent the model from overfitting the training data and improve its ability to generalize to new data.

Since it is more robust, we will train this model using cross-validation.

We will use the same metric the model used to track its performance to evaluate our predictions: `root mean squared error`.

Root Mean Squared Error (`RMSE`) is a common metric used to evaluate the performance of a regression model. It measures the average distance between the predicted and actual values of the target variable in units of the target variable itself:

$$\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y_i})^2}$$

Where:

- $n$ is the number of samples.
- $y_i$ is the true value of the target variable for sample $i$.
- $\hat{y_i}$ is the predicted value of the target variable for sample $i$.

The square of the difference between the actual and predicted values is first calculated for each sample, then averaged, and finally, the square root is taken to obtain the `RMSE`.

Luckily, we can import `mean_squared_error` from sklearn and square the result.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np


tss = TimeSeriesSplit(n_splits=4, test_size=30, gap=1)

model = xgb.XGBRegressor(n_estimators=1000, booster='gbtree',
                         early_stopping_rounds=150,
                         max_depth=2,
                         learning_rate=0.1)

fold= 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df_scaled):
    train = df_scaled.iloc[train_idx]
    test = df_scaled.iloc[val_idx]

    x_train = train[FEATURES]
    y_train = train[TARGET]

    x_test = test[FEATURES]
    y_test = test[TARGET]

    model.fit(x_train, y_train,
          eval_set=[(x_train, y_train), (x_test, y_test)],
          verbose=100)

    predictions = model.predict(x_test)
    preds.append(predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    scores.append(rmse)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=test.index,
                            y=test.sales,
                            name='Sales (ground truth)', mode='lines'))

    fig.add_trace(go.Scatter(x=test.index,
                            y=predictions,
                            name='Sales (predictions)', mode='lines'))

    fig.update_xaxes(showgrid=False, showline=False, mirror=False)
    fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

    fig.update_layout(
    template='plotly_dark',
    title=f"Chocolate Sales Ground-Truth/Predictions (Fold {fold+1})",
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)',
    )

    fig.show()

    fold += 1

print(f'Fold RMSE:{scores}')
print(f'Average RMSE across folds {np.mean(scores)}')


[0]	validation_0-rmse:99.96712	validation_1-rmse:77.45993
[100]	validation_0-rmse:15.04471	validation_1-rmse:16.23099
[200]	validation_0-rmse:10.64033	validation_1-rmse:14.06057
[300]	validation_0-rmse:8.20488	validation_1-rmse:12.79376
[400]	validation_0-rmse:6.80203	validation_1-rmse:11.94918
[500]	validation_0-rmse:5.74809	validation_1-rmse:11.37236
[600]	validation_0-rmse:5.08288	validation_1-rmse:11.05556
[700]	validation_0-rmse:4.49140	validation_1-rmse:10.83820
[800]	validation_0-rmse:4.06996	validation_1-rmse:10.68851
[900]	validation_0-rmse:3.69480	validation_1-rmse:10.63432
[999]	validation_0-rmse:3.38505	validation_1-rmse:10.52529


[0]	validation_0-rmse:99.17068	validation_1-rmse:78.75027
[100]	validation_0-rmse:15.03308	validation_1-rmse:17.42126
[200]	validation_0-rmse:10.78066	validation_1-rmse:14.46459
[300]	validation_0-rmse:8.39435	validation_1-rmse:12.75224
[400]	validation_0-rmse:6.91263	validation_1-rmse:11.98160
[500]	validation_0-rmse:5.93599	validation_1-rmse:11.51410
[600]	validation_0-rmse:5.23215	validation_1-rmse:11.20708
[700]	validation_0-rmse:4.67299	validation_1-rmse:11.04443
[800]	validation_0-rmse:4.19114	validation_1-rmse:10.89645
[900]	validation_0-rmse:3.83880	validation_1-rmse:10.75716
[999]	validation_0-rmse:3.52406	validation_1-rmse:10.67331


[0]	validation_0-rmse:98.42250	validation_1-rmse:78.14962
[100]	validation_0-rmse:14.70966	validation_1-rmse:13.87850
[200]	validation_0-rmse:10.47658	validation_1-rmse:12.42836
[300]	validation_0-rmse:8.26071	validation_1-rmse:11.36224
[400]	validation_0-rmse:6.85202	validation_1-rmse:10.73667
[500]	validation_0-rmse:5.92617	validation_1-rmse:10.36384
[600]	validation_0-rmse:5.26135	validation_1-rmse:10.27095
[700]	validation_0-rmse:4.72415	validation_1-rmse:10.03625
[800]	validation_0-rmse:4.28698	validation_1-rmse:9.88717
[900]	validation_0-rmse:3.92760	validation_1-rmse:9.72431
[999]	validation_0-rmse:3.62964	validation_1-rmse:9.64656


[0]	validation_0-rmse:97.68258	validation_1-rmse:90.79908
[100]	validation_0-rmse:14.81236	validation_1-rmse:16.92480
[200]	validation_0-rmse:10.57558	validation_1-rmse:14.82506
[300]	validation_0-rmse:8.28456	validation_1-rmse:13.89906
[400]	validation_0-rmse:6.93889	validation_1-rmse:13.26775
[500]	validation_0-rmse:5.95511	validation_1-rmse:13.11800
[600]	validation_0-rmse:5.26917	validation_1-rmse:12.95253
[700]	validation_0-rmse:4.76864	validation_1-rmse:12.92747
[800]	validation_0-rmse:4.30972	validation_1-rmse:12.79893
[900]	validation_0-rmse:3.95557	validation_1-rmse:12.74268
[999]	validation_0-rmse:3.65644	validation_1-rmse:12.66501


Fold RMSE:[10.520052021785917, 10.672554180402868, 9.646157817924287, 12.633030145118537]
Average RMSE across folds 10.867948541307902


It's not that bad, but we could improve this model by creating better features or tuning the hyperparameters more.

### Intepreting our Regressor

`Xgboost` gives you the importance of each feature used to train the model. You can get these values by using the `feature_importances_` method. It is an excellent way to try to `interpret` your model.

In [None]:

feature_importance = pd.DataFrame({
    'features' : model.feature_names_in_,
    'importance': model.feature_importances_
}).set_index('features').sort_values('importance')

import plotly.graph_objects as go

fig = go.Figure(go.Bar(
    x=feature_importance.importance,
    y=feature_importance.index,
    orientation='h'))

fig.update_xaxes(range=[feature_importance.importance.min()
    + (feature_importance.importance.min() * 0.1),
    feature_importance.importance.max()
    + (feature_importance.importance.max() * 0.1)])

fig.update_layout(
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    template='plotly_dark',
    title='Feature Importance for Regression Model',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)

fig.show()


`difference_month`, `difference_1`, `moving_average_week`, and `difference_year` seem to be this model's top 4 most important features.

Here are some potential features that could be used to improve this model:

- _External variables_: If you have access to external data, such as `weather data` or `economic indicators`, you could include those as features to capture the impact of those variables on sales.
- _Promotion and marketing variables_: If you have information about `promotions` or `marketing campaigns` that occurred during the period of interest, you could include those as features.

### Forecasting the Future

If we want to predict the future, we should train our model with all of our data to leverage our knowledge in this task. After that, we can create a future time frame and use the lag features from past sales to predict the future.

In [None]:
from datetime import timedelta

#df = "time_series_data' We already load from the hub

df.loc[df['sales'] > df.sales.mean() + (df.sales.std() * 3), 'sales'] = df.sales.mean() + (df.sales.std() * 3)

print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std(), "\n")

model = xgb.XGBRegressor(n_estimators=1000, booster='gbtree',
                         max_depth=2,
                         learning_rate=0.1)

train_df = create_sales_features(df)
train_df = create_time_features(train_df)
train_df = scale_dataset(train_df)

x_features = train_df[train_df.columns[2:]]
y_target = train_df['sales']

model.fit(x_features, y_target,
          eval_set=[(x_features, y_target)],
          verbose=100)

print('Training over.')

def generate_forecast(df, ahead):
    """
    Generates a forecast for future sales based
    on a time-series dataframe.

    Parameters:
    -----------
    df: pandas.DataFrame
        A time-series dataframe with dates as index and sales as a column.
    ahead: int
        The number of future periods to forecast.

    Returns:
    --------
    pandas.DataFrame
        A dataframe with the forecasted sales for the next `ahead` periods.
    """
    df = df.copy()

    df_time = create_time_features(df)
    monthly_sales = pd.DataFrame(df_time.groupby('month')['sales'].mean())

    df = df.set_index('dates')
    df.index = pd.to_datetime(df.index)

    for i in range(ahead):

        future_date = df.index.max() + timedelta(days=1)
        future_dates = pd.date_range(start = future_date.strftime("%Y-%m-%d"),
                end = future_date.strftime("%Y-%m-%d"))

        future_df = pd.DataFrame({"product_id": "chocolate", "sales": None},index=future_dates)
        future_df['sales'] = monthly_sales.loc[future_df.index.month[0]]['sales']

        df_with_future = pd.concat([df,future_df]).reset_index().rename(columns={"index": "dates"})
        df_with_future = create_sales_features(df_with_future)
        df_with_future = create_time_features(df_with_future)
        df_with_future = scale_dataset(df_with_future)

        pred = model.predict(df_with_future.tail(1)[df_with_future.columns[2:]])

        future_df['sales'] = abs(pred)

        df = pd.concat([df, future_df])

    return df.tail(ahead)

prediction_df = generate_forecast(df, 7)

fig = go.Figure()

fig.add_trace(go.Scatter(x=df.dates,
                        y=df.sales,
                        name='Chocolate Sales history', mode='lines'))

fig.add_trace(go.Scatter(x=prediction_df.index,
                        y=prediction_df.sales,
                        name='Sales Forecast', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales Forecast for Chocolate",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

print(f"Sales forecast for next 7 days: {prediction_df.sales.sum():.2f} Kg.")

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 101.95978229706922
Minimum Sold: 0.0
Maximum Sold: 251.7200437355448
Variance: 2397.281893291478
Standard Deviation: 48.96204543614858 

[0]	validation_0-rmse:96.78080
[100]	validation_0-rmse:14.76991
[200]	validation_0-rmse:10.61371
[300]	validation_0-rmse:8.32226
[400]	validation_0-rmse:6.90266
[500]	validation_0-rmse:5.91201
[600]	validation_0-rmse:5.17704
[700]	validation_0-rmse:4.62386
[800]	validation_0-rmse:4.19060
[900]	validation_0-rmse:3.81911
[999]	validation_0-rmse:3.51659
Training over.


Sales forecast for next 7 days: 804.31 Kg.


Let us see now how our seven days forecast compares to other weeks in our sales history.

In [None]:
fig = go.Figure()

fig.add_trace(go.Bar(x=prediction_df.sales.resample('7D').sum()\
                     .index.strftime("%Y-%m-%d"),
                     y=prediction_df.sales.resample('7D').sum(),
                     name='Sales Forecast'))

fig.add_trace(go.Bar(x=train_df.sales.resample('7D').sum()\
                     .index.strftime("%Y-%m-%d"),
                     y=train_df.sales.resample('7D').sum(),
                     name='Sales History'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales Forecast for Chocolate (Next 7 Days)",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

The more you try to see into the future, the more difficult it will be to get good results. However, there are many more models that you could try to use to get a better result, like `neural networks`, or you could invest in `feature engineering`, like adding weather features or other things that may be correlated with "_chocolate sales_."

Congratulations, you can now ("_kind of_") see into the future.

---

Return to the [castle](https://github.com/Nkluge-correa/TeenyTinyCastle).
