# Introduction to Time Series Forecasting Project Using XGBoost

In this notebook we'll leverage the predictive capabilities of XGBoost in time series forecasting. By systematically executing each phase of the project, from data preparation through to model assessment and projection, the objective is to construct a reliable model that offers accurate sales predictions, thereby providing useful insights for informed decision-making in business contexts.

## XGBoost

XGBoost stands for “Extreme Gradient Boosting” and is a machine learning algorithm that is part of the gradient boosting family. Fundamentally, it operates on the principle of boosting, which involves sequentially combining multiple weak predictive models to form a strong predictor. In the case of XGBoost, these weak models are typically decision trees.

A key aspect of XGBoost is its ability to handle missing data and provide a framework for both linear and tree learners. It also employs techniques like parallel processing and efficient memory usage, which make it computationally efficient, particularly for large datasets. The algorithm's performance and efficiency, coupled with its ability to be finely tuned through hyperparameter optimization, make it a powerful tool for predictive modeling in various machine learning applications.

## Decision Trees

A decision tree is a versatile machine learning model used for both classification and regression tasks. It represents a series of decision rules that, when followed from root to leaf, lead to a prediction based on the input features. At each node in the tree, the data is split according to a specific criterion, dividing the dataset into increasingly homogenous subsets. This structure makes decision trees particularly intuitive and easy to interpret, as they visually mimic human decision-making processes.

To train a decision tree, an algorithm first selects the best feature to split the data at each node. This selection is typically based on criteria such as Gini impurity or information gain, which measure how well a particular split will separate the data into distinct classes or groups. The process continues recursively, creating branches for each split, until a stopping criterion is met, such as reaching a maximum depth or a minimum number of samples in a leaf. This method allows the tree to learn from the training data by constructing a hierarchy of decision rules that can accurately classify or predict the target variable.

## Random Forests

A Random Forest model works by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes (in classification) or mean prediction (in regression) of the individual trees.

The algorithm begins by creating multiple decision trees from randomly selected subsets of the training dataset. This process, known as bootstrapping, involves sampling data with replacement, resulting in different trees seeing different parts of the dataset. Additionally, when splitting nodes during the construction of trees, Random Forest randomly selects a subset of the features rather than using the most significant feature. This randomness helps in creating a diverse set of trees and reduces the correlation between them, enhancing the overall prediction accuracy and reducing overfitting.

## Gradient Boosting

Boosting is a machine learning ensemble technique that improves the accuracy of models by combining multiple weak learners to form a strong learner. A weak learner is a model that is only slightly correlated with the true classification. In boosting, these learners are trained sequentially, with each one focusing on the errors made by the previous ones, thereby incrementally improving the model's performance.

Gradient boosting, a specific type of boosting, refines this process further. It builds the model in a stage-wise fashion, with each new model being added to correct the errors of the sum of the previously built models. The key idea in gradient boosting is to use the gradient descent algorithm to minimize the loss function, which measures the difference between the actual and predicted values. At each step, a new weak learner is trained with respect to the error gradient of the whole ensemble learned so far.

In both boosting and gradient boosting, the final model is a weighted sum of the weak learners, with more weight given to those that perform better. These techniques are known for their high accuracy and effectiveness, particularly in scenarios where the dataset is imbalanced or in the presence of significant noise. They are widely used in various machine learning tasks, including classification, regression, and ranking.

## Mathematical Foundations

The objective function in XGBoost combines a loss function and a regularization term, defined as:

$$
\text{Obj}(\Theta) = L(\Theta) + \Omega(\Theta)
$$

Where $ L(\Theta) $ is the loss function (measuring the prediction error) and $ \Omega(\Theta) $ is the regularization term (controlling model complexity).

The regularization term, unique to its formulation, is given by:

$$
\Omega(f) = \gamma T + \frac{1}{2} \lambda \|w\|^2
$$

Here, $ T $ is the number of leaves in a tree, $ w $ are the leaf weights, $ \gamma $ represents the complexity cost per tree, and $ \lambda $ is the L2 regularization term on the weights.

### Tree Ensemble Model
The model in XGBoost is an ensemble of additive functions (trees), represented as:

$$
\hat{y}_i = \sum_{k=1}^{K} f_k(x_i), \quad f_k \in F
$$

Where $ F $ is the space of regression trees, $ x_i $ are the input features, $ \hat{y}_i $ is the predicted output, and $ f_k $ denotes an individual tree.

Each tree is built by iteratively adding branches, optimizing the feature splits based on the gain, which is calculated using [gradient and Hessian statistics](https://en.wikipedia.org/wiki/Hessian_matrix). Post-training, XGBoost assigns an importance score to each feature based on the number of times a feature is used in splits and the associated gain.


# Imports

In [1]:
import os
import warnings

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error
from xgboost import XGBRegressor

warnings.filterwarnings("ignore")

# Data Preparation

## Data Loading

As in the last notebook, we'll focus on forecasting sales data specifically for Store 2

In [2]:
root_data_folder = "data"
raw_data_folder = os.path.join(root_data_folder, "raw")
processed_data_folder = os.path.join(root_data_folder, "processed")

In [3]:
stores_sales_df = pd.read_parquet(
    os.path.join(processed_data_folder, "stores-sales.parquet")
)

sample_store_data = (
    stores_sales_df[stores_sales_df["Store"] == 2]
)[["Sales", "StateHoliday", "Promo"]]

sample_store_data

Unnamed: 0_level_0,Sales,StateHoliday,Promo
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2013-01-01,0,a,0
2013-01-02,4422,0,0
2013-01-03,4159,0,0
2013-01-04,4484,0,0
2013-01-05,2342,0,0
...,...,...,...
2015-07-27,6627,0,1
2015-07-28,5671,0,1
2015-07-29,6402,0,1
2015-07-30,5567,0,1


### Data Resampling

We'll also data into weekly total, in the same way we did in the last notebooks. The same must be done with the `StateHoliday` and `Promo` features.

In [4]:
weekly_features = sample_store_data["Sales"].resample("W").sum().reset_index()

weekly_features["StateHoliday"] = (
    sample_store_data["StateHoliday"]
        .map({"0": 0, "a": 1, "b": 1, "c": 1})
        .resample("W")
        .max()
        .reset_index(drop=True)
)

weekly_features["Promo"] = (
    sample_store_data["Promo"]
        .resample("W")
        .max()
        .reset_index(drop=True)
)

# Rename the columns as required by Prophet
weekly_features = weekly_features.rename(columns={"Date": "ds", "Sales": "y"})
weekly_features

Unnamed: 0,ds,y,StateHoliday,Promo
0,2013-01-06,15407,1,0
1,2013-01-13,32914,0,1
2,2013-01-20,21081,0,0
3,2013-01-27,29973,0,1
4,2013-02-03,23297,0,0
...,...,...,...,...
130,2015-07-05,39757,0,1
131,2015-07-12,25264,0,0
132,2015-07-19,32399,0,1
133,2015-07-26,23838,0,0


### Data Split

Again, we'll collect the 20% more recent data for validation and the reamaining will be used for training.

In [5]:
data_len = len(weekly_features.index)
test_len = int(data_len * 0.2)

train_df = weekly_features.iloc[:-test_len].reset_index(drop=True)
test_df = weekly_features.iloc[-test_len:].reset_index(drop=True)

display(train_df)
display(test_df)

Unnamed: 0,ds,y,StateHoliday,Promo
0,2013-01-06,15407,1,0
1,2013-01-13,32914,0,1
2,2013-01-20,21081,0,0
3,2013-01-27,29973,0,1
4,2013-02-03,23297,0,0
...,...,...,...,...
103,2014-12-28,22955,1,0
104,2015-01-04,21050,0,0
105,2015-01-11,32981,0,1
106,2015-01-18,34196,0,1


Unnamed: 0,ds,y,StateHoliday,Promo
0,2015-02-01,30982,0,1
1,2015-02-08,34267,0,1
2,2015-02-15,23325,0,0
3,2015-02-22,33343,0,1
4,2015-03-01,23143,0,0
5,2015-03-08,33815,0,1
6,2015-03-15,23309,0,0
7,2015-03-22,34118,0,1
8,2015-03-29,24492,0,0
9,2015-04-05,35927,1,1
