# Time Series Forecasting with `XGBoost`

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).

**`Time series forecasting` is a type of predictive modeling technique that is used to forecast future values of a variable based on historical data. In a time series, data points are collected at regular intervals over time, and the goal of time series forecasting is to make accurate predictions about future values of the variable based on the patterns and trends observed in the historical data.**

<img src="https://quantdare.com/wp-content/uploads/2021/02/3D_animation_clear_background.gif" alt="drawing" width="400"/>

**`XGBoost`, on the other hand, is a popular machine learning algorithm that is often used for time series forecasting. `XGBoost` stands for "_eXtreme Gradient Boosting_," and it is a type of gradient boosting algorithm that is designed to work well with large and complex datasets.**

> **`Gradient boosting` is a machine learning technique used for building predictive models, specifically `decision trees`. It works by iteratively adding new `decision trees` to the model, each one correcting the errors of the previous tree. `Gradient boosting` is a type of `ensemble learning`, which combines multiple weaker models to create a stronger overall model.**

**`XGBoost` works by building a series of decision trees, where each tree is trained to predict the residual error of the previous tree. This allows the algorithm to capture complex nonlinear relationships and interactions between variables, making it a powerful tool for `time series forecasting`.**

**Overall, time series forecasting and `XGBoost` are important techniques in the field of machine learning, and they are used in a wide range of applications, including finance, economics, and meteorology.**

**In this notebook, we will be working with a fake dataset that contains a `sales history`, of three yeara, of a certain product (chocolate 🍫). Let us load our dataset.**

In [138]:
import pandas as pd

df = pd.read_csv('data/time_series_data.csv')

display(df.head())

Unnamed: 0,dates,product_id,sales
0,2020-01-01,chocolate,137.0
1,2020-01-02,chocolate,87.0
2,2020-01-03,chocolate,188.0
3,2020-01-04,chocolate,286.0
4,2020-01-05,chocolate,156.0


**Lets us take a look of our history of chocolate sales as a time line.**

In [139]:
import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=df.dates, 
                        y=df.sales,
                        name='Sales History', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales History",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)
fig.show()

**This dataset looks quite noisy, but it could have some seasonal patterns.**

> **Note: A seasonal pattern is a recurring pattern that occurs at a fixed time interval within a time series. It is a type of pattern that occurs due to changes in the time of the year, such as weather, holidays, and other seasonal factors. For instance, sales of winter clothes tend to increase during the winter season, whereas sales of summer clothes tend to increase during the summer season.**

**Let us also print some important statistical information about the distribution of sales that our product has.**

**Here's a brief explanation of each:**

- **`Mean`: The mean, also known as the average, is the sum of all values in a dataset divided by the total number of values.** 
- **`Minimum`: The minimum is the smallest value in a dataset.** 
- **`Maximum`: The maximum is the largest value in a dataset.**
- **`Variance`: The variance is a measure of how spread out the values in a dataset are. It is calculated by taking the average of the squared differences between each value and the mean. A high variance indicates that the values are widely spread out, while a low variance indicates that the values are tightly clustered around the mean.**
- **`Standard deviation`: The standard deviation is the square root of the variance. It is also a measure of how spread out the values in a dataset are, but is easier to interpret than the variance since it is expressed in the same units as the original data. A high standard deviation indicates that the values are widely spread out, while a low standard deviation indicates that the values are tightly clustered around the mean.**

In [140]:
print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std())

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 103.3488160291439
Minimum Sold: 0.0
Maximum Sold: 466.0
Variance: 2976.16354007369
Standard Deviation: 54.5542256848513


**We have a pretty considerable `variance`. Some of the values in our dataset could be `outliers`.**

> **Note: In statistics and machine learning, an `outlier` is an observation that is significantly different from other observations in the dataset. `Outliers` can be caused by measurement or recording errors, natural variations in the data, or rare events. `Outliers` can have a large effect on the mean, variance, and other statistics that are used to summarize the data, and can affect the performance of machine learning models that are trained on the data.**

**Let us plot the histogram to get a better understanding of the value distribution of our sales.**

In [141]:
fig = go.Figure(data=[go.Histogram(x=df.sales)])

fig.update_xaxes(showgrid=False, ticksuffix=' Kg', showline=False, mirror=False)
fig.update_yaxes(showgrid=True)

fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales Histogram",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

fig = go.Figure(data=go.Scatter(x=df.dates, 
                                y=df.sales, mode='markers'))

fig.update_layout(
template='plotly_dark',
title=f"Chocolate Sales Outliers",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()


**We could, for example, determine that sales above 400 Kg are a statistical anomaly, and we will "_substitute these values_" by an acceptable maximum (e.g., 350 Kg). Or we could do something fancier and say if the value is above 5 `standard deviations` from the `mean`, we will set it to $mean + (std \times 3)$.**

In [142]:
df.loc[df['sales'] > df.sales.mean() + \
       (df.sales.std() * 3), 'sales'] = \
        df.sales.mean() + (df.sales.std() * 3)

print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std())

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 102.23241838107728
Minimum Sold: 0.0
Maximum Sold: 267.0114930836978
Variance: 2482.95001490196
Standard Deviation: 49.82920845148917


### Feature Engineering 

**`Feature engineering` is the process of selecting, transforming, and creating input variables (`features`) to improve the performance of machine learning models. It involves using domain knowledge to select and transform the most relevant input variables to train machine learning models effectively.**

**In other words, `feature engineering` is the process of converting raw data into a set of features that can be used to train a machine learning model. This process involves a variety of techniques, such as data cleaning, feature selection, feature extraction, and feature scaling.**

**We are going to create some new features (we only have numerical sales so far) now.**

**First, let us suppose how much chocolate I sold today is correlated with how much chocolate I'll sell tomorrow. Thus, let us create features that show the difference between sales, looking up to 7 days back in the past.**

**We will also create features that tell us the difference in sales looking one year back in the past, plus the moving average values in the window of one week and two weeks.**

In [144]:
def create_sales_features(df):
    """
    Creates a copy of the input df, and
    clalculates the difference in sales 
    7 days back, and one year back. It also
    creates features to log the moving average
    of sales in a one week and two week window.
    """
    df = df.copy()

    previous = df.sales.shift(1)
    df['difference_1'] = df.sales - previous

    df['moving_average_week'] = df.sales.rolling(window=7).mean()

    df['moving_average_two_weeks'] = df.sales.rolling(window=14).mean()

    for i in range(1, 7):
        column = 'difference_' + str(i+1)
        df[column] = df['difference_1'].shift(i)

    df['difference_year'] = df.sales - df.sales.shift(366)

    df = df.dropna()

    return df

df_sales_features = create_sales_features(df)

display(df_sales_features.head())

Unnamed: 0,dates,product_id,sales,difference_1,moving_average_week,moving_average_two_weeks,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,difference_year
366,2021-01-01,chocolate,91.0,-8.0,115.571429,118.714286,-9.0,-21.0,79.0,-66.0,-100.0,116.0,-46.0
367,2021-01-02,chocolate,69.0,-22.0,94.571429,114.714286,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,-18.0
368,2021-01-03,chocolate,159.0,90.0,100.714286,118.285714,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,-29.0
369,2021-01-04,chocolate,232.0,73.0,126.714286,125.071429,90.0,-22.0,-8.0,-9.0,-21.0,79.0,-35.011493
370,2021-01-05,chocolate,126.0,-106.0,126.285714,121.071429,73.0,90.0,-22.0,-8.0,-9.0,-21.0,-30.0


**Second, if our data has some form of seasonality, information like the `day of the week`, `day of the year`, `quarter`, `month`, and `year`, could help us in this forecasting problem.**

**Luckily, pandas give us all of these for free, as long as our index is of a `datetime` format.**

In [145]:
def create_time_features(df):
    """
    Creates a copy of the input df, turns
    the index (should come as dates) into a
    `datetime format`, and gives you back
    a DataFrame with all time features.
    """

    df = df.copy()

    df = df.set_index('dates')

    df.index = pd.to_datetime(df.index)

    df['day_of_week'] = df.index.day_of_week
    df['day_of_year'] = df.index.day_of_year
    df['quarter'] = df.index.quarter
    df['month'] = df.index.month
    df['year'] = df.index.year

    df = df.sort_index()

    return df

df_time_features = create_time_features(df_sales_features)

display(df_time_features.head())

Unnamed: 0_level_0,product_id,sales,difference_1,moving_average_week,moving_average_two_weeks,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,difference_year,day_of_week,day_of_year,quarter,month,year
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2021-01-01,chocolate,91.0,-8.0,115.571429,118.714286,-9.0,-21.0,79.0,-66.0,-100.0,116.0,-46.0,4,1,1,1,2021
2021-01-02,chocolate,69.0,-22.0,94.571429,114.714286,-8.0,-9.0,-21.0,79.0,-66.0,-100.0,-18.0,5,2,1,1,2021
2021-01-03,chocolate,159.0,90.0,100.714286,118.285714,-22.0,-8.0,-9.0,-21.0,79.0,-66.0,-29.0,6,3,1,1,2021
2021-01-04,chocolate,232.0,73.0,126.714286,125.071429,90.0,-22.0,-8.0,-9.0,-21.0,79.0,-35.011493,0,4,1,1,2021
2021-01-05,chocolate,126.0,-106.0,126.285714,121.071429,73.0,90.0,-22.0,-8.0,-9.0,-21.0,-30.0,1,5,1,1,2021


**As a last modification, we can use the `StandardScaler` to standardize the scale of the features in the data, which can improve the performance of machine learning models that are sensitive to the scale of the features.**

**In _general_, it is a good practice to standardize the features in a regression problem.**

In [146]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

def scale_dataset(df):
    """
    This fuction is used to scale all feature values.
    It transforms each feature so that it has a 
    mean of 0 and a standard deviation of 1. 
    """

    df = df.copy()

    features = df.drop(['product_id', 'sales'], axis=1)
    product_id = df[['product_id']]
    target = df[['sales']]

    scaler.fit(features)

    features = pd.DataFrame(
        scaler.transform(features), 
        columns=features.columns, 
        index=features.index)

    df = pd.concat([product_id, target, features], axis=1)

    return df

df_scaled = scale_dataset(df_time_features)

display(df_scaled.head())

Unnamed: 0_level_0,product_id,sales,difference_1,moving_average_week,moving_average_two_weeks,difference_2,difference_3,difference_4,difference_5,difference_6,difference_7,difference_year,day_of_week,day_of_year,quarter,month,year
dates,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
2021-01-01,chocolate,91.0,-0.12336,1.009674,1.433483,-0.139426,-0.32432,1.222318,-1.020935,-1.546401,1.793529,-1.030095,0.497358,-1.718013,-1.346544,-1.594967,-0.997388
2021-01-02,chocolate,69.0,-0.339905,0.004941,1.196106,-0.123955,-0.13868,-0.324412,1.2224,-1.02051,-1.545988,-0.173555,0.996762,-1.708548,-1.346544,-1.594967,-0.997388
2021-01-03,chocolate,159.0,1.392457,0.298843,1.40805,-0.340543,-0.12321,-0.138804,-0.324728,1.222262,-1.020323,-0.510053,1.496167,-1.699083,-1.346544,-1.594967,-0.997388
2021-01-04,chocolate,232.0,1.129509,1.542798,1.810743,1.392163,-0.33979,-0.123337,-0.139072,-0.324477,1.221482,-0.693949,-1.50026,-1.689617,-1.346544,-1.594967,-0.997388
2021-01-05,chocolate,126.0,-1.639177,1.522293,1.573366,1.129163,1.392847,-0.339879,-0.123601,-0.138868,-0.324591,-0.540644,-1.000856,-1.680152,-1.346544,-1.594967,-0.997388


**With these new features, we can explore now how they are related to the sales of chocolate. For example, what are the days of the weak, or moths, where the sales are higher? We can plot `box plots` to find that out.**

> **Note: Box plots, also known as box-and-whisker plots, are a graphical representation of a dataset that displays the distribution of the data based on five summary statistics: minimum value, first quartile (25th percentile), median (50th percentile), third quartile (75th percentile), and maximum value.**

In [147]:
import plotly.express as px


fig = px.box(df_time_features, x="day_of_week", y="sales", points="all", color="day_of_week")

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales by Day of the Week",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

fig = px.box(df_time_features, x="month", y="sales", points="all", color="month")

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales by Month",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

**For the train/test split, we could do something very simple and use the last 30 days of our dataset for testing.**

In [148]:
train = df_scaled.loc[df_scaled.index < '2022-12-04']
test = df_scaled.loc[df_scaled.index >= '2022-12-04']

print('Number of samples (days) for training: ', len(train))
print('Number of samples (days) for testing: ', len(test))

fig = go.Figure()

fig.add_trace(go.Scatter(x=train.index, 
                        y=train.sales,
                        name='Train Data', mode='lines'))

fig.add_trace(go.Scatter(x=test.index, 
                        y=test.sales,
                        name='Test Data', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
fig.update_layout(
template='plotly_dark',
title=f"Dataset Train/Test Split",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)
fig.show()

Number of samples (days) for training:  702
Number of samples (days) for testing:  30


**Our, we could do something more robust, and use time series `cross validation`, and use a `TimeSeriesSplit`.**

**> Note: `Cross-validation` is a technique used in machine learning to evaluate the performance of a model and to find the optimal model hyperparameters. It involves dividing the available data into several subsets or "_folds_." The model is trained on a subset of the data called the training set and then evaluated on the remaining subset called the validation set. This process is repeated several times, with different subsets of data being used for training and validation in each iteration. The results are then averaged to obtain a more accurate estimate of the model's performance.** 

**[`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html) is a `cross-validation` technique for time series data provided by the `scikit-learn` library. It is similar to the traditional `cross-validation` technique, but it is designed to preserve the temporal order of the data when splitting it into training and validation sets.**

**Below, we are breaking our training set into 4 folds, each test set has 30 days, with a 1-day gap between them.**

In [149]:
from sklearn.model_selection import TimeSeriesSplit

tss = TimeSeriesSplit(n_splits=4, test_size=30, gap=1)

fold= 0
for train_idx, val_idx in tss.split(df_scaled):
    train = df_scaled.iloc[train_idx]
    test = df_scaled.iloc[val_idx]
    
    fig = go.Figure()

    fig.add_trace(go.Scatter(x=train.index, 
                            y=train.sales,
                            name='Train Data', mode='lines'))

    fig.add_trace(go.Scatter(x=test.index, 
                            y=test.sales,
                            name='Test Data', mode='lines'))

    fig.update_xaxes(showgrid=False, showline=False, mirror=False)
    fig.update_yaxes(showgrid=True, ticksuffix=' Kg')
    fig.update_layout(
    template='plotly_dark',
    title=f"Dataset Train/Test Split (Fold {fold+1})",
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)',
    )
    fig.show()

    fold += 1

**Let us now also separate `features` from our `target` variable.**

In [150]:
FEATURES = ['difference_1', 'difference_2', 'difference_3',
       'difference_4', 'difference_5', 'difference_6', 
       'difference_7', 'moving_average_week', 
       'moving_average_two_weeks', 'difference_year', 
       'day_of_week', 'day_of_year', 'quarter', 'month', 'year']

TARGET = ['sales']


print('Number of features: ', len(FEATURES))
print('Features shape: ', df_time_features[FEATURES].values.shape)
print('Targets shape: ', df_time_features[TARGET].values.shape)


Number of features:  15
Features shape:  (732, 15)
Targets shape:  (732, 1)


### Creating a `Regression` Model

**In machine learning, a `regression model` is a type of predictive model that is used to estimate a continuous output variable based on one or more input variables. It is a `supervised learning` technique that aims to find the relationship between the `input variables` (also called predictors or independent variables) and the `output variable` (also called the dependent variable).**

**Regression models can be `linear` or `nonlinear`, depending on the form of the relationship between the input and output variables. Linear regression models assume a linear relationship between the variables, while nonlinear regression models allow for more complex relationships.**

**To create our regression model we will use the `XGBRegressor` from [`xgboost`](https://xgboost.readthedocs.io/en/stable/index.html). Some of the arguments you can pass when build this model are:**

- **`n_estimators`: Number of gradient boosted trees. Equivalent to number of boosting rounds.**
- **`early_stopping_rounds`: Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training.** 
- **`max_depth`: Maximum tree depth for base learners.**
- **`learning_rate`: Boosting learning rate.**
- **`booster`: Specify which booster to use ("gbtree", "gblinear" or "dart").**

**For more information, read the [documentation](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn).**

**Below we are giving the model up to 1000 trees, an early stopping (to prevent `overfitting`) mark to trigger after 100 rounds, and a learning rate of 0.01 (low learning rates also help to prevent `overfitting`.)**

> **Note: In machine learning, `overfitting` refers to a phenomenon where a model learns the training data too well, to the point that it starts to memorize noise and irrelevant patterns in the data, rather than learning the underlying patterns that generalize well to new data. As a result, an overfit model performs well on the training data but poorly on the validation or test data, as it fails to capture the true relationship between the input and output variables. To avoid `overfitting`, it is important to use techniques such as cross-validation, early stopping, and regularization, which help to prevent the model from overfitting the training data and improve its ability to generalize to new data.**

**Since it is more robust, we will train this model using cross-validation.**

**To evaluate our predictions, we will use the same metric the model used to track it's performance: `root mean squared error`.**

**Root Mean Squared Error (`RMSE`) is a common metric used to evaluate the performance of a regression model. It measures the average distance between the predicted and actual values of the target variable, in units of the target variable itself:**

$$\mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y_i})^2}$$

**where:** 

- **$n$ is the number of samples.**
- **$y_i$ is the true value of the target variable for sample $i$.**
- **$\hat{y_i}$ is the predicted value of the target variable for sample $i$.** 

**The square of the difference between the true and predicted values is first calculated for each sample, then these values are averaged, and finally the square root is taken to obtain the `RMSE`.**

**Luckly, we can just import `mean_squared_error` from sklearn and square the result.**

In [151]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import numpy as np


tss = TimeSeriesSplit(n_splits=4, test_size=30, gap=1)

model = xgb.XGBRegressor(n_estimators=1000, booster='gbtree',
                         early_stopping_rounds=100,
                         max_depth=2,
                         learning_rate=0.1)

fold= 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df_scaled):
    train = df_scaled.iloc[train_idx]
    test = df_scaled.iloc[val_idx]

    x_train = train[FEATURES]
    y_train = train[TARGET]

    x_test = test[FEATURES]
    y_test = test[TARGET]
    
    model.fit(x_train, y_train,
          eval_set=[(x_train, y_train), (x_test, y_test)],
          verbose=100)
    
    predictions = model.predict(x_test)
    preds.append(predictions)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    scores.append(rmse)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=test.index, 
                            y=test.sales,
                            name='Sales (ground truth)', mode='lines'))

    fig.add_trace(go.Scatter(x=test.index, 
                            y=predictions,
                            name='Sales (predictions)', mode='lines'))

    fig.update_xaxes(showgrid=False, showline=False, mirror=False)
    fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

    fig.update_layout(
    template='plotly_dark',
    title=f"Chocolate Sales Ground-Truth/Predictions (Fold {fold+1})",
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)',
    )

    fig.show()

    fold += 1

print(f'Fold RMSE:{scores}')
print(f'Average RMSE across folds {np.mean(scores)}')


[0]	validation_0-rmse:100.04412	validation_1-rmse:76.51586
[100]	validation_0-rmse:15.54672	validation_1-rmse:20.47236
[200]	validation_0-rmse:10.32234	validation_1-rmse:17.28265
[300]	validation_0-rmse:7.68456	validation_1-rmse:15.50291
[400]	validation_0-rmse:6.13884	validation_1-rmse:14.26140
[500]	validation_0-rmse:5.13958	validation_1-rmse:13.53285
[600]	validation_0-rmse:4.43752	validation_1-rmse:13.01537
[700]	validation_0-rmse:3.90332	validation_1-rmse:12.57092
[800]	validation_0-rmse:3.47773	validation_1-rmse:12.30747
[900]	validation_0-rmse:3.14234	validation_1-rmse:12.02826
[999]	validation_0-rmse:2.84639	validation_1-rmse:11.90364


[0]	validation_0-rmse:99.26038	validation_1-rmse:79.55624
[100]	validation_0-rmse:15.56532	validation_1-rmse:20.00962
[200]	validation_0-rmse:10.18596	validation_1-rmse:15.48029
[300]	validation_0-rmse:7.72513	validation_1-rmse:13.93227
[400]	validation_0-rmse:6.23550	validation_1-rmse:12.87332
[500]	validation_0-rmse:5.25367	validation_1-rmse:12.39007
[600]	validation_0-rmse:4.51830	validation_1-rmse:12.00413
[700]	validation_0-rmse:3.96641	validation_1-rmse:11.89390
[800]	validation_0-rmse:3.52749	validation_1-rmse:11.73430
[900]	validation_0-rmse:3.16850	validation_1-rmse:11.58342
[999]	validation_0-rmse:2.88200	validation_1-rmse:11.53910


[0]	validation_0-rmse:98.51281	validation_1-rmse:78.76558
[100]	validation_0-rmse:15.40014	validation_1-rmse:13.03936
[200]	validation_0-rmse:10.16790	validation_1-rmse:11.42121
[300]	validation_0-rmse:7.70874	validation_1-rmse:10.78674
[400]	validation_0-rmse:6.22202	validation_1-rmse:10.39862
[500]	validation_0-rmse:5.18625	validation_1-rmse:10.04334
[600]	validation_0-rmse:4.48326	validation_1-rmse:9.82397
[700]	validation_0-rmse:3.95138	validation_1-rmse:9.65619
[800]	validation_0-rmse:3.51339	validation_1-rmse:9.50312
[900]	validation_0-rmse:3.14962	validation_1-rmse:9.38608
[999]	validation_0-rmse:2.84897	validation_1-rmse:9.33522


[0]	validation_0-rmse:97.77516	validation_1-rmse:91.14159
[100]	validation_0-rmse:15.39503	validation_1-rmse:17.14030
[200]	validation_0-rmse:10.12196	validation_1-rmse:14.41491
[300]	validation_0-rmse:7.64711	validation_1-rmse:12.94388
[400]	validation_0-rmse:6.28320	validation_1-rmse:11.91510
[500]	validation_0-rmse:5.32726	validation_1-rmse:11.28611
[600]	validation_0-rmse:4.63043	validation_1-rmse:10.87772
[700]	validation_0-rmse:4.08866	validation_1-rmse:10.61451
[800]	validation_0-rmse:3.62630	validation_1-rmse:10.44536
[900]	validation_0-rmse:3.23470	validation_1-rmse:10.23082
[999]	validation_0-rmse:2.93887	validation_1-rmse:10.07217


Fold RMSE:[11.89696141443807, 11.51642022835708, 9.315624564157584, 10.068480591517016]
Average RMSE across folds 10.699371699617437


**Not that bad, but we could certainly improve this model by creating better features or tuning the hyperparameters a little more.**

### Intepreting our Regressor 

**`Xgboost` gives you the importance of each feature used to train the model. You can get these values by using the `feature_importances_` method. This is a good way to try to `interpret` your model.**

In [152]:

feature_importance = pd.DataFrame({
    'features' : model.feature_names_in_,
    'importance': model.feature_importances_
}).set_index('features').sort_values('importance')

import plotly.graph_objects as go

fig = go.Figure(go.Bar(
    x=feature_importance.importance,
    y=feature_importance.index,
    orientation='h'))

fig.update_xaxes(range=[feature_importance.importance.min() 
    + (feature_importance.importance.min() * 0.1), 
    feature_importance.importance.max() 
    + (feature_importance.importance.max() * 0.1)])

fig.update_layout(
    xaxis=dict(
        tickmode='linear',
        tick0=0,
        dtick=0.5
    ),
    template='plotly_dark',
    title='Feature Importance for Regression Model',
    paper_bgcolor='rgba(0, 0, 0, 0)',
    plot_bgcolor='rgba(0, 0, 0, 0)'
)

fig.show()


**`difference_1`, `moving_average_week`, and `difference_2` seems to be the top 3 most important features of this model.**

**Here are some potential features that could be used to improve this model:**

- **_External variables_: If you have access to external data, such as `weather data` or `economic indicators`, you could include those as features to capture the impact of those variables on sales.**
- **_Promotion and marketing variables_: If you have information about `promotions` or `marketing campaigns` that occurred during the period of interest, you could include those as features.**

### Forecasting the Future

**If we want to try to predict the future, we should train our model with all of our data, to leverage all knowledge we have into this task. After that, we can create a future time frame, and use the lag features from past sales to predict the future.**

In [167]:
from datetime import timedelta

df = pd.read_csv('data/time_series_data.csv')

df.loc[df['sales'] > df.sales.mean() + (df.sales.std() * 3), 'sales'] = df.sales.mean() + (df.sales.std() * 3)

print(f'Statistics Report (Chocolate)\n{"-" * 50}')
print("Mean Sales:", df.sales.mean())
print("Minimum Sold:", df.sales.min())
print("Maximum Sold:", df.sales.max())
print("Variance:", df.sales.var())
print("Standard Deviation:", df.sales.std(), "\n")

FEATURES = ['difference_1', 'difference_2', 'difference_3',
       'difference_4', 'difference_5', 'difference_6', 
       'difference_7', 'moving_average_week', 
       'moving_average_two_weeks', 'difference_year', 
       'day_of_week', 'day_of_year', 'quarter', 'month', 'year']

TARGET = ['sales']

model = xgb.XGBRegressor(n_estimators=1000, booster='gbtree',
                         max_depth=2,
                         learning_rate=0.1)

train_df = create_sales_features(df)
train_df = create_time_features(train_df)
train_df = scale_dataset(train_df)

x_features = train_df[FEATURES]
y_target = train_df[TARGET]

model.fit(x_features, y_target,
          eval_set=[(x_features, y_target)],
          verbose=100)

print('Training over.')

def generate_forecast(df, ahead):
    """
    This functions call the original dataframe, sets
    the `dates` columns as the index and turns the
    index into `datetime`. After, it loops for a range
    equal to `ahead`. For each iteration. It creates the
    features dataframe with one additional day (a future day),
    and uses the lag features to predict the value of this day.
    In the end, we append this day on the bottom of the seed 
    `df` and repeat. The function returns only the future
    predictions.
    """

    df = df.set_index('dates')
    df.index = pd.to_datetime(df.index)

    for i in range(ahead):
        
        future_date = df.index.max() + timedelta(days=1)

        future_dates = pd.date_range(start = future_date.strftime("%Y-%m-%d"), 
              end = future_date.strftime("%Y-%m-%d"))

        future_df = pd.DataFrame(index=future_dates)
        present_df = df[['sales']].copy()

        df_with_future = pd.concat([present_df, future_df]).reset_index().rename(columns={"index": "dates"}).fillna(0)
        df_with_future = create_sales_features(df_with_future)
        df_with_future = create_time_features(df_with_future)

        pred = model.predict(df_with_future.tail(1)[FEATURES])

        future_df['sales'] = abs(np.round(pred))

        df = pd.concat([present_df, future_df])
        
    return df.tail(ahead)

prediction_df = generate_forecast(df, 7)

fig = go.Figure()

fig.add_trace(go.Scatter(x=df.dates, 
                        y=df.sales,
                        name='Chocolate Sales history', mode='lines'))

fig.add_trace(go.Scatter(x=prediction_df.index, 
                        y=prediction_df.sales,
                        name='Sales Forecast', mode='lines'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales Forecast for Chocolate",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

print(f"Sales forecast for next 7 days: {prediction_df.sales.sum():.2f} Kg.")

Statistics Report (Chocolate)
--------------------------------------------------
Mean Sales: 102.23241838107728
Minimum Sold: 0.0
Maximum Sold: 267.0114930836978
Variance: 2482.95001490196
Standard Deviation: 49.82920845148917 

[0]	validation_0-rmse:97.47387
[100]	validation_0-rmse:15.45691
[200]	validation_0-rmse:10.20780
[300]	validation_0-rmse:7.67205
[400]	validation_0-rmse:6.20145
[500]	validation_0-rmse:5.25535
[600]	validation_0-rmse:4.54690
[700]	validation_0-rmse:4.01042
[800]	validation_0-rmse:3.59117
[900]	validation_0-rmse:3.22862
[999]	validation_0-rmse:2.94339
Training over.


Sales forecast for next 7 days: 411.00 Kg.


**Let us see now how our 7 days forecast compares to other weeks in our sales history.**

In [193]:
fig = go.Figure()

fig.add_trace(go.Bar(x=prediction_df.sales.resample('7D').sum()\
                     .index.strftime("%Y-%m-%d"), 
                     y=prediction_df.sales.resample('7D').sum(),
                     name='Sales Forecast'))

fig.add_trace(go.Bar(x=train_df.sales.resample('7D').sum()\
                     .index.strftime("%Y-%m-%d"), 
                     y=train_df.sales.resample('7D').sum(),
                     name='Sales History'))

fig.update_xaxes(showgrid=False, showline=False, mirror=False)
fig.update_yaxes(showgrid=True, ticksuffix=' Kg')

fig.update_layout(
template='plotly_dark',
title=f"Sales Forecast for Chocolate (Next 7 Days)",
paper_bgcolor='rgba(0, 0, 0, 0)',
plot_bgcolor='rgba(0, 0, 0, 0)',
)

fig.show()

**The more you try to see into the future, the more difficult it will be to get good results. However, there are many more models that you could try to use to get a better result, like `neural networks`, or you could invest in `feature engineering`, like adding weather features or other things that may be correlated with "_chocolate sales_."**

**Congratulations, you can now see into the future ("_kind of_").**

---

Return to the [castle](https://github.com/Nkluge-correa/teeny-tiny_castle).
