# Introduction #

Run this cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.time_series.ex5 import *

# Setup notebook
import warnings
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from statsmodels.tsa.tsatools import lagmat
from xgboost import XGBRegressor

warnings.simplefilter("ignore")

plt.style.use("seaborn-whitegrid")
plt.rc(
    "figure",
    autolayout=True,
    figsize=(11, 4),
    titlesize=18,
    titleweight='bold',
)
plt.rc(
    "axes",
    labelweight="bold",
    labelsize="large",
    titleweight="bold",
    titlesize=16,
    titlepad=10,
)
plot_params = dict(
    color="0.75",
    style=".-",
    markeredgecolor="0.25",
    markerfacecolor="0.25",
    legend=False,
)
%config InlineBackend.figure_format = 'retina'

comp_dir = Path('../input/store-sales-time-series-forecasting')
data_dir = Path("../input/ts-course-data")

store_sales = pd.read_csv(
    comp_dir / 'train.csv',
    usecols=['store_nbr', 'family', 'date', 'sales', 'onpromotion'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

family_sales = (
    store_sales
    .groupby(['family', 'date'])
    .mean()
    .unstack('family')
    .loc['2017']
)

mean_sales = store_sales.groupby('date').mean().loc[:, 'sales'].to_timestamp()

-------------------------------------------------------------------------------

- make boosted hybrid :: on families?
- (opt) trends as features :: 
- (opt) extrapolating variation

-------------------------------------------------------------------------------

The `STL` estimator from `statsmodels` library decomposes time series using a sort of "moving regression". It can be useful as a data exploration tool. One limitation is that it can't create separate cyclic components but will often include cycles in the trend.

Run the next cell to see a season-trend decomposition of the average sales from the *Store Sales* collection.

In [None]:
from statsmodels.tsa.seasonal import STL

res = STL(
    mean_sales['2016':],  # try different years, if you like
    period=28,  # the length of a season
).fit()
fig = res.plot()
fig.set_size_inches(11, 7);

# 1) Examine season-trend decomposition

What patterns do you see in this decomposition? Does it appear that the **Trend** component contains seasons or cycles? Do the residuals seem like random noise, or can you detect any patterns?

After you've thought about your answer, run this cell for some discussion.

In [None]:
# View the solution (Run this cell to receive credit!)
q_1.check()

-------------------------------------------------------------------------------

In the tutorial we saw how to pivot a dataset from long format to wide format. Pivoting can create multiple levels of indexes or columns, a `MultiIndex`. You can select from a `MultiIndex` by specifying the axis that you want in `loc`:

```
# Select BuildingMaterials in 1992 (long format)
X.loc(axis=0)['1992', 'BuildingMaterials']

# Select Sales of BuildingMaterials (wide format)
X.loc(axis=1)['Sales', 'BuildingMaterials']
```

We'll work with the full *Store Sales* dataset for this question. Run the next cell to take a look. You can see that we're starting with the dataset in long format, with time series indexed by `store_nbr` and `family` along rows.

In [None]:
store_sales

# 2) Use MultiIndexes

Perform the operations indicated in the following cell. The `store_nbr` feature is encoded as a `CategoricalDtype` for efficiency so be sure to use quotes (Yes: '10' / No: 10).

In [None]:
# YOUR CODE HERE: Select family 'CELEBRATION' from store '4' in '2016' (long format)
#_UNCOMMENT_IF(PROD)_
#answer1 = store_sales.loc(axis=0)['4', ____, ____]

store_sales_wide = store_sales.unstack(['store_nbr', 'family'])  # pivot long to wide

# YOUR CODE HERE: Select 'sales' of 'BOOKS' from store '21'  (wide format)
#_UNCOMMENT_IF(PROD)_
#answer2 = store_sales_wide.loc(axis=____)['sales', ____, ____]


# Check your answer
q_2.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_2.hint()
#_COMMENT_IF(PROD)_
q_2.solution()

In [None]:
#%%RM_IF(PROD)%%
answer1 = store_sales.loc(axis=0)['4', 'AUTOMOTIVE', '2016']
store_sales_wide = store_sales.unstack(['store_nbr', 'family'])
answer2 = store_sales_wide.loc(axis=1)['sales', '21', 'BOOKS']

q_2.assert_check_passed()

----------

In the next question, you'll create a boosted hybrid for the *Store Sales* dataset.

Run this cell to create the trend model for the first stage of the hybrid. This code follows the standard procedure you learned in Lesson 2.

In [None]:
# Get target series
y = family_sales.loc[:, 'sales']

# Create trend features
dp = DeterministicProcess(index=y.index, order=1)
X = dp.in_sample()

# Fit trend model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_fit = pd.DataFrame(
    model.predict(X_train),
    index=y_train.index,
    columns=y_train.columns,
)

And now run this cell to prepare the data for XGBoost.

In [None]:
y_fit = y_fit.stack().squeeze()  # wide to long

X = family_sales.stack()  # wide to long
y = X.pop('sales')  # grab target series

# Label encoding for 'family' feature
X = X.reset_index('family')
for colname in X.select_dtypes(["object", "category"]):
    X[colname], _ = X[colname].factorize()

# Label encoding for seasonality
X["day"] = X.index.day  # values are day of the month

# 3) Create a boosted hybrid

Modify the following code to train XGBoost on the residuals of the trend model and create the final predictions.

In [None]:
# YOUR CODE HERE: Create residuals (the collection of detrended
# series) from the training set.
y_resid = ____

# Train XGBoost on the residuals
xgb = XGBRegressor(random_state=0)
#_UNCOMMENT_IF(PROD)_
#xgb.fit(X_train, y_resid)

# YOUR CODE HERE: Add the predicted residuals onto the predicted trends
y_fit_boosted = ____

# Ensure non-negative
#_UNCOMMENT_IF(PROD)_
#y_fit_boosted = y_fit_boosted.clip(0.0)


# Check your answer
q_3.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_3.hint()
#_COMMENT_IF(PROD)_
q_3.solution()

In [None]:
#%%RM_IF(PROD)%%
# Create residuals (the collection of detrended series) from the training set
y_resid = y_train - y_fit

# Train XGBoost on the residuals
xgb = XGBRegressor(n_estimators=1000, max_depth=10, learning_rate=1e-1)
xgb.fit(X_train, y_resid)

# Add the predicted residuals onto the predicted trends
y_fit_boosted = xgb.predict(X_train) + y_fit
# Ensure non-negative
y_fit_boosted = y_fit_boosted.clip(0.0)

q_3.assert_check_passed()

----------

Winners of Kaggle forecasting competitions have often included moving averages and other rolling statistics in their feature sets. Such features seem to be especially useful when used with GBDT algorithms like XGBoost.

In Lesson 2 you learned how to compute moving averages to estimate trends. Computing rolling statistics to be used as features is similar except we need to take care to avoid lookahead leakage. First, the result should be set at the right end of the window instead of the center -- that is, we should use `center=False` (the default) in the `rolling` method. Second, the target should be lagged a step.

# 4) Feature engineering for XGBoost

Edit the code in the next cell to create the following features:
- 14-day rolling median (`median`) of lagged target
- 7-day rolling standard deviation (`std`) of lagged target
- 7-day sum (`sum`) of items "on promotion", with centered window

In [None]:
y_lag = family_sales.loc[:, 'sales'].shift(1)  # lagged target
onpromo = family_sales.loc[:, 'onpromotion']  # items on promotion

# Statistical features
X_stats = pd.concat({
    # 28-day mean of lagged target
    'mean_7': y_lag.rolling(7).mean(),
    # YOUR CODE HERE: Edit to create the rolling statistic
    # 14-day median of lagged target
#_UNCOMMENT_IF(PROD)_
#    'median_14': ____,
    # 7-day rolling standard deviation of lagged target
#_UNCOMMENT_IF(PROD)_
#    'std_7': ____,
    # 7-day sum of promotions with centered window
#_UNCOMMENT_IF(PROD)_
#    'promo_7': ____,
}, axis=1).stack()


X = family_sales.stack()  # wide to long
y = X.pop('sales')  # grab target series
X = pd.concat([X, X_stats], axis=1)  # join features


# Check your answer
q_5.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_5.hint()
#_COMMENT_IF(PROD)_
q_5.solution()

In [None]:
#%%RM_IF(PROD)%%
y_lag = family_sales.loc[:, 'sales'].shift(1)
onpromo = family_sales.loc[:, 'onpromotion']

X_stats = pd.concat({
    'mean_7': y_lag.rolling(7).mean(),
    'median_14': y_lag.rolling(14).median(),
    'std_7': y_lag.rolling(7).std(),
    'promo_7': onpromo.rolling(7, center=True).sum(),
}, axis=1).stack()


X = family_sales.stack()
y = X.pop('sales')
X = pd.concat([X, X_stats], axis=1)

q_5.assert_check_passed()

Check out the Pandas [`Window` documentation](https://pandas.pydata.org/pandas-docs/stable/reference/window.html) for more statistics you can compute. Also try "exponential weighted" windows by using `ewm` in place of `rolling`; exponential decay is often a more realistic representation of how effects propagate over time.

----------

# (Optional) Create a stacked hybrid



-------------------------------------------------------------------------------

# Keep Going #