# Introduction #

Run this cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.time_series.ex3 import *

# Setup notebook
from pathlib import Path
from learntools.time_series.style import *  # plot style settings
from learntools.time_series.utils import plot_periodogram, seasonal_plot

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess


comp_dir = Path('../input/store-sales-time-series-forecasting')

holidays_events = pd.read_csv(
    comp_dir / "holidays_events.csv",
    dtype={
        'type': 'category',
        'locale': 'category',
        'locale_name': 'category',
        'description': 'category',
        'transferred': 'bool',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
holidays_events = holidays_events.set_index('date').to_period('D')

store_sales = pd.read_csv(
    comp_dir / 'train.csv',
    usecols=['store_nbr', 'family', 'date', 'sales'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()
average_sales = (
    store_sales
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

-------------------------------------------------------------------------------

Examine the following seasonal plot:

In [None]:
X = average_sales.to_frame()
X["week"] = X.index.week
X["day"] = X.index.dayofweek
seasonal_plot(X, y='sales', period='week', freq='day');

And also the periodogram:

In [None]:
plot_periodogram(average_sales);

# 1) Determine seasonality

What kind of seasonality do you see evidence of? Once you've thought about it, run the next cell for some discussion.

In [None]:
# View the solution (Run this cell to receive credit!)
q_1.check()

-------------------------------------------------------------------------------

# 2) Create seasonal features

Use `DeterministicProcess` and `CalendarFourier` to create:
- indicators for weekly seasons and
- Fourier features of order 4 for monthly seasons.

In [None]:
y = average_sales.copy()

# YOUR CODE HERE
fourier = ____
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    # ____
    drop=True,
)
X = ____

# Check your answer
q_2.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_2.hint()
#_COMMENT_IF(PROD)_
q_2.solution()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='M', order=12)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='A', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index[1:],
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index[1:],
    constant=True,
    order=1,
    seasonal=False,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
#    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_failed()

In [None]:
#%%RM_IF(PROD)%%
y = average_sales.copy()
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

q_2.assert_check_passed()

Now run this cell to fit the seasonal model.

In [None]:
model = LinearRegression().fit(X, y)
y_pred = pd.Series(
    model.predict(X),
    index=X.index,
    name='Fitted',
)

y_pred = pd.Series(model.predict(X), index=X.index)
ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax.legend();

-------------------------------------------------------------------------------

Removing from a series its trend or seasons is called **detrending** or **deseasonalizing** the series.

Look at the periodogram of the deseasonalized series.

In [None]:
y_deseason = y - y_pred

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(10, 7))
ax1 = plot_periodogram(y, ax=ax1)
ax1.set_title("Product Sales Frequency Components")
ax2 = plot_periodogram(y_deseason, ax=ax2);
ax2.set_title("Deseasonalized");

# 3) Check for remaining seasonality

Based on these periodograms, how effectively does it appear your model captured the seasonality in *Average Sales*? Does the periodogram agree with the time plot of the deseasonalized series?

In [None]:
# View the solution (Run this cell to receive credit!)
q_3.check()

-------------------------------------------------------------------------------

The *Store Sales* dataset includes a table of Ecuadorian holidays.

In [None]:
# National and regional holidays in the training set
holidays = (
    holidays_events
    .query("locale in ['National', 'Regional']")
    .loc['2017':'2017-08-15', ['description']]
    .assign(description=lambda x: x.description.cat.remove_unused_categories())
)

display(holidays)

From a plot of the deseasonalized *Average Sales*, it appears these holidays could have some predictive power.

In [None]:
ax = y_deseason.plot(**plot_params)
plt.plot_date(holidays.index, y_deseason[holidays.index], color='C3')
ax.set_title('National and Regional Holidays');

# 4) Create holiday features

What kind of features could you create to help your model make use of this information? Code your answer in the next cell. (Scikit-learn and Pandas both have utilities that should make this easy. See the `hint` if you'd like more details.)


In [None]:
# YOUR CODE HERE
#_UNCOMMENT_IF(PROD)_
#____

# Check your answer
q_4.check()

In [None]:
# Lines below will give you a hint or solution code
#_COMMENT_IF(PROD)_
q_4.hint()
#_COMMENT_IF(PROD)_
q_4.hint(2)
#_COMMENT_IF(PROD)_
q_4.solution()

In [None]:
#%%RM_IF(PROD)%%
# Scikit-learn solution
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)

X_holidays = pd.DataFrame(
    ohe.fit_transform(holidays),
    index=holidays.index,
    columns=holidays.description.unique(),
)

X2 = X.join(X_holidays, on='date').fillna(0.0)

q_4.assert_check_passed()

In [None]:
#%%RM_IF(PROD)%%
# Pandas solution
X_holidays = pd.get_dummies(holidays)

X2 = X.join(X_holidays, on='date').fillna(0.0)

q_4.assert_check_passed()

Use this cell to fit the seasonal model with holiday features added. Do the fitted values seem to have improved?

In [None]:
model = LinearRegression().fit(X2, y)
y_pred = pd.Series(
    model.predict(X2),
    index=X2.index,
    name='Fitted',
)

y_pred = pd.Series(model.predict(X2), index=X2.index)
ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax.legend();

-------------------------------------------------------------------------------

# (Optional) Understand log transforms and the RMSLE metric

Sometimes a logarithmic transform is effective at stabilizing the variation in a series.

```
log(trend * seasons * error) = log(trend) + log(seasons) + log(error)
```

The next cell plots the change in store sales after a log-transform. Notice how the variation in the transformed series, instead of increasing over time, appears to be almost constant from start to finish.

In [None]:
import numpy as np

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(10, 7))
average_sales.plot(ax=ax1)
ax1.set_title("Sales")
ax2 = np.log1p(average_sales).plot(ax=ax2)
ax2.set_title("Log Sales");

Log transforming the target can lead to better performance, especially when the metric is based on log-error as RMSLE is. In fact, if you train your model on a log-transformed target, then the errors it produces will be log-errors. Computing the RMSE of log-errors is equivalent to the RMSLE.

The next cell illustrates how a log-transform can improve RMSLE predictive performance.

In [None]:
from sklearn.metrics import mean_squared_error, mean_squared_log_error

y = average_sales.copy()

model = LinearRegression(fit_intercept=False).fit(X, y)
rmsle = mean_squared_log_error(y, model.predict(X)) ** 0.5

model = LinearRegression(fit_intercept=False).fit(X, np.log1p(y))
rmse = mean_squared_error(np.log1p(y), model.predict(X)) ** 0.5  # Equivalent to RMSLE


print(f'RMSLE when fit to y: {rmsle:.5f}')
print(f'RMSE when fit to log(y+1) (equivalent to RMSLE): {rmse:.5f}')

For this course, we'll stick with the untransformed `y` for simplicity, but you may want to consider log-transforming for your own models.

# Keep Going #