**This notebook is an exercise in the [Time Series](https://www.kaggle.com/learn/time-series) course.  You can reference the tutorial at [this link](https://www.kaggle.com/ryanholbrook/seasonality).**

---


# Introduction #

Run this cell to set everything up!

In [None]:
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.time_series.ex3 import *

# Setup notebook
from pathlib import Path
from learntools.time_series.style import *  # plot style settings
from learntools.time_series.utils import plot_periodogram, seasonal_plot

import time
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, ElasticNet, Ridge

from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess
from sklearn.metrics import mean_squared_log_error

comp_dir = Path('../input/store-sales-time-series-forecasting')

holidays_events = pd.read_csv(
    comp_dir / "holidays_events.csv",
    dtype={
        'type': 'category',
        'locale': 'category',
        'locale_name': 'category',
        'description': 'category',
        'transferred': 'bool',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
holidays_events = holidays_events.set_index('date').to_period('D')

store_sales = pd.read_csv(
    comp_dir / 'train.csv',
    usecols=['store_nbr', 'family', 'date', 'sales'],
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'sales': 'float32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales['date'] = store_sales.date.dt.to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family', 'date']).sort_index()

average_sales = (
    store_sales
    .groupby('date').mean()
    .squeeze()
    .loc['2017']
)

In [None]:
average_sales.head()

-------------------------------------------------------------------------------

Examine the following seasonal plot:

In [None]:
X = average_sales.to_frame()
X["week"] = X.index.week
X["day"] = X.index.dayofweek
seasonal_plot(X, y='sales', period='week', freq='day');

And also the periodogram:

In [None]:
plot_periodogram(average_sales);

# 1) Determine seasonality

What kind of seasonality do you see evidence of? Once you've thought about it, run the next cell for some discussion.

In [None]:
# View the solution (Run this cell to receive credit!)
q_1.check()

-------------------------------------------------------------------------------

# 2) Create seasonal features

Use `DeterministicProcess` and `CalendarFourier` to create:
- indicators for weekly seasons and
- Fourier features of order 4 for monthly seasons.

In [None]:
y = average_sales.copy()

# YOUR CODE HERE
fourier = CalendarFourier(freq='M', order=4)
dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    # YOUR CODE HERE
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)
X = dp.in_sample()

# Check your answer
q_2.check()

In [None]:
# Lines below will give you a hint or solution code
#q_2.hint()
#q_2.solution()

Now run this cell to fit the seasonal model.

In [None]:
model = LinearRegression().fit(X, y)

y_pred = pd.Series(
    model.predict(X),
    index=X.index,
    name='Fitted',
)

y_pred = pd.Series(model.predict(X), index=X.index)

ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax.legend();

-------------------------------------------------------------------------------


Removing from a series its trend or seasons is called **detrending** or **deseasonalizing** the series.

Look at the periodogram of the deseasonalized series.

In [None]:
y_deseason = y - y_pred

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, sharey=True, figsize=(10, 7))
ax1 = plot_periodogram(y, ax=ax1)
ax1.set_title("Product Sales Frequency Components")
ax2 = plot_periodogram(y_deseason, ax=ax2);
ax2.set_title("Deseasonalized");

# 3) Check for remaining seasonality

Based on these periodograms, how effectively does it appear your model captured the seasonality in *Average Sales*? Does the periodogram agree with the time plot of the deseasonalized series?

In [None]:
# View the solution (Run this cell to receive credit!)
q_3.check()

-------------------------------------------------------------------------------

The *Store Sales* dataset includes a table of Ecuadorian holidays.

In [None]:
# National and regional holidays in the training set
holidays = (
    holidays_events
    .query("locale in ['National', 'Regional']")
    .loc['2017':'2017-08-15', ['description']]
    .assign(description=lambda x: x.description.cat.remove_unused_categories())
)

display(holidays)

From a plot of the deseasonalized *Average Sales*, it appears these holidays could have some predictive power.

In [None]:
ax = y_deseason.plot(**plot_params)
plt.plot_date(holidays.index, y_deseason[holidays.index], color='C3')
ax.set_title('National and Regional Holidays');

# 4) Create holiday features

What kind of features could you create to help your model make use of this information? Code your answer in the next cell. (Scikit-learn and Pandas both have utilities that should make this easy. See the `hint` if you'd like more details.)


In [None]:
# YOUR CODE HERE
# Pandas solution
X_holidays = pd.get_dummies(holidays)

# Join to training data
X2 = X.join(X_holidays, on='date').fillna(0.0)


# Check your answer
q_4.check()

In [None]:
# Scikit-learn solution
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)

X_holidays = pd.DataFrame(
    ohe.fit_transform(holidays),
    index=holidays.index,
    columns=holidays.description.unique(),
)

In [None]:
# Lines below will give you a hint or solution code
#q_4.hint()
#q_4.hint(2)
#q_4.solution()

Use this cell to fit the seasonal model with holiday features added. Do the fitted values seem to have improved?

In [None]:
model = LinearRegression().fit(X2, y)

y_pred = pd.Series(
    model.predict(X2),
    index=X2.index,
    name='Fitted',
)

y_pred = pd.Series(model.predict(X2), index=X2.index)

ax = y.plot(**plot_params, alpha=0.5, title="Average Sales", ylabel="items sold")
ax = y_pred.plot(ax=ax, label="Seasonal")
ax.legend();

-------------------------------------------------------------------------------

# (Optional) Submit to Store Sales competition

This part of the exercise will walk you through your first submission to this course's companion competition: [**Store Sales - Time Series Forecasting**](https://www.kaggle.com/c/29781). Submitting to the competition isn't required to complete the course, but it's a great way to try out your new skills.

The next cell creates a seasonal model of the kind you've learned about in this lesson for the full *Store Sales* dataset with all 1800 time series.

### Useful hints and ideas from other Kagglers
https://www.kaggle.com/xholisilemantshongo/modeling-sales-3-types-of-regression/notebook

In [None]:
store_sales.shape

In [None]:
store_sales.head(10)

In [None]:
store_sales.index[:3]

In [None]:
#y = store_sales.unstack(['store_nbr', 'family']).loc["2017"] #V1,2 X: 227 rows × 17 columns ---> y: 227 rows × 1782 columns
#y = store_sales.unstack(['store_nbr', 'family'])                #V7 X: 1684 rows × 17 columns ---> y:1684 rows × 1782 columns
#y = store_sales.unstack(['store_nbr', 'family']).loc["2016":]     #V8 X: 592 rows × 17 columns ---> y:592 rows × 1782 columns

#VX use last 16 days of the training set as the validation set, see performance since we only get 4 kaggle submissions, pick best, train on training + validation set, submit to kaggle

#last 16 days of the training set
train_start_date='2017-04-01'
#train_end_date='2017-07-31'
valid_end_date='2017-08-15'

#y = store_sales.unstack(['store_nbr', 'family']).loc[train_start_date:valid_end_date] #V9
#y = store_sales.unstack(['store_nbr', 'family']).loc["2017"] #V9B
y = store_sales.unstack(['store_nbr', 'family']).loc[train_start_date:valid_end_date] #V9C

#y_valid =  store_sales.unstack(['store_nbr', 'family']).loc[train_end_date:]

# Create training data
#fourier = CalendarFourier(freq='M', order=4) #V1
fourier = CalendarFourier(freq='W', order=4)  #V9

dp = DeterministicProcess(
    index=y.index,
    #constant=True,  #V1
    constant=False,  #V9
    order=1,
    #seasonal=True,   #V1
    seasonal=False,   #V9 use fourier instead of one-hot encoded days of the week
    additional_terms=[fourier],
    drop=True,
)

X = dp.in_sample()
X['NewYear'] = (X.index.dayofyear == 1)
X['NewYear'] = X['NewYear'].astype('category') #V1 .astype('category')
print(X.shape, y.shape)
#print(X.dtypes, y.dtypes)

In [None]:
#average store_sales
store_sales.groupby('date').mean().squeeze().plot();

In [None]:
store_sales.groupby('date').mean().squeeze().loc[train_start_date:valid_end_date].plot();

In [None]:
X

In [None]:
y

In [None]:
%%time

#model = LinearRegression(fit_intercept=False) #V1_LR
#model = Lasso(alpha=0.1)                      #V1_Lasso, Linear with L1 weight regularization
#model = ElasticNet(alpha=0.1, l1_ratio=0.5)  #V1,9_ElasticNet, Linear with Lr weight regularization
#model = Ridge(alpha=1, solver="cholesky")    #V1_Ridge, Linear with L2 weight regularization
model = Ridge(fit_intercept=True, solver='auto', alpha=0.9, normalize=True) # V9CRidge_0.1 try alpha 0.1, 0.3, 0.5, 0.6, 0.7 and 0.9

model.fit(X, y)

In [None]:
y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=y.columns)
y_pred

### We are training 1782 different LinearRegression models using these (X) time based features 😮

In [None]:
model.coef_.shape 

### Calculate mean_squared_log_error on the training set

In [None]:
y_pred_metrics   = y_pred.stack(['store_nbr', 'family']).reset_index().copy()
y_target = y.stack(['store_nbr', 'family']).reset_index().copy()
y_target['sales_pred'] = y_pred_metrics['sales'].clip(0.) 

In [None]:
y_pred_metrics

In [None]:
y_target

In [None]:
def generate_df_msle_info(y_target, version, X, y, model, kaggle_rmsle):
    
    #Calculate msle on training set
    msle = y_target.groupby('family').apply(lambda r: mean_squared_log_error(r['sales'], r['sales_pred']))
    df_msle = pd.DataFrame(msle)
    df_msle.rename({0:'train_msle'}, axis=1, inplace=True)

    #Detailed information on features X, target y, model used and performance on training set and kaggle test set
    info = {'version': version,
    'X.index.min': X.index.min(),
    'X.index.max': X.index.max(),
    'y.index.min': y.index.min(),
    'y.index.max': y.index.max(),
    'X.shape': str(X.shape),
    'y.shape': str(y.shape),
    'models': model,
    'train_msle_mean': df_msle.mean().values[0],
    'kaggle_rmsle': kaggle_rmsle
    }
    
    #Pivot long df to wide df
    df_info = pd.DataFrame.from_dict(info, orient='index').transpose()
    #Repeat single df row len(df_msle) times
    df_info = df_info.loc[df_info.index.repeat(len(df_msle))].reset_index(drop=True)
    #family set to df_msle.index
    df_info['family'] = df_msle.index
    
    #Join df_msle and df_info on family
    df_msle_info = pd.merge(df_msle, df_info, left_index=True, right_on='family')
    df_msle_info = df_msle_info[['version', 'X.index.min', 'X.index.max', 'y.index.min',
       'y.index.max', 'X.shape', 'y.shape', 'models', 'train_msle_mean',
       'kaggle_rmsle', 'family', 'train_msle']]

    df_msle_info

    return df_msle_info

In [None]:
df_msle_info.tail()

### Read stored df_msle_info dataframe

In [None]:
my_models_path = '../input/models/'

#Read from my Kaggle models dataset
df_msle_info = pd.read_pickle(my_models_path+'df_msle_info.pkl')
df_msle_info.tail(3)

#Store locally
#df_msle_info.to_pickle('df_msle_info.pkl')

In [None]:
#df_msle_info_new = generate_df_msle_info(y_target, 'V1_LR', X, y, str(model), 0.51090)
#df_msle_info = df_msle_info_new.copy()

#df_msle_info_new = generate_df_msle_info(y_target, 'V1_Lasso', X, y, str(model), 0.50770)
#df_msle_info_new = generate_df_msle_info(y_target, 'V1_ElasticNet', X, y, str(model), 0.50519)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9_ElasticNet', X, y, str(model), 0.45707)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9B_ElasticNet', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.1', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.3', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.5', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.6', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.7', X, y, str(model), -1)
#df_msle_info_new = generate_df_msle_info(y_target, 'V9CRidge_0.9', X, y, str(model), -1)

df_msle_info_new.head(3)

In [None]:
df_msle_info_new.head(3)

In [None]:
df_msle_info.columns

In [None]:
print(df_msle_info.shape)
df_msle_info = df_msle_info.append(df_msle_info_new, ignore_index=True)
print(df_msle_info.shape)

In [None]:
df_msle_info.tail(3)

In [None]:
#Helpful plot method for msle on the training set per family item
def plot_df_msle_info(df_msle_info, version_list, figsize=(16, 9)):
    plt.rcParams["figure.figsize"] = figsize
    #df_msle_info.plot(kind='bar',x='family',y='train_msle');
    ax = sns.barplot(x='family', y='train_msle', hue='version', data=df_msle_info[df_msle_info['version'].isin(version_list)])
    for item in ax.get_xticklabels():
        item.set_rotation(90)

In [None]:
version_list = df_msle_info.version.unique()
plot_df_msle_info(df_msle_info, version_list)

In [None]:
version_list = [v for v in df_msle_info.version.unique() if 'V9' in v if v not in 'V9B_ElasticNet']
plot_df_msle_info(df_msle_info, version_list)

In [None]:
family_list = ['BEVERAGES', 'BREAD/BAKERY', 'GROCERY I', 'GROCERY II', 'LIQUOR,WINE,BEER', 'SCHOOL AND OFFICE SUPPLIES']
df = df_msle_info[df_msle_info['family'].isin(family_list)].copy()
df.family = df.family.astype(str)
version_list = df_msle_info.version.unique()
plot_df_msle_info(df, version_list)

In [None]:
version_list = [v for v in df_msle_info.version.unique() if 'V9' in v if v not in 'V9B_ElasticNet']
plot_df_msle_info(df, version_list)

In [None]:
#df_msle_info.loc[df_msle_info['version'] == 'V1_Lasso', 'kaggle_rmsle'] = 0.50770
#df_msle_info.loc[df_msle_info['version'] == 'V1_ElasticNet', 'kaggle_rmsle'] = 0.50519
#df_msle_info.loc[df_msle_info['version'] == 'V1_Ridge', 'kaggle_rmsle'] = 0.50920
#df_msle_info.loc[df_msle_info['version'] == 'V9_ElasticNet', 'kaggle_rmsle'] = 0.45707
#df_msle_info.loc[df_msle_info['version'] == 'V9CRidge_0.1', 'kaggle_rmsle'] = 0.45306
#df_msle_info.loc[df_msle_info['version'] == 'V9CRidge_0.5', 'kaggle_rmsle'] = 0.44314
#df_msle_info.loc[df_msle_info['version'] == 'V9CRidge_0.6', 'kaggle_rmsle'] = 0.44359
#df_msle_info.loc[df_msle_info['version'] == 'V9CRidge_0.7', 'kaggle_rmsle'] = 0.44435
#df_msle_info.loc[df_msle_info['version'] == 'V9CRidge_0.9', 'kaggle_rmsle'] = 0.44616

#df_msle_info.loc[df_msle_info['version'] == '', 'kaggle_rmsle'] =

In [None]:
groupedvalues = df_msle_info[~df_msle_info.version.isin(['V9B_ElasticNet', 'V9CRidge_0.3'])].groupby('version').min().reset_index().sort_values(by=['kaggle_rmsle'], ascending=True)
groupedvalues.head()

In [None]:
print(plt.style.available)
plt.style.use('default')

In [None]:
#Vertical bar chart
#ax = groupedvalues.plot(kind='bar',x='version',y='kaggle_rmsle', figsize=(16, 9));
#for p in ax.patches:
#    ax.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() * 1.005))

#Horizontal bar chart    
ax = groupedvalues.plot(kind='barh',x='version',y='kaggle_rmsle', figsize=(16, 9));
for p in ax.patches:
    ax.annotate(str(p.get_width()), (p.get_x() + p.get_width(), p.get_y()), xytext=(5, 10), textcoords='offset points')

### Store df_msle_info as a pickle object and re-import later

In [None]:
df_msle_info.to_pickle("./df_msle_info.pkl")
#df_msle_info = pd.read_pickle("./df_msle_info.pkl")

You can use this cell to see some of its predictions.


In [None]:
STORE_NBR = '1'  # 1 - 54
FAMILY = 'BEVERAGES'
# Uncomment to see a list of product families
display(store_sales.index.get_level_values('family').unique())

ax = y.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(**plot_params)
ax = y_pred.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(ax=ax)
ax.set_title(f'{FAMILY} Sales at Store {STORE_NBR}');

In [None]:
STORE_NBR = '2'  # 1 - 54
FAMILY = 'BEVERAGES'
# Uncomment to see a list of product families
#display(store_sales.index.get_level_values('family').unique())

ax = y.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(**plot_params)
ax = y_pred.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(ax=ax)
ax.set_title(f'{FAMILY} Sales at Store {STORE_NBR}');

In [None]:
STORE_NBR = '1'  # 1 - 54
FAMILY = 'AUTOMOTIVE'
# Uncomment to see a list of product families
#display(store_sales.index.get_level_values('family').unique())

ax = y.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(**plot_params)
ax = y_pred.loc(axis=1)['sales', STORE_NBR, FAMILY].plot(ax=ax)
ax.set_title(f'{FAMILY} Sales at Store {STORE_NBR}');

### V1,2:  X.shape: 227x17 ---> y.shape: 227x1782   LinearRegression(fit_intercept=False)
### V3...: X.shape: 404514x19 ---> y.shape: 404514x1 LabelEncoder family, strore_nbr + LinearRegression(fit_intercept=False)
### V4...: X.shape: 404514x19 ---> y.shape: 404514x1 LabelEncoder family, strore_nbr + XGBRegressor()
### V5...: X.shape: 404514x105 ---> y.shape: 404514x1 OneHotEncoder family, strore_nbr + LinearRegression(fit_intercept=False)
### V6...: X.shape: 404514x105 ---> y.shape: 404514x1 OneHotEncoder family, strore_nbr + XGBRegressor()
### V7 same as V1,2 but 2013+ data: X: 1684x17 ---> y:1684x1782   LinearRegression(fit_intercept=False)
### V8 same as V1,2 but 2016+ data: X:  592x17 ---> y:592x1782   LinearRegression(fit_intercept=False)

Finally, this cell loads the test data, creates a feature set for the forecast period, and then creates the submission file `submission.csv`.

In [None]:
df_test = pd.read_csv(
    comp_dir / 'test.csv',
    dtype={
        'store_nbr': 'category',
        'family': 'category',
        'onpromotion': 'uint32',
    },
    parse_dates=['date'],
    infer_datetime_format=True,
)
df_test['date'] = df_test.date.dt.to_period('D')
df_test = df_test.set_index(['store_nbr', 'family', 'date']).sort_index()

In [None]:
df_test

In [None]:
list(df_test.index.get_level_values('date').unique())

In [None]:
# Create features for test set
X_test = dp.out_of_sample(16) # ??? V2
X_test

In [None]:
model

In [None]:
df_msle_info.tail(3)

In [None]:
X_test.index.name = 'date'
X_test['NewYear'] = (X_test.index.dayofyear == 1)

y_pred = model.predict(X_test).clip(0.0)

In [None]:
y_submit = pd.DataFrame(y_pred, index=X_test.index, columns=y.columns)
y_submit = y_submit.stack(['store_nbr', 'family'])
y_submit = y_submit.join(df_test.id).reindex(columns=['id', 'sales'])
y_submit.to_csv('Time Series - Seasonality V9CRidge_0.9.csv', index=False)

In [None]:
y_submit

# V3 keep store_nbr and family as features by label or one-hot encoding them

In [None]:
#y = store_sales.unstack(['store_nbr', 'family']).loc["2017"]
y = store_sales.reset_index(level=[0,1]).loc["2017"]         #??? Keep only 2017 ? 
y 

In [None]:
# Create training data
fourier = CalendarFourier(freq='M', order=4)

dp = DeterministicProcess(
    index=y.index,
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)

X = dp.in_sample()
X['NewYear'] = (X.index.dayofyear == 1)
X

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label encoding for 'family'
le = LabelEncoder()  # from sklearn.preprocessing
X['family'] = le.fit_transform(y['family'])

X['store_nbr'] = y['store_nbr']  # V4 le.fit_transform(y['store_nbr'])

X["day"] = X.index.day  # values are day of the month
X

In [None]:
y

In [None]:
y.pop('store_nbr')
y.pop('family')
y

### V1,2: X.shape: 227 rows × 17 columns ---> y.shape: 227 rows × 1782 columns   LinearRegression(fit_intercept=False)
### V3...: X.shape: 404514 rows × 19 columns ---> y.shape: 404514 rows × 1 columns LabelEncoder family, strore_nbr + LinearRegression(fit_intercept=False)
### V4...: X.shape: 404514 rows × 19 columns ---> y.shape: 404514 rows × 1 columns LabelEncoder family, strore_nbr + XGBRegressor()
### V5...: X.shape: 404514 rows × 105 columns ---> y.shape: 404514 rows × 1 columns OneHotEncoder family, strore_nbr + LinearRegression(fit_intercept=False)
### V6...: X.shape: 404514 rows × 105 columns ---> y.shape: 404514 rows × 1 columns OneHotEncoder family, strore_nbr + XGBRegressor()
### V7 same as V1,2 but not just 2017 data: X: 1684 rows × 17 columns ---> y:1684 rows × 1782 columns   LinearRegression(fit_intercept=False)

In [None]:
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

In [None]:
y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=['sales'])

In [None]:
y_pred

In [None]:
df_test

In [None]:
df_test_long = df_test.reset_index(level=[0,1])
df_test_long

In [None]:
# Create test data
fourier = CalendarFourier(freq='M', order=4)
dp_test = DeterministicProcess(
    index=df_test_long.index,      # <------
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)

X_test = dp_test.in_sample()      # <------
X_test['NewYear'] = (X_test.index.dayofyear == 1)
X_test

In [None]:
# Create features for test set
#X_test = dp.out_of_sample(16) # ??? V2

X_test['NewYear'] = (X_test.index.dayofyear == 1)

# Label encoding for 'family'
X_test['family'] = le.transform(df_test_long['family'])
X_test['store_nbr'] = df_test_long['store_nbr']

X_test["day"] = X_test.index.day  # values are day of the month
X_test

In [None]:
y_pred = model.predict(X_test)
y_pred.shape

In [None]:
df_test_long

In [None]:
y_submit = pd.DataFrame(y_pred, index=X_test.index, columns=y.columns)
y_submit['id'] = df_test_long['id']
#y_submit = y_submit.stack(['store_nbr', 'family'])
#y_submit = y_submit.join(df_test.id).reindex(columns=['id', 'sales'])
#y_submit.to_csv('submission.csv', index=False)
y_submit = y_submit.reset_index().drop('date', axis=1)
y_submit = y_submit[['id', 'sales']]
y_submit

In [None]:
y_submit.to_csv('submission.csv', index=False)

In [None]:
plt.hist(y_submit['sales'], bins='auto')

### V4...: X.shape: 404514 rows × 19 columns ---> y.shape: 404514 rows × 1 columns XGBRegressor()

In [None]:
X

In [None]:
y

In [None]:
from xgboost import XGBRegressor

X.store_nbr = X.store_nbr.astype(int)
avoid_error = """ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
                categorical type is supplied, DMatrix parameter `enable_categorical` must
                be set to `True`.store_nbr
"""

model = XGBRegressor()
model.fit(X, y)

y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=['sales'])
print(y_pred)
y_pred

In [None]:
X_test

In [None]:
y_pred = model.predict(X_test)
print(y_pred)
y_pred

In [None]:
y_submit = pd.DataFrame(y_pred, index=X_test.index, columns=y.columns)
y_submit['id'] = df_test_long['id']
y_submit = y_submit.reset_index().drop('date', axis=1)
y_submit = y_submit[['id', 'sales']]
y_submit

In [None]:
y_submit.to_csv('submission.csv', index=False)

In [None]:
plt.hist(y_submit['sales'], bins='auto');

In [None]:
y_submit_clipped = y_submit.clip(0.0)
y_submit_clipped.to_csv('submission.csv', index=False)
plt.hist(y_submit_clipped['sales'], bins='auto');

### V5...: X.shape: 404514 rows × 105 columns ---> y.shape: 404514 rows × 1 columns OneHotEncoder family, strore_nbr + LinearRegression(fit_intercept=False)

In [None]:
X_original = X.copy()
y_original = y.copy()

In [None]:
X = X_original.copy()
X

In [None]:
y = y_original.copy()
y

In [None]:
from sklearn.preprocessing import OneHotEncoder

family_ohe = OneHotEncoder(sparse=False)
family_df = pd.DataFrame(family_ohe.fit_transform(X[['family']]))
family_df.columns = ['family_'+str(col_name) for col_name in family_df.columns] 
family_df.index = X.index
family_df

In [None]:
store_nbr_ohe = OneHotEncoder(sparse=False)
store_nbr_df = pd.DataFrame(store_nbr_ohe.fit_transform(X[['store_nbr']]))
store_nbr_df.columns = ['store_nbr_'+str(col_name) for col_name in store_nbr_df.columns] 
store_nbr_df.index = X.index
store_nbr_df

In [None]:
X.pop('family')
X.pop('store_nbr')

In [None]:
print(X.shape, family_df.shape, store_nbr_df.shape)

In [None]:
X

In [None]:
X = pd.concat([X, family_df, store_nbr_df], axis=1)
X

In [None]:
y

In [None]:
model = LinearRegression(fit_intercept=False)
model.fit(X, y)

In [None]:
model.coef_.shape

In [None]:
y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=['sales'])
y_pred

In [None]:
df_test_long

In [None]:
# Create test data
fourier = CalendarFourier(freq='M', order=4)
dp_test = DeterministicProcess(
    index=df_test_long.index,      # <------
    constant=True,
    order=1,
    seasonal=True,
    additional_terms=[fourier],
    drop=True,
)

X_test = dp_test.in_sample()      # <------
X_test['NewYear'] = (X_test.index.dayofyear == 1)
X_test

In [None]:
X_test['NewYear'] = (X_test.index.dayofyear == 1)

# Label encoding for 'family'
X_test['family'] = le.transform(df_test_long['family'])
X_test['store_nbr'] = df_test_long['store_nbr']

X_test["day"] = X_test.index.day  # values are day of the month
X_test

In [None]:
#family_ohe = OneHotEncoder(sparse=False)
family_df = pd.DataFrame(family_ohe.transform(X_test[['family']]))
family_df.columns = ['family_'+str(col_name) for col_name in family_df.columns] 
family_df.index = X_test.index
family_df

In [None]:
#store_nbr_ohe = OneHotEncoder(sparse=False)
store_nbr_df = pd.DataFrame(store_nbr_ohe.transform(X_test[['store_nbr']]))
store_nbr_df.columns = ['store_nbr_'+str(col_name) for col_name in store_nbr_df.columns] 
store_nbr_df.index = X_test.index
store_nbr_df

In [None]:
X_test.pop('family')
X_test.pop('store_nbr')

In [None]:
print(X_test.shape, family_df.shape, store_nbr_df.shape)
X_test = pd.concat([X_test, family_df, store_nbr_df], axis=1)

In [None]:
X_test

In [None]:
y_pred = model.predict(X_test)
print(y_pred.shape)
y_pred

In [None]:
y_submit = pd.DataFrame(y_pred, index=X_test.index, columns=y.columns)
y_submit['id'] = df_test_long['id']
y_submit = y_submit.reset_index().drop('date', axis=1)
y_submit = y_submit[['id', 'sales']]
y_submit

In [None]:
y_submit.to_csv('submission.csv', index=False)
plt.hist(y_submit['sales'], bins='auto');

In [None]:
y_submit_clipped = y_submit.clip(0.0)
y_submit_clipped.to_csv('submission-clipped.csv', index=False)
plt.hist(y_submit_clipped['sales'], bins='auto');

### V6...: X.shape: 404514 rows × 105 columns ---> y.shape: 404514 rows × 1 columns OneHotEncoder family, strore_nbr + XGBRegressor()

In [None]:
X

In [None]:
y

In [None]:
model = XGBRegressor()
model.fit(X, y)

y_pred = pd.DataFrame(model.predict(X), index=X.index, columns=['sales'])
print(y_pred)
y_pred

In [None]:
y_pred = model.predict(X_test)
print(y_pred.shape)
y_pred

In [None]:
y_submit = pd.DataFrame(y_pred, index=X_test.index, columns=y.columns)
y_submit['id'] = df_test_long['id']
y_submit = y_submit.reset_index().drop('date', axis=1)
y_submit = y_submit[['id', 'sales']]
y_submit

In [None]:
y_submit.to_csv('submission.csv', index=False)
plt.hist(y_submit['sales'], bins='auto');

In [None]:
y_submit_clipped = y_submit.clip(0.0)
y_submit_clipped.to_csv('submission-clipped.csv', index=False)
plt.hist(y_submit_clipped['sales'], bins='auto');

To test your forecasts, you'll need to join the competition (if you haven't already). So open a new window by clicking on [this link](https://www.kaggle.com/c/29781). Then click on the **Join Competition** button.

Next, follow the instructions below:
1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  
2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.
3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
4. Click on the **Output** tab on the right of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.


# Keep Going #

[**Use time series as features**](https://www.kaggle.com/ryanholbrook/time-series-as-features) to capture cycles and other kinds of serial dependence.