In [1]:
import numpy as np
import pandas as pd
import scipy.stats as sps

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

sns.set(font_scale=1.2)
%matplotlib inline

# Baseline

In this notebook I will provide a baseline solution according to advice 1, 2 in the course.

Advice 1:
> Competition data is rather challenging, so the sooner you get yourself familiar with it - the better. You can start with submitting sample_submission.csv from "Data" page on Kaggle and try submitting different constants.

Advice 2:

> A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

> The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get [lagged](https://en.wikipedia.org/wiki/Lag_operator) values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

> Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.

## Sample submission

According to advice 1 just send sample submission.

In [2]:
submission = pd.read_csv('../data/raw/sample_submission.csv')

In [3]:
submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


Check values of `item_cnt_month`.

In [4]:
submission.item_cnt_month.unique()

array([0.5])

In [5]:
submission.to_csv('../models/constants/0.5.csv', index=False)

As we can see, we predict just one constant value. Send it to the system using `kaggle` utility.

We got $1.23646$ on the liderboard.

## Different constants

In this section we will try another constants and find the optima.

In [6]:
submission.item_cnt_month = 1.0
submission.to_csv('../models/constants/1.0.csv', index=False)

We got $1.41241$ on the liderboard.

In [7]:
submission.item_cnt_month = 0.0
submission.to_csv('../models/constants/0.0.csv', index=False)

We got $1.25011$ on the liderboard.

Now we can easily calculate the best constant for MSE. Assume that we have predictions $y_1, y_2$ and their scores $\alpha, \beta$. It is known, that we can represent MSE in that case as

$$
MSE = (y - \overline{y})^2 + m, 
$$
ther $m$ is the lowest possible constant error.

If we have two submissions, then we can get the system of equations. By substracting it we can get a formula:

$$
\overline{y} = \frac{1}{2} \left( y_1 + y_2 - \frac{\alpha - \beta}{y_1 - y_2} \right)
$$

If you take into account, that LB tells us RMSE, you get: $\overline{y} = 0.2839$.

Let's submit this prediction.

In [8]:
submission.item_cnt_month = 0.2839
submission.to_csv('../models/constants/0.2839', index=False)

We got $$ on the liderboard.

## Previous value benchmark

According to advice 2 for each `item_id`, `shop_id` send prediction for the previous month and clip the value within [0, 20].

In [9]:
test = pd.read_hdf('../data/processed/test.h5', 'test')
test.head()

Unnamed: 0,shop_id,item_id,num_days,month,year,num_holidays,num_not_working_days,longest_sequence_without_holidays,fraction_non_even_mean_lag_1,price_mean_lag_1,...,target_lag_12,target_item_lag_12,target_shop_lag_12,item_name,item_full_category_name,item_category_name,item_subcategory_name,shop_name,city,num_residents
6332358,5,5037,30,11,2015,1,10,144,0.332889,1499.0,...,1.0,65.0,1445.0,"NHL 15 [PS3, русские субтитры]",Игры - PS3,Игры,PS3,"Вологда ТРЦ ""Мармелад""",Вологда,310302.0
6332359,5,5320,30,11,2015,1,10,144,0.0,0.0,...,0.0,0.0,0.0,ONE DIRECTION Made In The A.M.,Музыка - CD локального производства,Музыка,CD локального производства,"Вологда ТРЦ ""Мармелад""",Вологда,310302.0
6332360,5,5233,30,11,2015,1,10,144,0.165972,1199.0,...,0.0,0.0,0.0,"Need for Speed Rivals (Essentials) [PS3, русск...",Игры - PS3,Игры,PS3,"Вологда ТРЦ ""Мармелад""",Вологда,310302.0
6332361,5,5232,30,11,2015,1,10,144,0.161925,1190.43335,...,0.0,0.0,0.0,"Need for Speed Rivals (Classics) [Xbox 360, ру...",Игры - XBOX 360,Игры,XBOX 360,"Вологда ТРЦ ""Мармелад""",Вологда,310302.0
6332362,5,5268,30,11,2015,1,10,144,0.0,0.0,...,0.0,0.0,0.0,"Need for Speed [PS4, русская версия]",Игры - PS4,Игры,PS4,"Вологда ТРЦ ""Мармелад""",Вологда,310302.0


In [10]:
submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.2839
1,1,0.2839
2,2,0.2839
3,3,0.2839
4,4,0.2839


In [11]:
submission['item_cnt_month'] = np.clip(test['target_lag_1'].values, 0, 20)

In [12]:
submission.to_csv('../models/previous_value/submission.csv', index=False)

As we can see, we predict just one constant value. Send it to the system using `kaggle` utility.

We got $1.16777$ on the liderboard as was predicted by advice.