In [1]:
import numpy as np
import pandas as pd
import scipy.stats as sps

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

sns.set(font_scale=1.2)
%matplotlib inline

# Baseline

In this notebook I will provide a baseline solution according to advice 1, 2 in the course.

Advice 1:
> Competition data is rather challenging, so the sooner you get yourself familiar with it - the better. You can start with submitting sample_submission.csv from "Data" page on Kaggle and try submitting different constants.

Advice 2:

> A good exercise is to reproduce previous_value_benchmark. As the name suggest - in this benchmark for the each shop/item pair our predictions are just monthly sales from the previous month, i.e. October 2015.

> The most important step at reproducing this score is correctly aggregating daily data and constructing monthly sales data frame. You need to get [lagged](https://en.wikipedia.org/wiki/Lag_operator) values, fill NaNs with zeros and clip the values into [0,20] range. If you do it correctly, you'll get precisely 1.16777 on the public leaderboard.

> Generating features like this is a necessary basis for more complex models. Also, if you decide to fit some model, don't forget to clip the target into [0,20] range, it makes a big difference.

## Sample submission

According to advice 1 just send sample submission.

In [2]:
submission = pd.read_csv('../data/raw/sample_submission.csv')

In [3]:
submission.head()

Unnamed: 0,ID,item_cnt_month
0,0,0.5
1,1,0.5
2,2,0.5
3,3,0.5
4,4,0.5


Check values of `item_cnt_month`.

In [4]:
submission.item_cnt_month.unique()

array([0.5])

As we can see, we predict just one constant value. Send it to the system using `kaggle` utility.

In [6]:
!kaggle competitions submit competitive-data-science-predict-future-sales -f ../data/raw/sample_submission.csv -m "Sample submission"

100%|███████████████████████████████████████| 2.14M/2.14M [00:04<00:00, 457kB/s]
Successfully submitted to Predict Future Sales

In [7]:
!kaggle competitions submissions competitive-data-science-predict-future-sales

fileName               date                 description        status    publicScore  privateScore  
---------------------  -------------------  -----------------  --------  -----------  ------------  
sample_submission.csv  2020-08-27 08:38:17  Sample submission  complete  1.23646      None          


We got $1.23646$ on the liderboard.

## Previous value benchmark

According to advice 2 for each `item_id`, `shop_id` send prediction for the previous month and clip the value within [0, 20].

In [8]:
train = pd.read_csv('../data/processed/train.csv')
train.head()

Unnamed: 0,num_days,month,year,num_holidays,num_not_working_days,longest_sequence_without_holidays,fraction_non_even_mean_lag_1,price_mean_lag_1,price_nunique_lag_1,price_std_lag_1,...,target_lag_12,target_item_lag_12,target_shop_lag_12,item_name,item_full_category_name,item_category_name,item_subcategory_name,shop_name,city,num_residents
0,31,1,2014,8,14,23,0.066409,741.5607,3.0,42.685207,...,0.0,0.0,0.0,ГАДКИЙ Я 1-2 (BD),Кино - Blu-Ray,Кино,Blu-Ray,"Химки ТЦ ""Мега""",Химки,259550.0
1,31,1,2014,8,14,23,0.374609,1599.0,1.0,0.0,...,0.0,0.0,0.0,ГАДКИЙ Я 1-2 (3D BD),Кино - Blu-Ray 3D,Кино,Blu-Ray 3D,"Химки ТЦ ""Мега""",Химки,259550.0
2,31,1,2014,8,14,23,0.247979,392.3475,2.0,29.158052,...,0.0,0.0,0.0,ГАДКИЙ Я 2,Кино - DVD,Кино,DVD,"Химки ТЦ ""Мега""",Химки,259550.0
3,31,1,2014,8,14,23,0.136028,682.0333,4.0,63.386723,...,0.0,0.0,0.0,ГАДКИЙ Я 2 (BD),Кино - Blu-Ray,Кино,Blu-Ray,"Химки ТЦ ""Мега""",Химки,259550.0
4,31,1,2014,8,14,23,0.348627,266.0,2.0,66.0,...,0.0,0.0,0.0,"Высоцкий Владимир Спасибо, что живой (mp3-CD)...",Музыка - MP3,Музыка,MP3,"Химки ТЦ ""Мега""",Химки,259550.0
