<h1 align="center">1. Baseline</h1>
<h3 align="center">Dataset: <a href="https://www.kaggle.com/c/competitive-data-science-predict-future-sales">Predict future sales</a></h3>

### Imports

In [1]:
import pandas  as pd

### Constants

In [2]:
ENGLISH = False

DATA_RUS_PATH = "../DATA/1. Original data Russian (96Mb)/"
DATA_ENG_PATH = "../DATA/2. Translated data English (1Mb)/"
DATA_SUB_PATH = "../DATA/5. Submissions/"

### Load data

In [3]:
sales = pd.read_csv(DATA_RUS_PATH + "sales_train.csv")          # Dayly sales  Jan 2013 -> Oct 2015
test  = pd.read_csv(DATA_RUS_PATH + "test.csv", index_col="ID") # Predict Noviember 2015
sub   = pd.read_csv(DATA_RUS_PATH + "sample_submission.csv", index_col="ID")

if ENGLISH: 
    shops = pd.read_csv(DATA_ENG_PATH + "shops.csv")           # shops    (60)
    items = pd.read_csv(DATA_ENG_PATH + "items.csv")           # products  (22170)
    cats  = pd.read_csv(DATA_ENG_PATH + "item_categories.csv") # product categories (84)

else:
    shops = pd.read_csv(DATA_RUS_PATH + "shops.csv")           # shops    (60)
    items = pd.read_csv(DATA_RUS_PATH + "items.csv")           # products  (22170)
    cats  = pd.read_csv(DATA_RUS_PATH + "item_categories.csv") # product categories (84)

### `sales`: Daily sales between January 2013 until Octuber 2015.

In [4]:
sales.head(2)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0


In [5]:
sales.tail(2)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
2935847,22.10.2015,33,25,7440,299.0,1.0
2935848,03.10.2015,33,25,7460,299.0,1.0


### `test`: Sales of November 2015
**For a given item at a given shop: How many will be sold at November 2015?**

In [6]:
test.head()

Unnamed: 0_level_0,shop_id,item_id
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5,5037
1,5,5320
2,5,5233
3,5,5232
4,5,5268


---

### STEP 1: Get only sales of october 2015 

In [7]:
sales_oct2015 = sales[sales.date_block_num==33]
sales_oct2015

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
2882335,23.10.2015,33,45,13315,649.0,1.0
2882336,05.10.2015,33,45,13880,229.0,1.0
2882337,02.10.2015,33,45,13881,659.0,1.0
2882338,12.10.2015,33,45,13881,659.0,1.0
2882339,04.10.2015,33,45,13923,169.0,1.0
...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.0,1.0
2935845,09.10.2015,33,25,7460,299.0,1.0
2935846,14.10.2015,33,25,7459,349.0,1.0
2935847,22.10.2015,33,25,7440,299.0,1.0


### STEP 2: Sum Oct2015 sales per shop_id & item_id

In [8]:
oct15_item_shop = sales_oct2015.groupby(["shop_id", "item_id"])["item_cnt_day"].sum().reset_index()
oct15_item_shop

Unnamed: 0,shop_id,item_id,item_cnt_day
0,2,31,1.0
1,2,486,3.0
2,2,787,1.0
3,2,794,1.0
4,2,968,1.0
...,...,...,...
31526,59,22087,6.0
31527,59,22088,2.0
31528,59,22091,1.0
31529,59,22100,1.0


### STEP 3: Put this information (`oct15_item_shop.item_cnt_day`) in TEST
We need to perform a **LEFT JOIN**

In [9]:
test.head()

Unnamed: 0_level_0,shop_id,item_id
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
0,5,5037
1,5,5320
2,5,5233
3,5,5232
4,5,5268


In [10]:
results = pd.merge(left = test,               # Left table for the join
                   right = oct15_item_shop,   # Right table for the join
                   on=["shop_id", "item_id"], # Common keys
                   how='left')                # Type of join

results.head()

Unnamed: 0,shop_id,item_id,item_cnt_day
0,5,5037,
1,5,5320,
2,5,5233,1.0
3,5,5232,
4,5,5268,


### STEP 4: Fill missings with ceros
Ok, we get some NaNs because we have **NEW** `shop_id, item_id` pairs in test (November 2015). That means we don't have this information (these pairs) in Octuber 2015 to put it in November 2015.

**86.61% of test's rows is missing** 

In [11]:
print("% of missings:", results.item_cnt_day.isna().sum() / len(results) * 100)

% of missings: 86.61064425770309


In [12]:
results.item_cnt_day = results.item_cnt_day.fillna(0)
results.head()

Unnamed: 0,shop_id,item_id,item_cnt_day
0,5,5037,0.0
1,5,5320,0.0
2,5,5233,1.0
3,5,5232,0.0
4,5,5268,0.0


### STEP 5: Clipping (limit) mins values to 0 and max values to 20

In [13]:
results.item_cnt_day = results.item_cnt_day.clip(lower=0, upper=20)
results.head()

Unnamed: 0,shop_id,item_id,item_cnt_day
0,5,5037,0.0
1,5,5320,0.0
2,5,5233,1.0
3,5,5232,0.0
4,5,5268,0.0


### STEP 6: Generate Submission and Submit to Kaggle

In [14]:
sub["item_cnt_month"] = results["item_cnt_day"]
sub.head()

Unnamed: 0_level_0,item_cnt_month
ID,Unnamed: 1_level_1
0,0.0
1,0.0
2,1.0
3,0.0
4,0.0


In [15]:
sub.to_csv(DATA_SUB_PATH+"1_Baseline_oct2015.csv")

By submitting `1_Baseline_oct2015.csv` we get a baseline score of **`1.16777`** RMSE at the public leaderboard.
- **Baseline**: means a simple solution to the problem that we need to improve.
- **RMSE**: means the metric Root Mean Square Error. The lower, the better.
- **Public Leaderboard**: means the visible [kaggle leaderboard](https://www.kaggle.com/c/competitive-data-science-predict-future-sales/leaderboard) where participants compete.
  - RMSE Score from the Public Leaderboard is computed with **35% of the Test data**.
  - RMSE Score from the Private Leaderboard is computed with the other **65% of the Test data**.
  - Remember that winning the competition means be the first in the **private leaderboard**.
  - Tip: Rely more on your local validation strategy instad of the public score :)