In [1]:
import datetime
import gc
from itertools import product
import sys

sys.path.append('..')


import numpy as np
import pandas as pd
import scipy.stats as sps

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

import holidays

from src.utils.downcasting import downcast_dtypes

sns.set(font_scale=1.2)
%matplotlib inline

# Creating dataset

In this notebook I will create base dataset for training (without tuning for specific models).

In [2]:
items = pd.read_csv('../data/raw/items.csv')
item_categories = pd.read_csv('../data/raw/item_categories.csv')
shops = pd.read_csv('../data/raw/shops.csv')
sales_train = pd.read_csv('../data/raw/sales_train.csv')
test = pd.read_csv('../data/raw/test.csv')

## Cleaning

In this step I will clean dataset according to `1.0-db-EDA.ipynb`.

### Item categories

Now this section is empty, data was found clean.

### Shops

In [3]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


We found some typos in `shop_name`. We should replace shops with typos with shops without typos in `sales_train`, `test` (if it is the case).

In [4]:
map_typos = {0: 57, 1: 58, 10: 11, 39: 40}

Visualize mapping.

In [5]:
shops_list = shops.shop_name.to_list()
for from_id, to_id in map_typos.items():
    print(f"'{shops_list[from_id]}' ---> '{shops_list[to_id]}'")

'!Якутск Орджоникидзе, 56 фран' ---> 'Якутск Орджоникидзе, 56'
'!Якутск ТЦ "Центральный" фран' ---> 'Якутск ТЦ "Центральный"'
'Жуковский ул. Чкалова 39м?' ---> 'Жуковский ул. Чкалова 39м²'
'РостовНаДону ТРК "Мегацентр Горизонт"' ---> 'РостовНаДону ТРК "Мегацентр Горизонт" Островной'


Check datasets

In [6]:
for from_id, to_id in map_typos.items():
    print(f'Typos records with id={from_id} in train: '
          f'{sales_train.shop_id.isin([from_id]).sum()}')
    print(f'Corrected records wit id={to_id} in train: '
          f'{sales_train.shop_id.isin([to_id]).sum()}')

Typos records with id=0 in train: 9857
Corrected records wit id=57 in train: 117428
Typos records with id=1 in train: 5678
Corrected records wit id=58 in train: 71441
Typos records with id=10 in train: 21397
Corrected records wit id=11 in train: 499
Typos records with id=39 in train: 13440
Corrected records wit id=40 in train: 4257


In [7]:
for from_id, to_id in map_typos.items():
    print(f'Typos records with id={from_id} in test: '
          f'{test.shop_id.isin([from_id]).sum()}')
    print(f'Corrected records wit id={to_id} in test: '
          f'{test.shop_id.isin([to_id]).sum()}')

Typos records with id=0 in test: 0
Corrected records wit id=57 in test: 5100
Typos records with id=1 in test: 0
Corrected records wit id=58 in test: 5100
Typos records with id=10 in test: 5100
Corrected records wit id=11 in test: 0
Typos records with id=39 in test: 5100
Corrected records wit id=40 in test: 0


As we can see, in test there is just one possible variant from two. We have to select one of them. We select corrected variants in all cases.

**If we've made a mistake here, we will check it in the future (run without mapping)**

In [8]:
sales_train.shop_id[sales_train.shop_id.isin(map_typos.keys())] = sales_train.shop_id[
    sales_train.shop_id.isin(map_typos.keys())
].map(map_typos)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sales_train.shop_id[sales_train.shop_id.isin(map_typos.keys())] = sales_train.shop_id[


In [9]:
test.shop_id[test.shop_id.isin(map_typos.keys())] = test.shop_id[
    test.shop_id.isin(map_typos.keys())
].map(map_typos)

Check online shops. Are they used in test dataset? If not we can just remove them to avoid dealing with Nans for some new features.

In [10]:
suspicious_shops = [9, 12, 55]
shops[shops.shop_id.isin(suspicious_shops)]

Unnamed: 0,shop_name,shop_id
9,Выездная Торговля,9
12,Интернет-магазин ЧС,12
55,Цифровой склад 1С-Онлайн,55


In [11]:
for shop in suspicious_shops:
    shop_name = shops[shops.shop_id == shop].shop_name.item()
    print(f"Records '{shop_name}' in train: "
          f"{(sales_train.shop_id == shop).sum()}")

Records 'Выездная Торговля' in train: 3751
Records 'Интернет-магазин ЧС' in train: 34694
Records 'Цифровой склад 1С-Онлайн' in train: 34769


In [12]:
for shop in suspicious_shops:
    shop_name = shops[shops.shop_id == shop].shop_name.item()
    print(f"Records '{shop_name}' in train: "
          f"{(test.shop_id == shop).sum()}")

Records 'Выездная Торговля' in train: 0
Records 'Интернет-магазин ЧС' in train: 5100
Records 'Цифровой склад 1С-Онлайн' in train: 5100


In this case we can remove `Выездная торговля`.

In [13]:
sales_train.shape

(2935849, 6)

In [14]:
sales_train = sales_train[sales_train.shop_id != 9]

### Items

In [15]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


We found typos with `!`, `*`, `/`, `D`.

#### Correct `!`

In [16]:
first_character = items.item_name.apply(lambda x: x[0])
typos_list = items[first_character == '!'].item_name.to_list()
typos_list

['! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D',
 '!ABBYY FineReader 12 Professional Edition Full [PC, Цифровая версия]']

There was no duplicates, thus just delete redundant `!`.

In [17]:
map_typos = {}
for typo in typos_list:
    map_typos[typo] = typo.strip('!').strip()
    
items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[
    items.item_name.isin(map_typos.keys())
].map(map_typos)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[


#### Correct `*`

In [18]:
typos_list = items[first_character == '*'].item_name.to_list()
typos_list

['***В ЛУЧАХ СЛАВЫ   (UNV)                    D',
 '***ГОЛУБАЯ ВОЛНА  (Univ)                      D',
 '***КОРОБКА (СТЕКЛО)                       D',
 '***НОВЫЕ АМЕРИКАНСКИЕ ГРАФФИТИ  (UNI)             D',
 '***УДАР ПО ВОРОТАМ (UNI)               D',
 '***УДАР ПО ВОРОТАМ-2 (UNI)               D',
 '***ЧАЙ С МУССОЛИНИ                     D',
 '***ШУГАРЛЭНДСКИЙ ЭКСПРЕСС (UNI)             D',
 '*ЗА ГРАНЬЮ СМЕРТИ                       D',
 '*ЛИНИЯ СМЕРТИ                           D',
 '*МИХЕЙ И ДЖУМАНДЖИ  Сука любовь',
 '*СПАСАЯ ЭМИЛИ                           D',
 '*ЧОКНУТЫЙ ПРОФЕССОР /МАГИЯ/             D']

There was duplicate problem with `*МИХЕЙ И ДЖУМАНДЖИ  Сука любовь`. In all other cases we should just remove `*`.

In [19]:
map_typos = {}
for typo in typos_list:
    if typo[:3] != '*МИ':
        map_typos[typo] = typo.strip('*').strip()
    
items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[
    items.item_name.isin(map_typos.keys())
].map(map_typos)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[


Case with duplicate we should process separately. Check how many records with both typo and corrected.

In [20]:
items[items.item_name.str.contains('МИХЕЙ')]

Unnamed: 0,item_name,item_id,item_category_id
12,*МИХЕЙ И ДЖУМАНДЖИ Сука любовь,12,55
14690,МИХЕЙ И ДЖУМАНДЖИ Сука любовь,14690,55
14691,МИХЕЙ И ДЖУМАНДЖИ Сука любовь LP,14691,58


In [21]:
map_typos = {12: 14690}

In [22]:
for from_id, to_id in map_typos.items():
    print(f'Typos records with id={from_id} in train: '
          f'{sales_train.item_id.isin([from_id]).sum()}')
    print(f'Corrected records wit id={to_id} in train: '
          f'{sales_train.item_id.isin([to_id]).sum()}')

Typos records with id=12 in train: 1
Corrected records wit id=14690 in train: 427


In [23]:
for from_id, to_id in map_typos.items():
    print(f'Typos records with id={from_id} in test: '
          f'{test.item_id.isin([from_id]).sum()}')
    print(f'Corrected records wit id={to_id} in test: '
          f'{test.item_id.isin([to_id]).sum()}')

Typos records with id=12 in test: 0
Corrected records wit id=14690 in test: 42


We can see, that we can make mapping without problems.

In [24]:
sales_train.item_id[sales_train.item_id.isin(map_typos.keys())] = sales_train.item_id[
    sales_train.item_id.isin(map_typos.keys())
].map(map_typos)

In [25]:
for from_id, to_id in map_typos.items():
    print(f'Typos records with id={from_id} in train: '
          f'{sales_train.item_id.isin([from_id]).sum()}')
    print(f'Corrected records wit id={to_id} in train: '
          f'{sales_train.item_id.isin([to_id]).sum()}')

Typos records with id=12 in train: 0
Corrected records wit id=14690 in train: 428


#### Correct `/`

In [26]:
typos_list = items[first_character == '/'].item_name.to_list()
typos_list

['//АДРЕНАЛИН: ОДИН ПРОТИВ ВСЕХ (Регион)',
 '//МОНГОЛ С.Бодров (Регион)',
 '//НЕ ОСТАВЛЯЮЩИЙ СЛЕДА (Регион)',
 '/БОМБА ДЛЯ НЕВЕСТЫ /2DVD/               D',
 '/ЗОЛОТАЯ КОЛЛЕКЦИЯ м/ф-72',
 '/ОДНАЖДЫ В КИТАЕ-2',
 '/ПОСЛЕДНИЙ ШАНС',
 '/ПРОКЛЯТЬЕ ЭЛЬ ЧАРРО',
 '/СЕВЕР И ЮГ /Ч.2/',
 '/СМЕРТЕЛЬНЫЙ РАСКЛАД',
 '/ТЫ  - ТРУП',
 '/УМНОЖАЮЩИЙ ПЕЧАЛЬ т.2 (сер.3-4)']

There was no duplicates problem.

In [27]:
map_typos = {}
for typo in typos_list:
    map_typos[typo] = typo.strip('/')
    
items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[
    items.item_name.isin(map_typos.keys())
].map(map_typos)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[


#### Correct `D`

In [28]:
typos_list = items[items.item_name.str.endswith('    D')].item_name.to_list()
typos_list

['ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.)         D',
 'В ЛУЧАХ СЛАВЫ   (UNV)                    D',
 'ГОЛУБАЯ ВОЛНА  (Univ)                      D',
 'КОРОБКА (СТЕКЛО)                       D',
 'НОВЫЕ АМЕРИКАНСКИЕ ГРАФФИТИ  (UNI)             D',
 'УДАР ПО ВОРОТАМ (UNI)               D',
 'УДАР ПО ВОРОТАМ-2 (UNI)               D',
 'ЧАЙ С МУССОЛИНИ                     D',
 'ШУГАРЛЭНДСКИЙ ЭКСПРЕСС (UNI)             D',
 'ЗА ГРАНЬЮ СМЕРТИ                       D',
 'ЛИНИЯ СМЕРТИ                           D',
 'СПАСАЯ ЭМИЛИ                           D',
 'ЧОКНУТЫЙ ПРОФЕССОР /МАГИЯ/             D',
 'БОМБА ДЛЯ НЕВЕСТЫ /2DVD/               D']

There was no duplicates problem (it was already processed cases).

In [29]:
map_typos = {}
for typo in typos_list:
    map_typos[typo] = typo.strip('    D')
    
items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[
    items.item_name.isin(map_typos.keys())
].map(map_typos)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  items.item_name[items.item_name.isin(map_typos.keys())] = items.item_name[


### Sales

In [30]:
sales_train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In EDA we have found entry with negative price. 

In [31]:
sales_train[sales_train.item_price <= 1e-9]

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
484683,15.05.2013,4,32,2973,-1.0,1.0


Look at another entries with this `shop_id` and `item_id`.

In [32]:
sales_train[(sales_train.shop_id == 32) & (sales_train.item_id == 2973)]

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
67427,29.01.2013,0,32,2973,2499.0,1.0
67428,25.01.2013,0,32,2973,2499.0,1.0
67429,22.01.2013,0,32,2973,2499.0,1.0
67430,21.01.2013,0,32,2973,2499.0,1.0
67431,18.01.2013,0,32,2973,2499.0,1.0
67432,17.01.2013,0,32,2973,2499.0,1.0
67433,15.01.2013,0,32,2973,2499.0,3.0
187844,05.02.2013,1,32,2973,2499.0,1.0
187845,14.02.2013,1,32,2973,2499.0,1.0
484682,23.05.2013,4,32,2973,1249.0,1.0


It obviously looks like a mistake. Let's remove this row from the dataset.

In [33]:
sales_train = sales_train[sales_train.item_price > 0]

## Creating new features

In this step I will add new features according to `1.0-db-EDA.ipynb`.

### Item categories

In [34]:
item_categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


#### Adding category and subcategory

We will split `item_category_name` by delimiter and treat parts as category and subcategory. If there is no delimiter, then it will be both: category and subcategory.

In [35]:
item_categories.item_category_name.to_list()[:10]

['PC - Гарнитуры/Наушники',
 'Аксессуары - PS2',
 'Аксессуары - PS3',
 'Аксессуары - PS4',
 'Аксессуары - PSP',
 'Аксессуары - PSVita',
 'Аксессуары - XBOX 360',
 'Аксессуары - XBOX ONE',
 'Билеты (Цифра)',
 'Доставка товара']

In [36]:
item_categories = item_categories.rename(columns={'item_category_name': 'item_full_category_name'})
item_categories['item_category_name'] = item_categories.item_full_category_name.apply(
    lambda x: x.split(' - ')[0]
)
item_categories['item_subcategory_name'] = item_categories.item_full_category_name.apply(
    lambda x: x.split(' - ')[-1]
)

#### Saving new dataset

In [37]:
item_categories.to_csv('../data/processed/item_categories.csv', index=False)

### Shops

In [38]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


#### Adding city

In [39]:
shops['city'] = shops.shop_name.apply(lambda x: x.strip('!').split()[0])

Delete city for `Выездная Торговля`.

Set city for `Интернет-магазин ЧС`, `Цифровой склад 1С-Онлайн` to `Online`.

In [40]:
shops.city[shops.shop_name == 'Выездная Торговля'] = None
shops.city[shops.shop_name.isin(['Интернет-магазин ЧС', 
                                 'Цифровой склад 1С-Онлайн'])] = 'Online'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shops.city[shops.shop_name == 'Выездная Торговля'] = None
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  shops.city[shops.shop_name.isin(['Интернет-магазин ЧС',


Add number of residents in a city.

In [41]:
num_residents = pd.read_csv('../data/external/num_residents.csv', index_col=0)
num_residents.head()

Unnamed: 0_level_0,num_residents
city,Unnamed: 1_level_1
Якутск,322987
Адыгея,463088
Балашиха,507366
Волжский,323906
Вологда,310302


In [42]:
shops['num_residents'] = shops.city.map(num_residents.num_residents)

#### Adding coordinates of a shop

To be added.

#### Adding shop_id in test_indicator

In [43]:
shops['shop_in_test'] = shops.shop_id.isin(test.shop_id)

#### Saving new dataset

In [44]:
shops.to_csv('../data/processed/shops.csv', index=False)

### Items

In [45]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.),0,40
1,ABBYY FineReader 12 Professional Edition Full ...,1,76
2,В ЛУЧАХ СЛАВЫ (UNV),2,40
3,ГОЛУБАЯ ВОЛНА (Univ),3,40
4,КОРОБКА (СТЕКЛО),4,40


#### Addint text features

Will be in 4.0-db-text-features.ipynb.

#### Adding item_id in test indicator

In [46]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.),0,40
1,ABBYY FineReader 12 Professional Edition Full ...,1,76
2,В ЛУЧАХ СЛАВЫ (UNV),2,40
3,ГОЛУБАЯ ВОЛНА (Univ),3,40
4,КОРОБКА (СТЕКЛО),4,40


In [47]:
items['item_in_test'] = items.item_id.isin(test.item_id)

#### Saving new dataset

In [48]:
items.to_csv('../data/processed/items.csv', index=False)

### Sales

In [49]:
sales_train.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


#### Adding date features

We also will process a part of test dataset here.

In [50]:
sales_train.date = pd.to_datetime(sales_train.date, format='%d.%m.%Y')
test_date = pd.DataFrame({'date': pd.date_range('2015-11-1', '2015-11-30', freq='D')})

In [51]:
for df in [sales_train, test_date]:
    df['day'] = df.date.dt.day
    df['month'] = df.date.dt.month
    df['year'] = df.date.dt.year
    df['weekday'] = df.date.dt.weekday

Add holiday features.

In [52]:
def days_since_holiday(datetimes):
    """Assign to each date days since last holiday."""
    ru_holidays = holidays.Russia()
    days_since_holiday = np.zeros(datetimes.shape[0])
    for i, current_datetime in tqdm(enumerate(datetimes.values), 
                                    total=datetimes.size):
        last_holiday = current_datetime
        while not last_holiday in ru_holidays:
            last_holiday = last_holiday - datetime.timedelta(days=1)
        days_since_holiday[i] = (current_datetime - last_holiday).days
    return days_since_holiday

In [53]:
russian_holidays = holidays.Russia()

for df in [sales_train, test_date]:
    df['is_holiday'] = df.date.apply(
        lambda x: x in russian_holidays
    )
    df['not_working_day'] = ((df['weekday'].isin([5, 6])) 
                             | (df['is_holiday']))
    df['days_since_holiday'] = days_since_holiday(
        df.date.dt.date
    ).astype(int)

HBox(children=(FloatProgress(value=0.0, max=2932097.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=30.0), HTML(value='')))




#### Adding price features

We will device price on "even" and "non even" parts. Even part is a price, derived from first digit, non-even -- residual.

In [54]:
sales_train['item_non_even_price'] = sales_train.item_price.apply(
    lambda x: float(str(x)[1:])
)
sales_train['item_even_price'] = (sales_train.item_price 
                                  - sales_train['item_non_even_price'])
sales_train['fraction_non_even_price'] = (
    sales_train['item_non_even_price'] / sales_train['item_price']
)

#### Saving new dataset

In [55]:
sales_train.to_csv('../data/processed/sales_train.csv', index=False)

## Aggregating

### Grouping

Aggregate all features in one table by months. Just copy creating grid from `Programming_assignment_week_4`.

In [56]:
# Create "grid" with columns
index_cols = ['shop_id', 'item_id', 'date_block_num']

# For every month we create a grid from all shops/items combinations from that month
grid = [] 
for block_num in sales_train['date_block_num'].unique():
    cur_shops = sales_train.loc[
        sales_train['date_block_num'] == block_num, 'shop_id'
    ].unique()
    cur_items = sales_train.loc[
        sales_train['date_block_num'] == block_num, 'item_id'
    ].unique()
    grid.append(
        np.array(
            list(product(*[cur_shops, cur_items, [block_num]])), 
            dtype='int32'
        )
    )

# Turn the grid into a dataframe
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)

In [57]:
# Groupby data to get shop-item-month aggregates
gb = sales_train.groupby(index_cols, as_index=False).agg(
    {'item_cnt_day': 'sum'})
gb = gb.rename(columns={'item_cnt_day': 'target'})
train = pd.merge(grid, gb, how='left', on=index_cols).fillna(0)

# Same as above but with shop-month aggregates
gb = sales_train.groupby(['shop_id', 'date_block_num'], as_index=False).agg(
    {'item_cnt_day': 'sum'}
)
gb = gb.rename(columns={'item_cnt_day': 'target_shop'})
train = pd.merge(train, gb, how='left', on=['shop_id', 'date_block_num']).fillna(0)

# Same as above but with item-month aggregates
gb = sales_train.groupby(['item_id', 'date_block_num'], as_index=False).agg(
    {'item_cnt_day': 'sum'}
)
gb = gb.rename(columns={'item_cnt_day': 'target_item'})
train = pd.merge(train, gb, how='left', on=['item_id', 'date_block_num']).fillna(0)

# Add num of num of days in a month, month, year
gb = sales_train.groupby('date_block_num', as_index=False).agg(
    {'date': 'nunique', 'month': 'first', 'year': 'first'}
)
gb = gb.rename(columns={'date': 'num_days'})
train = pd.merge(train, gb, how='left', on='date_block_num').fillna(0)

# Add num of holidays, num of not working days, longest sequence without holidays
gb = sales_train[['date', 'date_block_num', 'is_holiday', 'not_working_day', 
                  'days_since_holiday']].drop_duplicates().groupby(
    'date_block_num', as_index=True
).agg(
    {'is_holiday': 'sum', 'not_working_day': 'sum', 
     'days_since_holiday': 'max'}
)
gb = gb.rename(columns={
    'is_holiday': 'num_holidays', 
    'not_working_day': 'num_not_working_days', 
    'days_since_holiday': 'longest_sequence_without_holidays'
})
train = pd.merge(train, gb, how='left', on='date_block_num').fillna(0)

# Add mean/std of price in a month (we will then make it lug feature)
gb = sales_train.groupby(['item_id', 'date_block_num'], as_index=False).agg(
    {'item_price': ['mean', 'std', 'nunique'], 
     'fraction_non_even_price': ['mean']}
)
gb.columns = [
    'item_id', 'date_block_num', 
    'price_mean', 'price_std', 'price_nunique', 'fraction_non_even_mean',
]
train = pd.merge(train, gb, how='left', on=['item_id', 'date_block_num']).fillna(0)

train = downcast_dtypes(train)
del grid, gb
gc.collect();

### Adding lags features

Before calculating lag features let's join train and test dataset. We add redundant `target, target_shop`, `target_item`, `price_xxx`, `fraction_xxx` just to process united dataset.

In [58]:
test = test.drop(columns=['ID'])
test['date_block_num'] = train.date_block_num.max() + 1

In [59]:
test['num_days'] = test_date.shape[0]
test['month'] = test_date.month[0]
test['year'] = test_date.year[0]
test['num_holidays'] = test_date.is_holiday.sum()
test['num_not_working_days'] = test_date.not_working_day.sum()
test['longest_sequence_without_holidays'] = test_date.days_since_holiday.max()

In [60]:
for column in train.columns.difference(test.columns):
    test[column] = 0

In [61]:
all_data = pd.concat([train, test])
del train
gc.collect()

80

After creating a grid, we can calculate lag features. We will use lags from [1, 2, 3, 4, 6, 12] months ago.

In [62]:
# List of columns that we will use to create lags
cols_to_rename = list(all_data.columns.difference(
    index_cols 
    + ['month', 'year', 'num_days', 'num_holidays', 
       'num_not_working_days', 'longest_sequence_without_holidays']
))

shift_range = [1, 2, 3, 4, 6, 12]

for month_shift in tqdm(shift_range):
    train_shift = all_data[index_cols + cols_to_rename].copy()
    
    train_shift['date_block_num'] = train_shift['date_block_num'] + month_shift
    
    foo = lambda x: '{}_lag_{}'.format(x, month_shift) if x in cols_to_rename else x
    train_shift = train_shift.rename(columns=foo)

    all_data = pd.merge(all_data, train_shift, on=index_cols, how='left').fillna(0)
    all_data = downcast_dtypes(all_data)
    
    del train_shift
    gc.collect()

HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))




Remove earliest months with no info about lags, remove target columns.

In [63]:
# Don't use old data from year 2013
all_data = all_data[all_data['date_block_num'] >= 12] 
target = all_data.target

# List of all lagged features
fit_cols = [
    col for col in all_data.columns 
    if col.split('_')[-1] in [str(item) for item in shift_range]
] 
# We will drop these at fitting stage
to_drop_cols = list(
        set(list(all_data.columns)) 
        - (set(fit_cols)|set(all_data.columns.difference(cols_to_rename)))
)
all_data = all_data.drop(to_drop_cols, axis=1)

### Merging with previous datasets

Join items with items_categories.

In [64]:
items_with_categories = pd.merge(
    items, 
    item_categories, 
    how='left', on='item_category_id'
).drop(columns=['item_category_id'])

Join train with items.

In [65]:
all_data = pd.merge(
    all_data,
    items_with_categories,
    how='left', on='item_id'
)
gc.collect();

Join train with shops.

In [66]:
all_data = pd.merge(
    all_data,
    shops,
    how='left', on='shop_id'
)
gc.collect();

#### Saving new datasets

In [67]:
all_data = downcast_dtypes(all_data)
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6546558 entries, 0 to 6546557
Data columns (total 60 columns):
 #   Column                             Dtype  
---  ------                             -----  
 0   shop_id                            int32  
 1   item_id                            int32  
 2   date_block_num                     int32  
 3   num_days                           int32  
 4   month                              int32  
 5   year                               int32  
 6   num_holidays                       int32  
 7   num_not_working_days               int32  
 8   longest_sequence_without_holidays  int32  
 9   fraction_non_even_mean_lag_1       float32
 10  price_mean_lag_1                   float32
 11  price_nunique_lag_1                float32
 12  price_std_lag_1                    float32
 13  target_lag_1                       float32
 14  target_item_lag_1                  float32
 15  target_shop_lag_1                  float32
 16  fraction_non_even_

In [68]:
test_size = test.shape[0]
train = all_data.iloc[:-test_size]
target = target.iloc[:-test_size]
train['target'] = target.values
test = all_data.iloc[-test_size:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['target'] = target.values


As we know from `1.0-db-EDA.ipynb` a lot of pairs (item, shop) in test don't present in train. During our lar creating we added this pairs to train with a lot of zeros after filling NaNs. It can unintentionally mislead our model. Let's delete this pairs from train.

In [70]:
sales_train_pairs = set(
    map(
        lambda x: f'{x[0]}_{x[1]}', 
        zip(sales_train.item_id, sales_train.shop_id)
    )
)

original_test = pd.read_csv('../data/raw/test.csv')
original_test_pairs = set(
    map(
        lambda x: f'{x[0]}_{x[1]}', 
        zip(original_test.item_id, original_test.shop_id)
    )
)

problem_pairs = original_test_pairs.difference(sales_train_pairs)

In [72]:
train_pairs = pd.Series(
    map(
        lambda x: f'{x[0]}_{x[1]}', 
        zip(train.item_id, train.shop_id)
    )
)

In [73]:
train.shape

(6332358, 61)

In [74]:
train = train[~train_pairs.isin(problem_pairs)]

For increase the speed of reading train dataset, let's save it in hdf5 format.

In [76]:
train.to_hdf('../data/processed/train.h5', 'table')
test.to_hdf('../data/processed/test.h5', 'table')