# Feature Engineering
- code something to generate dataset for model to work
- both for validation and final test

In [29]:
# Modules import
from importlib import reload
import time
import pandas as pd
import numpy as np

# All special functions are written in support module
import support

# Settings
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)

In [14]:
reload(support)
print('Support Reloaded')

Support Reloaded


In [36]:
# Import data (we will immideatly merge some informaton together)
items              = pd.read_csv('./readonly/items.csv').set_index('item_id')
dataset            = pd.read_csv('./readonly/sales_train.csv')
test_dataset       = pd.read_csv('./readonly/test.csv').drop('ID', axis=1)
item_categories    = pd.read_csv('./readonly/item_categories.csv').set_index('item_category_id')
shops              = pd.read_csv('./readonly/shops.csv').set_index('shop_id')

Our features:
- One hot encodings for items' categories
- Target mean encodings by categorical values on whole train data
- For each month we calculate categories' frequences and target mean encodings (also with respect to categories' values) to use them as sales' story (length is 12 months). With the same idea we consider number of different items sold in previous months for the shop in a record and target value in previous months for pair (shop_id, item_id).
- For historical revenue and target values we also calculate std and number of zero/nonzero sales
- For some of previosly calculated features we also calculate polynomial values (square and sqruare root)

P.S.\
Essential part is to drop all values, which contained leakages to the target data for each month.

In [4]:
# Clear data
dataset = support.clear_records(dataset, test_dataset, shops, item_categories, items)

# Group date num blocks
data_grouped = support.group_records(dataset)

validation_answers = data_grouped[data_grouped['date_block_num'] == 33]['target']
validation_answers.to_csv('./results/validation_answers.csv', index_label='index')


# Choose validation block
validation_dataset = dataset[dataset['date_block_num'] < 33].copy()
validation_test = data_grouped[data_grouped['date_block_num'] == 33].drop(['revenue', 'date_block_num'], axis=1).copy()


# Part for test
item_mean_prices, shop_mean_prices, cat_mean_prices = support.get_prices_means(dataset)

full, train_mask = support.union(data_grouped, test_dataset)

full = full.join(items[['item_category_id']], on='item_id')
full = support.validation_preparation(full, train_mask, 10, 
                                      [item_mean_prices, shop_mean_prices, cat_mean_prices], join=False)
full.to_csv('./results/dataset_for_test.csv', index=False)


# Part for validation
data_grouped = data_grouped[data_grouped['date_block_num'] < 33]

item_mean_prices, shop_mean_prices, cat_mean_prices = support.get_prices_means(validation_dataset)

full, train_mask = support.union(data_grouped, validation_test)

full = full.join(items[['item_category_id']], on='item_id')
full = support.validation_preparation(full, train_mask, 12, [item_mean_prices, shop_mean_prices, cat_mean_prices], join=False)

full.to_csv('./results/dataset_for_validation.csv', index_label='index')

Records are clear...
Generating zero sales...
Zero sales generated...
Records collected by month...


  


Mean prices are found...
Calculating categories' features...
Categories' features added...
Calculating time-based features...


100%|██████████████████████████████████████████| 10/10 [00:06<00:00,  1.62it/s]


Time-based features added...
Calculating number of sales...
Number of sales added...
Calculating numerical features...
Numerical features added...
Mean prices are found...
Calculating categories' features...
Categories' features added...
Calculating time-based features...


100%|██████████████████████████████████████████| 12/12 [00:00<00:00, 15.43it/s]


Time-based features added...
Calculating number of sales...
Number of sales added...
Calculating numerical features...
Numerical features added...
