# Capstone Two: Feature Engineering

In this step for my capstone, I am going to use my insights from the EDA step as well as tools like featuretools

The general plan for this notebook will be: 

1. Import the data.
2. Add features using EDA and other insights. 
3. Save the new DataFrames. 

# 1. Import the data.

In [1]:
import numpy as np
import pandas as pd
import featuretools as ft

import warnings
warnings.filterwarnings('ignore')

import os
from library.sb_utils import save_file

pd.options.display.float_format = '{:.2f}'.format

In [2]:
training = pd.read_csv("./data/training_data_cleaned.csv")
testing = pd.read_csv('./data/test.csv')
categories = pd.read_csv("./data/item_categories.csv")

In [3]:
training['datetime'] = pd.to_datetime(training.date, format='%Y-%m-%d', errors='coerce')

In [4]:
training.dtypes

date                          object
date_block_num                 int64
shop_id                        int64
item_id                        int64
item_price                   float64
item_cnt_day                 float64
item_name                     object
item_category_id               int64
item_category_name            object
shop_name                     object
datetime              datetime64[ns]
dtype: object

In [5]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15


It's fine that date and datetime are non-numeric since we will drop those later. 

# 2. Add features by hand using EDA insights.

Here are the major insights from the EDA step. 
1. The most popular items are Grand Theft Auto V PC, Grand Theft Auto V PS3, The Witcher 3: Wild Hunt PC. The most popular category is games.
2. The stores that sell the most are either online or in Moscow. 
3. The general trend falling.
4. There is a distinct seasonality to the data. 
5. Weekends  see more sales - by an average of about 211,116 per day.
6. Holidays, specifically new year and Defender of the Fatherland Day, play a large role. 


This means we should implement the following features: 
1. Month (categorical)
2. Day of the week (categorical)
3. Is it a holiday? (discrete yes/no)

I am also going to add other helpful factors: revenue, 

I am leaving out the store location because that is beyond the score of this notebook. I am leaving out release dates for games since that is both hard to find and simply creates outliers, so I do not think it is important enough for this application. The general trend is captured by date block num. 

In [6]:
# First, add the simple date info to testing dataset
testing['date_block_num'] = 34

### Add groups (more broad than categories)

In [7]:
categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [8]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#create broader category groupings, where group is general and item_category is more specific
categories['group_name'] = categories['item_category_name'].str.extract(r'(^[\w\s]*)')
categories['group_name'] = categories['group_name'].str.strip()
#label encode group names
categories['group_id']  = le.fit_transform(categories.group_name.values)
categories.sample(5)

Unnamed: 0,item_category_name,item_category_id,group_name,group_id
55,Музыка - CD локального производства,55,Музыка,12
57,Музыка - MP3,57,Музыка,12
81,Чистые носители (шпиль),81,Чистые носители,16
44,Книги - Аудиокниги (Цифра),44,Книги,11
21,Игры - PSP,21,Игры,5


In [9]:
categories.set_index('item_category_id')

Unnamed: 0_level_0,item_category_name,group_name,group_id
item_category_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,PC - Гарнитуры/Наушники,PC,0
1,Аксессуары - PS2,Аксессуары,1
2,Аксессуары - PS3,Аксессуары,1
3,Аксессуары - PS4,Аксессуары,1
4,Аксессуары - PSP,Аксессуары,1
...,...,...,...
79,Служебные,Служебные,15
80,Служебные - Билеты,Служебные,15
81,Чистые носители (шпиль),Чистые носители,16
82,Чистые носители (штучные),Чистые носители,16


In [10]:
training = training.merge(categories.set_index('item_category_id').drop('item_category_name', axis=1), on='item_category_id', how='left')

In [11]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime,group_name,group_id
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02,Кино,10
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03,Музыка,12
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05,Музыка,12
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06,Музыка,12
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15,Музыка,12


In [12]:
# add item categories
testing = pd.merge(testing, training[['item_id', 'item_category_id']], on = 'item_id', how='left') 
testing=testing.drop_duplicates(subset=['ID'])
testing.reset_index(inplace=True, drop=True)
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id
0,0,5,5037,34,19.0
1,1,5,5320,34,
2,2,5,5233,34,19.0
3,3,5,5232,34,23.0
4,4,5,5268,34,


In [13]:
# how many item categories are na? if fewer than 5 percent, lets just set those to an unused value (0)
print("items in category 0:",testing[testing.item_category_id==0].sum())
print("percent NaN: ", testing.item_category_id.isna().sum() / testing.ID.count())
testing.item_category_id.fillna(0, inplace=True)
print("percent NaN after filling with 0: ", testing.item_category_id.isna().sum() / testing.ID.count())

items in category 0: ID                 0.00
shop_id            0.00
item_id            0.00
date_block_num     0.00
item_category_id   0.00
dtype: float64
percent NaN:  0.0711764705882353
percent NaN after filling with 0:  0.0


In [14]:
testing['item_category_id']=testing['item_category_id'].astype('int64')

In [15]:
testing=testing.merge(categories.set_index('item_category_id'), on='item_category_id', how='left')

In [16]:
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id,item_category_name,group_name,group_id
0,0,5,5037,34,19,Игры - PS3,Игры,5
1,1,5,5320,34,0,PC - Гарнитуры/Наушники,PC,0
2,2,5,5233,34,19,Игры - PS3,Игры,5
3,3,5,5232,34,23,Игры - XBOX 360,Игры,5
4,4,5,5268,34,0,PC - Гарнитуры/Наушники,PC,0


### Add total item revenue

In [17]:
training['revenue'] = training['item_cnt_day']*training['item_price']

In [18]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime,group_name,group_id,revenue
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02,Кино,10,999.0
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03,Музыка,12,899.0
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05,Музыка,12,-899.0
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06,Музыка,12,1709.05
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15,Музыка,12,1099.0


In [19]:
agg = training.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'revenue':'sum'})

In [20]:
training = training.drop('revenue', axis=1).merge(agg, on=["shop_id", "item_id","date_block_num"], how='left')

In [21]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime,group_name,group_id,revenue
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02,Кино,10,999.0
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03,Музыка,12,0.0
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05,Музыка,12,0.0
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06,Музыка,12,1709.05
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15,Музыка,12,1099.0


### TODO   
How many days since last sale of an item

In [22]:
training['first_sale_day'] = training.datetime.dt.dayofyear 
training['first_sale_day'] += 365 * (training.datetime.dt.year-2013)
training['first_sale_day'] = training.groupby('item_id')['first_sale_day'].transform('min').astype('int16')

In [23]:
agg = training.groupby(['date_block_num', 'shop_id', 'item_id']).agg({'first_sale_day':'first'}).reset_index()

In [24]:
training['first_sale_day'] = training.groupby('item_id')['first_sale_day'].transform('max').astype('int16')

In [25]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime,group_name,group_id,revenue,first_sale_day
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02,Кино,10,999.0,2
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03,Музыка,12,0.0,3
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05,Музыка,12,0.0,3
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06,Музыка,12,1709.05,6
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15,Музыка,12,1099.0,6


### TODO
Add previous sales data (how many times it has sold)

### TODO
Mean first month sales per item

### Month

In [31]:
training['month'] = training.datetime.dt.month_name()

Now I want to one hot encode the days and months so that they can be used best in our algorithms. This is so that we do not rank them; if we simply said Januray = 1, Feb = 2, etc. then we would be ranking the months and that is less clear for the ML step. 

In [32]:
months_to_one_hot = training['month']
months = pd.get_dummies(months_to_one_hot)
months

Unnamed: 0,April,August,December,February,January,July,June,March,May,November,October,September
0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2935838,0,0,0,0,0,0,0,0,0,0,1,0
2935839,0,0,0,0,0,0,0,0,0,0,1,0
2935840,0,0,0,0,0,0,0,0,0,0,1,0
2935841,0,0,0,0,0,0,0,0,0,0,1,0


Let's combine this into our original dataframe. 

In [33]:
training = pd.concat([training, months], axis=1)
training.drop('month', axis=1, inplace=True)

In [34]:
# Finally, add the year as a columns with dummy variables. 
training['year'] = training.datetime.dt.year

In [35]:
years_to_one_hot = training['year']
years = pd.get_dummies(years_to_one_hot)
years

Unnamed: 0,2013,2014,2015
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
2935838,0,0,1
2935839,0,0,1
2935840,0,0,1
2935841,0,0,1


In [36]:
training = pd.concat([training, years], axis=1)
training.drop('year', axis=1, inplace=True)

In [37]:
testing[['November', 2015, 'num_holidays', 'April', 'August', 'December', 'February',
       'January', 'July', 'June', 'March', 'May', 'October',
       'September', 2013, 2014]] = [1, 1, 1, 0,0,0,0,0,0,0,0,0,0,0,0,0]

In [38]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,...,July,June,March,May,November,October,September,2013,2014,2015
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",...,0,0,0,0,0,0,0,1,0,0
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,0,1,0,0
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,0,1,0,0
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,0,1,0,0
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,0,1,0,0


In [39]:
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id,item_category_name,group_name,group_id,November,2015,...,February,January,July,June,March,May,October,September,2013,2014
0,0,5,5037,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,5,5320,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,2,5,5233,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
3,3,5,5232,34,23,Игры - XBOX 360,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
4,4,5,5268,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,0


### Holidays

The data goes through 2015-10, meaning we need to add Russian holidays through November 2015. 

This will be helpful as we can use the number of holidays per month as a metric later. 

In [40]:
# List of all Russian public hoildays 2013-2015
holidays = ['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-02-21', '2013-02-22',
               '2013-02-23', '2013-03-06', '2013-03-07', '2013-03-08',
               '2013-05-01', '2013-05-02', '2013-05-03', '2013-05-08',
               '2013-05-09', '2013-05-10', '2013-06-12', '2013-11-04',
               '2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10', '2014-02-21', '2014-02-22',
               '2014-02-23', '2014-03-06', '2014-03-07', '2014-03-08',
               '2014-05-01', '2014-05-02', '2014-05-03', '2014-05-08',
               '2014-05-09', '2014-05-10', '2014-06-12', '2014-11-04',
               '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-02-23', '2015-03-08', '2015-03-09',
               '2015-05-01', '2015-05-04', '2015-05-09', '2015-05-11',
               '2015-06-12', '2015-11-04']

In [41]:
training['holiday']=0
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,...,June,March,May,November,October,September,2013,2014,2015,holiday
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",...,0,0,0,0,0,0,1,0,0,0
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,0
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,0
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,0
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,0


In [42]:
training['holiday']=0
training.loc[training['date'].isin(holidays), 'holiday'] = 1

In [43]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,...,June,March,May,November,October,September,2013,2014,2015,holiday
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",...,0,0,0,0,0,0,1,0,0,1
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,1
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,1
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,1
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",...,0,0,0,0,0,0,1,0,0,0


In [44]:
# I want to group by monthly sales, since our final prediction will be for the enitre month (Nov 2015). 
grouped = training.groupby(['item_id','shop_id','item_category_id','date_block_num'])
agg = grouped.agg({'item_cnt_day':'sum', 'item_price':'mean','holiday':'sum'}).reset_index()
agg = agg.rename(columns = {'item_cnt_day' : 'item_cnt_month', 'item_price':'item_month_avg_price','holiday':'num_holidays'})
agg.head()

Unnamed: 0,item_id,shop_id,item_category_id,date_block_num,item_cnt_month,item_month_avg_price,num_holidays
0,0,54,40,20,1.0,58.0,0
1,1,55,76,15,2.0,4490.0,0
2,1,55,76,18,1.0,4490.0,0
3,1,55,76,19,1.0,4490.0,0
4,1,55,76,20,1.0,4490.0,0


In [45]:
training = training.merge(agg, on=["shop_id", "item_id","date_block_num",'item_category_id'], how='left')

In [46]:
training.drop(['holiday'], axis=1, inplace=True) #

In [47]:
testing['num_holidays']=1 # only one, on 2015-11-04

In [48]:
print(testing.count())

ID                    214200
shop_id               214200
item_id               214200
date_block_num        214200
item_category_id      214200
item_category_name    214200
group_name            214200
group_id              214200
November              214200
2015                  214200
num_holidays          214200
April                 214200
August                214200
December              214200
February              214200
January               214200
July                  214200
June                  214200
March                 214200
May                   214200
October               214200
September             214200
2013                  214200
2014                  214200
dtype: int64


In [49]:
training.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,...,May,November,October,September,2013,2014,2015,item_cnt_month,item_month_avg_price,num_holidays
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",...,0,0,0,0,1,0,0,1.0,999.0,1
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,1,0,0,0.0,899.0,2
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,1,0,0,0.0,899.0,2
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",...,0,0,0,0,1,0,0,1.0,1709.05,1
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",...,0,0,0,0,1,0,0,1.0,1099.0,0


In [50]:
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id,item_category_name,group_name,group_id,November,2015,...,February,January,July,June,March,May,October,September,2013,2014
0,0,5,5037,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
1,1,5,5320,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,0
2,2,5,5233,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
3,3,5,5232,34,23,Игры - XBOX 360,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,0
4,4,5,5268,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,0


In [51]:
# date and datetime are encoded enough in the date_block_num, so we will drop those.
# item_cnt_day is encoded in item monthly sales, so we will drop that too.
training.drop(['item_cnt_day','date','datetime'], axis=1, inplace=True) 

### Add item price

In [52]:
# fill item price with average price for that ite
grouped = training.groupby(by = ['item_id'])
result = grouped.agg({'item_price':'mean'})
result = result.reset_index()
testing = testing.merge(result, on=['item_id'], how='left')
testing=testing.rename(columns={'item_price':'item_month_avg_price'})
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id,item_category_name,group_name,group_id,November,2015,...,January,July,June,March,May,October,September,2013,2014,item_month_avg_price
0,0,5,5037,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,1926.83
1,1,5,5320,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,
2,2,5,5233,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,800.78
3,3,5,5232,34,23,Игры - XBOX 360,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,790.51
4,4,5,5268,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,


In [53]:
#testing.drop('item_price_y', axis=1, inplace=True)
#testing.rename(columns={'item_price_x':'item_price'}, inplace=True)
#testing.head()

In [54]:
testing=testing.drop_duplicates(subset=['ID'])

In [55]:
print(testing.item_month_avg_price.isna().sum())

15246


In [56]:
print("percent NaN: ", testing.item_month_avg_price.isna().sum() / testing.ID.count())
testing.item_month_avg_price.fillna(testing.item_month_avg_price.mean(), inplace=True)
print("percent NaN after filling with mean: ", testing.item_month_avg_price.isna().sum() / testing.ID.count())

percent NaN:  0.0711764705882353
percent NaN after filling with mean:  0.0


# 3. Save the new dataframe

In [57]:
testing.head()

Unnamed: 0,ID,shop_id,item_id,date_block_num,item_category_id,item_category_name,group_name,group_id,November,2015,...,January,July,June,March,May,October,September,2013,2014,item_month_avg_price
0,0,5,5037,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,1926.83
1,1,5,5320,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,1026.5
2,2,5,5233,34,19,Игры - PS3,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,800.78
3,3,5,5232,34,23,Игры - XBOX 360,Игры,5,1,1,...,0,0,0,0,0,0,0,0,0,790.51
4,4,5,5268,34,0,PC - Гарнитуры/Наушники,PC,0,1,1,...,0,0,0,0,0,0,0,0,0,1026.5


In [60]:
testing.columns

Index([                  'ID',              'shop_id',              'item_id',
             'date_block_num',     'item_category_id',   'item_category_name',
                 'group_name',             'group_id',             'November',
                         2015,         'num_holidays',                'April',
                     'August',             'December',             'February',
                    'January',                 'July',                 'June',
                      'March',                  'May',              'October',
                  'September',                   2013,                   2014,
       'item_month_avg_price'],
      dtype='object')

In [59]:
# Save the data
datapath = './data'
save_file(training, 'training_data_feature_engineered.csv', datapath)
save_file(testing, 'testing_data_feature_engineered.csv', datapath)

A file already exists with this name.

Do you want to overwrite? (Y/N)N

Please re-run this cell with a new filename.
A file already exists with this name.

Do you want to overwrite? (Y/N)N

Please re-run this cell with a new filename.
