# Capstone Two: Feature Engineering

In this step for my capstone, I am going to use my insights from the EDA step as well as tools like featuretools

The general plan for this notebook will be: 

1. Import the data.
2. Add features by hand using EDA insights. 
3. Save the new DataFrame. 

# 1. Import the data.

In [2]:
import numpy as np
import pandas as pd
import featuretools as ft

import warnings
warnings.filterwarnings('ignore')

import os
from library.sb_utils import save_file

pd.options.display.float_format = '{:.2f}'.format

In [3]:
df = pd.read_csv("./data/training_data_cleaned.csv")

In [4]:
df['datetime'] = pd.to_datetime(df.date, format='%Y-%m-%d', errors='coerce')

In [5]:
df.dtypes

date                          object
date_block_num                 int64
shop_id                        int64
item_id                        int64
item_price                   float64
item_cnt_day                 float64
item_name                     object
item_category_id               int64
item_category_name            object
shop_name                     object
datetime              datetime64[ns]
dtype: object

In [6]:
df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name,datetime
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир""",2013-01-02
1,2013-01-03,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-03
2,2013-01-05,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-05
3,2013-01-06,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум""",2013-01-06
4,2013-01-15,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум""",2013-01-15


We do not need item name, category name, or shop name since we have IDs for those. 

In [7]:
df.drop(['item_name', 'item_category_name', 'shop_name'], axis=1, inplace=True)
df.dtypes

date                        object
date_block_num               int64
shop_id                      int64
item_id                      int64
item_price                 float64
item_cnt_day               float64
item_category_id             int64
datetime            datetime64[ns]
dtype: object

# 2. Add features by hand using EDA insights.

Here are the major insights from the EDA step. 
1. The most popular items are Grand Theft Auto V PC, Grand Theft Auto V PS3, The Witcher 3: Wild Hunt PC. The most popular category is games.
2. The stores that sell the most are either online or in Moscow. 
3. The general trend falling.
4. There is a distinct seasonality to the data. 
5. Weekends  see more sales - by an average of about 211,116 per day.
6. Holidays, specifically new year and Defender of the Fatherland Day, play a large role. 


This means we should implement the following features: 
1. Month (categorical)
2. Day of the week (categorical)
3. Is it a holiday? (discrete yes/no)


I am leaving out the store location because that is beyond the score of this notebook. I am leaving out relesae dates for games since that is both hard to find and simply creates outliers, so I do not think it is important enough for this application. The general trend is captured by date block num. 

### Day of week and month

In [8]:
df['month'] = df.datetime.dt.month_name()

In [9]:
df['day_of_week'] = df.datetime.dt.day_name()

In [10]:
# Which day is also potentially insightful, as the months may have some trend; I will add that. 
df['day_of_month'] = df.datetime.dt.day

In [11]:
df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,datetime,month,day_of_week,day_of_month
0,2013-01-02,0,59,22154,999.0,1.0,37,2013-01-02,January,Wednesday,2
1,2013-01-03,0,25,2552,899.0,1.0,58,2013-01-03,January,Thursday,3
2,2013-01-05,0,25,2552,899.0,-1.0,58,2013-01-05,January,Saturday,5
3,2013-01-06,0,25,2554,1709.05,1.0,58,2013-01-06,January,Sunday,6
4,2013-01-15,0,25,2555,1099.0,1.0,56,2013-01-15,January,Tuesday,15


In [12]:
df.dtypes

date                        object
date_block_num               int64
shop_id                      int64
item_id                      int64
item_price                 float64
item_cnt_day               float64
item_category_id             int64
datetime            datetime64[ns]
month                       object
day_of_week                 object
day_of_month                 int64
dtype: object

Now I want to one hot encode the days and months so that they can be used best in our algorithms. This is so that we do not rank them; if we simply said Januray = 1, Feb = 2, etc. then we would be ranking the months and that is less clear for the ML step. 

In [13]:
days_to_one_hot = df['day_of_week']
days = pd.get_dummies(days_to_one_hot)
days

Unnamed: 0,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0
2,0,0,1,0,0,0,0
3,0,0,0,1,0,0,0
4,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...
2935838,0,0,1,0,0,0,0
2935839,1,0,0,0,0,0,0
2935840,0,0,0,0,0,0,1
2935841,0,0,0,0,1,0,0


In [14]:
months_to_one_hot = df['month']
months = pd.get_dummies(months_to_one_hot)
months

Unnamed: 0,April,August,December,February,January,July,June,March,May,November,October,September
0,0,0,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
2935838,0,0,0,0,0,0,0,0,0,0,1,0
2935839,0,0,0,0,0,0,0,0,0,0,1,0
2935840,0,0,0,0,0,0,0,0,0,0,1,0
2935841,0,0,0,0,0,0,0,0,0,0,1,0


Let's combine this into our original dataframe. 

In [15]:
df = pd.concat([df, days], axis=1)
df.drop('day_of_week', axis=1, inplace=True)

In [16]:
df = pd.concat([df, months], axis=1)
df.drop('month', axis=1, inplace=True)

In [17]:
df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,datetime,day_of_month,Friday,...,December,February,January,July,June,March,May,November,October,September
0,2013-01-02,0,59,22154,999.0,1.0,37,2013-01-02,2,0,...,0,0,1,0,0,0,0,0,0,0
1,2013-01-03,0,25,2552,899.0,1.0,58,2013-01-03,3,0,...,0,0,1,0,0,0,0,0,0,0
2,2013-01-05,0,25,2552,899.0,-1.0,58,2013-01-05,5,0,...,0,0,1,0,0,0,0,0,0,0
3,2013-01-06,0,25,2554,1709.05,1.0,58,2013-01-06,6,0,...,0,0,1,0,0,0,0,0,0,0
4,2013-01-15,0,25,2555,1099.0,1.0,56,2013-01-15,15,0,...,0,0,1,0,0,0,0,0,0,0


### Holidays

The data goes through 2015-10, meaning we need to add Russian holidays through November 2015. 

In [18]:
# List of all Russian public hoildays 2013-2015
holidays = ['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-02-21', '2013-02-22',
               '2013-02-23', '2013-03-06', '2013-03-07', '2013-03-08',
               '2013-05-01', '2013-05-02', '2013-05-03', '2013-05-08',
               '2013-05-09', '2013-05-10', '2013-06-12', '2013-11-04',
               '2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06', '2014-01-07', '2014-01-08',
               '2014-01-09', '2014-01-10', '2014-02-21', '2014-02-22',
               '2014-02-23', '2014-03-06', '2014-03-07', '2014-03-08',
               '2014-05-01', '2014-05-02', '2014-05-03', '2014-05-08',
               '2014-05-09', '2014-05-10', '2014-06-12', '2014-11-04',
               '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',
               '2015-01-05', '2015-01-06', '2015-01-07', '2015-01-08',
               '2015-01-09', '2015-02-23', '2015-03-08', '2015-03-09',
               '2015-05-01', '2015-05-04', '2015-05-09', '2015-05-11',
               '2015-06-12', '2015-11-04']

In [19]:
df['holiday']=0
df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,datetime,day_of_month,Friday,...,February,January,July,June,March,May,November,October,September,holiday
0,2013-01-02,0,59,22154,999.0,1.0,37,2013-01-02,2,0,...,0,1,0,0,0,0,0,0,0,0
1,2013-01-03,0,25,2552,899.0,1.0,58,2013-01-03,3,0,...,0,1,0,0,0,0,0,0,0,0
2,2013-01-05,0,25,2552,899.0,-1.0,58,2013-01-05,5,0,...,0,1,0,0,0,0,0,0,0,0
3,2013-01-06,0,25,2554,1709.05,1.0,58,2013-01-06,6,0,...,0,1,0,0,0,0,0,0,0,0
4,2013-01-15,0,25,2555,1099.0,1.0,56,2013-01-15,15,0,...,0,1,0,0,0,0,0,0,0,0


In [20]:
df['holiday']=0
df.loc[df['date'].isin(holidays), 'holiday'] = 1

In [21]:
df.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,datetime,day_of_month,Friday,...,February,January,July,June,March,May,November,October,September,holiday
0,2013-01-02,0,59,22154,999.0,1.0,37,2013-01-02,2,0,...,0,1,0,0,0,0,0,0,0,1
1,2013-01-03,0,25,2552,899.0,1.0,58,2013-01-03,3,0,...,0,1,0,0,0,0,0,0,0,1
2,2013-01-05,0,25,2552,899.0,-1.0,58,2013-01-05,5,0,...,0,1,0,0,0,0,0,0,0,1
3,2013-01-06,0,25,2554,1709.05,1.0,58,2013-01-06,6,0,...,0,1,0,0,0,0,0,0,0,1
4,2013-01-15,0,25,2555,1099.0,1.0,56,2013-01-15,15,0,...,0,1,0,0,0,0,0,0,0,0


# 3. Save the new dataframe

In [22]:
# Save the data
datapath = './data'
save_file(df, 'training_data_feature_engineered.csv', datapath)

Writing file.  "./data\training_data_feature_engineered.csv"
