# Feature Engineering

Feature engineering is an answer to the question, "How can I make the most of the data I have?"


Let's get started, then. How does one do feature engineering?

I'll assume you're familiar with pandas and the decision tree pipeline that we're using for this project. That's the algorithm we're going to engineer the data for; not all algorithms will want the data engineered the same way, though often the benefits will work for many algorithms.

In [1]:
import pandas as pd

In [3]:
# load the data output by src/merger.py
original_data = pd.read_csv('./merger/bigTable.csv')

In [4]:
print(original_data.columns)
print(original_data.shape)
original_data.head()

Index(['id', 'date', 'store_nbr', 'item_nbr', 'unit_sales', 'onpromotion',
       'city', 'state', 'cluster', 'family', 'class', 'perishable',
       'transactions', 'year', 'month', 'day', 'dayofweek',
       'days_til_end_of_data', 'cpi', 'dayoff', 'percent_in_transactions',
       'item_store_sales_variance'],
      dtype='object')
(5877318, 22)


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,cluster,family,...,transactions,year,month,day,dayofweek,days_til_end_of_data,cpi,dayoff,percent_in_transactions,item_store_sales_variance
0,88211471,2016-08-16,44,103520,7.0,True,Quito,Pichincha,5,GROCERY I,...,3941,2016,8,16,1,364,105.123322,False,0.001776,39.466659
1,88306356,2016-08-17,44,103520,2.0,False,Quito,Pichincha,5,GROCERY I,...,4256,2016,8,17,2,363,105.123322,False,0.00047,39.466659
2,88399003,2016-08-18,44,103520,4.0,False,Quito,Pichincha,5,GROCERY I,...,3776,2016,8,18,3,362,105.123322,False,0.001059,39.466659
3,88492368,2016-08-19,44,103520,6.0,False,Quito,Pichincha,5,GROCERY I,...,4185,2016,8,19,4,361,105.123322,False,0.001434,39.466659
4,88591626,2016-08-20,44,103520,13.0,True,Quito,Pichincha,5,GROCERY I,...,4830,2016,8,20,5,360,105.123322,True,0.002692,39.466659


In [14]:
import sys, os
sys.path.append(os.path.join('src'))
from src import splitter

In [15]:
# Now we run splitter and decision_tree with our original data
splitter.main()

Loading data from merger output
Splitting data 70:30 train:validation
Writing to ./splitter/train.csv
Writing to ./splitter/validation.csv
Finished splitting


In [16]:
from src import decision_tree

In [17]:
decision_tree.main()

Loading data from splitter/train.csv
Loading data from splitter/validation.csv
Encoding categorical variables
Joining tables for consistent encoding
Creating decision tree model
Making prediction on validation data
Calculating estimated error


  log_square_errors = (np.log(predictions + 1) - np.log(targets + 1)) ** 2
  log_square_errors = (np.log(predictions + 1) - np.log(targets + 1)) ** 2


Writing to ./decision_tree/model.pkl
Writing to ./decision_tree/score_and_metadata.csv
Done deciding with trees
Decision tree analysis done with a validation score (error rate) of 0.00268005495579566.


In [18]:
original_validation_score = 0.00268005495579566

So now we have a baseline for how well our decision tree performed before we added a feature.

Let's see what happens if we add a `two_weeks_before_christmas` and a `two_weeks_after_christmas` column, as per our Exploratory Analysis discussion.

In [23]:
# Re-read the data and use datetime objects for the date
engineered_data = pd.read_csv('./merger/bigTable.csv')
engineered_data.date = pd.to_datetime(engineered_data.date)


In [24]:
# Create a before_christmas_window
start_date = pd.to_datetime('2016-12-11')
end_date = pd.to_datetime('2016-12-25')
before_christmas = (engineered_data['date'] > start_date) & (engineered_data['date'] <= end_date)

In [25]:
# Create an after_christmas_window
start_date = pd.to_datetime('2016-12-25')
end_date = pd.to_datetime('2017-01-08')
after_christmas = (engineered_data['date'] > start_date) & (engineered_data['date'] <= end_date)

In [37]:
engineered_data['two_weeks_before_christmas'] = before_christmas
engineered_data['two_weeks_after_christmas'] = after_christmas

#### Just as a spot check, let's look at the date of the first few records in our new columns

In [38]:
print(engineered_data[engineered_data.two_weeks_before_christmas == True].date.head())
print(engineered_data[engineered_data.two_weeks_after_christmas == True].date.head())

117   2016-12-12
118   2016-12-13
119   2016-12-14
120   2016-12-15
121   2016-12-16
Name: date, dtype: datetime64[ns]
130   2016-12-26
131   2016-12-27
132   2016-12-28
133   2016-12-29
134   2016-12-30
Name: date, dtype: datetime64[ns]


Seems okay to me. Let's see how it changes the results now.

In [41]:
engineered_data.to_csv('./merger/bigTable.csv', index=False)


In [42]:
splitter.main()


Loading data from merger output
Splitting data 70:30 train:validation
Writing to ./splitter/train.csv
Writing to ./splitter/validation.csv
Finished splitting


In [43]:
decision_tree.main()

Loading data from splitter/train.csv
Loading data from splitter/validation.csv
Encoding categorical variables
Joining tables for consistent encoding
Creating decision tree model
Making prediction on validation data
Calculating estimated error


  log_square_errors = (np.log(predictions + 1) - np.log(targets + 1)) ** 2
  log_square_errors = (np.log(predictions + 1) - np.log(targets + 1)) ** 2


Writing to ./decision_tree/model.pkl
Writing to ./decision_tree/score_and_metadata.csv
Done deciding with trees
Decision tree analysis done with a validation score (error rate) of 0.003692818003915606.


In [44]:
engineered_validation_score = 0.003692818003915606

In [45]:
print(original_validation_score - engineered_validation_score)

-0.0010127630481199463


So as it turns out, adding a boolean about before/after Christmas slightly hurt our performance. 

- Now we should iterate on the features 
  - for example, maybe two weeks is too wide a window
- or maybe it's time to question if the scoring algorithm provided to us by the kaggle competition
  - should we replace nwrmsle with another error measurement?