# Feature Engineering
After the intial exploration and prototype model, we have a better understanding of what new variables could be helpful and what transformations we do on the features. The output of this file should be a pipeline object that contains all the preprocessing steps

### Plan of Attack 
- After running feature importance on the XGBoost model, we found the date block variable ranked 2nd in importance. This tells me that time based insights will be helpful. I'm thinking I can look at monthly average prices for all categories (per shop, per item, per item category, and overall). I'm not sure how each will perform so I'll just throw them all into the model and see how it performs. 
- Avg price per item category, shop, item across all months
- Rank items on scale of cheap-normal-expensive, then for individual stores, show the % of cheap-normal-expensive items the store holds.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from category_encoders import TargetEncoder
from sklearn.compose import ColumnTransformer

import os
os.chdir("/Users/Tosan.Johnson/Personal Projects/Kaggle Projects/Predict Future Sales") # changing working directory
from IPython.core.interactiveshell import InteractiveShell # allows multiple outputs per cell
InteractiveShell.ast_node_interactivity = "all"

In [82]:
from sklearn.compose import ColumnTransformer

In [89]:
# Read in data
df_icats = pd.read_csv('data/item_categories.csv')
df_items = pd.read_csv('data/items.csv')
df_sales = pd.read_csv('data/sales_train.csv')
df_shops = pd.read_csv('data/shops.csv')

In [90]:
# merge over the item category column
""" Note - This step would be part of cleaning in my opinion, getting all the necessary data insto one table"""
df_sales = df_sales.merge(df_items[['item_id','item_category_id']], on='item_id', how='left')
# X = df_sales.drop(columns='item_price')
X = df_sales.copy()
y = df_sales['item_price']

' Note - This step would be part of cleaning in my opinion, getting all the necessary data insto one table'

In [42]:
df_icats.head()
df_items.head()
df_sales.head()
df_shops.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id,item_avg_price_per_mth,item_cat_avg_price_per_mth,shop_avg_price_per_mth,item_avg_price,item_cat_avg_price,shop_avg_price,item_price_category
0,02.01.2013,0,59,22154,999.0,1.0,37,999.0,465.036007,820.174553,702.932203,434.289667,884.981227,Normal
1,03.01.2013,0,25,2552,899.0,1.0,58,899.0,1401.858108,672.071345,937.888889,1703.176031,835.863571,Normal
2,05.01.2013,0,25,2552,899.0,-1.0,58,899.0,1401.858108,672.071345,937.888889,1703.176031,835.863571,Normal
3,06.01.2013,0,25,2554,1709.05,1.0,58,1709.05,1401.858108,672.071345,1709.05,1703.176031,835.863571,Expensive
4,15.01.2013,0,25,2555,1099.0,1.0,56,1098.85,867.446992,672.071345,1123.101786,1125.963251,835.863571,Expensive


Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


## Preprocessing Without Pipeline (Don't run)
I'm doing this step so I can play around with how I want to create the variables. I'm sure I'll get to a point where I can think through this within the pipeline, but since I'm learning this for the first time now, it's easier to get the logic down before having to implement it within the structure of the pipeline.

In [None]:
# dropping the date variable (not needed for training)
df_sales.drop(columns='date',inplace=True)

In [None]:
# Use Target encoding on the ID variables (avoiding One hot encoding since there are a ton of groups)

X = df_sales.drop(columns='item_price')
y = df_sales['item_price']
target_encoder = TargetEncoder(cols=['item_category_id', 'item_id','shop_id'])
X = target_encoder.fit_transform(X,y)

In [None]:
# Scale the continuous variables and to normalize data
scaler = StandardScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X) # turning it back into a DF and attaching the correspoding column names to it
X.columns = df_sales.drop(columns='item_price').columns

In [12]:
# Average price per month - item, item_category, shop_id
df_sales['item_avg_price_per_mth'] = df_sales.groupby(['date_block_num', 'item_id'])['item_price'].transform('mean')
df_sales['item_cat_avg_price_per_mth'] = df_sales.groupby(['date_block_num', 'item_category_id'])['item_price'].transform('mean')
df_sales['shop_avg_price_per_mth'] = df_sales.groupby(['date_block_num', 'shop_id'])['item_price'].transform('mean')


In [13]:
# Average prices across all months - item, item_category, shop_id
df_sales['item_avg_price'] = df_sales.groupby(['item_id'])['item_price'].transform('mean')
df_sales['item_cat_avg_price'] = df_sales.groupby(['item_category_id'])['item_price'].transform('mean')
df_sales['shop_avg_price'] = df_sales.groupby(['shop_id'])['item_price'].transform('mean')

In [39]:
# Create cheap-normal-expensive categories for each store. Then show ratio of cheap-normal-expensive for each store

# exploration
# sns.histplot(data=df_sales, x='item_price', binrange=(0,5000), bins=25)
# df_sales['item_price'].describe()

df_sales['item_price_category'] = (np.where(df_sales['item_avg_price'] <= 250, 'Cheap', 
                                    np.where(df_sales['item_avg_price'] > 1000, 'Expensive', 'Normal')))

# Preprocessing with the Pipeline

In [53]:
# Separate num and categorical columns 
""" Note - This step will change depending on the data (what types I include/exclude). We can even separate beyond these two groups, depending on the preprocessing plan """
cat_cols = list(df_sales.select_dtypes(include=object).columns)
num_cols = list(df_sales.select_dtypes(exclude=object).columns)

' Note - This step will change depending on the data (what types I include/exclude). We can even separate beyond these two groups, depending on the preprocessing plan '

In [93]:
""" Note - I believe best practice for production models is use indexes instead of column names... but I'm lazy and don't want to over think this for now
         - I haven't optimized anything, don't judge me
"""

# Using this tranformer to create all the new variables
class VariableTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, price_categories=[250,1000]):
        self.price_categories = price_categories

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        data = X.copy() # just being safe
        
        # dropping useless col
        data.drop(columns='date')

        # Average price per month - item, item_category, shop_id
        data['item_avg_price_per_mth'] = data.groupby(['date_block_num', 'item_id'])['item_price'].transform('mean')
        data['item_cat_avg_price_per_mth'] = data.groupby(['date_block_num', 'item_category_id'])['item_price'].transform('mean')
        data['shop_avg_price_per_mth'] = data.groupby(['date_block_num', 'shop_id'])['item_price'].transform('mean')

        # Average prices across all months - item, item_category, shop_id
        data['item_avg_price'] = data.groupby(['item_id'])['item_price'].transform('mean')
        data['item_cat_avg_price'] = data.groupby(['item_category_id'])['item_price'].transform('mean')
        data['shop_avg_price'] = data.groupby(['shop_id'])['item_price'].transform('mean')

        # creating a new col that groups whether an item is cheap/normal/expensive
        data['item_price_category'] = (np.where(data['item_avg_price'] <= self.price_categories[0], 'Cheap', 
                                       np.where(data['item_avg_price'] > self.price_categories[1], 'Expensive', 'Normal')))
        return data

class ColumnSelector(BaseEstimator, TransformerMixin):
    def __init__(self, columns):
        self.columns = columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X[self.columns]

class Debug(BaseEstimator, TransformerMixin):

    def transform(self, X):
        print(pd.DataFrame(X).head())
        print(X.shape)
        return X

    def fit(self, X, y=None, **fit_params):
        return self
    

" Note - I believe best practice for production models is use indexes instead of column names... but I'm lazy and don't want to over think this for now\n         - I haven't optimized anything, don't judge me\n"

In [103]:
# Construct Pipeline
"""
Note - I'm having trouble getting the flow of the pipeline perfect. I think I'm misundertanding something within the transformation that causing me to take an extra step
"""

# Var creation Pipeline
new_feature_pipe = Pipeline([
    ('new_var_creation', VariableTransformer())
])

# Categorical Pipelines
cat_cols = ['item_price_category','item_category_id', 'item_id','shop_id','date_block_num'] # need to automate this probably
ohe_pipe_vars = ['item_price_category']
ohe_pipe = Pipeline([
    ('OHE', OneHotEncoder(sparse=False))
])

te_pipe_vars = ['item_category_id', 'item_id','shop_id','date_block_num']
te_pipe = Pipeline([
    ('TE', TargetEncoder())
])


# Numeric Pipelines
num_cols = ['item_avg_price','item_cat_avg_price','shop_avg_price','item_avg_price_per_mth','item_cat_avg_price_per_mth','shop_avg_price_per_mth','item_cnt_day']

std_pipe_vars = [col for col in num_cols if col is not 'item_cnt_day']
std_pipe = Pipeline([
    ('standard_scaler', StandardScaler())
])

min_max_pipe_vars = ['item_cnt_day']
min_max_pipe = Pipeline([
    ('min_max_scaler', MinMaxScaler())
])

# Feature Union Pipeline
feat_pipe = ColumnTransformer([
    ('cat_ohe', ohe_pipe, ohe_pipe_vars),
    ('cat_te', te_pipe, te_pipe_vars),
    ('num_std_scaler', std_pipe, std_pipe_vars),
    ('min_max_scaler',min_max_pipe, min_max_pipe_vars)
])

# Preprocessing pipeline
preprocesor = Pipeline([
    ('var creation', new_feature_pipe),
    ('debugger', Debug()),
    ('feature transformations', feat_pipe)
])

"\nNote - I'm having trouble getting the flow of the pipeline perfect. I think I'm misundertanding something within the transformation that causing me to take an extra step\n"

In [104]:
new_X = preprocesor.fit_transform(X)


         date  date_block_num  shop_id  item_id  item_price  item_cnt_day  \
0  02.01.2013               0       59    22154      999.00           1.0   
1  03.01.2013               0       25     2552      899.00           1.0   
2  05.01.2013               0       25     2552      899.00          -1.0   
3  06.01.2013               0       25     2554     1709.05           1.0   
4  15.01.2013               0       25     2555     1099.00           1.0   

   item_category_id  item_avg_price_per_mth  item_cat_avg_price_per_mth  \
0                37                  999.00                  465.036007   
1                58                  899.00                 1401.858108   
2                58                  899.00                 1401.858108   
3                58                 1709.05                 1401.858108   
4                56                 1098.85                  867.446992   

   shop_avg_price_per_mth  item_avg_price  item_cat_avg_price  shop_avg_price  \
0    

TypeError: fit_transform() missing argument: y

In [79]:
X

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category_id
0,02.01.2013,0,59,22154,999.00,1.0,37
1,03.01.2013,0,25,2552,899.00,1.0,58
2,05.01.2013,0,25,2552,899.00,-1.0,58
3,06.01.2013,0,25,2554,1709.05,1.0,58
4,15.01.2013,0,25,2555,1099.00,1.0,56
...,...,...,...,...,...,...,...
2935844,10.10.2015,33,25,7409,299.00,1.0,55
2935845,09.10.2015,33,25,7460,299.00,1.0,55
2935846,14.10.2015,33,25,7459,349.00,1.0,55
2935847,22.10.2015,33,25,7440,299.00,1.0,57


In [66]:
# Categoical Pipeline

# Numerical Pipeline

# Combine both Pipelines
[col for col in num_cols if col is not 'item_cnt_day']

['item_avg_price',
 'item_cat_avg_price',
 'shop_avg_price',
 'item_avg_price_per_mth',
 'item_cat_avg_price_per_mth',
 'shop_avg_price_per_mth']