# Feature Engineering with open source

In this notebook we reproduce feature engineering pipeline from notebook 2 using [scikit learn](https://scikit-learn.org/) and [feature engine](https://feature-engine.readthedocs.io/en/1.2.x/)

In [58]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# saving pipeline
import joblib

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, Binarizer

# feature-engine
from feature_engine.imputation import(
    AddMissingIndicator,
    MeanMedianImputer,
    CategoricalImputer
)

from feature_engine.encoding import (
    RareLabelEncoder,
    OrdinalEncoder
)

from feature_engine.transformation import (
    LogTransformer,
    YeoJohnsonTransformer
)


from feature_engine.selection import DropFeatures
from feature_engine.wrappers import SklearnTransformerWrapper


pd.pandas.set_option('display.max_columns', None)

In [59]:
data = pd.read_csv('train.csv')

print(data.shape)

data.head()

(1460, 81)


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


# Separate dataset into train and test set

Some feature engineering technique will require learning from our data. For this purpose it is important we learn from our train data (not both) to avoid overfitting

- mean
- mode
- exponent for yeo-johnson
- category frequency
- and category to number mappings


**Separating the data into train and test involves randomness, therefore, we need to set the seed.**

In [60]:
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1), #predictor
    data['SalePrice'], #target
    test_size=0.1, # 10% for testing
    random_state=0 # seed -> for reproduceability
)


X_train.shape, X_test.shape

((1314, 79), (146, 79))

# Feature Engineering

We will carry out feature engineering in to take care of the following

1. Missing values
2. Temporal variables
3. Non-Gaussian distributed variables
4. Categorical variables: remove rare labels
5. Categorical variables: convert strings to numbers
5. Standardize the values of the variables to the same range

In [61]:
y_train = np.log(y_train)
y_test = np.log(y_test)

## Missing values
### Categorical Variables

In [62]:
cat_vars = [var for var in data.columns if data[var].dtype == 'O']

cat_vars = cat_vars + ['MSSubClass']


X_train[cat_vars] = X_train[cat_vars].astype('O')
X_test[cat_vars] = X_test[cat_vars].astype('O')


len(cat_vars)

44

In [63]:
cat_vars_with_na = [
    var for var in cat_vars
    if X_train[var].isnull().sum() > 0
]


X_train[cat_vars_with_na].isnull().mean().sort_values(
    ascending=False
)

PoolQC          0.995434
MiscFeature     0.961187
Alley           0.938356
Fence           0.814307
FireplaceQu     0.472603
GarageType      0.056317
GarageFinish    0.056317
GarageQual      0.056317
GarageCond      0.056317
BsmtExposure    0.025114
BsmtFinType2    0.025114
BsmtQual        0.024353
BsmtCond        0.024353
BsmtFinType1    0.024353
MasVnrType      0.004566
Electrical      0.000761
dtype: float64

In [64]:
# for columns with missing value greater than 10%
# replace with 'missing'
# While for those less than 10% replace with the first mode


with_string_missing = [
    var for var in cat_vars_with_na if X_train[var].isnull().mean()
    > 0.1 
]


with_mode_missing = [
    var for var in cat_vars_with_na if X_train[var].isnull().mean()
    < 0.1
]

In [65]:
with_string_missing

['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']

In [66]:
# replace missing value with label: "missing"

cat_imputer_missing = CategoricalImputer(
    imputation_method='missing', variables=with_string_missing
)

cat_imputer_missing.fit(X_train)

cat_imputer_missing.imputer_dict_

{'Alley': 'Missing',
 'FireplaceQu': 'Missing',
 'PoolQC': 'Missing',
 'Fence': 'Missing',
 'MiscFeature': 'Missing'}

In [67]:
# replace NA with missing

X_train = cat_imputer_missing.transform(X_train)
X_test = cat_imputer_missing.transform(X_test)



In [68]:
cat_imputer_mode = CategoricalImputer(
    imputation_method='frequent', variables=with_mode_missing
)

cat_imputer_mode.fit(X_train)

cat_imputer_mode.imputer_dict_

{'MasVnrType': 'None',
 'BsmtQual': 'TA',
 'BsmtCond': 'TA',
 'BsmtExposure': 'No',
 'BsmtFinType1': 'Unf',
 'BsmtFinType2': 'Unf',
 'Electrical': 'SBrkr',
 'GarageType': 'Attchd',
 'GarageFinish': 'Unf',
 'GarageQual': 'TA',
 'GarageCond': 'TA'}

In [69]:
X_train = cat_imputer_mode.transform(X_train)
X_test = cat_imputer_mode.transform(X_test)

In [70]:
X_train[cat_vars_with_na].isnull().sum()

Alley           0
MasVnrType      0
BsmtQual        0
BsmtCond        0
BsmtExposure    0
BsmtFinType1    0
BsmtFinType2    0
Electrical      0
FireplaceQu     0
GarageType      0
GarageFinish    0
GarageQual      0
GarageCond      0
PoolQC          0
Fence           0
MiscFeature     0
dtype: int64

In [71]:
[
    var for var in cat_vars_with_na if X_test[var].isnull().sum()
]

[]

## Numerical variables

For numerical variable we will

- add a binary indicator column
- replace the missing va;ue in the original variable with the meaN

In [72]:
num_vars = [
    var for var in X_train.columns if var not in cat_vars
]

len(num_vars)

35

In [73]:
var_with_na = [
    var for var in num_vars if X_train[var].isnull().sum() > 0
]


X_train[var_with_na].isnull().mean()

LotFrontage    0.177321
MasVnrArea     0.004566
GarageYrBlt    0.056317
dtype: float64

In [74]:
var_with_na

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [75]:
missing_ind = AddMissingIndicator(variables=var_with_na)

missing_ind_var = missing_ind.fit(X_train)



In [76]:
missing_ind_var.variables_

['LotFrontage', 'MasVnrArea', 'GarageYrBlt']

In [77]:
X_train = missing_ind.transform(X_train)
X_test = missing_ind.transform(X_test)

X_train[['LotFrontage_na', 'MasVnrArea_na', 'GarageYrBlt_na']].head(4)

Unnamed: 0,LotFrontage_na,MasVnrArea_na,GarageYrBlt_na
930,0,0,0
656,0,0,0
45,0,0,0
1348,1,0,0


In [78]:
mean_imputer = MeanMedianImputer(
    imputation_method='mean', variables=var_with_na
)

mean_imputer.fit(X_train)

mean_imputer.imputer_dict_

{'LotFrontage': 69.87974098057354,
 'MasVnrArea': 103.7974006116208,
 'GarageYrBlt': 1978.2959677419356}

In [79]:
X_train = mean_imputer.transform(X_train)
X_test = mean_imputer.transform(X_test)


X_train[var_with_na].isnull().sum()

LotFrontage    0
MasVnrArea     0
GarageYrBlt    0
dtype: int64

## Temporal Variable

In [80]:
def elapsed_years(df, var):

    df[var] = df['YrSold'] - df[var]
    return df

In [81]:
for var in ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt']:

    X_train = elapsed_years(X_train, var)
    X_test = elapsed_years(X_test, var)

In [82]:
drop_features = DropFeatures(features_to_drop=['YrSold'])


X_train = drop_features.fit_transform(X_train)
X_test = drop_features.fit_transform(X_test)

## Numerical variable transformation

### Logarithmic transformation

In [83]:
log_transformer = LogTransformer(
    variables=["LotFrontage", "1stFlrSF", "GrLivArea"]
)

X_train = log_transformer.fit_transform(X_train)
X_test = log_transformer.fit_transform(X_test)


