# Introduction to Feature Engineering
<hr style="border:2px solid black">

## Introduction

### Feature engineering: what & why?

- "art" of formulating useful features from existing data 
- transforms data to better relate to the underlying target variable
- improves the performance of an ML model
- follows naturally from domain knowledge
- helps incorporate non-numeric features into an ML model

### Feature engineering techniques

 |       technique      |                                        usefulness                                |
 |:--------------------:|:--------------------------------------------------------------------------------:|
 |     `Imputation`     |                    fills out missing values in data                    |
 |   `Discretization`   |                groups a feature in some logical fashion into bins                |
 |`Categorical Encoding`|encodes categorical features into numerical values|
 |  `Feature Splitting` |splits a feature into parts|
 |   `Outlier Handling` |takes care of unusually high/low values in the dataset|
 | `Log Transformation` |deals with ill-behaved (skewed of heteroscedastic) data       |
 |   `Feature Scaling`  |handles the sensitivity of ML algorithms to the scale of input values| 


In [1]:
# data analysis stack
import pandas as pd


# machine-learning feature-engineering stack 

## imputation
from sklearn.impute import (
    SimpleImputer,
    KNNImputer, 
)

## encoding-stack
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder, 
)

## feature-scaling stack
from sklearn.preprocessing import (
StandardScaler,
    MinMaxScaler,
    RobustScaler,
    PowerTransformer
)

## discretization
from sklearn.preprocessing import KBinsDiscretizer

## polynomial feature
from sklearn.preprocessing import PolynomialFeatures

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

from sklearn import set_config
set_config(transform_output="pandas") # will set the outoput of the transformation to a pandas dataframe

# numpy
import numpy as np
import pandas as pd

<hr style="border:2px solid black">

## Welcome Back Mace Factorization

![](../images/penguins.png)

## Business Goal:
>Predict the penguins sex based on species, culmen length, culmen_depth, flipper_length and body mass and island

## Get Data

In [2]:
df = pd.read_csv('data/penguins_unclean.csv')
df.head()

Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass,sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,,FEMALE
2,Adelie,40.3,18.0,195.0,,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


## Select columns for X and y

In [3]:
# keep a copy 
df_copy = df.copy()

In [4]:
X = df.drop(['sex'], axis = 1)
y = df.sex

## Perform a train test split

In [5]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
# shuffle: by default true
# when would we not want to shuffe? e.g. time series 

x_train


Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass
224,Gentoo,40.9,13.7,214.0,4650.0
78,Adelie,37.3,17.8,191.0,
295,Gentoo,50.0,15.9,224.0,5350.0
17,Adelie,35.9,19.2,189.0,3800.0
24,Adelie,40.5,18.9,180.0,3950.0
...,...,...,...,...,...
188,Chinstrap,50.9,19.1,196.0,3550.0
71,Adelie,37.2,19.4,184.0,3900.0
106,Adelie,39.7,17.7,193.0,3200.0
270,Gentoo,45.5,15.0,220.0,5000.0


In [6]:
# Check shapes
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(266, 5)
(67, 5)
(266,)
(67,)


## EDA
- which variable has missing values?
- which variables are binary, categorical, metric?
- do categorical variables have non-numeric values?
- do metric features are varying on a different scale?
- ...


In [7]:
# Combine back Xtrain, ytrain
df_train = pd.concat([x_train, y_train], axis = 1)
df_train

Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass,sex
224,Gentoo,40.9,13.7,214.0,4650.0,FEMALE
78,Adelie,37.3,17.8,191.0,,FEMALE
295,Gentoo,50.0,15.9,224.0,5350.0,MALE
17,Adelie,35.9,19.2,189.0,3800.0,FEMALE
24,Adelie,40.5,18.9,180.0,3950.0,MALE
...,...,...,...,...,...,...
188,Chinstrap,50.9,19.1,196.0,3550.0,MALE
71,Adelie,37.2,19.4,184.0,3900.0,MALE
106,Adelie,39.7,17.7,193.0,3200.0,FEMALE
270,Gentoo,45.5,15.0,220.0,5000.0,MALE


In [8]:
# which variable has missing values?
df_train.isnull().sum() # bodymass, species and flipper length have missing vlaues



species            1
culmen_length      0
culmen_depth       0
flipper_length     2
body_mass         15
sex                0
dtype: int64

## Featuring Engeneering

### Imputation
Many real world dataset contain missing values.
Such datasets are uncompatible with most scikit-learn estimators

**What can we do with missing values?**

__Strategies__

- Drop:
    + nrows with missing values
    + columns with a lot of missing values

- Fill with a value:
    + mean/median/mode of a column
    + interpolate / back fill / forward fill
    + mean/median/mode of a group

- With pandas: 
    + df.isna(): checks for NaNs, then do a sum or a heatmap
    + df.dropna(): drop NaNs
    + df.fillna(): fill NaNs

In "real life", you'll want to use inplace=True in these examples

#### `SimpleImputer`

**body mass imputing**

In [19]:
# instatiate the imputer
bm_imputer = SimpleImputer(strategy = 'median')

In [21]:
# learn the median from the column
bm_imputer.fit(x_train[['body_mass']])

bm_imputer.statistics_

array([4050.])

In [22]:
# transform the body mass
bm_imputed_train = bm_imputer.transform(x_train[['body_mass']])
assert bm_imputed_train.isnull().sum()[0] == 0
x_train

Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass
224,Gentoo,40.9,13.7,214.0,4650.0
78,Adelie,37.3,17.8,191.0,
295,Gentoo,50.0,15.9,224.0,5350.0
17,Adelie,35.9,19.2,189.0,3800.0
24,Adelie,40.5,18.9,180.0,3950.0
...,...,...,...,...,...
188,Chinstrap,50.9,19.1,196.0,3550.0
71,Adelie,37.2,19.4,184.0,3900.0
106,Adelie,39.7,17.7,193.0,3200.0
270,Gentoo,45.5,15.0,220.0,5000.0


In [23]:
bm_imputed_train.isnull().sum()

body_mass    0
dtype: int64

In [24]:
# access the imputation value (just for a check)
assert bm_imputer.statistics_ == x_train.body_mass.median()

In [25]:
bm_imputed_test = bm_imputer.transform(x_test[['body_mass']])
bm_imputed_test
# we are now transforming the test sample with the median from the train data!!

Unnamed: 0,body_mass
25,3250.0
309,4875.0
73,4000.0
195,3675.0
57,4050.0
...,...
280,4700.0
3,3450.0
77,4200.0
311,4050.0


**species imputing**

In [26]:
# instatiate the imputer
sp_imputer = SimpleImputer(strategy = 'most_frequent')


In [27]:
# learn the most frequent species
sp_imputer.fit(x_train[['species']])

# check which is the value we are imputing
sp_imputer.statistics_

array(['Adelie'], dtype=object)

In [28]:
# transform the train species 
sp_imputed_train = sp_imputer.transform(x_train[['species']])
assert sp_imputed_train.isnull().sum()[0] == 0

In [29]:
# transform the test species 
sp_imputed_test = sp_imputer.transform(x_test[['species']])
sp_imputed_test

Unnamed: 0,species
25,Adelie
309,Gentoo
73,Adelie
195,Chinstrap
57,Adelie
...,...
280,Gentoo
3,Adelie
77,Adelie
311,Gentoo


#### `KNNImputer`

**flipper length imputing**

In [35]:
# instatiate the knn imputer
fl_imputer = KNNImputer(n_neighbors = 10)

In [32]:
# learn the most similar observations and calculate the mean value
fl_imputer.fit(x_train[['flipper_length']])

In [33]:
# transform the train flipper length 
fl_imputed_train = fl_imputer.transform(x_train[['flipper_length']])
fl_imputed_train

Unnamed: 0,flipper_length
224,214.0
78,191.0
295,224.0
17,189.0
24,180.0
...,...
188,196.0
71,184.0
106,193.0
270,220.0


In [34]:
# transform the test flipper length 
fl_imputed_test = fl_imputer.transform(x_test[['flipper_length']])
fl_imputed_test

Unnamed: 0,flipper_length
25,178.0
309,222.0
73,195.0
195,198.0
57,192.0
...,...
280,220.0
3,193.0
77,193.0
311,225.0


### Categorical Encoding
Species column

#### `pd.get_dummies()` with Pandas 🐼

In [37]:
pd.get_dummies(
    data=sp_imputed_train, 
    prefix = 's',
    dtype = 'int'
)

Unnamed: 0,s_Adelie,s_Chinstrap,s_Gentoo
224,0,0,1
78,1,0,0
295,0,0,1
17,1,0,0
24,1,0,0
...,...,...,...
188,0,1,0
71,1,0,0
106,1,0,0
270,0,0,1


#### `OneHotEncoder`

In [43]:
# instantiate the encoder
sp_ohe = OneHotEncoder(sparse = False)

In [44]:
# learn the unique value of the category
sp_ohe.fit(sp_imputed_train)

In [45]:
# # create dummy features
sp_imputed_dummy_train = sp_ohe.transform(sp_imputed_train)
sp_imputed_dummy_train


Unnamed: 0,species_Adelie,species_Chinstrap,species_Gentoo
224,0.0,0.0,1.0
78,1.0,0.0,0.0
295,0.0,0.0,1.0
17,1.0,0.0,0.0
24,1.0,0.0,0.0
...,...,...,...
188,0.0,1.0,0.0
71,1.0,0.0,0.0
106,1.0,0.0,0.0
270,0.0,0.0,1.0


<hr style="border:2px solid black">

### Discretization
Binning a metric variable


#### `KBinsDiscretizer`

discretize the flipper length into 3 bins

In [95]:
# instantiate the discretizer
fl_discritizer = KBinsDiscretizer(n_bins = 3, encode='ordinal', strategy='uniform')

In [96]:
# learn the bin interval 
fl_discritizer.fit(fl_imputed_train)

In [97]:
# transform the train flipper imputed 
fl_imputed_discritized_train = fl_discritizer.transform(fl_imputed_train)
fl_imputed_discritized_train.rename(columns = {'flipper_length':'flipper_cat'}, inplace = True)

fl_imputed = pd.concat([fl_imputed_train, fl_imputed_discritized_train], axis = 1)
fl_imputed

Unnamed: 0,flipper_length,flipper_cat
224,214.0,2.0
78,191.0,0.0
295,224.0,2.0
17,189.0,0.0
24,180.0,0.0
...,...,...
188,196.0,1.0
71,184.0,0.0
106,193.0,1.0
270,220.0,2.0


In [98]:
# get bin edges
bin_edges = fl_imputed.groupby('flipper_cat').agg({
    'flipper_length': ['min', 'max']}).reset_index()
bin_edges

Unnamed: 0_level_0,flipper_cat,flipper_length,flipper_length
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max
0,0.0,172.0,191.0
1,1.0,192.0,211.0
2,2.0,212.0,231.0


### Feature Scaling

Feature scaling is a data preprocessing technique used to transform the values of features or variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally to the model and to avoid the domination of features with larger values.

Affected models
+ Gradient Descent based model
+ Distance based model

Not Affected models:
+ tree based models

#### Standardization
- remove the mean and scale the data according to the standard deviation
- sensitive to outliers 
- the features may scale differently from each other in the presence of outliers.
- implemented in scikit-learn by `StandardScaler()`

>$$\dfrac{x - mean}{sd}$$


In [99]:
X

Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass
0,Adelie,39.1,18.7,181.0,3750.0
1,Adelie,39.5,17.4,186.0,
2,Adelie,40.3,18.0,195.0,
3,Adelie,36.7,19.3,193.0,3450.0
4,Adelie,39.3,20.6,190.0,3650.0
...,...,...,...,...,...
328,Gentoo,47.2,13.7,214.0,4925.0
329,Gentoo,46.8,14.3,215.0,4850.0
330,Gentoo,50.4,15.7,222.0,5750.0
331,Gentoo,45.2,14.8,212.0,5200.0


In [103]:
# Combine columns to scale 
to_scale_train = X.drop('species', axis = 1)
to_scale_train

Unnamed: 0,culmen_length,culmen_depth,flipper_length,body_mass
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,
2,40.3,18.0,195.0,
3,36.7,19.3,193.0,3450.0
4,39.3,20.6,190.0,3650.0
...,...,...,...,...
328,47.2,13.7,214.0,4925.0
329,46.8,14.3,215.0,4850.0
330,50.4,15.7,222.0,5750.0
331,45.2,14.8,212.0,5200.0


In [104]:
# instantiate the scaler
st_scaler = StandardScaler()

In [105]:
# learn mean and standard deviation of each feature
st_scaler.fit(to_scale_train)

In [109]:
# apply transformation on column to scale 
scaled_features_train = st_scaler.transform(to_scale_train)
scaled_features_train

Unnamed: 0,culmen_length,culmen_depth,flipper_length,body_mass
0,-0.896042,0.780732,-1.428078,-0.559684
1,-0.822788,0.119584,-1.071328,
2,-0.676280,0.424729,-0.429178,
3,-1.335566,1.085877,-0.571878,-0.933866
4,-0.859415,1.747026,-0.785928,-0.684412
...,...,...,...,...
328,0.587352,-1.762145,0.926472,0.905862
329,0.514098,-1.457000,0.997823,0.812317
330,1.173384,-0.744994,1.497273,1.934863
331,0.221082,-1.202712,0.783772,1.248863


#### Normalization
- scikit-learn equivalent `MinMaxScaler()`
- output range is always [0,1]
- doesn't deal well with outliers
- transformation formula:
>$$\dfrac{x - min(x)}{max(x) - min(x)}$$

#### Robust scaler
- scikit-learn equivalent `RobustScaler()`
- remove the median and scale the data according to the interquartile range (IQR)
- robust to outliers
- transformation formula:
>$$\dfrac{x - median(x)}{IQR}$$

<hr style="border:2px solid black">

## Train Model(s)

### Fit the model on the (transformed) training data

## Evaluate the model on the (transformed) test data

<hr style="border:2px solid black">

## Feature engineering best practices:

#### 1. We should try to split our data set into training and testing sub-samples as early as we can.
- this is _somewhat_ flexible — e.g. you can drop NaNs from the entire dataset before filling.
- still, in interest of good machine learning habits and avoiding mistakes, it's smart to _always_ split first!

#### 2. We need to feature engineer our testing data in the same way that we feature-engineered our training data.
- otherwise the performance of our model will suffer, if it runs at all.
- writing a function is a nice way to do this.
#### 3. Feature Engineering includes any pre-processing techniques, such as:
- imputation, dropping missing values
- converting strings / non-numeric values into numeric values
- scaling
- binning
- combining feature

## References

- [8 Feature Engineering Techniques for Machine Learning](https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423)

- [Fundamental Techniques of Feature Engineering for Machine Learning](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)