# Introduction to Feature Engineering
<hr style="border:2px solid black">

## 1. Introduction

### Feature engineering: what & why?

- "art" of formulating useful features from existing data 
- transforms data to better relate to the underlying target variable
- improves the performance of an ML model
- follows naturally from domain knowledge
- helps incorporate non-numeric features into an ML model

### Feature engineering techniques

 |       technique      |                                        usefulness                                |
 |:--------------------:|:--------------------------------------------------------------------------------:|
 |     `Imputation`     |                    fills out missing values in data                    |
 |   `Discretization`   |                groups a feature in some logical fashion into bins                |
 |`Categorical Encoding`|encodes categorical features into numerical values|
 |  `Feature Splitting` |splits a feature into parts|
 |   `Outlier Handling` |takes care of unusually high/low values in the dataset|
 | `Log Transformation` |deals with ill-behaved (skewed of heteroscedastic) data       |
 |   `Feature Scaling`  |handles the sensitivity of ML algorithms to the scale of input values| 
 | `RBF Transformation` |uses a continuous distribution to encode ordinal features|

<hr style="border:2px solid black">

## 2. Example: Penguin Data

**load packages**

In [27]:
# data analysis stack
import numpy as np
import pandas as pd

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

**read data**

In [None]:
df = pd.read_csv('../data/penguins_unclean.csv')
df.head()

### 2.1 Train-Test split

In [29]:
train,test = train_test_split(df, test_size=0.2, random_state=42)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

### 2.2 Quick exploration

In [None]:
train.head()

In [None]:
train.info()

### 2.3 Imputation

In [None]:
mean_weight = round(
    train.groupby(['Species','Sex'])['Body Mass (g)'].mean(),1
)
mean_weight

In [33]:
train['Body Mass (g)'] = train.apply(
    lambda x: mean_weight[x['Species']][x['Sex']] \
    if x['Body Mass (g)']!=x['Body Mass (g)'] \
    else x['Body Mass (g)'],
    axis=1
)

In [None]:
train.info()

In [None]:
train.head()

### 2.4 Categorical Encoding

In [None]:
pd.get_dummies(
    data=train['Sex'],
    #drop_first=True
)

In [None]:
train = train.join(
    pd.get_dummies(data=train['Sex'], drop_first=True)
)
train.head()

### 2.5 Scaling

In [38]:
def standardize(series, mean, std):
    """
    returns the standardized counterpart of a series,
    given a mean and standard deviation
    """
    return (series-mean)/std

In [39]:
numerical_features = [
    'Culmen Length (mm)',
    'Culmen Depth (mm)',
    'Flipper Length (mm)',
    'Body Mass (g)'
]

In [40]:
# standard scaling parameter dictionary
parameters = {}

for feature in numerical_features: 
    # populate parameter dictionary
    mean = train[feature].mean()
    std = train[feature].std()
    parameters[feature] = (mean, std)
    
    # create standadrdized numerical columns
    train[feature] = standardize(train[feature], mean, std)

In [None]:
train.head()

### 2.6 Feature-Target Separation

In [42]:
# features
X_train = train[numerical_features + ['MALE']]

# target
y_train = train['Species']

In [None]:
X_train.head()

In [None]:
y_train

### 2.7 Model Building

**instantiate model**

In [45]:
classifier_model = LogisticRegression()

**train model**

In [None]:
classifier_model.fit(X_train,y_train)

**model validation**

In [None]:
training_acccuracy = classifier_model.score(X_train,y_train)
print(f"training accuracy: {round(training_acccuracy, 6)}")

### 2.8 Model Evaluation

**test data quick exploration**

In [None]:
test.head()

In [None]:
test.info()

**imputation**

In [50]:
test['Body Mass (g)'] = test.apply(
    lambda x: mean_weight[x['Species']][x['Sex']] \
    if x['Body Mass (g)']!=x['Body Mass (g)'] else x['Body Mass (g)'],
    axis=1
)

**categorical encoding**

In [None]:
test = test.join(
    pd.get_dummies(data=test['Sex'], drop_first=True)
)
test.head()

**scaling**

In [52]:
for feature in numerical_features:
    # call out standardization parameters
    mean, std = parameters[feature]
    
    # transform test data
    test[feature] = standardize(test[feature], mean, std)

In [None]:
test.head()

**feature-target separation**

In [54]:
# features
X_test = test[numerical_features + ['MALE']]

# target
y_test = test['Species']

In [None]:
X_test.head()

**model performance**

In [None]:
test_acccuracy = classifier_model.score(X_test,y_test)
print(f"test accuracy: {round(test_acccuracy, 6)}")

<hr style="border:2px solid black">

## 3. Exercise: Titanic Data

**3.1 create a feature named Title**

In [None]:
# hint
# .apply(lambda x: x.split(',')[1].split('.')[0].lower().strip()) 

**3.2 binning: handling of rare titles**

In [None]:
# hint
# 1. find the list of unique titles

# 2. write a function that does the following transformations:
## ['mrs','mr','miss','master','dr','rev'] remain the same
## ['mlle','ms'] become 'miss'
## 'mme' becomes 'mrs'
## ['col','major','capt'] become 'army'
## ['don','lady','the countess','sir','the count','madam','lord'] become 'nobl'
## other titles become 'unknown'

# 3. use .apply() methif for binning the ttile column

**3.3 imputation of age**

In [None]:
# hint:
# .groupby(['Pclass','Sex'])['Age'].mean()

**3.4 imputation of embarkation**

In [None]:
# hint: use most frequent class

**3.5 imputation of cabin**

In [None]:
# hint: incorporate missing cabin as a class

**3.6 engineer fare price**

In [None]:
# hint
# .apply(lambda x: x['Fare']/(x['SibSp']+x['Parch']),axis=1)

**scaling: numerical features**

****

<hr style="border:2px solid black">

## References

- [8 Feature Engineering Techniques for Machine Learning](https://www.projectpro.io/article/8-feature-engineering-techniques-for-machine-learning/423)

- [Fundamental Techniques of Feature Engineering for Machine Learning](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)