# Feature Engineering Intro I
<hr style="border:2px solid black">

## Introduction

### Feature engineering: what & why?

- "art" of formulating useful features from existing data 
- transforms data to better relate to the underlying target variable
- improves the performance of an ML model
- follows naturally from domain knowledge
- helps incorporate non-numeric features into an ML model

### Feature engineering techniques

 |       technique      |                                        usefulness                                |
 |:--------------------:|:--------------------------------------------------------------------------------:|
 |     `Imputation`     |                    fills out missing values in data                    |
 |   `Discretization`   |                groups a feature in some logical fashion into bins                |
 |`Categorical Encoding`|encodes categorical features into numerical values|
 |  `Feature Splitting` |splits a feature into parts|
 |   `Feature Scaling`  |handles the sensitivity of ML algorithms to the scale of input values| 
 |`Feature Expansion`|derives new features from existing ones|
 | `Log Transformation` |deals with ill-behaved (skewed of heteroscedastic) data       |
 |   `Outlier Handling` |takes care of unusually high/low values in the dataset|
 | `RBF Transformation` |uses a continuous distribution to encode ordinal features|

### Feature engineering best practices

#### 1. Split dataset into train and test sub-samples as early as we possible.

but, this is flexible — e.g. you can drop NaNs from the entire dataset before filling.
still, in interest of good machine learning habits, even then, better to do this after splitting.
#### 2. Feature engineer test data the same way as train data.

otherwise the performance of our model will suffer, if it runs at all.
writing a function is a nice way to do this.

#### 3. Feature Engineering includes any pre-processing techniques, such as:

- dropping missing values
- converting strings / non-numeric values into numeric values
- combining features

<hr style="border:2px solid black">

## Example: Penguin Data

#### load packages

In [None]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
from sklearn.model_selection import train_test_split

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

#### load data

In [None]:
df = pd.read_csv('../data/penguins.csv')
df.head()

#### quick exploration

In [None]:
df.info()

In [None]:
df.describe()

#### features and target

In [None]:
numerical_features = [
    'bill_length_mm',
    'bill_depth_mm',
    'flipper_length_mm'
]

categorical_features = [
    'species',
    'island',
    'sex'
]

features = numerical_features + categorical_features

target_variable = 'body_mass_g'

#### feature-target separation

In [None]:
# feature matrix and target column
X,y = df[features],df[target_variable]

#### train-test split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)
X_train.shape, X_test.shape

### Exploratory Data Analysis

**check missing values**

In [None]:
X_train.isna().sum()

In [None]:
# check missing values graphically
plt.figure(figsize=(6,4), dpi=100)
sns.heatmap(X_train.reset_index(drop=True).isna());

#### issues with the data

- missing values 
- categorical features with non-numeric values
- metric features with varying magnitudes

<hr style="border:2px solid black">

## 1. Imputation

#### What can we do with missing values?

---
- Drop:
    + rows with missing values
    + columns with a lot of missing values
---
- Fill with a value:
    + mean/median of a column
    + interpolate / back fill / forward fill
    + mean/median of a group
---

- With pandas: 
    - `df.isna()`: checks for NaNs, then do a sum or a heatmap
    - `df.dropna()`: drop NaNs
    - `df.fillna()`: fill NaNs
One would to use `inplace=True` in these examples.

---

#### 1.1 [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

In [None]:
from sklearn.impute import SimpleImputer

**sex column imputation**

In [None]:
imputer = SimpleImputer(strategy='most_frequent')
X_train['sex'].value_counts()

In [None]:
X_test.head()

In [None]:
# Fit the variable imputer on the 'sex' column training data
imputer.fit(X_train[['sex']])
X_train['sex'] = imputer.transform(X_train[['sex']]).flatten()
X_train.isna().sum()
X_train['sex'].value_counts()

---

**flipper length column imputation**

In [None]:
flipper_length_imputer = SimpleImputer(strategy='mean')
flipper_length_imputer.fit(X_train[['flipper_length_mm']])
X_train['flipper_length_mm'] = flipper_length_imputer.transform(
    X_train[['flipper_length_mm']]
).flatten()
X_train.isna().sum()

#### 1.2 [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

- **see exercise notebook *3_intro_to_fe_continued***

<hr style="border:2px solid black">

### 2. Categorical Encoding

#### 2.1 [`get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

- **see exercise notebook *3_intro_to_fe_continued***

#### 2.2 [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
# This line of code initializes a OneHotEncoder from scikit-learn, used for transforming categorical data into a binary (one-hot encoded) format.
ohe = OneHotEncoder(drop=None, sparse_output=False, handle_unknown='ignore')
# drop=None: keine katogoreien sollen gedroppt werden 
# sparse_output=False    = the encoder will return a dense array (i.e., a standard NumPy
# Handles how to deal with unknown categories (i.e., categories that were not present during training/fitting but appear during transformation, such as in new data).

In [None]:
X_train['species'].nunique()

In [None]:
 # is used to find the number of not categorical values 
X_train.head() 

In [None]:
X_train['species'].unique()

In [None]:
  # is used to find the number of categorical values
ohe.fit(X_train[['species']]) # ohe.fit(): This step is used to fit the OneHotEncoder (ohe) on the 'species' column of the X_train DataFrame.
# This step applies the one-hot encoding transformation to the 'species' column.
t = ohe.transform(X_train[['species']])
# X_train[['species']]: Selects the 'species' column from the X_train DataFrame. 
   #The double brackets ([['species']]) are used to ensure that it’s passed as a DataFrame rather than a Series, as OneHotEncoder expects a 2D input.
#t is the one-hot encoded result. 
t.shape

In [None]:
# Purpose: It generates the new feature names created during one-hot encoding, reflecting the original categorical feature and its unique values.
ohe.get_feature_names_out() 

In [None]:
species = pd.DataFrame(t, columns= ohe.get_feature_names_out())
species.head()


In [None]:
# introduced into the frame
X_train = pd.concat([X_train.reset_index(drop=True), species], axis=1)
X_train.drop(columns='species',inplace=True)

In [None]:
X_train.head()

<hr style="border:2px solid black">

## 3. Feature Scaling

### 3.1 Standardization

- doesn't always result in the same range
- deals well with outliers
- implemented in scikit-learn by `StandardScaler()`
- center the data and scale by standard deviation: 
>$$z = \dfrac{x - \bar{x}}{\sigma_x}$$



In [None]:
def standardize(series):
    """
    returns the standardized counterpart of a series
    """
    mean_ = series.mean()
    std_ = series.std()
    return mean_, std_, (series-mean_)/std_

In [None]:
numerical_features.remove('flipper_length_mm')
numerical_features

In [None]:
df_standard = pd.DataFrame()

for feature in numerical_features: 
    # populate parameter dictionary
    mean_, std_, t = standardize(X_train[feature])
    
    # create standadrdized numerical columns
    vars()['mean_'+feature] = mean_   # braucne wir nicht
    vars()['std_'+feature] = std_     # brauchen wir nicht
    df_standard[feature+'_scaled'] = t

In [None]:
df_standard.head()

#### [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
std_scaler.fit(X_train[numerical_features])
t=std_scaler.transform(X_train[numerical_features])

In [None]:
df_scaled = pd.DataFrame(t, columns=[f+'_scaled' for f in numerical_features])
df_scaled.head()

In [None]:
X_train = pd.concat([X_train, df_scaled], axis=1)
X_train.head()

### 3.2 Normalization

***see exercise notebook 3_intro_to_fe_continued***

<hr style="border:2px solid black">

## 4. Feature Expansion

### 4.1 Polynomial Terms

- Additional features obtained by an existing feature to some power
- Non-linear relationships can be modelled
- For some feature x, consider the model: 

$$
y = a_0 + a_1x + a_2x^2 +\ldots+\epsilon
$$

- Likely increase of model accuracy, but increased risk of overfitting

#### [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
pf = PolynomialFeatures(
    degree = 3,
    interaction_only = False,
    include_bias = False
)

In [None]:
pf.fit(X_train[['bill_length_mm_scaled']])

In [None]:
t = pf.transform(X_train[['bill_length_mm_scaled']])

In [None]:
pd.DataFrame(t, columns=[f"bill_length_power_{i+1}" for i in range(3)])

### 4.2 Interaction Terms

- For multiple initial features, there could be *interactions* (cross-polynomial terms)
- For 2 features, $x_0$ and $x_1$ for example, a 2nd-degree polynomial may contain:

$$
1,~x_0,~x_1,~x_0^2,~x_0x_1,~x_1^2
$$

- Each of the terms get their own coefficient in a regression model
- Polynomial preprocessing function with `interaction_only = True`