# Feature Engineering Intro II
<hr style="border:2px solid black">

- This notebook suppliments and serves as a continuation of notebook *1_intro_to fe*.
- The reader is expected to run the codes and try to understand the concepts
- The reader is expected to read the documentation pages that could be accessed via click, whenever possible

#### load packages

In [1]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

#### load data

In [2]:
df = pd.read_csv('../data/penguins.csv')

#### train-test split

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
train, test = train_test_split(df, test_size=0.25, random_state=42)
train.shape, test.shape

((256, 7), (86, 7))

#### features and target

In [5]:
numerical_features = [
    'bill_length_mm',
    'bill_depth_mm',
    'flipper_length_mm'
]

categorical_features = [
    'species',
    'island',
    'sex'
]

features = numerical_features + categorical_features

target_variable = 'body_mass_g'

#### feature-target separation

In [6]:
# feature matrix and target column
X_train, y_train = train[features], train[target_variable]
X_test, y_test = test[features], test[target_variable]

- **Did you notice the order of feature-target separation and train-test split?**
- **How is it different from what was done in previous notebooks?**
- **Could you think about scenarious where it would be better one way or the other?**

<hr style="border:2px solid black">

## 1. Imputation

#### 1.1 [`SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer)

- **discussed in notebook *1_intro_to_fe***

#### 1.2 [`KNNImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

- Imputation for completing missing values using k-Nearest Neighbors.
- Each sample's missing values are imputed using the mean value from `n` nearest neighbors found in the training set
- Two samples are close if the features that neither is missing are close.

**example: impute `bill_depth_mm`**

In [7]:
X_train.isna().sum()

bill_length_mm       0
bill_depth_mm        2
flipper_length_mm    2
species              0
island               0
sex                  6
dtype: int64

In [8]:
X_train.dtypes

bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
species               object
island                object
sex                   object
dtype: object

In [9]:
knn_imputer = KNNImputer(n_neighbors=5)
knn_imputer.fit(X_train[['bill_depth_mm']])

X_train['bill_depth_mm'] = knn_imputer.transform(
    X_train[['bill_depth_mm']]
).flatten()

In [10]:
X_train.isna().sum()

bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    2
species              0
island               0
sex                  6
dtype: int64

<hr style="border:2px solid black">

### 2. Categorical Encoding

#### 2.1 [`get_dummies()`](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)

- Converts categorical variable into dummy/indicator variables.
- Each variable is converted in as many 0/1 variables as there are different values.
- Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

**example: create dummies for `species`**

In [11]:
dummy_species = pd.get_dummies(
    data=X_train['species'],
    drop_first=False,
)
dummy_species.head()

Unnamed: 0,Adelie,Chinstrap,Gentoo
24,True,False,False
323,False,False,True
143,True,False,False
208,False,True,False
253,False,False,True


## is it not a problem to join without id?

In [12]:
# no Problem when indexes align correctly between the original DataFrame (X_train) and the dummy variables (dummy_species).
# since dummy_species originally from X_train
X_train.join(dummy_species).head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,species,island,sex,Adelie,Chinstrap,Gentoo
24,35.3,18.9,187.0,Adelie,Biscoe,Female,True,False,False
323,47.3,13.8,,Gentoo,Biscoe,,False,False,True
143,37.3,16.8,192.0,Adelie,Dream,Female,True,False,False
208,49.3,19.9,203.0,Chinstrap,Dream,Male,False,True,False
253,49.1,14.8,220.0,Gentoo,Biscoe,Female,False,False,True


#### 2.2 [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

- **discussed in notebook *1_intro_to_fe***

<hr style="border:2px solid black">

## 3. Feature Scaling

### 3.1 Standardization

- **discussed in notebook *1_intro_to_fe***

### 3.2 Normalization

- scikit-learn equivalent MinMaxScaler()
- output range is always [0,1]
- doesn't deal well with outliers
- transformation formula:
>$$z=\dfrac{x - min(x)}{max(x) - min(x)}$$


**example: create normalized numerical columns**

In [13]:
def normalize(series):
    """
    Returns the normalized counterpart of a series.
    The formula used is: (value - min) / (max - min), which scales the values between 0 and 1.
    """
    min_ = series.min()  # Get the minimum value of the series
    max_ = series.max()  # Get the maximum value of the series
    return min_, max_, (series - min_) / (max_ - min_)  # Return min, max, and the normalized series

# Create an empty DataFrame to hold the normalized features
df_normal = pd.DataFrame()

# Iterate over each feature in numerical_features for normalization
for feature in numerical_features: 
    # Normalize the feature and get the min, max, and scaled values
    min_, max_, t = normalize(X_train[feature])
    
    # Store the min and max values for each feature using dynamic variable names
    vars()['min_' + feature] = min_
    vars()['max_' + feature] = max_
    
    # Add the normalized feature to the new DataFrame with a '_scaled' suffix
    df_normal[feature + '_scaled'] = t
df_normal.head()

Unnamed: 0,bill_length_mm_scaled,bill_depth_mm_scaled,flipper_length_mm_scaled
24,0.116364,0.690476,0.258621
323,0.552727,0.083333,
143,0.189091,0.440476,0.344828
208,0.625455,0.809524,0.534483
253,0.618182,0.202381,0.827586


#### [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

In [14]:
from sklearn.preprocessing import MinMaxScaler

# Step 1: Initialize the MinMaxScaler: 
# # By default, it scales the features to the range [0, 1]. If you want to scale to a different range, you can pass feature_range=(min, max) to the scaler.
mm_scaler = MinMaxScaler() # 0 bis 1

# Step 2: The fit() method calculates the minimum and maximum values for each feature in X_train[numerical_features]. 
# This information will be used to transform the data into the desired range.
mm_scaler.fit(X_train[numerical_features])

# Step 3: After fitting, the transform() method scales the data based on the previously calculated min and max values.
# Each feature is transformed using the formula:
t = mm_scaler.transform(X_train[numerical_features])

# Step 4: Convert the transformed (scaled) data back into a DataFrame
# Each transformed feature will have a '_mm_scaled' suffix to indicate scaling
pd.DataFrame(t, columns=[f + '_mm_scaled' for f in numerical_features])

Unnamed: 0,bill_length_mm_mm_scaled,bill_depth_mm_mm_scaled,flipper_length_mm_mm_scaled
0,0.116364,0.690476,0.258621
1,0.552727,0.083333,
2,0.189091,0.440476,0.344828
3,0.625455,0.809524,0.534483
4,0.618182,0.202381,0.827586
...,...,...,...
251,0.723636,0.904762,0.655172
252,0.272727,0.488095,0.413793
253,0.221818,0.821429,0.310345
254,0.596364,0.119048,0.827586


<hr style="border:2px solid black">

## 4. Feature Expansion

- **discussed in notebook *1_intro_to_fe***

<hr style="border:2px solid black">

## 5. Discretization

#### [`KBinsDiscretizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer)

- Bins continuous data into intervals.

**example: discretize `flipper_length_mm`**

In [15]:
from sklearn.preprocessing import KBinsDiscretizer

In [16]:
kbins = KBinsDiscretizer(
    n_bins=3,
    encode='onehot-dense',
    strategy='quantile'
)

- The following line of code will result in `ValueError`
- Read the error message and try to fix the bug

In [214]:
#  kbins.fit(X_train[['flipper_length_mm']])

**Hint**
- Try `X_train.isna().sum()` to see if there are missing values in numerical columns
- If yes, impute the missing values, say the way it was done in notebook _1_intro_to_fe_
- Then run the above code again, and proceed to the `.transform()` step

In [17]:
knn_imputer = KNNImputer(n_neighbors=5)
knn_imputer.fit(X_train[['flipper_length_mm']])

X_train['flipper_length_mm'] = knn_imputer.transform(
    X_train[['flipper_length_mm']]
).flatten()
X_train.isna().sum()

bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
species              0
island               0
sex                  6
dtype: int64

In [18]:
kbins = KBinsDiscretizer(
    n_bins=3,
    encode='onehot-dense',
    strategy='quantile'
)

kbins.fit(X_train[['flipper_length_mm']])
t = kbins.transform(X_train[['flipper_length_mm']])
t.shape

(256, 3)

In [226]:
# Fit the KBinsDiscretizer on the 'flipper_length_mm' column of the training data
kbins.fit(X_train[['flipper_length_mm']])

# Transform the 'flipper_length_mm' values into binned categories
t = kbins.transform(X_train[['flipper_length_mm']])

# Display the shape of the transformed data (number of rows, number of bins)
t.shape   # BONUS: see bin ranges

# Define human-readable bin names for the different categories of 'flipper_length_mm'
bin_names = ['small_flipper', 'medium_flipper', 'large_flipper']

# Retrieve the bin edges (ranges) and round them to one decimal place
edges = kbins.bin_edges_[0].round(1)

# Loop through each bin and print the bin name along with its corresponding range
for i in range(len(bin_names)):
    print(f"{bin_names[i]}: {edges[i]} - {edges[i+1]}")

small_flipper: 172.0 - 192.0
medium_flipper: 192.0 - 210.0
large_flipper: 210.0 - 230.0


---

**With `KBinsDiscretizer` what does the following hyperparameter choices mean?**
- `n_bins=3`,
- `encode='onehot-dense'`,
- `strategy='quantile'`?