# Feature Engineering with Pipeline and Column Transformer
> Derive a *better* representation of your input features `X`
<hr style="border:2px solid black">

### new `sklearn` concepts that we explore today

In [21]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import MinMaxScaler

### and the usual suspects ...

In [93]:
# data analysis stack
import pandas as pd


# machine-learning feature-engineering stack 

## imputation
from sklearn.impute import (
    SimpleImputer,
    KNNImputer, 
)

## encoding-stack
from sklearn.preprocessing import (
    OneHotEncoder,
    OrdinalEncoder, 
)

## feature-scaling stack
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    RobustScaler,
    PowerTransformer
)

## continuous feature discretization
from sklearn.preprocessing import KBinsDiscretizer

## polynomial feature
from sklearn.preprocessing import PolynomialFeatures

# combining different transformers
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

# metrics
from sklearn.metrics import confusion_matrix, classification_report


from sklearn import set_config
set_config(transform_output="pandas") # will set the outoput of the transformation to a pandas dataframe

---
## Business Goal:
>Predict the penguins sex based on species, culmen length, culmen_depth, flipper_length and body mass and island

## Get Data

In [23]:
df = pd.read_csv('data/penguins_unclean.csv')
df.head()

Unnamed: 0,species,culmen_length,culmen_depth,flipper_length,body_mass,sex
0,Adelie,39.1,18.7,181.0,3750.0,MALE
1,Adelie,39.5,17.4,186.0,,FEMALE
2,Adelie,40.3,18.0,195.0,,FEMALE
3,Adelie,36.7,19.3,193.0,3450.0,FEMALE
4,Adelie,39.3,20.6,190.0,3650.0,MALE


## Select columns for X and y

In [24]:
y = df['sex']
X = df.drop(['sex'], axis=1)

## Perform a train test split

In [34]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=420)

## EDA
- which variable has missing values?
- which variables are binary, categorical, metric?
- do categorical variables have non-numeric values?
- do metric features are varying on a different scale?
- ...

In [35]:
# Combine back Xtrain, ytrain
df_train = pd.concat([x_train,y_train], axis=1)

In [36]:
df_train.isna().sum()

species            1
culmen_length      0
culmen_depth       0
flipper_length     2
body_mass         15
sex                0
dtype: int64

## Featuring Engeneering

### The Pipeline

To apply several FE techniques to the same columns we have to create our own custom preprocessor.

- A `Pipeline` runs several transformers/ preprocessors in a row. E.g. to apply both imputing and scaling.
- A `Pipeline` is a transformer and can be used inside a `ColumnTransformer`

In [88]:
df_train

# define columns to transform
median_impute = ['body_mass']
species = ['species']
knn_impute = ['flipper_length']
#  = ['frequent_dummies__species']


species_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')), 
    ('onehot', OneHotEncoder(sparse=False, drop='first'))
])

# define column transformer
transformers = [
    ('median', SimpleImputer(strategy = 'mean'), median_impute),
    ('knn', KNNImputer(n_neighbors=3), knn_impute),
    ('frequent_dummy', species_transformer, species),
]

column_transformer = ColumnTransformer(transformers)
column_transformer

In [89]:
# build pipeline
pipeline = Pipeline([
    ('preprocessor', column_transformer),
    ('classifier', LogisticRegression())
])

pipeline

In [90]:
pipeline.named_steps['preprocessor'].fit(x_train)
pipeline.named_steps['preprocessor'].transform(x_train)

Unnamed: 0,median__body_mass,knn__flipper_length,frequent_dummy__species_Chinstrap,frequent_dummy__species_Gentoo
42,3450.000000,190.0,0.0,0.0
7,3200.000000,182.0,0.0,0.0
251,4350.000000,208.0,0.0,1.0
139,3650.000000,185.0,0.0,0.0
21,3550.000000,183.0,0.0,0.0
...,...,...,...,...
185,4500.000000,205.0,1.0,0.0
115,3500.000000,198.0,0.0,0.0
287,5800.000000,230.0,0.0,1.0
63,4225.697211,198.0,0.0,0.0


In [94]:
# Fit the pipeline
pipeline.fit(x_train, y_train)

# Make predictions
y_pred = pipeline.predict(x_test)

In [95]:
# performance

confusion_matrix(y_test, y_pred)

array([[29,  4],
       [15, 19]])

### The ColumnTransformer

To apply different FE techniques to our raw data we can use a `ColumnTransformer`.

- A `ColumnsTransformer` applies different transformers/ preprocessors to different columns of your `DataFrame`

In [None]:
#(name, transformer, columns)
fe = ...

In [None]:
# fit the column transformer on the training data
...

# transform the training data
Xtrain_tran = 

In [None]:
# transform the test data
Xtest_tran = ...

## Train Model(s)

### Fit the model on the (transformed) training data

In [None]:
# initialize the model 
...
# fit the model on the transformed training data
...

In [None]:
...

## Evaluate the model on the (transformed) test data

In [None]:
# calculate predictions with the transformed test data

# calculate an accuracy score
...

## 🌶️🌶️🌶️Bonus🌶️🌶️🌶️
**Applying Feature Engineering and Modeling in one go**

In [None]:
fe

In [None]:
# Building the model together with feature engeenering
one_go_mlr= ...
one_go_mlr

In [None]:
# train the model
...