# Featuring Enginnering Using Scikit-Learn: Column Transformer and Pipeline

### Do You Want to Build a Snowman?

Let's prep some data for a model to predict the heigh of snowman.

<img src = "./images/olaf.jpeg">

#### Load Packages

In [7]:
# data analysis stack
import numpy as np
import pandas as pd

# data visualization stack
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    RobustScaler,
    MinMaxScaler,
    KBinsDiscretizer,
    PolynomialFeatures,
    FunctionTransformer
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

#### Load Data

In [20]:
data = {
    'temp' : [-3, 5, 0, 7, 3, -1, 1, None, -6, 3, 0, -1, None, -2],
    'lunch' : ['soup', 'sandwich', 'soup', 'burger', 'sandwich', 'soup', 'cereal', 'salad', 'sandwich', 'burger', 'soup', 'cereal', 'burger', 'soup'],
    'dinner' : ['pizza', 'pizza', 'noodles', None, 'fishsticks', 'pizza', None, 'fishsticks', 'noodles', 'pizza', None, 'pizza', 'fishsticks', 'pizza'],
    'precipitation' : ['yes', 'no', 'yes', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'no'],
    'heigth_snowman_cm' : [100, 0, 75, 0, 20, 25, 0, 35, 170, 0, 85, 85, 45, 0]
}

df_train = pd.DataFrame(data=data)
df_train

Unnamed: 0,temp,lunch,dinner,precipitation,heigth_snowman_cm
0,-3.0,soup,pizza,yes,100
1,5.0,sandwich,pizza,no,0
2,0.0,soup,noodles,yes,75
3,7.0,burger,,yes,0
4,3.0,sandwich,fishsticks,yes,20
5,-1.0,soup,pizza,yes,25
6,1.0,cereal,,no,0
7,,salad,fishsticks,yes,35
8,-6.0,sandwich,noodles,yes,170
9,3.0,burger,pizza,no,0


#### Exercise
Transform the data above using scikit-learn tools in combination with `ColumnTransformer()` and `Pipeline()` in a way that is suitable to be used for modeling
+ **Separate the DataFrame `df_train` into `X_train` and `y_train`** 
   +  Our target variable is `eigth_snowman_cm`
+ **Preprocess `X_train`**:
  + Identify which variables are **binary**, **categorical** and  **numeric**
  + Check which variables have **missing values**
    + **Impute missing value** as needed using appropriate strategy
  + Determine if categorical variables have **non-numeric values**
    + **Encode categorical variables** using techniques such as one-hot encoding
  + Determine if numeric variables on different scale
    + **Scale numeric variables**
+ **Create `X_train_fe`**:
    + Once the preprocessing steps are completed, compile the transformed columns into a new DataFrame called `X_train_fe`. 


1. **Separate the DataFrame `df_train` into `X_train` and `y_train`**

In [21]:
target = 'heigth_snowman_cm'

In [22]:
# Feature matrix
X_train = df_train.drop(target,axis=1)
y_train = df_train[target]

2. **Preprocess `X_train`**

In [28]:
# Create a function to fill in missin value via interpolation
def interpolate(X,column):
    X_copy = X.copy()
    X_copy[f'{column}_interpoleted'] = X_copy[[column]].interpolate()
    return X_copy[[f'{column}_interpoleted']]

We want to sequentially apply on the `dinner` column two transformations:
1. `Imputation` 
2. `Encoding`
   
Therefore we can employ the `Pipeline()` tool for it 

In [26]:
# Define the steps of the pipeline
dinner_steps = [('imputer',SimpleImputer(missing_values=None, strategy='most_frequent')),
         ('ohe', OneHotEncoder(sparse_output=False, drop='first'))
         ]

In [27]:
# Instantiate the dinner pipeline object  
dinner_pipeline = Pipeline(steps=dinner_steps)
dinner_pipeline

In [29]:
trasformers = [('temp_interpolation', FunctionTransformer(interpolate,kw_args={'column':'temp'}), ['temp']),
              ('dinner_transformation', dinner_pipeline, ['dinner']),
              ('ohe', OneHotEncoder(sparse_output=False, drop='first'),['lunch','dinner'])]

In [14]:
fe_column_transformer = ColumnTransformer(transformers=trasformers, remainder='drop').set_output(transform='pandas')
fe_column_transformer

In [31]:
fe_column_transformer.fit(X_train)

In [33]:
X_train_fe = fe_column_transformer.transform(X_train)
X_train_fe

Unnamed: 0,temp_interpolation__temp_interpoleted,dinner_transformation__dinner_noodles,dinner_transformation__dinner_pizza,ohe__lunch_cereal,ohe__lunch_salad,ohe__lunch_sandwich,ohe__lunch_soup,ohe__dinner_noodles,ohe__dinner_pizza,ohe__dinner_None
0,-3.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,5.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
3,7.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,3.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,-1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
6,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
7,-2.5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,-6.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
9,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
