# Advanced Feature Engineering II
<hr style="border:2px solid black">

This notebook introduces some new technologies used in ML workflow. Some pieces of code are repetition from previous notebooks. They are meant for the readers as a scope to get more familiar and comfortable with the syntaxes. The new technologies include:
- `ColumnTransformer`
- `Pipeline`

## Penguin Data

**load packages**

In [1]:
# data analysis stack
import numpy as np
import pandas as pd

# machine-learning stack
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    MinMaxScaler,
    PolynomialFeatures,
    FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# miscellaneous
import warnings
warnings.filterwarnings("ignore")

**read data**

In [2]:
df = pd.read_csv('../data/penguins.csv')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


### 1.1 Train-Test split

In [4]:
train,test = train_test_split(df, test_size=0.2, random_state=101)
train.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

### 1.2 Quick exploration

In [5]:
train.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Dream,36.5,18.0,182.0,3150.0,Female
1,Gentoo,Biscoe,47.2,15.5,215.0,4975.0,Female
2,Gentoo,Biscoe,46.3,15.8,215.0,5050.0,Male
3,Gentoo,Biscoe,50.4,15.3,224.0,5550.0,Male
4,Chinstrap,Dream,45.9,17.1,190.0,3575.0,Female


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 273 entries, 0 to 272
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            273 non-null    object 
 1   island             273 non-null    object 
 2   bill_length_mm     273 non-null    float64
 3   bill_depth_mm      271 non-null    float64
 4   flipper_length_mm  272 non-null    float64
 5   body_mass_g        273 non-null    float64
 6   sex                266 non-null    object 
dtypes: float64(4), object(3)
memory usage: 15.1+ KB


### 1.3 Feature-Target Separation

In [7]:
# target column name
target = 'body_mass_g'

In [8]:
# feature column names
features = list(set(train.columns) - set([target]))
features

['species',
 'island',
 'flipper_length_mm',
 'bill_length_mm',
 'sex',
 'bill_depth_mm']

In [9]:
# feature and target columns
X_train, y_train = train[features], train[target]
X_train.shape, y_train.shape

((273, 6), (273,))

In [10]:
cat_features = list(X_train.select_dtypes(include=['object']).columns)
cat_features

['species', 'island', 'sex']

In [11]:
num_features = list(X_train.select_dtypes(exclude=['object']).columns)
num_features

['flipper_length_mm', 'bill_length_mm', 'bill_depth_mm']

<hr style="border:2px solid black">

- In what follows, the reader is expected to repruduce the ML steps as in notebook *2_advanced_fe*, using 
    + `Pipeline` instead of `make_pipeline`
    + `ColumnTransformer` instead of `make_column_transformer`

- Just uncomment the pieces of code, and proceed by running them one by one. 

<hr style="border:2px solid black">

## Feature Engineering

### Preprocessing Pipelines

**numerical columns**

In [12]:
num_transformer = Pipeline(
    steps=[
         ('imputer_num', SimpleImputer(strategy='mean')),  # imputer_numerical we just give a name for later steps
         ('scaling_num', StandardScaler()),       # scaling_num  we just give a name for later steps
         ('poly_num', PolynomialFeatures(degree=2)) 
    ])

In [13]:
type(num_transformer)

sklearn.pipeline.Pipeline

**categorical columns**

In [14]:
# Syntax: Requires specifying steps as a list of tuples, where each tuple contains a name and a transformation.
# demands Pipeline
cat_transformer = Pipeline(
    steps=[
        # Step 1: Impute missing values in the categorical data
        # Use SimpleImputer with strategy='most_frequent' to fill missing values
        # with the most frequent (mode) value in the column.
        ('imputer_cat', SimpleImputer(strategy='most_frequent')),
        
        # Step 2: Apply one-hot encoding to the categorical data
        # Use OneHotEncoder to convert categorical values into binary columns.
        # 'drop='first'' drops the first category to avoid multicollinearity (dummy variable trap).
        # 'handle_unknown='ignore'' ensures that any unknown categories encountered during transformation
        # are ignored rather than raising an error.
        ('onehot_cat', OneHotEncoder(drop='first', handle_unknown='ignore'))
    ])
 

In [15]:
type(cat_transformer)

sklearn.pipeline.Pipeline

**total preprocessing**

In [16]:
preprocessor = ColumnTransformer(
    transformers=[
        ('transformer_num', num_transformer, num_features), # what is pipeline for num_features list defined by num_transformer
        ('transformer_cat', cat_transformer, cat_features)  # what is pipeline for cat_features list defined by cat_transformer
    ])
# 'transformer_num': A name for the transformer, for later operations.
# num_transformer: The actual pipeline or sequence of preprocessing steps for numerical features.
# num_features: A list or array of column names (or indices) that this transformer will be applied to. 
   

In [17]:
type(preprocessor)

sklearn.compose._column_transformer.ColumnTransformer

**What similarities and differences did you notice between ...**
- `Pipeline` and `make_pipeline`
- `ColumnTransformer` and `make_column_transformer`?

### Model Building

**instantiate model**

In [18]:
linear_model = Pipeline(
    steps=[
        ('preprocessor_name', preprocessor),
        ('regressor', LinearRegression())
    ])

**train model**

In [19]:
linear_model.fit(X_train,y_train)

**model validation**

In [20]:
training_score = linear_model.score(X_train,y_train)
print(f"training r2 score: {round(training_score, 6)}")

training r2 score: 0.879775


### Model Evaluation

**feature-target separation**

In [21]:
X_test, y_test = test[features], test[target]

**model performance**

In [22]:
test_score = linear_model.score(X_test,y_test)
print(f"test r2 score: {round(test_score, 6)}")

test r2 score: 0.845228


<hr style="border:2px solid black">

## Exploring the Pipeline Interior

As one might have noticed each transformer in a `ColumnTransformer` and each step in a `Pipeline` must be given a name. This is in contrast with their respective simpler cousins: `make_column_transformer` and `make_pipeline`. While the former might look clumsy, they come with an immense advantage: the given names allows the ML practitioner to get access inside the ML workflow, thereby enabling any tweaking thereof if needed (say, for hyperparameter optimization, to be taught in near future). 

**What are the model hyperparameters?**

In [23]:
hyperparameters = linear_model.get_params()
hyperparameters

{'memory': None,
 'steps': [('preprocessor_name',
   ColumnTransformer(transformers=[('transformer_num',
                                    Pipeline(steps=[('imputer_num',
                                                     SimpleImputer()),
                                                    ('scaling_num',
                                                     StandardScaler()),
                                                    ('poly_num',
                                                     PolynomialFeatures())]),
                                    ['flipper_length_mm', 'bill_length_mm',
                                     'bill_depth_mm']),
                                   ('transformer_cat',
                                    Pipeline(steps=[('imputer_cat',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('onehot_cat',
                                                     One

**What are the steps in the ML workflow?**

In [30]:
steps = linear_model.named_steps
steps

{'preprocessor_name': ColumnTransformer(transformers=[('transformer_num',
                                  Pipeline(steps=[('imputer_num',
                                                   SimpleImputer()),
                                                  ('scaling_num',
                                                   StandardScaler()),
                                                  ('poly_num',
                                                   PolynomialFeatures())]),
                                  ['flipper_length_mm', 'bill_length_mm',
                                   'bill_depth_mm']),
                                 ('transformer_cat',
                                  Pipeline(steps=[('imputer_cat',
                                                   SimpleImputer(strategy='most_frequent')),
                                                  ('onehot_cat',
                                                   OneHotEncoder(drop='first',
                                

**What are the hyperparameters in the "regressor" step?**

In [31]:
regressor_hyperparameters = steps.regressor.get_params()
regressor_hyperparameters

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}

**What are the transformers inside the "preprosessor" step?**

In [32]:
preprocessor_transformers = steps.preprocessor.transformers_
preprocessor_transformers

AttributeError: preprocessor

**What are the hyperparameters in the "preprocessor" step?**

<hr style="border:2px solid black">

In [29]:
preprocessor_hyperparameters = steps.preprocessor.get_params()
preprocessor_hyperparameters

AttributeError: preprocessor

## References

- [How to add feature engineering to a scikit-learn pipeline](https://practicaldatascience.co.uk/machine-learning/how-to-add-feature-engineering-to-a-scikit-learn-pipeline)

- [Coding a custom imputer in scikit-learn](https://towardsdatascience.com/coding-a-custom-imputer-in-scikit-learn-31bd68e541de)