# Scikit-learn Pipeline and ColumnTransformer

## Introduction
- Reference: [How to Improve Machine Learning Code Quality with Scikit-learn Pipeline and ColumnTransformer](https://www.freecodecamp.org/news/machine-learning-pipeline/)
### Scikit-learn `Pipeline`
- Before training a model, you should split your data into a training set and a test set. Each dataset will go through the data cleaning and preprocessing steps before you put it in a machine learning model.
- The Scikit-learn `Pipeline` is a tool that links all steps of data manipulation together to create a pipeline.
    - It is also easier to perform `GridSearchCV` without data leakage from the test set.
<p align="center"><img src="../../assets/img/sklearn-pipeline.png" width=500></p>

- The `Pipeline` constructor takes a list of (name,estimator) pairs (2-tuples) defining a sequence of steps.
```Python
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])
```
- If you don’t want to name the transformers, you can use the `make_pipeline()` function instead
```Python
num_pipeline = make_pipeline(
                OutlierRemover(), # custom transformer
                SimpleImputer(strategy="median"),  
                MinMaxScaler()
)
```
### Scikit-learn `ColumnTransformer`
- `ColumnTransformer` will transform each group of dataframe columns separately and combine them later. This is useful in the data preprocessing process.

<p align="center"><img src="../../assets/img/sklearn-columntransformer.png" width=500></p>

- For example, the following `ColumnTransformer` will apply `num_pipeline` (the one which is defined above) to the numerical attributes and `cat_pipeline` to the categorical attribute:

```Python
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

cat_pipeline = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore"))

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])
```
- If you don’t care about naming the transformers, you can use `make_column_transformer()`

```Python
from sklearn.compose import make_column_selector, make_column_transformer

preprocessing = make_column_transformer(
    (num_pipeline, make_column_selector(dtype_include=np.number)),
    (cat_pipeline, make_column_selector(dtype_include="category")),
)
```

#### Column Selector
- Since listing all the column names is not very convenient, Scikit-Learn provides a make_column_selector() function that returns a selector function you can use to automatically select all the features of a given type, such as numerical or categorical. 
```Python
from sklearn.compose import make_column_selector
selector = make_column_selector(dtype_include=np.number)
selected_columns = selector(df)
selected_columns #  ['city_development_index', 'training_hours']
```

In [297]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


import pandas as pd
import numpy as np

In [279]:
df = pd.read_csv("../.././data/common_datasets/aug_train.csv")
df.head()

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [280]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [281]:
# Define Sets of Columns to be Transformed in Different Ways
num_cols = ['city_development_index', 'training_hours']

oh_cat_cols = ['gender', 'enrolled_university', 'education_level', 'major_discipline', 'company_size', 'company_type']
ord_cat_cols = ['relevent_experience', 'experience', 'last_new_job']

In [440]:

# Create Pipelines for Numerical and Categorical Features
num_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScaler())
])
oh_cat_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('one-hot',OneHotEncoder(handle_unknown='ignore'))
])

ord_cat_pipeline = Pipeline(steps=[
    ('impute',  SimpleImputer(strategy='most_frequent')),
    ('one-hot', OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1))
])

In [425]:
from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer(transformers=[
        ('num_pipeline', num_pipeline, num_cols),
        ('oh_cat_pipeline', oh_cat_pipeline, oh_cat_cols),
        ('ord_cat_pipeline', ord_cat_pipeline, ord_cat_cols),

    ],
    remainder='drop',   # 'drop' and 'passthrough'
    n_jobs=-1)          # n_job = -1 means that we'll be using all processors to run in parallel.

- you can get the column names using `col_trans.get_feature_names_out()` and wrap the data in a nice DataFrame as we did before.

In [450]:
transfomred_df = pd.DataFrame(col_trans.fit_transform(df), columns=col_trans.get_feature_names_out())
transfomred_df.head()

Unnamed: 0,num_pipeline__city_development_index,num_pipeline__training_hours,oh_cat_pipeline__gender_Female,oh_cat_pipeline__gender_Male,oh_cat_pipeline__gender_Other,oh_cat_pipeline__enrolled_university_Full time course,oh_cat_pipeline__enrolled_university_Part time course,oh_cat_pipeline__enrolled_university_no_enrollment,oh_cat_pipeline__education_level_Graduate,oh_cat_pipeline__education_level_High School,...,oh_cat_pipeline__company_size_<10,oh_cat_pipeline__company_type_Early Stage Startup,oh_cat_pipeline__company_type_Funded Startup,oh_cat_pipeline__company_type_NGO,oh_cat_pipeline__company_type_Other,oh_cat_pipeline__company_type_Public Sector,oh_cat_pipeline__company_type_Pvt Ltd,ord_cat_pipeline__relevent_experience,ord_cat_pipeline__experience,ord_cat_pipeline__last_new_job
0,0.738919,-0.488985,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,21.0,0.0
1,-0.42841,-0.305825,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,6.0,4.0
2,-1.66059,0.293607,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15.0,5.0
3,-0.323026,-0.222571,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,20.0,5.0
4,-0.501368,-0.955209,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,21.0,3.0


- You also can add the model to complete the full pipline from data processing + model

In [426]:
# add a Model to the final pipeline
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000, random_state=0)
clf_pipeline = Pipeline(steps=[
    ('col_trans', col_trans),
    ('model', clf)
])
clf_pipeline

In [428]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='target')
y = df['target']
# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

clf_pipeline.fit(X_train, y_train)
# preds = clf_pipeline.predict(X_test)
score = clf_pipeline.score(X_test, y_test)

print(f"Model score: {score}") # model accuracy

Model score: 0.7656576200417536


### Custom Transformers
- Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom transformations, cleanup operations, or combining specific attributes.

##### Custom Function Transformer
- For transformations that don’t require any training (i.e. not require `.fit()`), you can just write a **function** that takes a NumPy array as input and outputs the transformed array.
- The `inverse_func` argument is optional. It lets you specify an inverse transform function


In [432]:
from sklearn.preprocessing import FunctionTransformer

log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
log_training_hours = log_transformer.transform(df['training_hours'])
log_training_hours[:5]

0    3.583519
1    3.850148
2    4.418841
3    3.951244
4    2.079442
Name: training_hours, dtype: float64

- `FunctionTransformer` is also useful to combine features. 
    - For example, here’s a `FunctionTransformer` that computes the ratio between the input features 0 and 1:

In [None]:
ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
ratio_transformer.transform(df[["people", "room"]].values)

- The transformation function can take hyperparameters as additional arguments

In [None]:
def diff_mul(X: np.array, multipler: int) -> np.array:
    return (X[:, 0] - X[:, 1]) * multipler

diff_mul_transformer = FunctionTransformer(diff_mul) 
diff_mul_transformer.transform(df[['ndp', 'discount']].value,
                               kw_args={"mutliplier": 2}  # provide the "multipler" input to function diff_mul as 2
                            )

##### Custom Class Transformer
- Custom Class Transformer is to have the transformer with trainable parameters using `fit()` method and using them later in the `transform()` method
- Custom Class Transformer requires:
    - `BaseEstimator` as a base class (and avoid using `*args` and `**kwargs` in your constructor), you will also get two extra methods: `get_params()` and `set_params()`, which will be useful for automatic hyperparameter tuning.
    - `TransformerMixin` as a base class to auto-have `.fit_transform()` 
    - Define `fit(X, y)` method with `y=None` as required and it must return `self`
        - Note: `X` should be `np.ndarray` type as if it is in the a step in the Pipeline, the data is passed only with the Numpy array, not Pandas dataframe 
    - Define `transformer()` method

In [436]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_array, check_is_fitted
from typing import Union

class StandardScalerClone(BaseEstimator, TransformerMixin): 
    def __init__(self, with_mean=True):                 # [REQUIRED] no *args or **kwargs as using BaseEstimator as a base class
        self.with_mean = with_mean

    def fit(self, X: np.ndarray, y: np.ndarray=None):   # [REQUIRED] y is required even though we don't use it
        X = check_array(X)                              # checks that X is an array with finite float values

        self.mean_ = X.mean(axis=0)                     # [REQUIRED] learned attributes have end with "_"
        self.scale_ = X.std(axis=0)
        self.n_features_in_ = X.shape[1]                # [REQUIRED] every estimator stores this in fit()
        
        return self                                     # [REQUIRED] always return self!

    def transform(self, X: Union[pd.DataFrame, np.ndarray]) -> Union[pd.DataFrame, np.ndarray]:
        check_is_fitted(self)                           # [REQUIRED] looks for learned attributes (with trailing _)
        #self.columns = X.columns
        
        X = check_array(X)
        
        assert self.n_features_in_ == X.shape[1]
        if self.with_mean:
            X = X - self.mean_

        return X / self.scale_
    
    # def get_feature_names_out(self, input_features):
    #     return [col for col in self.columns]

In [441]:
# Create Pipelines for Numerical and Categorical Features
num_pipeline_cloned = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('scale', StandardScalerClone())
])

In [444]:
num_pipeline_cloned.fit_transform(df[["training_hours"]])[:5]

array([[-0.4889846 ],
       [-0.30582494],
       [ 0.29360665],
       [-0.22257056],
       [-0.95520917]])

In [445]:
# compare the result with the default standard scaler
num_pipeline.fit_transform(df[["training_hours"]])[:5]

array([[-0.4889846 ],
       [-0.30582494],
       [ 0.29360665],
       [-0.22257056],
       [-0.95520917]])

### Save Pipeline

In [None]:
# # Save the pipeline
# import joblib

# # Save pipeline to file "pipe.joblib"
# joblib.dump(clf_pipeline,"pipe.joblib")

# # Load pipeline when you want to use
# same_pipe = joblib.load("pipe.joblib")

## How to Find the Best Hyperparameter and Data Preparation Method

### How to find the best hyperparameter

In [291]:
# # check the list of adjustable parameters 
# clf_pipeline.get_params()


In [None]:
# set param: step_name + '_' + parameter
clf_pipeline.set_params(model_C = 10)


In [296]:
grid_params = {'model__penalty' : ['l2'],
               'model__C' : np.logspace(-4, 4, 20)}

gs = GridSearchCV(clf_pipeline, grid_params, cv=5, scoring='accuracy')
gs.fit(X_train, y_train)

print(f"Best Score of train set: {gs.best_score_}")
print(f"Best parameter set: {gs.best_params_}")
print(f"Test Score: {gs.score(X_test, y_test)}")

Best Score of train set: 0.7671934994024873
Best parameter set: {'model__C': 0.03359818286283781, 'model__penalty': 'l2'}
Test Score: 0.7643528183716075


### How to find the Best Data Preparation Method: Skip a Step in a Pipeline
- With the pipeline, we can create data transformation steps in the pipeline and perform a grid search to find the best step. A grid search will select which step to skip and compare the result of each case.
- In grid search parameters, specify the steps you want to skip and set their value to `passthrough`.

In [302]:
num_pipeline2 = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='mean')),
    ('minmax_scale', MinMaxScaler()), 
    ('std_scale', StandardScaler()), # add also 'std_scale'
])

col_trans2 = ColumnTransformer(transformers=[
        ('num_pipeline', num_pipeline2, num_cols), # update with num_pipeline2
        ('oh_cat_pipeline', oh_cat_pipeline, oh_cat_cols),
        ('ord_cat_pipeline', ord_cat_pipeline, ord_cat_cols),

    ],
    remainder='drop',   # 'drop' and 'passthrough'
    n_jobs=-1)          # n_job = -1 means that we'll be using all processors to run in parallel.

clf_pipeline2 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', clf)
])

In [303]:
grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough']},
                    {'col_trans__num_pipeline__std_scale': ['passthrough']}]

In [305]:
gs2 = GridSearchCV(clf_pipeline2, grid_step_params, scoring='accuracy')
gs2.fit(X_train, y_train)
print(f"Best Score of train set: {gs2.best_score_}")
print(f"Best parameter set: {gs2.best_params_}")
print(f"Test Score: {gs2.score(X_test, y_test)}")

Best Score of train set: 0.7660189054504011
Best parameter set: {'col_trans__num_pipeline__minmax_scale': 'passthrough'}
Test Score: 0.7659185803757829


- The best case is `minmax_scale : ‘passthrough’`, so `StandardScaler` is the best scaling method for this data.

### How to Find the Best Hyperparameter Sets and the Best Data Preparation Method

In [310]:
grid_params = {'model__penalty' : ['l2'],
               'model__C' : np.logspace(-4, 4, 20)}
               
grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough'], **grid_params},
                    {'col_trans__num_pipeline__std_scale': ['passthrough'], **grid_params}
                    ]

In [311]:
gs3 = GridSearchCV(clf_pipeline2, grid_step_params, scoring='accuracy')
gs3.fit(X_train, y_train)

print(f"Best Score of train set: {gs3.best_score_}")
print(f"Best parameter set: {gs3.best_params_}")
print(f"Test Score: {gs3.score(X_test, y_test)}")

Best Score of train set: 0.7671934994024873
Best parameter set: {'col_trans__num_pipeline__std_scale': 'passthrough', 'model__C': 0.03359818286283781, 'model__penalty': 'l2'}
Test Score: 0.7643528183716075


### How to find the best models
- The solution to this problem is to create a custom transformation that receives a model as an input and performs grid search to find the best model.

In [312]:
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

class ClfSwitcher(BaseEstimator):
    def __init__(self, estimator = LogisticRegression()):
            self.estimator = estimator
            
    def fit(self, X, y=None, **kwargs):
            self.estimator.fit(X, y)
            return self
            
    def predict(self, X, y=None):
            return self.estimator.predict(X)
            
    def predict_proba(self, X):
            return self.estimator.predict_proba(X)
            
    def score(self, X, y):
            return self.estimator.score(X, y)

In [315]:
clf_pipeline4 = Pipeline(steps=[
    ('col_trans', col_trans2),
    ('model', ClfSwitcher())
])

In [318]:
grid_params = {'model__estimator' : [LogisticRegression(max_iter=1000), SVC(gamma='auto')]}
               
grid_step_params = [{'col_trans__num_pipeline__minmax_scale': ['passthrough'], **grid_params},
                    {'col_trans__num_pipeline__std_scale': ['passthrough'], **grid_params}
                    ]


gs4 = GridSearchCV(clf_pipeline4, grid_step_params, scoring='accuracy')
gs4.fit(X_train, y_train)
print(f"Best Score of train set: {gs4.best_score_}")
print(f"Best parameter set: {gs4.best_params_}")
print(f"Test Score: {gs4.score(X_test, y_test)}")

Best Score of train set: 0.7661492408981738
Best parameter set: {'col_trans__num_pipeline__minmax_scale': 'passthrough', 'model__estimator': LogisticRegression(max_iter=1000)}
Test Score: 0.7659185803757829


In [319]:
pd.DataFrame(gs4.cv_results_)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_col_trans__num_pipeline__minmax_scale,param_model__estimator,param_col_trans__num_pipeline__std_scale,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.168548,0.017555,0.013922,0.000678,passthrough,LogisticRegression(max_iter=1000),,{'col_trans__num_pipeline__minmax_scale': 'pas...,0.762883,0.765416,0.765742,0.782708,0.753997,0.766149,0.009309,1
1,2.217746,0.040602,1.275425,0.029387,passthrough,SVC(gamma='auto'),,{'col_trans__num_pipeline__minmax_scale': 'pas...,0.767776,0.76509,0.763132,0.77553,0.758238,0.765953,0.005714,2
2,0.110988,0.026882,0.013632,0.000757,,LogisticRegression(max_iter=1000),passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,0.762557,0.76509,0.766069,0.781403,0.754323,0.765888,0.008789,3
3,2.242711,0.028602,1.316997,0.025985,,SVC(gamma='auto'),passthrough,{'col_trans__num_pipeline__std_scale': 'passth...,0.757665,0.75106,0.752692,0.757259,0.744209,0.752577,0.004902,4
