You can read the blog here: https://www.abhishekmamidi.com/2020/09/building-machine-learning-pipelines-using-scikit-learn-and-optimization.html

## Introduction
Data Scientists often build Machine learning pipelines which involves preprocessing (imputing null values, feature transformation, creating new features), modeling, hyper parameter tuning. There are many transformations that need to be done before modeling in a particular order.  Scikit learn provides us with the Pipeline class to perform those transformations in one go.


Pipeline serves multiple purposes here (from [documentation](https://scikit-learn.org/stable/modules/compose.html)):
- <b>Convenience and encapsulation</b>: You only have to call fit and predict once on your data to fit a whole sequence of estimators.
- <b>Joint parameter selection</b>: You can grid search over parameters of all estimators in the pipeline at once (hyper-parameter tuning/optimization).
- <b>Safety</b>: Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

In this video, I will show you
- How to build pipelines using scikit-learn?
- How to create custom transformers?
- Hyper-parameter tuning

For the entire analysis, I am using the Titanic dataset. I chose this dataset because most of them are familiar with this dataset. Let’s start the analysis by loading all the required libraries:

In [1]:
# Load libraries

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, RobustScaler

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import xgboost
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data

def load_data(PATH):
    data = pd.read_csv(PATH)
    return data

In [3]:
# Read titanic data

titanic_data = load_data('https://raw.githubusercontent.com/mattdelhey/kaggle-titanic/master/Data/train.csv')

In [4]:
titanic_data.head()

Unnamed: 0,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic_data.shape

(891, 11)

In [6]:
# Null value columns

null_value_count = titanic_data.isnull().sum()
features_with_null_values = null_value_count[null_value_count != 0].index
null_value_count, features_with_null_values

(survived      0
 pclass        0
 name          0
 sex           0
 age         177
 sibsp         0
 parch         0
 ticket        0
 fare          0
 cabin       687
 embarked      2
 dtype: int64, Index(['age', 'cabin', 'embarked'], dtype='object'))

In [7]:
# Split data 80:20
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

X_train = train_data.drop(columns=["survived"])
y_train = train_data["survived"]

X_test = test_data.drop(columns=["survived"])
y_test = test_data["survived"]

In [8]:
X_train.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
331,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


There are 11 features (10 + target) in total. We can remove high cardinal features (['name', 'ticket', 'cabin']) for this analysis from the data. We remain with 7 features. Out of 7, there are 5 numerical features ('pclass', 'age', 'sibsp', 'parch', 'fare') and 2 categorical features ('sex', 'embarked'). The preprocessing for numerical and categorical features is different. So, let’s build pipelines for numerical and categorical features separately.

In [9]:
drop_columns = ["name", "ticket", "cabin"]
numerical_columns = X_train.drop(columns=drop_columns).select_dtypes(exclude = "object").columns
categorical_columns = X_train.drop(columns=drop_columns).select_dtypes(include = "object").columns
print(numerical_columns, categorical_columns)

Index(['pclass', 'age', 'sibsp', 'parch', 'fare'], dtype='object') Index(['sex', 'embarked'], dtype='object')


## Preprocessing Numerical features
- Impute null values with median
- Create new features. We can create 'family_count' by adding 'sibsp' (No. of siblings/spouses) and 'parch' (No. of parents/children)
- Feature Scaling

Scikit learn provides a lot of transformers by default. For custom processing purposes, we can create our own Custom Transformers for eg. creating new features. As scikit-learn relies on duck typing (you check only for the presence of a given method or attribute), it’s very easy to create custom transformers just by implementing fit, transform and fit_transformer methods in a class.

For the fit() function, you can just return self. The main processing code will go into the transform() function. The fit_transform() is automatically available for us if we add TransformerMixin as a base class. We can also add BaseEstimator as a base class which automatically provides two functions: get_params() and set_params().

In [10]:
# Create New features

class CreateNewFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, indices):
        self.indices = indices
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Create a new feature 'family_count' by adding 'sibsp' and 'parch'
        """
        sibsp = self.indices[0]
        parch = self.indices[1]

        family_count = X[:, sibsp] + X[:, parch]
        X = np.c_[X, family_count]
        return X

- Let’s build a pipeline for processing numerical features. It’s very easy.

In [11]:
# get_params
CreateNewFeatures([2, 3]).get_params()

{'indices': [2, 3]}

In [12]:
X_train[numerical_columns].head()

Unnamed: 0,pclass,age,sibsp,parch,fare
331,1,45.5,0,0,28.5
733,2,23.0,0,0,13.0
382,3,32.0,0,0,7.925
704,3,26.0,1,0,7.8542
813,3,6.0,4,2,31.275


In [13]:
# Pipeline for preprocessing numerical features
family_count_indices = [2, 3]

# List of tuples (name, transform) in a particular order. We have to provide name and transformer in each tuple
numerical_pipeline = Pipeline([
    ('numerical_imputer', SimpleImputer(strategy='median')),
    ('create_new_features', CreateNewFeatures(family_count_indices)),
    ('feature_scaling', StandardScaler())
])

In [14]:
numerical_pipeline

Pipeline(memory=None,
         steps=[('numerical_imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='median',
                               verbose=0)),
                ('create_new_features', CreateNewFeatures(indices=[2, 3])),
                ('feature_scaling',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

## Preprocessing Categorical features
- Impute null values with mode
- One hot encoding

In [15]:
categorical_pipeline = Pipeline([
    ('categorical_imputer', SimpleImputer(strategy='most_frequent')),
    ('categorical_encoder', OneHotEncoder())
])

In [16]:
categorical_pipeline

Pipeline(memory=None,
         steps=[('categorical_imputer',
                 SimpleImputer(add_indicator=False, copy=True, fill_value=None,
                               missing_values=nan, strategy='most_frequent',
                               verbose=0)),
                ('categorical_encoder',
                 OneHotEncoder(categories='auto', drop=None,
                               dtype=<class 'numpy.float64'>,
                               handle_unknown='error', sparse=True))],
         verbose=False)

## Column Transformer

We have applied transformations separately for numerical and categorical features. Scikit learn provides ColumnTransformer through which we can apply different transformations on different columns at the same time. Let’s see how we can do this.

In [17]:
# Column Transformer
column_pipeline = ColumnTransformer([
    ("numerical_pipeline", numerical_pipeline, numerical_columns),
    ("categorical_pipeline", categorical_pipeline, categorical_columns)
])

In [18]:
column_pipeline

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('numerical_pipeline',
                                 Pipeline(memory=None,
                                          steps=[('numerical_imputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='median',
                                                                verbose=0)),
                                                 ('create_new_features',
                                                  CreateNewFeatures(indices=[2,
                                                                  

- Each tuple takes 3 inputs: (name, pipeline, columns on which the transformation to be applied). Using ColumnTransformer, we can apply numerical and categorical transformations parallelly on the same dataset. Now, let’s create the full pipeline by dropping unnecessary features before transforming the features.

## Full pipeline

- We didn't include dropping features in the pipeline. Let's create full pipeline by dropping the features as a first step

In [19]:
# Drop features

class DropFeatures(BaseEstimator, TransformerMixin):
    """
    Drop features
    """
    def __init__(self, drop_columns):
        self.drop_columns = drop_columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Drop features
        """
        X = X.drop(columns=self.drop_columns, axis=1)
        return X
    
drop_columns = ["name", "ticket", "cabin"]

# Full pipeline
full_pipeline = Pipeline([
    ('drop_features', DropFeatures(drop_columns)),
    ('column_transformer', column_pipeline)
])

In [20]:
print(full_pipeline)

Pipeline(memory=None,
         steps=[('drop_features',
                 DropFeatures(drop_columns=['name', 'ticket', 'cabin'])),
                ('column_transformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('numerical_pipeline',
                                                  Pipeline(memory=None,
                                                           steps=[('numerical_imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_va...
                                                           steps=[('categorical_imputer',
                                       

Hurray! We created the pipeline for processing the input features.

## Different ways of using pipelines
- Use pipeline for preprocessing only
- Include modeling in the pipeline
- Hyper-parameter tuning

### Use pipeline for preprocessing features only
We use the pipeline to pre-process the features and then do modeling on top of the processed dataset.

In [21]:
# Transform input data
X_train_processed = full_pipeline.fit_transform(X_train)

X_train.shape, X_train_processed.shape

((712, 10), (712, 11))

In [22]:
# Train data using XGBoost
model = xgboost.XGBClassifier(max_depth=4)
model.fit(X_train_processed, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=4,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

We use the same pipeline to transform the test data and predict using the trained model.

In [23]:
# Transform test data using full_pipeline
X_test_processed = full_pipeline.transform(X_test)
print(X_test_processed.shape)

# Predict on the processed data
y_pred = model.predict(X_test_processed)

# Evaluate on test data
accuracy_score(y_test, y_pred), f1_score(y_test, y_pred) # (0.815, 0.762)

(179, 11)


(0.8156424581005587, 0.762589928057554)

### Include modeling in the pipeline
In this case, we include modeling (for eg.: DecisionTreeClassifier()) in the pipeline by adding it to full_pipeline.

In [24]:
# Add DecisionTreeClassifier to the end of processing pipeline
pipeline_modeling = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', DecisionTreeClassifier())
])

Using the pipeline object, we can directly fit and predict on the data.

In [25]:
# Fit data
pipeline_modeling.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessing',
                 Pipeline(memory=None,
                          steps=[('drop_features',
                                  DropFeatures(drop_columns=['name', 'ticket',
                                                             'cabin'])),
                                 ('column_transformer',
                                  ColumnTransformer(n_jobs=None,
                                                    remainder='drop',
                                                    sparse_threshold=0.3,
                                                    transformer_weights=None,
                                                    transformers=[('numerical_pipeline',
                                                                   Pipeline(memory=None,
                                                                            steps=[('numerical_imputer',
                                                                            

In [26]:
# Predict on new data
y_pred = pipeline_modeling.predict(X_test)

# Score on new data. Returns the accuracy score
pipeline_modeling.score(X_test, y_test)

0.7932960893854749

### Hyper-parameter tuning

The most interesting part!

Before diving deep into the hyper-parameter tuning, let’s understand the power of using pipelines.
- First advantage is that we can tune any parameter of any method that is in the pipeline. 
- Second advantage is that we can tune different methods too. For example, in the categorical pipeline, we are using OneHotEncoder(). But, there are different methods like OrdinalEncoder(). We can tune the method along with the parameters.

Let’s define the parameter grid. We can also pass a list of parameter dictionaries to optimize. In the parameter grid, to define the parameters we want to tune, we have to use the transformer names that were used while creating the pipeline. To backtrack to the parameter, we should use double underscore (__). The below code gives you much clarity on how to backtrack and define the parameters.

In [27]:
parameter_grid = [
    {
        "preprocessing__column_transformer__numerical_pipeline__numerical_imputer__strategy": ['median', 'mean'],
        "preprocessing__column_transformer__numerical_pipeline__feature_scaling": [StandardScaler(), RobustScaler()],
        "model": [DecisionTreeClassifier()],
        "model__criterion": ["gini", "entropy"],
        "model__max_depth": [10, 20]
    },
    {
        "preprocessing__column_transformer__categorical_pipeline__categorical_encoder": [OneHotEncoder(), OrdinalEncoder()],
        "model": [RandomForestClassifier()],
        "model__max_depth": [10, 15, 25],
        "model__n_estimators": [100, 200],
        "model__bootstrap": [True, False]
    },
    {
        "model": [XGBClassifier()],
        "model__n_estimators": [10, 50, 100],
        "model__learning_rate": [0.01, 0.1, 1],
        "model__max_depth": [3, 6, 9],
        "model__min_child_weight": [1, 3]
    }
]

Pass the parameter_grid to GridSearchCV to initialize and call fit() function to find the optimal parameters.

In [28]:
# Initialize grid search
grid_search = GridSearchCV(pipeline_modeling, parameter_grid, cv=5, verbose=0)

# Fit data
grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('preprocessing',
                                        Pipeline(memory=None,
                                                 steps=[('drop_features',
                                                         DropFeatures(drop_columns=['name',
                                                                                    'ticket',
                                                                                    'cabin'])),
                                                        ('column_transformer',
                                                         ColumnTransformer(n_jobs=None,
                                                                           remainder='drop',
                                                                           sparse_threshold=0.3,
                                                                           transformer_

In [29]:
# Get best estimator
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('preprocessing',
                 Pipeline(memory=None,
                          steps=[('drop_features',
                                  DropFeatures(drop_columns=['name', 'ticket',
                                                             'cabin'])),
                                 ('column_transformer',
                                  ColumnTransformer(n_jobs=None,
                                                    remainder='drop',
                                                    sparse_threshold=0.3,
                                                    transformer_weights=None,
                                                    transformers=[('numerical_pipeline',
                                                                   Pipeline(memory=None,
                                                                            steps=[('numerical_imputer',
                                                                            

In [30]:
grid_search.predict(X_test)

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 1], dtype=int64)

In [31]:
# Score on new data
grid_search.score(X_test, y_test)

0.8212290502793296

In this way, we can easily try different transformations and select the best pipeline. 

<b>TIP</b>: We can build pipelines with different transformations and save them for future purposes. We can add new transformations and functions as we go. This will be very helpful in times of competitions and personal use as well.

## Full code for reference

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import f1_score, accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

import xgboost
from xgboost import XGBClassifier

from sklearn.model_selection import GridSearchCV

# Load data
def load_data(PATH):
    data = pd.read_csv(PATH)
    return data

titanic_data = load_data('https://raw.githubusercontent.com/mattdelhey/kaggle-titanic/master/Data/train.csv')

# Split data 80:20
train_data, test_data = train_test_split(titanic_data, test_size=0.2, random_state=42)

X_train = train_data.drop(columns=["survived"])
y_train = train_data["survived"]

X_test = test_data.drop(columns=["survived"])
y_test = test_data["survived"]

# Create new features
class CreateNewFeatures(BaseEstimator, TransformerMixin):
    def __init__(self, indices):
        self.indices = indices
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Create a new feature 'family_count' by adding 'sibsp' and 'parch'
        """
        sibsp = self.indices[0]
        parch = self.indices[1]
        family_count = X[:, sibsp] + X[:, parch]
        X = np.c_[X, family_count]
        return X

# Pipeline for processing numercial features
family_count_indices = [2, 3]
numerical_pipeline = Pipeline([
    ('numerical_imputer', SimpleImputer(strategy='median')),
    ('create_new_features', CreateNewFeatures(family_count_indices)),
    ('feature_scaling', StandardScaler())
])

# Pipeline for processing categorical features
categorical_pipeline = Pipeline([
    ('categorical_imputer', SimpleImputer(strategy='most_frequent')),
    ('categorical_encoder', OneHotEncoder())
])

drop_columns = ["name", "ticket", "cabin"]
numerical_columns = X_train.drop(columns=drop_columns).select_dtypes(exclude = "object").columns
categorical_columns = X_train.drop(columns=drop_columns).select_dtypes(include = "object").columns

# Column Transformer
column_pipeline = ColumnTransformer([
    ("numerical_pipeline", numerical_pipeline, numerical_columns),
    ("categorical_pipeline", categorical_pipeline, categorical_columns)
])
    
class DropFeatures(BaseEstimator, TransformerMixin):
    """
    Drop features
    """
    def __init__(self, drop_columns):
        self.drop_columns = drop_columns
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        """
        Drop features
        """
        X = X.drop(columns=self.drop_columns, axis=1)
        return X
    
drop_columns = ["name", "ticket", "cabin"]

# Full pipeline
full_pipeline = Pipeline([
    ('drop_features', DropFeatures(drop_columns)),
    ('column_transformer', column_pipeline)
])

# 1. Use pipeline for preprocessing features only

# Transform the input data
X_train_processed = full_pipeline.fit_transform(X_train)

# Train data using XGBoost
model = xgboost.XGBClassifier(max_depth=4)
model.fit(X_train_processed, y_train)

# Transform the data using full_pipeline
X_test_processed = full_pipeline.transform(X_test)

# Predict on the processed data
y_pred = model.predict(X_test_processed)

# Evaluate on test data
accuracy_score(y_test, y_pred), f1_score(y_test, y_pred) #(0.815, 0.762)

# 2. Include modeling in the pipeline

pipeline_modeling = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', DecisionTreeClassifier())
])

# Fit data
pipeline_modeling.fit(X_train, y_train)

# Predict on new data
y_pred = pipeline_modeling.predict(X_test)

# Score on the new data. Returns the accuracy score
pipeline_modeling.score(X_test, y_test) # 0.782

# 3. Hyper-parameter tuning

# Parameter grid
parameter_grid = [
    {
        "preprocessing__column_transformer__numerical_pipeline__numerical_imputer__strategy": ['median', 'mean'],
        "preprocessing__column_transformer__numerical_pipeline__feature_scaling": [StandardScaler(), RobustScaler()],
        "model": [DecisionTreeClassifier()],
        "model__criterion": ["gini", "entropy"],
        "model__max_depth": [10, 20]
    },
    {
        "preprocessing__column_transformer__categorical_pipeline__categorical_encoder": [OneHotEncoder(), OrdinalEncoder()],
        "model": [RandomForestClassifier()],
        "model__max_depth": [10, 15, 25],
        "model__n_estimators": [100, 200],
        "model__bootstrap": [True, False]
    },
    {
        "model": [XGBClassifier()],
        "model__n_estimators": [10, 50, 100],
        "model__learning_rate": [0.01, 0.1, 1],
        "model__max_depth": [3, 6, 9],
        "model__min_child_weight": [1, 3]
    }
]

# Initialize grid search
grid_search = GridSearchCV(pipeline_modeling, parameter_grid, cv=5, verbose=0)
# Fit data
grid_search.fit(X_train, y_train)

# Get best estimator
grid_search.best_estimator_

# Score on new data
grid_search.score(X_test, y_test) # 0.821

- I will share you the github link of this notebook in the description below.
- Please like, share and subscribe to my channel and don't forget to hit the bell icon.
- Check out my website for more articles [https://www.abhishekmamidi.com/](https://www.abhishekmamidi.com/)