## Introduction

The `Pipeline()` method in Sklearn is super helpful. We can use it 
* to compose preprocessing steps, 
* combine preprocessing with modeling, and 
* the pipeline object has common sklean methods like `fit` and `predict`/`predict_proba` 

I love using it! Here is a demo.

In [85]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score

In [86]:
df = pd.DataFrame({
    'Category': ['A', 'B', np.nan, 'B', 'C', 'C', 'C', 'B'],
    'Numeric': [10, 20, 15, 25, np.nan, 35, 40, 33],
    'Target': [0, 1, 0, 1, 0, 0, 0, 1]
})

X = df.drop('Target', axis = 1)
y = df['Target']
num_columns = X.select_dtypes(include=np.number).columns.tolist() 
cat_columns = X.select_dtypes(exclude=np.number).columns.tolist()

## Define the pipeline

In [87]:
# data preprocessing: we can use pipeline to form steps
num_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scalar', StandardScaler())
])

column_transformer = ColumnTransformer(transformers = [
    ('num', num_transformer, num_columns),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_columns) 
])

In [88]:
# use pipeline to combine preprocessing and model training
model = RandomForestClassifier()
pipeline = Pipeline(steps = [
    ('preprocessing', column_transformer),
    ('model', model)
])

## Use the pipeline

In [89]:
pipeline.fit(X,y)

In [90]:
y_prob = pipeline.predict_proba(X)[:,1]
print(y_prob)

[0.05 0.95 0.09 0.97 0.08 0.04 0.02 0.96]


## Pipeline object in sklearn eco-system

We can conveniently use a pipeline object to do many things we would normally do with a model object. For example, in the below function for cross-validation. 

In [91]:
def get_cv_results(model, cv, X, y):
    kfold = KFold(cv)
    cv_scores = []

    for i, (train_index, test_index) in enumerate(kfold.split(X, y)):
        print(f"train index for fold {i+1}:", train_index)

        X_train, y_train = X.iloc[train_index], y.iloc[train_index]
        X_test, y_test   = X.iloc[test_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_prob = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, y_prob)
        cv_scores.append(auc)
    
    return cv        


In [92]:
cv = 3
cv_scores = get_cv_results(pipeline, cv, X, y)

train index for fold 1: [3 4 5 6 7]
train index for fold 2: [0 1 2 6 7]
train index for fold 3: [0 1 2 3 4 5]
