# From classical Sklearn ML workflow to Sklearn pipelines

In [1]:
import pandas as pd
import numpy as np
import json
from scipy import sparse

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib notebook
sns.set_style('whitegrid')

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector

from imblearn.pipeline import make_pipeline as make_pipeline_with_sampler
from imblearn.over_sampling import RandomOverSampler

from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.metrics import accuracy_score, balanced_accuracy_score, f1_score

## 1. Machine learning pipelines

In [2]:
# Orignal Authors: Jesse E.Agbe (JCharis)
# Original source: https://blog.jcharistech.com/2021/02/05/building-machine-learning-pipelines-with-scikit-learn-python/#:~:text=A%20Pipeline%20consists%20of%20a,means%20of%20automating%20a%20workflow.
# Modified by Dr Adnane Ez-zizi 

### 1.1. What is a machine learning pipeline?

A Machine Learning (ML) pipeline as a sequence of processing elements or functions, where the output of one element becomes the input for the next. It is a method for chaining functions and tasks typically found in workflows and is used across various fields such as Data Science, Machine Learning, DevOps, Manufacturing, and general Software Development. The concept mirrors the continuous life cycle of an assembly line in the manufacturing industry.

ML Pipeline serve to automate the Machine Learning workflow, enabling the codification and automation of producing usable ML models. It is an independently executable workflow that completes an ML task by executing tasks in sequence automatically, including data transformation, training, and model building, to achieve a specific output. This automation aims to package workflows or sequences of tasks to enhance efficiency and organization, ensuring the process is well-structured and reproducible.

### 1.2. Advantages of using ML pipelines

- Making the building of models more efficient and simplified.
- Helping to cut redundant work.
- Moving the product from just the model to a complete pipeline/workflow, which improves efficiency and scalability.
- Making it easier to monitor and tune each component of the process.
- Reducing the chance of error and saving time by automating repetitive tasks.

### 1.3. Pipeline stages

1) **Transformer Stage:** A transformer takes a dataset as input and produces a transformed/augmented dataset as output. It processes the data and converts it into a feature-ready dataset. An example of this is a tokenizer.

2) **Estimator Stage:** An estimator is fitted on an input dataset and produces a model that can be used to perform predictive tasks. Examples include Naive Bayes and Logistic Regression

### 1.4 Build a Simple ML Pipeline with Scikit-Learn

First, define the pipeline by specifying the sequence of the pipeline steps. Each step is a tuple consisting of a name and an instance of a transformer or an estimator. For a Logistic Regression pipeline, you might include a scaler (e.g. `StandardScaler`) followed by the logistic regression estimator (`LogisticRegression`). The code looks like this:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Create the pipeline
pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression())
])
```

Next, we fit the pipeline to our training data. This step will execute the scaler transformation followed by training the logistic regression model:

```python
# Fit the pipeline to the training data
pipe_lr = pipe_lr.fit(X_train, y_train)
```

After fitting the model, you can proceed with making predictions, evaluating the model, etc.

## 2. Example

This example was used in Session 7 to illustrate the problem induced by learning on datasets having imbalanced classes. Here we will use it to show how to use Sklearn pipelines to make the process of building machine learning pipelines easier and more efficient.

In [3]:
# Orignal Author: Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT
# Original source: https://imbalanced-learn.org/stable/auto_examples/applications/plot_impact_imbalanced_classes.html#sphx-glr-auto-examples-applications-plot-impact-imbalanced-classes-py
# Modified by Dr Adnane Ez-zizi 

### Dataset

We are using a modified version of the "adult" (income) dataset from sklearn.datasets without dropping features:

- "fnlwgt": this feature was created while studying the "adult" dataset. Thus, we will not use this feature which is not acquired during the survey.
- "education-num": it is encoding the same information than "education". Thus, we are removing one of these 2 features.

In [4]:
df = pd.read_csv("Adult_income.csv")
df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25.0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K
1,38.0,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K
2,18.0,,Some-college,Never-married,,Own-child,White,Female,0.0,0.0,30.0,United-States,<=50K
3,34.0,Private,10th,Never-married,Other-service,Not-in-family,White,Male,0.0,0.0,30.0,United-States,<=50K
4,29.0,,HS-grad,Never-married,,Unmarried,Black,Male,0.0,0.0,40.0,United-States,<=50K


In [5]:
# Description of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38393 entries, 0 to 38392
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             38393 non-null  float64
 1   workclass       35829 non-null  object 
 2   education       38393 non-null  object 
 3   marital-status  38393 non-null  object 
 4   occupation      35819 non-null  object 
 5   relationship    38393 non-null  object 
 6   race            38393 non-null  object 
 7   sex             38393 non-null  object 
 8   capital-gain    38393 non-null  float64
 9   capital-loss    38393 non-null  float64
 10  hours-per-week  38393 non-null  float64
 11  native-country  37735 non-null  object 
 12  class           38393 non-null  object 
dtypes: float64(4), object(9)
memory usage: 3.8+ MB


In [7]:
# Convert the categorical columns to category type
df = df.astype({"workclass":"category", 
                "education":"category", 
                "marital-status":"category", 
                "occupation":"category", 
                "relationship":"category", 
                "race":"category", 
                "sex":"category", 
                "native-country":"category"})

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38393 entries, 0 to 38392
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             38393 non-null  float64 
 1   workclass       35829 non-null  category
 2   education       38393 non-null  category
 3   marital-status  38393 non-null  category
 4   occupation      35819 non-null  category
 5   relationship    38393 non-null  category
 6   race            38393 non-null  category
 7   sex             38393 non-null  category
 8   capital-gain    38393 non-null  float64 
 9   capital-loss    38393 non-null  float64 
 10  hours-per-week  38393 non-null  float64 
 11  native-country  37735 non-null  category
 12  class           38393 non-null  object  
dtypes: category(8), float64(4), object(1)
memory usage: 1.8+ MB


In [14]:
corr_matrix = df.corr(method='spearman')
print(corr_matrix)

                     age  capital-gain  capital-loss  hours-per-week
age             1.000000      0.078726      0.038444        0.148751
capital-gain    0.078726      1.000000     -0.040948        0.032147
capital-loss    0.038444     -0.040948      1.000000        0.036330
hours-per-week  0.148751      0.032147      0.036330        1.000000


Let's separate the predictors and response variable.

In [9]:
X = df.drop(columns = 'class')
Y = df['class']

The dataset has a class ratio of 30:1 in favour of the class <=50K, so very imbalanced.

In [10]:
classes_count = Y.value_counts()
classes_count

<=50K    37155
>50K      1238
Name: class, dtype: int64

Before we get to the modelling, let's recode the target variable into 0 and 1 (i.e. '>50K': 1, '<=50K': 0)

In [11]:
# Recode the target variable to 0 and 1
Y = Y.map({'>50K': 1, '<=50K': 0})

### Strategies to learn from an imbalanced dataset
We will use a dictionary and a list to continuously store the results of our experiments and show them as a pandas dataframe.

In [12]:
index = []
scores = {"Accuracy": [], "Balanced accuracy": [], "F1-score": []}

We will perform a cross-validation evaluation to get an estimate of the test accuracy score. We will using both the standard accuracy and the balanced accuracy, which is the average the accuracy over both classes.

As a baseline, we could use a classifier which will always predict the majority class independently of the features provided. This is what we call the dummy baseline.

### Dummy baseline

Before to train a real machine learning model, we can store the results
obtained with our :class:`~sklearn.dummy.DummyClassifier`.

In [13]:
index.append("Dummy classifier")
scoring = ["accuracy", "balanced_accuracy", "f1_macro"]
dummy_clf = DummyClassifier(strategy="most_frequent")

cv_result = cross_validate(dummy_clf, X, Y, scoring = scoring)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())
scores["F1-score"].append(cv_result["test_f1_macro"].mean())  

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy,F1-score
Dummy classifier,0.967755,0.5,0.491807


### Linear classifier baseline

We will create a machine learning pipeline using a `sklearn.linear_model.LogisticRegression` classifier and 5-fold cross-validation. As part of this pipline, we will need to one-hot encode the categorical columns and standardized the numerical columns before to inject the data into the `sklearn.linear_model.LogisticRegression` classifier.

First, we define our numerical and categorical pipelines.

In [15]:
########### Without using Sklearn pipelines ################

# Identifying continuous and categorical columns
continuous_columns = X.select_dtypes(include="number").columns
categorical_columns = X.select_dtypes(exclude="number").columns

# Define Stratified K-Fold cross-validation 
# (stratified to make sure that we retain the same original proportion of classes in each fold)
skf = StratifiedKFold(n_splits=5)

# Initialise lists to store the results for each fold from the cross-validation
acc_scores = []
bal_acc_scores = []
f1_scores = []

for train_index, test_index in skf.split(X, Y):
    
    ### Splitting the data into training and test sets for this fold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

    ### Preprocessing continuous columns for this fold
    # Step 1: Impute missing values for continuous columns with their mean
    imputer_cont = SimpleImputer(strategy='mean')
    X_train_cont_imputed = imputer_cont.fit_transform(X_train[continuous_columns])
    # Step 2: Scale the continuous columns
    scaler = StandardScaler()
    X_train_cont_scaled = scaler.fit_transform(X_train_cont_imputed)
    # Step 3: Apply the same transformation to test data
    X_test_cont_imputed = imputer_cont.transform(X_test[continuous_columns])
    X_test_cont_scaled = scaler.transform(X_test_cont_imputed)

    ### Preprocessing categorical columns for this fold
    # Step 1: Impute missing values for categorical columns with the value 'missing'
    imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')
    X_train_cat_imputed = imputer_cat.fit_transform(X_train[categorical_columns])
    # Step 2: Transform categorical columns using OneHotEncoder
    encoder_cat = OneHotEncoder(handle_unknown='ignore')
    X_train_cat_encoded = encoder_cat.fit_transform(X_train_cat_imputed)
    # Step 3: Apply the same transformation to test data
    X_test_cat_imputed = imputer_cat.transform(X_test[categorical_columns])
    X_test_cat_encoded = encoder_cat.transform(X_test_cat_imputed)

    ### Combine continuous and categorical preprocessed columns
    X_train_preprocessed = sparse.hstack((X_train_cont_scaled, X_train_cat_encoded))
    X_test_preprocessed = sparse.hstack((X_test_cont_scaled, X_test_cat_encoded))

    ### Fit Logistic Regression model
    lr_clf = LogisticRegression(max_iter=500)
    #lr_clf = LogisticRegression()
    lr_clf.fit(X_train_preprocessed, Y_train)

    ### Predict and evaluate on the test set
    Y_pred = lr_clf.predict(X_test_preprocessed)
    acc_scores.append(accuracy_score(Y_test, Y_pred))
    bal_acc_scores.append(balanced_accuracy_score(Y_test, Y_pred))
    f1_scores.append(f1_score(Y_test, Y_pred, average='macro'))  

# Calculating mean scores across all folds
mean_accuracy = np.mean(acc_scores)
mean_balanced_accuracy = np.mean(bal_acc_scores)
mean_f1 = np.mean(f1_scores)

# Append the results to scores and df_scores
index += ["Logistic regression"]
scores["Accuracy"].append(mean_accuracy)
scores["Balanced accuracy"].append(mean_balanced_accuracy)
scores["F1-score"].append(mean_f1)
df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy,F1-score
Dummy classifier,0.967755,0.5,0.491807
Logistic regression,0.97062,0.574108,0.616515


We can see that our linear model is learning slightly better than our dummy baseline. However, it is impacted by the class imbalance. 
Now let's re-run the same machine learning pipeline using Sklearn.

First, we define our numerical and categorical pipelines.

In [17]:
########### With Sklearn pipelines ################

num_pipe = make_pipeline(
    SimpleImputer(strategy="mean"), StandardScaler() 
)

cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore")
)

Then, we can create a preprocessor which will dispatch the categorical
columns to the categorical pipeline and the numerical columns to the
numerical pipeline



In [18]:
preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=-1
)

Finally, we connect our preprocessor with our `sklearn.linear_model.LogisticRegression`. We can then evaluate our model.

In [19]:
lr_clf = make_pipeline(preprocessor_linear, LogisticRegression(max_iter=500))

In [20]:
# Define Stratified K-Fold cross-validation 
# (stratified to make sure that we retain the same original proportion of classes in each fold)
skf = StratifiedKFold(n_splits=5)

index += ["Logistic regression with sklearn pipeline"]
cv_result = cross_validate(lr_clf, X, Y, scoring=scoring, cv=skf)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())
scores["F1-score"].append(cv_result["test_f1_macro"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy,F1-score
Dummy classifier,0.967755,0.5,0.491807
Logistic regression,0.97062,0.574108,0.616515
Logistic regression with sklearn pipeline,0.97062,0.574108,0.616515


Now, we will present an approach to improve the performance by using under-sampling and over-sampling.

### Resample the training set during learning

One way to overcome class imbalance is to resample the training set by under-sampling or over-sampling some of the samples. `imbalanced-learn` provides some samplers to do such processing.

Applying a random over-sampler before the training of the linear model or random forest allows us to not focus on the majority class at the cost of making more mistake for samples in the majority class (i.e. decreased accuracy).

We could apply any type of samplers and find which sampler is working best on the current dataset. What about over-sampling now? Could you do some research on it and implement it (see https://imbalanced-learn.org/dev/references/over_sampling.html)?

In [None]:
########### Without using Sklearn pipelines ################

# Initialise over-sampler
over_sampler = RandomOverSampler(random_state=42)

# Reinitialise lists to store the results for each fold
acc_scores_os = []
bal_acc_scores_os = []
f1_scores_os = []

# Define Stratified K-Fold cross-validation 
# (stratified to make sure that we retain the same original proportion of classes in each fold)
skf = StratifiedKFold(n_splits=5)

for train_index, test_index in skf.split(X, Y):
    
    ### Splitting the data into training and test sets for this fold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

    ### Apply over-sampling to the training data
    X_train, Y_train = over_sampler.fit_resample(X_train, Y_train)

    ### Preprocessing continuous columns for this fold
    # Step 1: Impute missing values for continuous columns with their mean
    imputer_cont = SimpleImputer(strategy='mean')
    X_train_cont_imputed = imputer_cont.fit_transform(X_train[continuous_columns])
    # Step 2: Scale the continuous columns
    scaler = StandardScaler()
    X_train_cont_scaled = scaler.fit_transform(X_train_cont_imputed)
    # Step 3: Apply the same transformation to test data
    X_test_cont_imputed = imputer_cont.transform(X_test[continuous_columns])
    X_test_cont_scaled = scaler.transform(X_test_cont_imputed)

    ### Preprocessing categorical columns for this fold
    # Step 1: Impute missing values for categorical columns with the value 'missing'
    imputer_cat = SimpleImputer(strategy='constant', fill_value='missing')
    X_train_cat_imputed = imputer_cat.fit_transform(X_train[categorical_columns])
    # Step 2: Transform categorical columns using OneHotEncoder
    encoder_cat = OneHotEncoder(handle_unknown='ignore')
    X_train_cat_encoded = encoder_cat.fit_transform(X_train_cat_imputed)
    # Step 3: Apply the same transformation to test data
    X_test_cat_imputed = imputer_cat.transform(X_test[categorical_columns])
    X_test_cat_encoded = encoder_cat.transform(X_test_cat_imputed)

    ### Combine continuous and categorical preprocessed columns
    X_train_preprocessed = sparse.hstack((X_train_cont_scaled, X_train_cat_encoded))
    X_test_preprocessed = sparse.hstack((X_test_cont_scaled, X_test_cat_encoded))

    ### Fit Logistic Regression model
    lr_clf = LogisticRegression(max_iter=500)
    lr_clf.fit(X_train_preprocessed, Y_train)

    ### Predict and evaluate on the test set
    Y_pred = lr_clf.predict(X_test_preprocessed)
    acc_scores_os.append(accuracy_score(Y_test, Y_pred))
    bal_acc_scores_os.append(balanced_accuracy_score(Y_test, Y_pred))
    f1_scores_os.append(f1_score(Y_test, Y_pred, average='macro'))

# Calculating mean scores across all folds
mean_accuracy_oversampling = np.mean(acc_scores_os)
mean_balanced_accuracy_oversampling = np.mean(bal_acc_scores_os)
mean_f1_oversampling = np.mean(f1_scores_os)

# Append the results to scores and df_scores
index.append("Over-sampling Logistic Regression")
scores["Accuracy"].append(mean_accuracy_oversampling)
scores["Balanced accuracy"].append(mean_balanced_accuracy_oversampling)
scores["F1-score"].append(mean_f1_oversampling)
df_scores = pd.DataFrame(scores, index=index)
df_scores

In [21]:
########### With Sklearn pipelines ################

num_pipe = make_pipeline(
    SimpleImputer(strategy="mean"), StandardScaler() 
)

cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore")
)

preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=-1
)

lr_clf_over = make_pipeline_with_sampler(
    RandomOverSampler(random_state=42),
    preprocessor_linear,
    LogisticRegression(max_iter=500)
)

# Define Stratified K-Fold cross-validation 
# (stratified to make sure that we retain the same original proportion of classes in each fold)
skf = StratifiedKFold(n_splits=5)

index.append("Over-sampling Logistic Regression with sklearn pipeline")
cv_result = cross_validate(lr_clf_over, X, Y, scoring=scoring, cv=skf)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())
scores["F1-score"].append(cv_result["test_f1_macro"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

Unnamed: 0,Accuracy,Balanced accuracy,F1-score
Dummy classifier,0.967755,0.5,0.491807
Logistic regression,0.97062,0.574108,0.616515
Logistic regression with sklearn pipeline,0.97062,0.574108,0.616515
Over-sampling Logistic Regression with sklearn pipeline,0.811607,0.814417,0.55577


### 3. EXERCISE

1) Run a pipeline with over-sampling and a random forest classifier.
2) Compare with random forest with under-sampling

In [26]:
# 1) Run a pipeline with over-sampling and a random forest classifier.
# TODO: replace the content of this cell with your solution
########### With Sklearn pipelines ################
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
num_pipe = make_pipeline(
    SimpleImputer(strategy="mean"), StandardScaler() 
)

cat_pipe = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="missing"),
    OneHotEncoder(handle_unknown="ignore")
)

preprocessor_linear = make_column_transformer(
    (num_pipe, selector(dtype_include="number")),
    (cat_pipe, selector(dtype_include="category")),
    n_jobs=-1
)

rfc_clf_over = make_pipeline_with_sampler(
    RandomOverSampler(random_state=42),
    preprocessor_linear,
    RandomForestClassifier(n_estimators=100)
)

# Define Stratified K-Fold cross-validation 
# (stratified to make sure that we retain the same original proportion of classes in each fold)
skf = StratifiedKFold(n_splits=5)

index.append("Over-sampling Random Forest Classifier with sklearn pipeline")
cv_result = cross_validate(rfc_clf_over, X, Y, scoring=scoring, cv=skf)
scores["Accuracy"].append(cv_result["test_accuracy"].mean())
scores["Balanced accuracy"].append(cv_result["test_balanced_accuracy"].mean())
scores["F1-score"].append(cv_result["test_f1_macro"].mean())

df_scores = pd.DataFrame(scores, index=index)
df_scores

ValueError: Shape of passed values is (7, 3), indices imply (9, 3)

In [None]:
# 2) Run a pipeline with under-sampling and a random forest classifier.
# TODO: replace the content of this cell with your solution
raise NotImplementedError()