### Pipeline

In this notebook, we extend the work from the previous notebook by introducing a more structured approach to building and evaluating the **Logistic Regression** model incorporating a **Pipeline**.

A **Pipeline** allows us to bundle multiple steps—such as preprocessing, feature scaling, imputation, and model training—into a single object. This ensures that all operations are applied consistently and in the correct order. By using a pipeline, we can automate the process of transforming the data, fitting the model, and making predictions, all while reducing the risk of data leakage and improving code readability.

In this notebook, we'll use the **Pipeline** to handle the preprocessing of the data (e.g., imputing missing values and standardizing the numerical features) as well as the logistic regression classification step. Additionally, we will employ **Grid Search** to optimize hyperparameters, ensuring the best performance of our model.

This approach not only simplifies the process but also allows us to easily apply the same transformations and model evaluation across different datasets or experiments. Let's dive into the process of constructing the pipeline and optimizing the model.


In [1]:
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer # imputation module
from sklearn.pipeline import Pipeline # # streaming pipelines
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
# Load the diabetes dataset
diabetes = pd.read_csv('./datasets/diabetes.csv')

In [3]:
display(diabetes)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [4]:
# Shuffling all samples to avoid group bias
diabetes = diabetes.sample(frac=1, random_state=42).reset_index(drop=True)

In [5]:
# Select features and target variable
selected_features = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
                     'BMI', 'DiabetesPedigreeFunction', 'Age']
X = diabetes[selected_features].values
y = diabetes['Outcome'].values

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

In [6]:
# Define the pipeline without OutlierRemover (that we will see in the next notebook)

# List of indices for numeric features (from 0 to X.shape[1] - 1)
numeric_features = list(range(X.shape[1]))  # (in this case columns 0 to 7 in the dataset)

# imputer fill any remaining missing values with the mean strategy
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# ColumnTransformer integrates preprocessing steps for specific feature subsets:
# - 'num': Applies numeric_transformer to the numeric_features columns
# - remainder='passthrough': the features not involved in the transformations (eg. in this case outcome) are included in the output without undergoing any modification
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ],
    remainder='passthrough'
)
# In this case preprocessor manages the imputation and standardization of numerical features
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42, C=1.0, penalty='l2', solver='saga')) # we start with these parameters
])

In [None]:
# Define the hyperparameter grid for grid search
param_grid = {
    'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2']
}

# Create GridSearchCV object
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy') # note that here we are passing the entire pipeline

# Fit the model with grid search on the training data
grid_search.fit(X_train, y_train)

In [8]:
# Get the best parameters from the grid search
best_params = grid_search.best_params_
print(best_params)

{'classifier__C': 1, 'classifier__penalty': 'l1'}


In [9]:
# Make predictions on the test data using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

In [10]:
# Evaluate the performance of the best model
accuracy = accuracy_score(y_test, y_pred)
classification_report_str = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f'Best Hyperparameters: {best_params}')
print(f'Accuracy with Best Model: {accuracy:.2f}')
print('Classification Report:\n', classification_report_str)
print('Confusion Matrix:\n', conf_matrix)

Best Hyperparameters: {'classifier__C': 1, 'classifier__penalty': 'l1'}
Accuracy with Best Model: 0.79
Classification Report:
               precision    recall  f1-score   support

           0       0.80      0.89      0.84        96
           1       0.77      0.64      0.70        58

    accuracy                           0.79       154
   macro avg       0.79      0.76      0.77       154
weighted avg       0.79      0.79      0.79       154

Confusion Matrix:
 [[85 11]
 [21 37]]
