# Chapters 19: Pipelines

Scikit-learn uses the notion of a pipeline. Using the Pipeline
class, you can chain together transformers and models, and
treat the whole process like a scikit-learn model. You can even
insert custom logic.

In [3]:
# Basic Libraries

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()

import pandas as pd
import numpy as np

In [4]:
# Specific Libraries

from sklearn.experimental import enable_iterative_imputer
from sklearn import (
    ensemble,
    impute,
    model_selection,    
    preprocessing,
    tree,
)
from sklearn.base import (
    BaseEstimator,
    TransformerMixin,
)
from sklearn.ensemble import (
    RandomForestClassifier,
)
from sklearn.pipeline import Pipeline

In [5]:
def tweak_titanic(df):      # Define a function called 'tweak_titanic' that takes a DataFrame 'df' as an argument
    df = df.drop(           # Drop specific columns from the DataFrame
        columns=[
            "name",         # Drop the 'name' column as it's unlikely to contribute to prediction
            "ticket",       # Drop the 'ticket' column, which may be considered unnecessary for modeling
            "home.dest",    # Drop 'home.dest' as it may not add significant predictive power
            "boat",         # Drop 'boat' information since it may not be relevant to survival prediction
            "body",         # Drop 'body' as it indicates deceased passengers and could create data leakage
            "cabin",        # Drop 'cabin' due to many missing values and potential overfitting
        ]
    ).pipe(pd.get_dummies, drop_first=True)  # Convert categorical columns to dummy/indicator variables, dropping the first level
    return df                                # Return the modified DataFrame

### Classification Pipeline

In [7]:
class TitanicTransformer(  # Define a custom transformer class for the Titanic dataset
    BaseEstimator, TransformerMixin  # Inherit from scikit-learn's BaseEstimator and TransformerMixin for compatibility with pipelines
):
    def transform(self, X):             # Define the 'transform' method that processes the input data 'X'
                                        # assumes X is output from reading Excel file
        X = tweak_titanic(X)            # Apply the 'tweak_titanic' function to clean and preprocess 'X'
        X = X.drop(columns="survived")  # Drop the 'survived' column to only retain feature columns
        return X                        # Return the transformed DataFrame

    def fit(self, X, y):  # Define the 'fit' method (required by scikit-learn), but it does not need to do anything here
        return self       # Return 'self' for compatibility

# Create a pipeline for preprocessing and model training
pipe = Pipeline([
        ("titan", TitanicTransformer()),         # Step 1: Apply the 'TitanicTransformer' to preprocess the data
        ("impute", impute.IterativeImputer()),   # Step 2: Impute missing values using IterativeImputer
        ("std", preprocessing.StandardScaler()), # Step 3: Standardize the features using StandardScaler
        ("rf", RandomForestClassifier()),        # Step 4: Fit a RandomForestClassifier to the data
    ])

In [8]:
url = ("https://raw.githubusercontent.com/joanby/python-ml-course/refs/heads/master/datasets/titanic/titanic3.csv")

df = pd.read_csv(url)
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [9]:
from sklearn.model_selection import train_test_split 

# Split the dataset 'df' into training and testing sets with 'survived' as the target variable
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    df,                        # The entire DataFrame 'df'
    df.survived,               # The target variable 'survived'
    test_size=0.3,             # 30% of the data will be used for the test set
    random_state=42,           # Set random state for reproducibility
)

pipe.fit(X_train2, y_train2)  # Fit the pipeline to the training data (preprocessing and model training)
pipe.score(X_test2, y_test2)  # Evaluate the pipeline on the test data and return the accuracy score

0.7888040712468194

Pipelines can be used in grid search. Our `param_grid` needs to
have the parameters prefixed by the name of the pipe stage, followed by two underscores. In the example below, we add some
parameters for the random forest stage:

In [11]:
# Define a dictionary for parameter grid search
params = {
    "rf__max_features": [0.4, None],  # Specify valid values for 'max_features': 0.4 and None (using all features)
    "rf__n_estimators": [15, 200],    # Specify different numbers of estimators for the RandomForestClassifier
}

# Create a GridSearchCV object for hyperparameter tuning
grid = model_selection.GridSearchCV(
    pipe,                          # The pipeline object to use for fitting and scoring
    cv=3,                          # Use 3-fold cross-validation
    param_grid=params              # Provide the parameter grid defined above
)

# Fit the GridSearchCV object to the DataFrame 'df' using the target 'survived'
grid.fit(df, df.survived)          # Perform the grid search fitting process on the entire DataFrame

Now we can pull out the best parameters and train the final
model. (In this case the random forest doesn’t improve after
grid search.)

In [13]:
grid.best_params_                    # Retrieve the best parameters found during the grid search
pipe.set_params(**grid.best_params_) # Set the pipeline parameters to the best found in the grid search
pipe.fit(X_train2, y_train2)         # Fit the pipeline to the training data using the best parameters
pipe.score(X_test2, y_test2)         # Score the pipeline on the test data to evaluate its performance

0.806615776081425

We can use the pipeline where we use scikit-learn models:

In [15]:
from sklearn import metrics

# Calculate the ROC AUC score to evaluate the performance of the pipeline's predictions
metrics.roc_auc_score(
    y_test2,                # True labels for the test data
    pipe.predict(X_test2)   # Predicted labels from the pipeline on the test data
)

0.7969410397295013

### Regression Pipeline
Here is an example of a pipeline that performs linear regression on the Boston dataset:

In [17]:
# Boston Housing Dataframe

from sklearn.datasets import fetch_openml

# Fetch the Boston housing dataset from openml
boston_data = fetch_openml(data_id=531, as_frame=True)
bos_X = boston_data.data
bos_y = boston_data.target

# Split the dataset into training and testing sets with 30% of data as the test set
bos_X_train, bos_X_test, bos_y_train, bos_y_test = model_selection.train_test_split(
    bos_X,          # Feature data (input variables)
    bos_y,          # Target data (output variable)
    test_size=0.3,  # Size of the test set as a fraction of the whole dataset
    random_state=42 # Random seed to ensure reproducibility
)

# Standardize the feature data (subtract the mean and scale to unit variance)
bos_sX = preprocessing.StandardScaler().fit_transform(bos_X)

# Split the standardized data into training and testing sets
bos_sX_train, bos_sX_test, bos_sy_train, bos_sy_test = model_selection.train_test_split(
    bos_sX,         # Standardized feature data
    bos_y,          # Target data (output variable)
    test_size=0.3,  # Size of the test set as a fraction of the whole dataset
    random_state=42 # Random seed to ensure reproducibility
)

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# Create a pipeline for regression with two steps: standardization and linear regression
reg_pipe = Pipeline(
    [("std", preprocessing.StandardScaler()),  # Step 1: Standardize the features using StandardScaler
     ("lr", LinearRegression()),               # Step 2: Apply Linear Regression
    ])

reg_pipe.fit(bos_X_train, bos_y_train)  # Fit the pipeline on the training data
reg_pipe.score(bos_X_test, bos_y_test)  # Calculate and return the R^2 score of the model on the test data

0.7112260057484933

If we want to pull parts out of the pipeline to examine their properties, we can do that with the `.named_steps` attribute:

In [20]:
# Access and print the intercept of the Linear Regression model in the pipeline
reg_pipe.named_steps["lr"].intercept_

23.01581920903955

In [21]:
# Access and print the coefficients of the Linear Regression model in the pipeline
reg_pipe.named_steps["lr"].coef_

array([-1.10834602,  0.80843998,  0.34313466,  0.81386426, -1.79804295,
        2.913858  , -0.29893918, -2.94251148,  2.09419303, -1.44706731,
       -2.05232232,  1.02375187, -3.88579002])

We can use the pipeline in metric calculations as well:

In [23]:
from sklearn import metrics 

# Calculate and print the Mean Squared Error between the actual test values and the predicted values
metrics.mean_squared_error(
    bos_y_test,                   # The actual target values from the test set
    reg_pipe.predict(bos_X_test)  # The predicted target values using the regression pipeline
)

21.517444231177205

### PCA Pipeline
Scikit-learn pipelines can also be used for PCA.
Here we standardize the Titanic dataset and perform PCA on
it:

In [25]:
from sklearn.decomposition import PCA

# Create a pipeline that includes data transformation, imputation, scaling, and PCA decomposition
pca_pipe = Pipeline([
    ("titan", TitanicTransformer()),         # Apply custom transformer for preprocessing Titanic data
    ("impute", impute.IterativeImputer()),   # Impute missing values using IterativeImputer
    ("std", preprocessing.StandardScaler()), # Standardize features by removing the mean and scaling to unit variance
    ("pca", PCA()),                          # Apply Principal Component Analysis for dimensionality reduction
])

# Fit the pipeline to the DataFrame 'df' and target variable 'survived', then transform the data
X_pca = pca_pipe.fit_transform(df, df.survived)

Using the `.named_steps` attribute, we can pull properties off of the PCA portion of the pipeline:

In [27]:
# Access the explained variance ratio for each principal component from the PCA step in the pipeline
pca_pipe.named_steps["pca"].explained_variance_ratio_

array([0.23843437, 0.21766138, 0.19207432, 0.10460781, 0.08254178,
       0.07218454, 0.05099774, 0.04149805])

Each value in this array corresponds to a principal component (PC) and shows the fraction of the total dataset variance captured by that component. The first value, 0.23843437, means that the first principal component explains approximately 23.84% of the total variance in the data.

In [29]:
# Access the first principal component vector from the PCA step in the pipeline
pca_pipe.named_steps["pca"].components_[0]

array([ 0.63591201, -0.39601222,  0.00210876, -0.10899407, -0.58278256,
        0.19349714,  0.19275661,  0.11258023])

The `components_` attribute from the PCA step provides the principal component vectors. Each component vector describes the weights (or loadings) assigned to each original feature to form a principal component. The array [ 0.63591201, -0.39601222, ... , 0.11258023] represents the first principal component. 

* `Positive and Negative Values`: The sign of the coefficients indicates the direction of the relationship between the feature and the principal component. A positive coefficient means that an increase in that feature increases the principal component score, while a negative coefficient means the opposite.
* `Magnitude of Values`: The absolute value of each coefficient reflects the strength of the feature's contribution. Larger magnitudes indicate that the feature has a stronger influence on the component.
* `Feature Contribution`: In this example, the first feature (represented by 0.63591201) has the highest positive contribution, while the fifth feature (with -0.58278256) has a significant negative contribution.