# Pipelines

## Introduction
In this lecture, we will learn about **pipelines** in sklearn, which are a powerful tool for building and evaluating machine learning models. We will cover the following topics:

- What is a pipeline and what are its benefits?
- How to create a pipeline using the `make_pipeline` function
- How to access and modify the steps and parameters of a pipeline
- How to use a pipeline for cross-validation and grid search

## What is a pipeline and what are its benefits?
A typical machine learning workflow involves several steps, such as data cleaning, feature extraction, feature selection, scaling, dimensionality reduction, and model fitting. Applying these steps manually can be tedious and error-prone, especially when we need to ensure that the same steps are applied consistently to both the training and the test data.

A **pipeline** is a way of automating this workflow by chaining together a sequence of data transformers with an optional final predictor. A pipeline allows us to:

- **Simplify the code** by avoiding intermediate variables and fit/transform calls
- **Ensure consistent preprocessing** by applying the same steps to both the training and the test data
- **Enable parameter setting** by using the names of the steps and the parameter names separated by a '__'
- **Allow for easy cross-validation and grid search** over the whole process

## How to create a pipeline using the `make_pipeline` function

One of the easiest ways to create a pipeline in sklearn is to use the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) function, which takes a list of estimator objects as arguments and returns a pipeline object. The `make_pipeline` function does not require and does not permit, naming the estimators. Instead, it automatically assigns names to the steps based on their types, in lowercase. For example, if we pass a `StandardScaler` and a `SVC` object, the names of the steps will be 'standardscaler' and 'svc', respectively.

The syntax of the `make_pipeline` function is:

```python
make_pipeline(*steps, memory=None, verbose=False)
```

where:

- `steps` is a list of estimator objects that are chained together
- `memory` is an optional argument that can be used to cache the fitted transformers of the pipeline
- `verbose` is an optional argument that can be used to print the time elapsed while fitting each step

For example, to create a pipeline that scales the data and applies a support vector classifier, we can write:

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), SVC())

## How to access the steps and parameters of a pipeline
- A pipeline object has several attributes and methods that can be used to access the steps and parameters of the pipeline
- Some of the most common ones are:
    * `named_steps`: a dictionary that maps the names of the steps to the estimators
    * `steps`: a list of tuples that contains the names and the estimators of the steps
    * `get_params()`: a method that returns a dictionary of all the parameters of the pipeline and the estimators
    * `set_params()`: a method that sets the parameters of the pipeline and the estimators
    * `fit()`: a method that fits the pipeline on the data
    * `predict()`: a method that makes predictions using the pipeline
    * `score()`: a method that returns the score of the pipeline on the data
- For example, to access the scaler and the SVC objects in the previous pipeline, we can write:

In [None]:
print(pipe.named_steps['standardscaler'])
print(pipe.named_steps['svc'])

- To get the value of the C parameter of the SVC, we can write:

In [None]:
print(pipe.get_params()['svc__C'])

- To set the value of the C parameter of the SVC to 10, we can write:

In [None]:
pipe.set_params(svc__C=10)

- To fit the pipeline on the training data and score it on the test data, we can write:

```python
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
```

<font color='Blue'><b>Example:</b></font> This example is based on  [scikit-learn.org](https://scikit-learn.org/stable/auto_examples/ensemble/plot_stack_predictors.html#sphx-glr-auto-examples-ensemble-plot-stack-predictors-py).

**Dateset:** The Ames Housing dataset is a realistic and complex dataset for house price prediction. It contains 2,919 observations of housing sales in Ames, Iowa between 2006 and 2010, with 80 features describing various aspects of the houses. The target variable is the sale price of the house in dollars.

The dataset can be loaded from sklearn using the `fetch_openml` function, which returns a `Bunch` object containing the data, the target, the feature names, and some metadata. The dataset has some missing values, which are marked with the character "?" in the original CSV file. We can use the `na_values` argument of `fetch_openml` to specify this missing value marker, so that pandas can parse the data correctly. We can also drop the "Id" column, which is not relevant for the analysis.

In [None]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

def load_ames_housing(n, random_state = 0):
    # Fetch the Ames Housing dataset from openml
    df = fetch_openml(name="house_prices", as_frame=True, parser='auto')

    # Extract features (X) and sale price (y)
    X = df.data
    y = df.target

    # Selected features for analysis
    features = [
        "YrSold",
        "HeatingQC",
        "Street",
        "YearRemodAdd",
        "Heating",
        "MasVnrType",
        "BsmtUnfSF",
        "Foundation",
        "MasVnrArea",
        "MSSubClass",
        "ExterQual",
        "Condition2",
        "GarageCars",
        "GarageType",
        "OverallQual",
        "TotalBsmtSF",
        "BsmtFinSF1",
        "HouseStyle",
        "MiscFeature",
        "MoSold",
    ]

    # Select only the relevant features from the dataset
    X = X.loc[:, features]

    # Shuffle the dataset for randomness
    X, y = shuffle(X, y, random_state = random_state)

    # Limit the dataset to the first 600 instances for efficiency
    X = X.iloc[:n]
    y = y.iloc[:n]

    # Return the preprocessed features (X) and target variable (y)
    return X, np.log(y)

# Load the Ames Housing dataset using the defined function
X, y = load_ames_housing(n = 1000)
print('\nX')
display(X)
print('\ny')
display(y.to_frame('ln(Sale Price)'))

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X,                              # Feature matrix
    y,                              # Target variable
    test_size=0.25,                 # Proportion of the dataset to include in the test split
    random_state=0                  # Seed for reproducibility
)

**Making a pipeline to preprocess the data**

In [None]:
from sklearn.compose import make_column_selector

# Define a column selector for categorical features using the make_column_selector function
cat_selector = make_column_selector(dtype_include=object)

# Define a column selector for numerical features using the make_column_selector function
num_selector = make_column_selector(dtype_include=np.number)

# Apply the categorical feature selector to the feature matrix X
cat_features = cat_selector(X)
print('Categorical features:', cat_features)

# Apply the numerical feature selector to the feature matrix X
num_features = num_selector(X)
print('Numerical features:', num_features)

In this code, we are using the `make_column_selector` function from sklearn to create two column selectors: one for categorical features and one for numerical features. A column selector is a callable that can select columns from a pandas DataFrame based on their names or data types. We can use column selectors with the `ColumnTransformer` class to apply different transformations to different subsets of features.

The `make_column_selector` function takes three optional arguments: `pattern`, `dtype_include`, and `dtype_exclude`. The `pattern` argument is a regular expression that matches the names of the columns to be selected. The `dtype_include` and `dtype_exclude` arguments are lists of data types that specify which columns to include or exclude based on their types. For more details, see the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html).

In this code, we are using the `dtype_include` argument to create two column selectors: one for categorical features and one for numerical features. We pass the `object` data type to the `dtype_include` argument of the `cat_selector` to select the columns that have string values. We pass the `np.number` data type to the `dtype_include` argument of the `num_selector` to select the columns that have numeric values. We use the `numpy` library as `np` to access the `number` data type.

After creating the column selectors, we apply them to the feature matrix `X`, which is a pandas DataFrame. We assign the output of the `cat_selector` to the variable `cat_features`, which is a list of the names of the categorical features. We assign the output of the `num_selector` to the variable `num_features`, which is a list of the names of the numerical features. We print the values of these variables to see the results.

**Defining the pipeline required for the tree-based models**

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder

# Define an OrdinalEncoder for categorical features in a decision tree context
cat_tree_processor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                    unknown_value=-1,
                                    encoded_missing_value=-2)

# Define a SimpleImputer for numerical features in a decision tree context
num_tree_processor = SimpleImputer(strategy="mean", add_indicator=True)

# Create a column transformer for preprocessing both numerical and categorical features
tree_preprocessor = make_column_transformer(
    (num_tree_processor, num_selector),   # Apply num_tree_processor to numerical features
    (cat_tree_processor, cat_selector)    # Apply cat_tree_processor to categorical features
    )

In this code, we are creating a column transformer for decision tree models, which can handle missing values and categorical features differently from other models. We define two transformers and two column selectors for this purpose:

- `cat_tree_processor` is an `OrdinalEncoder` that encodes categorical features as ordinal integers. It can handle unknown values and missing values by assigning them special codes: -1 for unknown values and -2 for missing values. This way, the decision tree can split on these values and learn from them.
- `num_tree_processor` is a `SimpleImputer` that imputes missing values in numerical features with the mean value. It also adds an indicator feature that marks the missing values with a 1 and the non-missing values with a 0. This allows the decision tree to split on the indicator feature and learn from the missingness pattern.
- `num_selector` is a column selector that selects numerical features based on their data type. It returns a list of the names of the numerical features in the data.
- `cat_selector` is a column selector that selects categorical features based on their data type. It returns a list of the names of the categorical features in the data.

We pass these tuples to the `make_column_transformer` function, which creates a column transformer object called `tree_preprocessor`. We can use this object to fit and transform the data, and pass the output to a decision tree model.

**Defining the preprocessor used when the ending regressor is a linear model**



In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Define a OneHotEncoder for categorical features in a linear regression context
cat_linear_processor = OneHotEncoder(handle_unknown="ignore")

# Create a pipeline for numerical features in a linear regression context
num_linear_processor = make_pipeline(
    StandardScaler(),                     # Standardize numerical features
    SimpleImputer(strategy="mean", add_indicator=True)  # Impute missing values with mean and add indicator
    )

# Create a column transformer for preprocessing both numerical and categorical features
linear_preprocessor = make_column_transformer(
    (num_linear_processor, num_selector),   # Apply num_linear_processor to numerical features
    (cat_linear_processor, cat_selector)    # Apply cat_linear_processor to categorical features
    )

In this code, we are creating a column transformer for linear regression models, which require different preprocessing for numerical and categorical features. We define two transformers and two column selectors for this purpose:

- `cat_linear_processor` is a `OneHotEncoder` that encodes categorical features as one-hot vectors. It can handle unknown values by ignoring them and not generating any output for them.
- `num_linear_processor` is a pipeline that applies two transformations to numerical features: a `StandardScaler` that standardizes the features by removing the mean and scaling to unit variance, and a `SimpleImputer` that imputes missing values with the mean value and adds an indicator feature that marks the missing values with a 1 and the non-missing values with a 0.
- `num_selector` is a column selector that selects numerical features based on their data type. It returns a list of the names of the numerical features in the data.
- `cat_selector` is a column selector that selects categorical features based on their data type. It returns a list of the names of the categorical features in the data.

We pass these tuples to the `make_column_transformer` function, which creates a column transformer object called `linear_preprocessor`. We can use this object to fit and transform the data, and pass the output to a linear regression model.

**Stack of predictors on a single data set**

In [None]:
from sklearn.linear_model import LassoCV

# Create a pipeline for Lasso regression, including preprocessing with linear_preprocessor
lasso_pipeline = make_pipeline(
    linear_preprocessor,   # Apply linear_preprocessor for feature transformation
    LassoCV(random_state = 0)   # Apply LassoCV for linear regression with L1 regularization
)

lasso_pipeline

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create a pipeline for Random Forest regression, including preprocessing with tree_preprocessor
rf_pipeline = make_pipeline(
    tree_preprocessor,                  # Apply tree_preprocessor for feature transformation
    RandomForestRegressor(random_state=0)  # Apply RandomForestRegressor for ensemble regression
)

rf_pipeline

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# Create a pipeline for Gradient Boosting regression, including preprocessing with tree_preprocessor
gbr_pipeline = make_pipeline(
    tree_preprocessor,                     # Apply tree_preprocessor for feature transformation
    GradientBoostingRegressor(random_state=0)  # Apply GradientBoostingRegressor for ensemble regression
)

gbr_pipeline

Stacking is an ensemble learning technique that combines the predictions of multiple base models, such as decision trees, linear models, etc., and uses a meta-model to produce the final output. The meta-model can be either a regressor or a classifier, depending on the task.

The main advantage of stacking is that it can leverage the strengths of each base model and learn how to best weigh their predictions with the meta-model. This can improve the performance and generalization ability of a single model, especially when the base models have different biases or assumptions.

However, stacking also has some drawbacks, such as increased complexity and computational cost, and the need for careful tuning of the hyperparameters of the base and meta-models.

In Python, stacking can be implemented using the following modules from the scikit-learn library:

- [StackingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingRegressor.html) for regression tasks
- [StackingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html) for classification tasks


<center>
<img src="https://raw.githubusercontent.com/HatefDastour/ENSF444/71daeddd7eda2b9c19d1157ee4ec0c27afbc5d30/Images/Stacking_Model.png" alt="picture" width="700">
<br>
<b>Figure</b>: Stacking Regressor/Classifier.
</center>

In [None]:
from sklearn.ensemble import StackingRegressor
from sklearn.svm import SVR

# Define a list of base estimators for stacking
estimators = [
    ("Random Forest", rf_pipeline),
    ("Lasso", lasso_pipeline),
    ("Gradient Boosting", gbr_pipeline)
]

# Create a StackingRegressor with specified base estimators and final estimator as SVR
stacking_regressor = StackingRegressor(
    estimators=estimators,      # List of base estimators
    final_estimator=SVR()       # Final estimator for meta-regression (SVR in this case)
)

stacking_regressor

In [None]:
import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import PredictionErrorDisplay
from sklearn.model_selection import cross_val_predict, cross_validate
from scipy.stats import sem

plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENSF444/main/Files/mystyle.mplstyle')

def visualize_prediction_errors(X, y,
                                suptitle = "Single predictors versus stacked predictors",
                                estimators = estimators, stacking_regressor = stacking_regressor):

    # Create subplots for visualizing prediction errors
    fig, axs = plt.subplots(2, 2, figsize=(11, 10))
    axs = axs.ravel()

    # Iterate through base estimators and the stacking regressor
    for ax, (name, est) in zip(axs, estimators + [("Stacking Regressor", stacking_regressor)]):
        scorers = {"$R^2$": "r2", "MSE": "neg_mean_squared_error"}

        # Measure the time taken for cross-validation
        start_time = time.time()
        scores = cross_validate(est, X, y, scoring=list(scorers.values()), n_jobs=-1, verbose=0)
        elapsed_time = time.time() - start_time

        # Obtain predictions using cross-validation
        y_pred = cross_val_predict(est, X, y, n_jobs=-1, verbose=0)

        # Calculate and format evaluation scores
        scores = {
            key: (
                f"{np.abs(np.mean(scores[f'test_{value}'])):.3f} ± "
                f"{(sem(scores[f'test_{value}'], ddof=0)):.3f}"
            )
            for key, value in scorers.items()
        }

        # Display prediction errors using PredictionErrorDisplay
        display = PredictionErrorDisplay.from_predictions(
            y_true=y,
            y_pred=y_pred,
            kind="actual_vs_predicted",
            ax=ax,
            scatter_kwargs={"alpha": 0.2, "color": "tab:blue"},
            line_kwargs={"color": "tab:red", 'linewidth':2},
        )
        ax.set_title(f"{name}\nEvaluation in {elapsed_time:.2f} seconds")

        # Add legend for evaluation scores
        for name, score in scores.items():
            ax.plot([], [], " ", label=f"{name}: {score}")
        ax.legend(loc="upper left")

    # Set overall plot title and adjust layout
    plt.suptitle(suptitle, weight = 'bold', fontsize = 14)
    plt.tight_layout()
    plt.show()

In [None]:
visualize_prediction_errors(X_train, y_train, suptitle = "Single Predictors vs. Stacked Predictors (Train Set)")

In [None]:
visualize_prediction_errors(X_test, y_test, suptitle = "Single Predictors vs. Stacked Predictors (Test Set)")

## How to use a pipeline for cross-validation and grid search

A pipeline is a useful tool that allows us to evaluate and optimize the whole process of data preprocessing and model fitting as a single unit. This means that we can use a pipeline as any other estimator for cross-validation and grid search, using the same functions and classes from sklearn.

- To use a pipeline for **cross-validation**, we can use the `cross_val_score` function, which takes a pipeline object, the data, the target, and the number of folds as arguments, and returns an array of scores for each fold. We can then compute the mean score to get an estimate of the performance of the pipeline on the data. Cross-validation is a technique that splits the data into multiple folds, and uses one fold as the validation set and the rest as the training set. This is repeated for each fold, and the average performance is reported.

- To use a pipeline for **grid search**, we can use the `GridSearchCV` class, which takes a pipeline object, a parameter grid, and the number of folds as arguments, and returns a grid search object that can be fitted on the data. The parameter grid is a dictionary that maps the names of the steps and the parameters to the values that we want to try. We need to use the '__' syntax to specify the step and the parameter name, for example 'svc__C' for the C parameter of the SVC step. The grid search object has attributes such as `best_params_` and `best_score_` that can be used to get the best combination of parameters and the corresponding score. Grid search is a tool that helps you find the best hyperparameters for your model by using cross-validation.

For example, suppose we want to perform a 5-fold cross-validation and a grid search over the C and gamma parameters of the SVC in the previous pipeline. We can write:

```python
from sklearn.model_selection import cross_val_score, GridSearchCV

# 5-fold cross-validation
scores = cross_val_score(pipe, X_train or X_test, y _train or y_test, cv=5)
print("Mean cross-validation score:", scores.mean())

# grid search
param_grid = {'svc__C': [0.1, 1, 10], 'svc__gamma': [0.01, 0.1, 1]}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best parameters:", grid.best_params_)
print("Best score:", grid.best_score_)
```

# Summary
In this lecture, we learned how to use **pipelines** in sklearn to automate and optimize our machine learning workflow. We covered the following key points:

- A pipeline is a sequence of data transformers with an optional final predictor that can be applied to both the training and the test data
- A pipeline simplifies the code by avoiding intermediate variables and fit/transform calls, and enables parameter setting by using the '__' syntax
- A pipeline can be created using the `make_pipeline` function, which automatically names the steps based on their types, or using the `Pipeline` class, which allows us to name the steps explicitly
- A pipeline can be accessed and modified using the `named_steps`, `steps`, `get_params()`, and `set_params()` attributes and methods of the pipeline object
- A pipeline can be used as any other estimator for cross-validation and grid search, using the `cross_val_score` function and the `GridSearchCV` class from sklearn