# **Machine Learning Pipelines**

In this lesson we’re going to learn how to turn a machine learning (ML) workflow to a pipeline using `scikit-learn`. A ML pipeline is a modular sequence of objects that codifies and automates a ML workflow to make it efficient, reproducible and generalizable. While the process of building pipelines is not singular, there are some tools that are universally used to do this. The most accessible of these is `scikit-learn`'s `Pipeline` object which allows us to chain together the different steps that go into a ML workflow.

Turning a workflow into a pipeline has many other advantages too. Pipelines provide consistency — the same steps will always be applied in the same order under the same conditions. They also are very concise and can streamline your code. The `Pipeline` object within `scikit-learn` has consistent methods to use the many other estimators and transformers we have already covered in our ML curriculum. It is usually the starting point for a Machine Learning Engineer before turning to more sophisticated tools for scaling pipelines (such as PySpark, etc) and we will delve deeper into it in this lesson

What can go into a pipeline? For any of the intermediate steps, it must have both the `.fit` and `.transform` methods. This includes preprocessing, imputation, feature selection and dimensionality reduction. The final step must have the `.fit` method. Examples of tasks we’ve seen already that could benefit from a pipeline include:
- scaling data then applying principal component analysis
- filling in missing values then fitting a regression model
- one-hot-encoding categorical variables and scaling numerical variables.

## Data Cleaning - Numeric

To introduce pipelines, let’s look at a common set of data cleaning/EDA tasks — dealing with missing values and scaling numeric variables. We’re going to convert an existing code base that performs these tasks to more concise code that uses `scikit-learn`'s `Pipeline`.

In [28]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder

In [2]:
columns = ["sex","length","diam","height","whole","shucked","viscera","shell","age"]
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", names=columns)

y = df.age
X = df.drop(columns=['age'])
num_cols = X.select_dtypes(include=np.number).columns
cat_cols = X.select_dtypes(include=['object']).columns
#create some missing values
for i in range(1000):
    X.loc[np.random.choice(X.index),np.random.choice(X.columns)] = np.nan
    
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.25) 

In [3]:
## not using a pipeline

X_train_num = X_train[num_cols]
# fill missing values with mean on numeric features only
X_train_fill_missing = X_train_num.fillna(X_train_num.mean())
# fit standard scaler on X_train_fill_missing
scale = StandardScaler().fit(X_train_fill_missing)
# scale data after filling in missing values
X_train_fill_missing_scale = scale.transform(X_train_fill_missing)

# repeat on the test set 
X_test_fill_missing = X_test[num_cols].fillna(X_train_num.mean())
X_test_fill_missing_scale = scale.transform(X_test_fill_missing)

In [4]:
## using a pipeline

# instantiate a pipeline object
pipeline = Pipeline([("imputer", SimpleImputer(strategy="mean")), ("scale", StandardScaler())])

pipeline.fit(X_train[num_cols])
X_transform = pipeline.transform(X_test[num_cols])

# confirm that X_transform and X_test_fill_missing_scale are the same
array_diff = np.sum(np.abs(X_test_fill_missing_scale - X_transform))
array_diff

0.0

In [5]:
# instantiate a pipeline object
pipeline_median = Pipeline([("imputer", SimpleImputer(strategy="median")), ("scale", StandardScaler())])

pipeline_median.fit(X_train[num_cols])
X_transform_median = pipeline_median.transform(X_test[num_cols])

# confirm that X_transform and X_test_fill_missing_scale are different
array_diff = np.sum(np.abs(X_test_fill_missing_scale - X_transform_median))
array_diff

44.7499993017456

In [6]:
# confirm that the results of the pipelines are indeed different
array_diff = np.sum(np.abs(X_transform - X_transform_median))
array_diff

44.7499993017456

## Data Cleaning - Categorical

We’re now going to implement a task similar to the previous exercise with `pipeline.Pipeline()`, but with categorical variables now. Specifically we’ll be dealing with missing values in categorical data and one-hot-encoding categorical variables. We will convert an existing codebase to a pipeline like in the previous exercise.

In [7]:
## not using a pipeline

X_train_cat = X_train[cat_cols]
# fill missing values with mode on categorical features only
X_train_fill_missing = X_train_cat.fillna(X_train_cat.mode().values[0][0])
# one-hot-encode categorical features in X_train_fill_missing
ohe = OneHotEncoder(sparse=False, drop='first').fit(X_train_fill_missing)
X_train_fill_missing_ohe = ohe.transform(X_train_fill_missing)

# repeat on the test set
X_test_fill_missing = X_test[cat_cols].fillna(X_train_cat.mode().values[0][0])
X_test_fill_missing_ohe = ohe.transform(X_test_fill_missing)

In [8]:
## using a pipeline

pipeline = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder(drop='first', sparse=False))])

pipeline.fit(X_train[cat_cols])
X_transform = pipeline.transform(X_test[cat_cols])

# confirm that X_transform and X_test_fill_missing_ohe are the same
check_arrays = np.array_equal(X_transform, X_test_fill_missing_ohe)
check_arrays

True

## Column Transformer

Often times, you may not want to simply apply every function to all columns. If our columns are of different types, we may only want to apply certain parts of the pipeline to a subset of columns. This is what we saw in the two previous exercises. One set of transformations are applied to numeric columns and another set to the categorical ones. We can use `scikit-learn`'s `ColumnTransformer` as one way of combining these processes together.

`ColumnTransformer` takes in a list of tuples of the form `(name, pipeline, columns)`:

```
example_column_transformer = ColumnTransformer(
    transformers=[ ("name_1", pipeline_1, columns_1),
                   ("name_2", pipeline_2, columns_2)])
```

The transformer can be anything with a `.fit` and `.transform` method like we used previously (like `SimpleImputer` or `StandardScaler`), but can also itself be a pipeline, as we will use in the exercise.

In [9]:
# create separate pipelines for the categorical and numeric features
num_vals = Pipeline([("imputer", SimpleImputer(strategy='mean')), ("scale", StandardScaler())])
cat_vals = Pipeline([("imputer", SimpleImputer(strategy="most_frequent")), ("ohe", OneHotEncoder(sparse=False, drop='first'))])

# create a column transformer
preprocess = ColumnTransformer(transformers=[("num_preprocess", num_vals, num_cols),
                                             ("cat_preprocess", cat_vals, cat_cols)]
                              )


# fit to the training and test data
preprocess.fit(X_train)
x_transform = preprocess.transform(X_test)

## Adding a Model

Now that we have all the preprocessing done and coded succinctly using `ColumnTransformer` and `Pipeline`, we can add a model. We will take the result at the end of the previous exercise, and now create a final pipeline with the `ColumnTransformer` as the first step, and a `LinearRegression()` model as the second step.

By adding a model to the final step, the last step no longer has a `.transform` method. This is the only step in a pipeline that can be a non-transformer. But now the final step also has a `.predict` method, which can be called on the entire pipeline! Additionally the `.score()` method, which estimates the default prediction score on any `scikit-learn` model can also be used to evaluate the performance of the pipeline.

In [12]:
# create a pipeline including the column transformer and a linear regression
pipeline = Pipeline([("preprocess", preprocess), ("regr", LinearRegression())])

# fit the pipeline on the training data and predict on the test data
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# use the score method
pipeline_score = pipeline.score(X_test, y_test)
pipeline_score

0.49892357249092656

In [14]:
# compare this against R2 (the default performance metric for linear regression)
r2_score = r2_score(y_test, y_pred)
print(r2_score)

0.49892357249092656


## Hyperparameter Tuning

What now? Well, we can tune some of the parameters of the model by applying a grid search over a range of hyperparameter values.

A linear regression model has very few hyperparameters and here we’ll be using the hyperparameter that pertains to whether we include an intercept or not. As we’ve aseen, the pipeline created in the previous exercise is an estimator and we can call the `.fit()` and `.predict()` methods on it. So in fact, the whole pipeline can be passed as an estimator for `GridSearchCV`. This will then refit the pipeline for each combination of parameter values in the grid and each fold in the cross-validation split.

That’s a lot – but the code is again very short. One last thing to keep in mind while referencing hyperparameters in a pipeline is the following: any hyperparameter can be called using `pipeline_step_name + '__' + hyperparameter`. For example, `regr__fit_intercept` corresponds to a pipeline step named “regr” and the hyperparameter “fit_intercept”.

In [17]:
# simple parameter grid with and without the intercept
param_grid = {
    "regr__fit_intercept": [True,False]
}

# define a GridSearchCV object
gs = GridSearchCV(pipeline, param_grid, scoring='neg_mean_squared_error', cv=5)

# fit to the training data
gs.fit(X_train, y_train)
best_score = gs.best_score_
best_score

-5.423115734444834

In [18]:
best_params = gs.best_params_
best_params

{'regr__fit_intercept': True}

## Final Pipeline

Now that we are getting the hang of pipelines, we’re going take things up a notch. We will now be searching over different types of models, each having their own sets of hyperparameters! In the original pipeline, we defined regr to be an instance of `LinearRegression()`. Then in defining the parameter grid to search over, we used the dictionary `{"regr__fit_intercept": [True,False]}` to define the values of the fit_intercept term. We can equivalently do this by passing both the estimator AND parameters in a single dictionary as
```
{'regr': [LinearRegression()], "regr__fit_intercept": [True,False]}
```
We can add more models to it as follows. Suppose we wanted to add a Ridge regression model and also perform hyperparamter tuning using `GridSearchCV` to find the best regularization parameter `alpha`, we would add it to previous dictionary to create an array of dictionaries as follows:
```
search_space = [{'regr': [LinearRegression()], 'regr__fit_intercept': [True,False]},
                {'regr':[Ridge()], 'regr__alpha': [0,0.1,1,10,100]}
```

The goal of this process is to find the best estimator for our dataset and problem in the most efficient manner possible. The pipeline module allows us to do exactly that! In a couple of lines of code, we’re able to preprocess the data and search an entire model and hyperparameter space. The final step is to access the pipeline elements to draw out the information about which estimator and hyperparameter set gets us the best score. We do this by using the `.next_steps` method by using the strings we’ve used in the dictionary. For instance, the regression model can be access using the string 'regr' from the dictionary as follows:

- Get the best estimator using `GridSearchCV`'s `.best_estimator_` method
- Use `.named_steps['regr'].get_params()` on the best estimator to get its hyperparameters!

In [21]:
# define the linear regression search space
search_space = [{'regr': [LinearRegression()], 'regr__fit_intercept': [True,False]},
                {'regr': [Ridge()], 'regr__alpha': [0,0.1,1,10,100]},
                {'regr': [Lasso()], 'regr__alpha': [0,0.1,1,10,100]}]

# initialise a grid search on 'search space'
gs = GridSearchCV(pipeline, search_space, scoring='neg_mean_squared_error', cv=5)

# fit to the training data
gs.fit(X_train, y_train)
# find the best pipeline
best_pipeline = gs.best_estimator_
best_pipeline

In [24]:
# locate the best regression model
best_regression_model = best_pipeline.named_steps['regr']
best_regression_model

In [25]:
# determine the best hyperparameters of the best model
best_model_hyperparameters = best_regression_model.get_params()
best_model_hyperparameters

{'alpha': 1,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'normalize': 'deprecated',
 'positive': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.001}

In [26]:
# access the hyperparameters of the categorical preprocessing step
cat_preprocess_hyperparameters = best_pipeline.named_steps['preprocess'].named_transformers_['cat_preprocess'].named_steps['imputer'].get_params()
cat_preprocess_hyperparameters

{'add_indicator': False,
 'copy': True,
 'fill_value': None,
 'missing_values': nan,
 'strategy': 'most_frequent',
 'verbose': 'deprecated'}

## Writing Custom Classes

While scikit-learn contains many existing transformers and classes that can be used in pipelines, you may need at some point to create your own. This is simpler than you may think, as a step in the pipeline needs to have only a few methods implemented. If it is an intermediate step, it will need fit and transform methods. We will implement all of this in the exercise below!

Here are some of the major takeaways on building pipelines in `scikit-learn`:

- Pipelines help make concise, reproducible, code by combining steps of transformers and/or a final estimator.
- Intermediate steps of a pipeline must have both the `.fit()` and `.transform()` methods. This includes preprocessing, imputation, feature selection, dimension reduction.
- The final step of a pipeline must have the `.fit()` method – this can include a transformer or an estimator/model.
- If the pipeline is meant to only transform your data by combining preprocessing and data cleaning steps, then each step in the pipeline will be a transformer. If your pipeline will also include a model (a final estimation or prediction step), then the last step must be an estimator.
- Once the steps of a pipeline are defined, it can be used like an other transformer/estimator by calling fit, transform, and/or predict methods. Similarly, it can be used in place of an estimator in a hyperparameter grid search.

In [29]:
# the class `MyImputer` replicates the `SimpleImputer` using the mean strategy
class MyImputer(BaseEstimator, TransformerMixin): 
    def __init__(self):
        return None
    
    def fit(self, X, y = None):
        self.means = np.mean(X, axis=0)    # calculate the mean of each column
        return self
    
    def transform(self, X, y = None):
        #transform method fills in missing values with means using pandas
        return X.fillna(self.means)
    
# create a new pipeline using the custom class and StandardScaler on the second
new_pipeline = Pipeline([("myImputer", MyImputer()), ("scale", StandardScaler())])

# fit the pipeline on the training data and apply to the test data
new_pipeline.fit(X_train[num_cols])
X_transform = new_pipeline.transform(X_test[num_cols])

# verify the results are the same as x_test_fill_missing_scale
check_arrays = np.array_equal(X_transform, X_test_fill_missing_scale)
print(check_arrays)

True
