References:

https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
 https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/ 

Introduction to ML with Python : CH. - 6

In [None]:
# Restart runtime after this cell first, sklearn needs to be upgraded for displaying pipeline diagrams

!pip install --upgrade scikit-learn

Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.6/dist-packages (0.23.2)


## Pipeline and ColumnTransformer Eg.

In [None]:
import numpy as np

from sklearn.svm import SVC
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

# Alternatively X and y can be obtained directly from the frame attribute:
# X = titanic.frame.drop('survived', axis=1)
# y = titanic.frame['survived']

In [None]:
X.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [None]:
X.isnull().sum()

pclass          0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [None]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.790


In [None]:
from sklearn import set_config
set_config(display='diagram')
clf

In [None]:
categorical_transformer

In [None]:
preprocessor

## Convenient Pipeline Creation with make_pipeline

Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:

In [None]:
from sklearn.pipeline import make_pipeline
# standard syntax
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
# abbreviated syntax
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))

In [None]:
pipe_long

In [None]:
pipe_short

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
print("Pipeline steps:\n{}".format(pipe.steps))

Pipeline steps:
[('standardscaler-1', StandardScaler()), ('pca', PCA(n_components=2)), ('standardscaler-2', StandardScaler())]


In [None]:
pipe

As you can see, the first StandardScaler step was named standardscaler-1 and the second standardscaler-2. **However, in such settings it might be better to use the Pipeline construction with explicit names, to give more semantic names to each step**.

## Using Pipelines in Grid Searches

Using a pipeline in a grid search works the same way as using any other estimator. We define a parameter grid to search over, and construct a GridSearchCV from the pipeline and the parameter grid. When specifying the parameter grid, there is a slight change, though. We need to specify for each parameter which step of the pipeline it belongs to. Both parameters that we want to adjust, C and gamma, are parameters of SVC, the second step. We gave this step the name "svm". The syntax to define a parameter grid for a pipeline is to specify for each parameter the step name, followed by __ (a double underscore), followed by the parameter name. To search over the C parameter of SVC we therefore have to use "svm__C" as the key in the parameter grid dictionary, and similarly for gamma:

In [None]:
param_grid = {'svm__C': [0.001, 0.01, 0.1, 1, 10, 100],
'svm__gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

grid = GridSearchCV(pipe_short, param_grid=param_grid, cv=5)

**[IN]:**

grid.fit(X_train, y_train)

print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))

print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))

print("Best parameters: {}".format(grid.best_params_))

**[OUT]:**

*Best cross-validation accuracy: 0.98*

*Test set score: 0.97*

*Best parameters: {'svm__C': 1, 'svm__gamma': 1}*

In [None]:
# Pipeline inside GridSearchCV can also be visualised using diagrams
grid

## Illustrating Information Leakage in cross-validation

The impact of leaking information in the cross-validation varies depending on the nature of the preprocessing step.

Estimating the scale of the data using the test fold usually doesn’t have a terrible impact, while using the test fold in feature extraction
and feature selection can lead to substantial differences in outcomes.


Let’s consider a synthetic regression task with 100 samples and 1,000 features that are sampled independently from a Gaussian distribution. We also sample the response from a Gaussian distribution:

 Given the way we are creating the dataset, there is no relation between the data, X, and the  target, y (they are independent), so it should not be possible to learn anything from this dataset. 

In [None]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

# We will now do the following. First, select the most informative of the 10
# features using SelectPercentile feature selection, and then we evaluate a Ridge
# regressor using cross-validation:

from sklearn.feature_selection import SelectPercentile, f_regression
select = SelectPercentile(score_func=f_regression, percentile=5).fit(X, y)
X_selected = select.transform(X)
print("X_selected.shape: {}".format(X_selected.shape))

print("----------------------------------------")
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import Ridge
print("Cross-validation accuracy (cv only on ridge): {:.2f}".format(
np.mean(cross_val_score(Ridge(), X_selected, y, cv=5))))
print(" The mean R2 computed by cross-validation is 0.91, indicating a very good model. \n\
 This clearly cannot be right, as our data is entirely random. ")
# The mean R2 computed by cross-validation is 0.91, indicating a very good model.
# This clearly cannot be right, as our data is entirely random.
# What happened here is that our feature selection picked out some features among 
# the 10,000 random features that are (by chance) very well correlated with the target.

# Because we fit the feature selection outside of the cross-validation, it could
# find features that are correlated both on the training and the test folds.

# The information we leaked from the test folds was very informative,
# leading to highly unrealistic results.
# Let’s compare this to a proper cross-validation using a pipeline:
print("----------------------------------------")
pipe = Pipeline([("select", SelectPercentile(score_func=f_regression, percentile=5)),
                ("ridge", Ridge())])
print("Cross-validation accuracy (pipeline): {:.2f}".format(
np.mean(cross_val_score(pipe, X, y, cv=5))))
print(" This time, we get a negative R2 score, indicating a very poor model.")
print("----------------------------------------")
# Using the pipeline, the feature selection is now inside the cross-validation loop. 
# This means features can only be selected using the training folds of the data,
# not the test fold.
# The feature selection finds features that are correlated with the target on the training set,
# but because the data is entirely random, these features are not correlated with the target
# on the test set.

X_selected.shape: (100, 500)
----------------------------------------
Cross-validation accuracy (cv only on ridge): 0.91
 The mean R2 computed by cross-validation is 0.91, indicating a very good model. 
 This clearly cannot be right, as our data is entirely random. 
----------------------------------------
Cross-validation accuracy (pipeline): -0.25
 This time, we get a negative R2 score, indicating a very poor model.
----------------------------------------
