# Pipelines and composite estimators

From this guide: https://scikit-learn.org/stable/modules/compose.html

## Overview

Transformers are usually combined with classifiers, regressors or other estimators to build a **composite estimator**. 

The most common tool is a [`Pipeline`](https://scikit-learn.org/stable/modules/compose.html#pipeline). `Pipeline` is often used in combination with `FeatureUnion` which *concatenates the output of transformers into a composite feature space*. 

`TransformedTargetRegressor` deals with transforming the target (e.g. log-transform `y`). In contrast, `Pipeline`s only transform the observed data (`X`).

## [Pipeline: chaining estimators](https://scikit-learn.org/stable/modules/compose.html#pipeline-chaining-estimators)

`Pipeline` can be used to *chain multiple estimators into one*. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. 

Pipeline serves multiple purposes here:

**Convenience and encapsulation**
* You only have to call `fit` and `predict` *once* on your data to fit a whole sequence of estimators.

**Joint parameter selection**
* You can grid *search over parameters* of **all estimators** in the pipeline **at once**.

**Safety**
* `Pipeline`s help avoid *leaking* statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

⚠️ All estimators in a pipeline, except the last one, must be transformers (i.e. must have a `transform` method). The *last estimator may be any type* (transformer, classifier, etc.).

### Construction

The `Pipeline` is built using a list of `(key, value)` pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [1]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from IPython.display import display

estimators = [
    # string --> object
    ('reduce_dim', PCA()), 
    ('clf', SVC())
]

pipe = Pipeline(estimators)
print(pipe)

Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])


The **utility function `make_pipeline`** is a shorthand for constructing pipelines. 

It takes a variable number of estimators and returns a pipeline, *filling in the names automatically*:

In [2]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer

pipe2 = make_pipeline(Binarizer(), MultinomialNB())
print(pipe2)

Pipeline(steps=[('binarizer', Binarizer()), ('multinomialnb', MultinomialNB())])


### Accessing steps

The estimators of a pipeline are stored as a list in the **`steps` attribute**, but can be accessed by index or name by indexing (with `[idx]`) the `Pipeline`:

In [6]:
pipe.steps

[('reduce_dim', PCA()), ('clf', SVC())]

In [7]:
pipe.steps[0]  # Returns the (name, obj) tuple.

('reduce_dim', PCA())

In [8]:
pipe[0]  # Returns only obj.

PCA()

In [9]:
pipe["reduce_dim"]

PCA()

Pipeline’s `named_steps` attribute allows accessing steps by name with tab completion in interactive environments:

In [10]:
pipe.named_steps.reduce_dim is pipe['reduce_dim']

True

A *sub-pipeline* can also be extracted using the slicing notation commonly used for Python Sequences such as lists or strings (although only a step of 1 is permitted). This is convenient for performing only some of the transformations (or their inverse):

In [11]:
pipe[:1]

Pipeline(steps=[('reduce_dim', PCA())])

In [13]:
pipe[-1:]

Pipeline(steps=[('clf', SVC())])

## Nested parameters

Parameters of the estimators in the pipeline can be accessed using the `<estimator>__<parameter>` syntax:

In [16]:
{k: v for k, v in pipe.get_params().items() if "__" in k}

{'reduce_dim__copy': True,
 'reduce_dim__iterated_power': 'auto',
 'reduce_dim__n_components': None,
 'reduce_dim__random_state': None,
 'reduce_dim__svd_solver': 'auto',
 'reduce_dim__tol': 0.0,
 'reduce_dim__whiten': False,
 'clf__C': 1.0,
 'clf__break_ties': False,
 'clf__cache_size': 200,
 'clf__class_weight': None,
 'clf__coef0': 0.0,
 'clf__decision_function_shape': 'ovr',
 'clf__degree': 3,
 'clf__gamma': 'scale',
 'clf__kernel': 'rbf',
 'clf__max_iter': -1,
 'clf__probability': False,
 'clf__random_state': None,
 'clf__shrinking': True,
 'clf__tol': 0.001,
 'clf__verbose': False}

In [19]:
pipe.set_params(clf__C=10)
pipe.get_params()["clf__C"]

10

This is particularly important for doing **grid searches**:

In [21]:
from sklearn.model_selection import GridSearchCV

# ‼️ Define your parameter grid using the <estimator>__<parameter> notation!
param_grid = dict(
    reduce_dim__n_components=[2, 5, 10],
    clf__C=[0.1, 10, 100]
)

In [22]:
grid_search = GridSearchCV(pipe, param_grid=param_grid, verbose=2)

### Notes

Calling `.fit()` on the pipeline is the same as: 
* calling `.fit()` on each estimator in turn, 
* transform the input and pass it on to the next step. 

The pipeline *has all the methods that the last estimator in the pipeline has*, i.e. if the last estimator is a classifier, the `Pipeline` can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

### ⚠️ Caching transformers: avoid repeated computation

Fitting transformers may be computationally expensive. With its **`memory` parameter** set, `Pipeline` **will cache each transformer after calling fit**. 

This feature is used to avoid computing the fit transformers within a pipeline **if the parameters and input data are identical**. 

A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.

The parameter `memory` is needed in order to cache the transformers. `memory` can be either a *string containing the directory where to cache the transformers* or a `joblib.Memory` object:

In [27]:
from tempfile import mkdtemp
from shutil import rmtree
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline

estimators = [
    ('reduce_dim', PCA()), 
    ('clf', SVC())
]

cachedir = mkdtemp()
print("cachedir:", cachedir, end="\n\n")

pipe = Pipeline(estimators, memory=cachedir)
print(pipe)

# Clear the cache directory when you don't need it anymore
rmtree(cachedir)

cachedir: /tmp/tmp05otwxvq

Pipeline(memory='/tmp/tmp05otwxvq',
         steps=[('reduce_dim', PCA()), ('clf', SVC())])


**‼️⚠️ Warning: Side effect of caching transformers:**

Using a `Pipeline` *without cache enabled*, it is **possible to inspect the original instance** such as:

In [34]:
from sklearn.datasets import load_digits

X_digits, y_digits = load_digits(return_X_y=True)

# Create our object instances:
pca1 = PCA()
svm1 = SVC()

pipe = Pipeline(
    [
        ('reduce_dim', pca1), 
        ('clf', svm1)
    ]
)

pipe.fit(X_digits, y_digits)

# ⚠️ The pca instance can be inspected directly:
print(pca1.components_[0, :4])

[-1.77484909e-19 -1.73094651e-02 -2.23428835e-01 -1.35913304e-01]


Enabling **caching triggers a *clone* of the transformers before fitting**. 

Therefore, the transformer instance given to the pipeline **cannot be inspected directly**. 

In following example, accessing the PCA instance `pca2` will raise an `AttributeError` since `pca2` will be an unfitted transformer. Instead, use the attribute `named_steps` to inspect estimators within the pipeline:

In [35]:
cachedir = mkdtemp()

# Again, create our object instances:
pca2 = PCA()
svm2 = SVC()

cached_pipe = Pipeline(
    [
        ('reduce_dim', pca2), 
        ('clf', svm2)
    ],
    memory=cachedir  # <-- Note!
)
cached_pipe.fit(X_digits, y_digits)

Pipeline(memory='/tmp/tmpav6n0ff5',
         steps=[('reduce_dim', PCA()), ('clf', SVC())])

In [38]:
try:
    pca2.components_
except AttributeError as e:
    print(f"AttributeError {e}")

AttributeError 'PCA' object has no attribute 'components_'


In [40]:
print(cached_pipe.named_steps['reduce_dim'].components_[0, :4])

[-1.77484909e-19 -1.73094651e-02 -2.23428835e-01 -1.35913304e-01]


In [41]:
# Remove the cache directory
rmtree(cachedir)

## [Transforming target in regression](https://scikit-learn.org/stable/modules/compose.html#transforming-target-in-regression)

**`TransformedTargetRegressor` transforms the targets `y` before fitting a regression model.**
* The predictions are mapped back to the original space via an inverse transform. 
* It takes as an argument *the regressor that will be used for prediction*, and *the transformer that will be applied to the target variable*:

In [42]:
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.compose import TransformedTargetRegressor  # <-- This.
from sklearn.preprocessing import QuantileTransformer
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [43]:
X, y = fetch_california_housing(return_X_y=True)
X, y = X[:2000, :], y[:2000]  # select a subset of data

In [44]:
transformer = QuantileTransformer(output_distribution='normal')
regressor = LinearRegression()

In [45]:
# Define the `TransformedTargetRegressor`:
regr = TransformedTargetRegressor(
    regressor=regressor,
    transformer=transformer
)

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [47]:
regr.fit(X_train, y_train)

TransformedTargetRegressor(regressor=LinearRegression(),
                           transformer=QuantileTransformer(output_distribution='normal'))

In [48]:
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: 0.61


In [49]:
# For comparison, "untransformed" regressor:
raw_target_regr = LinearRegression().fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))

R2 score: 0.59


For simple transformations, instead of a `Transformer` object, a pair of functions can be passed, defining the transformation and its inverse mapping:

In [50]:
def func(x):
    return np.log(x)
def inverse_func(x):
    return np.exp(x)

regr = TransformedTargetRegressor(
    regressor=regressor,
    # This time, pass `func` and `inverse_func` arguments:
    func=func,
    inverse_func=inverse_func
)

In [51]:
regr.fit(X_train, y_train)
print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: 0.51


By default, **the provided functions are checked at each fit to be the inverse of each other**. 

However, it is possible to bypass this checking by setting `check_inverse` to `False`:

In [52]:
def inverse_func(x):
    return x

regr = TransformedTargetRegressor(
    regressor=regressor,
    func=func,
    inverse_func=inverse_func,
    check_inverse=False  # Here.
)

regr.fit(X_train, y_train)

print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))

R2 score: -1.57


**📚 An example covering transformation of targets is [here](https://scikit-learn.org/stable/auto_examples/compose/plot_transformed_target.html#sphx-glr-auto-examples-compose-plot-transformed-target-py).**

## [`FeatureUnion`: composite feature spaces](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces)

`FeatureUnion` combines several transformer objects into a new transformer that combines their output. 
* A `FeatureUnion` takes *a list of transformer objects*. 
* During fitting, each of these is fit to the data independently.
* The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

When you want to apply *different transformations* to *each field of the data*, see the related class `ColumnTransformer`.

`FeatureUnion` serves the same purposes as Pipeline - *convenience and joint parameter estimation and validation*.

`FeatureUnion` and `Pipeline` **can be combined to create complex models**.

(A `FeatureUnion` has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are is the caller’s responsibility.)

### Usage

A `FeatureUnion` is built using a list of `(key, value)` pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:

In [53]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA

estimators = [
    ('linear_pca', PCA()), 
    ('kernel_pca', KernelPCA())
]

combined = FeatureUnion(estimators)
print(combined)

FeatureUnion(transformer_list=[('linear_pca', PCA()),
                               ('kernel_pca', KernelPCA())])


Like pipelines, feature unions have a shorthand constructor called `make_union` that does not require explicit naming of the components.

*‼️ The below point hasn't been covered in this guide's `Pipeline` section ‼️*

Like `Pipeline`, individual steps may be replaced using `set_params`, and ignored by setting to `'drop'`:

In [54]:
combined.set_params(kernel_pca='drop')

FeatureUnion(transformer_list=[('linear_pca', PCA()), ('kernel_pca', 'drop')])

---
#### [Example: Concatenating multiple feature extraction methods](https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html#sphx-glr-auto-examples-compose-plot-feature-union-py)

In many real-world examples, there are many ways to extract features from a dataset. Often it is beneficial to *combine several methods to obtain good performance*. This example shows how to use `FeatureUnion` to *combine features obtained by **PCA** and **univariate selection***.

> Combining features using this transformer has the benefit that it allows **cross validation** and **grid searches** over *the whole process*.

(The combination used in this example is not particularly helpful on this dataset and is only used to illustrate the usage of `FeatureUnion`.)

In [55]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

In [58]:
iris = load_iris()
X, y = iris.data, iris.target
print("X.shape:", X.shape)
print("y.shape:", y.shape)
print()
print("X[:3]:\n", X[:3])
print("y[:3]:\n", y[:3])

X.shape: (150, 4)
y.shape: (150,)

X[:3]:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]
y[:3]:
 [0 0 0]


In [59]:
# PCA:
pca = PCA(n_components=2)  # Gives 2 features.

In [60]:
# Univariate selection:
selection = SelectKBest(k=1)  # Gives 1 feature.

In [61]:
# Build estimator from PCA and Univariate selection:
combined_features = FeatureUnion(
    [
        ("pca", pca), 
        ("univ_select", selection)
    ]
)
combined_features

FeatureUnion(transformer_list=[('pca', PCA(n_components=2)),
                               ('univ_select', SelectKBest(k=1))])

In [63]:
# Use combined features to transform dataset:
X_features = combined_features.fit(X, y).transform(X)

print("Combined space has", X_features.shape[1], "features")
# Get total of 3 features.

Combined space has 3 features


In [64]:
# SVM Classifier.
svm = SVC(kernel="linear")

In [65]:
# Do grid search over k, n_components and C:

# Define Pipeline:
pipeline = Pipeline(
    [
        ("features", combined_features),  # Note that Pipeline has taken in a FeatureUnion here!
        ("svm", svm)
    ]
)

# Define param_grid:
param_grid = dict(
    features__pca__n_components=[1, 2, 3],
    features__univ_select__k=[1, 2],
    svm__C=[0.1, 1, 10]
)

In [67]:
grid_search = GridSearchCV(
    pipeline, 
    param_grid=param_grid, 
    verbose=10
)

grid_search.fit(X, y)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV 1/5; 1/18] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 1/5; 1/18] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.933 total time=   0.0s
[CV 2/5; 1/18] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 2/5; 1/18] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.933 total time=   0.0s
[CV 3/5; 1/18] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 3/5; 1/18] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.867 total time=   0.0s
[CV 4/5; 1/18] START features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1
[CV 4/5; 1/18] END features__pca__n_components=1, features__univ_select__k=1, svm__C=0.1;, score=0.933 total time=   0.0s
[CV 5/5; 1/18] START features__pca__n_components=1, features__univ_select__k=1, svm__C=

GridSearchCV(estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('pca',
                                                                        PCA(n_components=2)),
                                                                       ('univ_select',
                                                                        SelectKBest(k=1))])),
                                       ('svm', SVC(kernel='linear'))]),
             param_grid={'features__pca__n_components': [1, 2, 3],
                         'features__univ_select__k': [1, 2],
                         'svm__C': [0.1, 1, 10]},
             verbose=10)

In [68]:
print(grid_search.best_estimator_)

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('pca', PCA(n_components=3)),
                                                ('univ_select',
                                                 SelectKBest(k=1))])),
                ('svm', SVC(C=10, kernel='linear'))])


---

## [`ColumnTransformer` for heterogeneous data](https://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data)

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. 

Often it is easiest to preprocess data before applying scikit-learn methods, for example using `pandas`. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

* Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as **data leakage**), for example in the case of scalers or imputing missing values.
* You may want to include the parameters of the preprocessors in a parameter search.

The `ColumnTransformer` helps performing different transformations for different columns of the data, within a `Pipeline` that is *safe from data leakage* and that *can be parametrized*. `ColumnTransformer` works on:
* arrays, 
* sparse matrices, and 
* `pandas DataFrames`.

To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:

In [70]:
import pandas as pd
X = pd.DataFrame(
    {
        'city': ['London', 'London', 'Paris', 'Sallisaw'],
        'title': ["His Last Bow", "How Watson Learned the Trick", "A Moveable Feast", "The Grapes of Wrath"],
        'expert_rating': [5, 3, 4, 5],
        'user_rating': [4, 5, 4, 3]
    }
)
display(X)

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


For this data, we might want to encode the `'city'` column as a categorical variable using `OneHotEncoder` but apply a `CountVectorizer` to the `'title'` column. 

As we might use multiple feature extraction methods on the same column, *we give each transformer a unique name*, say `'city_category'` and `'title_bow'`. 

By default, the remaining rating columns are ignored (`remainder='drop'`):

In [71]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

column_trans = ColumnTransformer(
    [
        # Note the third item in the tuple - column name / list of culumns!
        ('city_category', OneHotEncoder(dtype='int'), ['city']),
        ('title_bow', CountVectorizer(), 'title')
    ],
    remainder='drop'  # Note also this default!
)

In [72]:
column_trans.fit(X)

ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
                                 ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

In [73]:
column_trans.get_feature_names()

['city_category__x0_London',
 'city_category__x0_Paris',
 'city_category__x0_Sallisaw',
 'title_bow__bow',
 'title_bow__feast',
 'title_bow__grapes',
 'title_bow__his',
 'title_bow__how',
 'title_bow__last',
 'title_bow__learned',
 'title_bow__moveable',
 'title_bow__of',
 'title_bow__the',
 'title_bow__trick',
 'title_bow__watson',
 'title_bow__wrath']

In [74]:
transformed = column_trans.transform(X)
print("type(transformed):", type(transformed))  # Sparse matrix, in this case, due to the particular transformers used, I suppose.

transformed.toarray()

type(transformed): <class 'scipy.sparse.csr.csr_matrix'>


array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]])

⚠️ In the above example, the `CountVectorizer` **expects a 1D array as input** and therefore the columns were **specified as a string (`'title'`)**. However, `OneHotEncoder` as most of other transformers **expects 2D data**, therefore in that case you need to specify the column **as a list of strings (`['city']`)**.

Apart from
* a scalar or 
* a single item list, 

the column selection can be specified as 
* a list of multiple items, 
* an integer array, 
* a slice, 
* a boolean mask, 
* or with a `make_column_selector`. 

The `make_column_selector` [(docs)](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_selector.html#sklearn.compose.make_column_selector) is used to select columns based on data type or column name:

In [75]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import make_column_selector

ct = ColumnTransformer(
    [
        # Note the use of `make_column_selector`:
        ('scale', StandardScaler(), make_column_selector(dtype_include=np.number)),
        ('onehot', OneHotEncoder(), make_column_selector(pattern='city', dtype_include=object))
    ]
)

ct.fit_transform(X)

array([[ 0.90453403,  0.        ,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.41421356,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  0.        ,  1.        ,  0.        ],
       [ 0.90453403, -1.41421356,  0.        ,  0.        ,  1.        ]])

Strings can reference columns if the input is a `DataFrame`, integers are always interpreted as the positional columns.

We can **keep the remaining columns** by setting `remainder='passthrough'`. The values are appended *to the end* of the transformation:

In [76]:
column_trans = ColumnTransformer(
    [
        ('city_category', OneHotEncoder(dtype='int'),['city']),
        ('title_bow', CountVectorizer(), 'title')
    ],
    remainder='passthrough'  # This.
)

In [77]:
column_trans.fit_transform(X)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]])

The remainder parameter **can be set to an *estimator*** to *transform the remaining columns*. 

The transformed values are appended to the end of the transformation:

In [78]:
from sklearn.preprocessing import MinMaxScaler

column_trans = ColumnTransformer(
    [
        ('city_category', OneHotEncoder(), ['city']),
        ('title_bow', CountVectorizer(), 'title')
    ],
    remainder=MinMaxScaler()  # Like so.
)

In [79]:
column_trans

ColumnTransformer(remainder=MinMaxScaler(),
                  transformers=[('city_category', OneHotEncoder(), ['city']),
                                ('title_bow', CountVectorizer(), 'title')])

In [83]:
with np.printoptions(linewidth=120):
    print(column_trans.fit_transform(X))

[[1.  0.  0.  1.  0.  0.  1.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.5]
 [1.  0.  0.  0.  0.  0.  0.  1.  0.  1.  0.  0.  1.  1.  1.  0.  0.  1. ]
 [0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.5 0.5]
 [0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  1.  1.  0.  0.  1.  1.  0. ]]


## [Visualizing Composite Estimators](https://scikit-learn.org/stable/modules/compose.html#visualizing-composite-estimators)

Estimators can be displayed with a HTML representation when shown in a jupyter notebook. 

This can be useful to diagnose or visualize a `Pipeline` with many estimators. 

This visualization is activated by setting the `display` option in `set_config`:

In [84]:
from sklearn import set_config
set_config(display='diagram')   
# diplays HTML representation in a jupyter context
column_trans

As an alternative, the HTML can be written to a file using estimator_html_repr:
```python
from sklearn.utils import estimator_html_repr
with open('my_estimator.html', 'w') as f:  
    f.write(estimator_html_repr(clf))
```

**Examples:**
* [Column Transformer with Heterogeneous Data Sources](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py)
* [Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py) - covered below.

---
#### [Example: Column Transformer with Mixed Types](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py)

This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using `ColumnTransformer`. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to:
* scale the numeric features and 
* one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after mean-imputation, while the categorical data is one-hot encoded after imputing missing values with a new category (`'missing'`).

In addition, we show *two different ways to dispatch the columns* to the particular pre-processor: 
* by column names and 
* by column data types.

Finally, the preprocessing pipeline is integrated in a full prediction pipeline using `Pipeline`, together with a simple classification model.

In [91]:
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn import set_config  # For Pipeline representation control.

In [86]:
np.random.seed(0)

In [88]:
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
print("X.shape", X.shape)
print("y.shape", y.shape)
display(X)
display(y)

X.shape (1309, 13)
y.shape (1309,)


Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3.0,"Zabour, Miss. Hileni",female,14.5000,1.0,0.0,2665,14.4542,,C,,328.0,
1305,3.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,"Zakarian, Mr. Mapriededer",male,26.5000,0.0,0.0,2656,7.2250,,C,,304.0,
1307,3.0,"Zakarian, Mr. Ortin",male,27.0000,0.0,0.0,2670,7.2250,,C,,,


0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: category
Categories (2, object): ['0', '1']

**Use ColumnTransformer by selecting column by names**

We will train our classifier with the following features:

Numeric Features:
* `age`: float;
* `fare`: float.

Categorical Features:
* `embarked`: categories encoded as strings `{'C', 'S', 'Q'}`;
* `sex`: categories encoded as strings `{'female', 'male'}`;
* `pclass`: ordinal integers `{1, 2, 3}`.

We create the preprocessing pipelines for both numeric and categorical data. Note that `pclass` could either be treated as a categorical or numeric feature.

In [92]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

set_config(display='text')
display(numeric_transformer)

Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                ('scaler', StandardScaler())])

In [93]:
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

categorical_transformer

OneHotEncoder(handle_unknown='ignore')

In [94]:
preprocessor = ColumnTransformer(  # <-- Here we get our ColumnTransformer object!
    transformers=[
        # Note the 3rd element of tuples sets the COLUMNS to use (by their name).
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

preprocessor

ColumnTransformer(transformers=[('num',
                                 Pipeline(steps=[('imputer',
                                                  SimpleImputer(strategy='median')),
                                                 ('scaler', StandardScaler())]),
                                 ['age', 'fare']),
                                ('cat', OneHotEncoder(handle_unknown='ignore'),
                                 ['embarked', 'sex', 'pclass'])])

In [95]:
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression())
    ]
)

clf

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'fare']),
                                                 ('cat',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['embarked', 'sex',
                                                   'pclass'])])),
                ('classifier', LogisticRegression())])

In [96]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

In [98]:
clf.fit(X_train, y_train)
print(f"model score: {clf.score(X_test, y_test):.3f}")

model score: 0.790


**HTML representation of Pipeline**

When the `Pipeline` is printed out in a jupyter notebook an HTML representation of the estimator is displayed as follows:

In [100]:
from sklearn import set_config

set_config(
    display='diagram'  # Options: {'text', diagram'}
)

display(clf)

# NOTE: The HTML representation is interactive, click on things! 😎

**Use `ColumnTransformer` by selecting column *by data types***

When dealing with a cleaned dataset, the preprocessing can be automatic *by using the data types of the column* to decide whether to treat a column as a numerical or categorical feature. `sklearn.compose.make_column_selector` gives this possibility. 

First, let’s only select a subset of columns to simplify our example.

In [101]:
subset_feature = ['embarked', 'sex', 'pclass', 'age', 'fare']
X_train, X_test = X_train[subset_feature], X_test[subset_feature]
display(X_train)

Unnamed: 0,embarked,sex,pclass,age,fare
1118,S,male,3.0,25.0000,7.9250
44,C,female,1.0,41.0000,134.5000
1072,Q,male,3.0,,7.7333
1130,S,female,3.0,18.0000,7.7750
574,S,male,2.0,29.0000,21.0000
...,...,...,...,...,...
763,S,female,3.0,0.1667,20.5750
835,S,male,3.0,,8.0500
1216,Q,female,3.0,,7.7333
559,S,female,2.0,20.0000,36.7500


Then, we introspect the information regarding each column data type.

In [102]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   float64 
 3   age       841 non-null    float64 
 4   fare      1046 non-null   float64 
dtypes: category(2), float64(3)
memory usage: 35.0 KB


We can observe that the `embarked` and `sex` columns were tagged as *category* columns when loading the data with `fetch_openml`. 

Therefore, we can use this information to dispatch the categorical columns to the `categorical_transformer` and the remaining columns to the `numerical_transformer`.

**Note:**

In practice, you will have to handle the column data type yourself. If you want some columns to be considered as category, you will have to convert them into categorical columns. If you are using pandas, you can refer to their documentation regarding [Categorical data](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html). 

In [103]:
from sklearn.compose import make_column_selector

preprocessor = ColumnTransformer(
    transformers=[
        # Note the use of `make_column_selector()` here:
        ('num', numeric_transformer, make_column_selector(dtype_exclude="category")),
        ('cat', categorical_transformer, make_column_selector(dtype_include="category"))
    ]
)

clf = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression())
    ]
)

display(clf)

In [104]:
clf.fit(X_train, y_train)
print(f"model score: {clf.score(X_test, y_test):.3f}")

model score: 0.794


The resulting score is not exactly the same as the one from the previous pipeline because the dtype-based selector treats the `pclass` column as a numeric feature instead of a categorical feature as previously:

In [105]:
make_column_selector(dtype_exclude="category")(X_train)

['pclass', 'age', 'fare']

In [107]:
make_column_selector(dtype_include="category")(X_train)

['embarked', 'sex']

**Using the prediction pipeline in a grid search**

Grid search can also be performed on the different preprocessing steps defined in the `ColumnTransformer` object, together with the classifier’s hyperparameters as part of the `Pipeline`. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using `GridSearchCV`.

In [108]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],  # Note the nested estimator notation.
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(clf, param_grid, cv=10)
display(grid_search)

Calling `fit` triggers the cross-validated search for the best hyper-parameters combination:

In [109]:
grid_search.fit(X_train, y_train)

print(f"Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__C': 0.1, 'preprocessor__num__imputer__strategy': 'mean'}


In [110]:
print(f"Internal CV score: {grid_search.best_score_:.3f}")

Internal CV score: 0.784


In [113]:
import pandas as pd

cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[
    [
        "mean_test_score", 
        "std_test_score",
        "param_preprocessor__num__imputer__strategy",
        "param_classifier__C"
    ]
].head(5)

Unnamed: 0,mean_test_score,std_test_score,param_preprocessor__num__imputer__strategy,param_classifier__C
0,0.784167,0.035824,mean,0.1
2,0.780366,0.032722,mean,1.0
1,0.780348,0.037245,median,0.1
4,0.779414,0.033105,mean,10.0
6,0.779414,0.033105,mean,100.0


The best hyper-parameters have been used to re-fit a final model on the full training set. 

We can evaluate that final model on held out test data that was not used for hyperparameter tuning:

In [114]:
print(("best logistic regression from grid search: %.3f" % grid_search.score(X_test, y_test)))

best logistic regression from grid search: 0.794


---