In [93]:
%matplotlib inline

# Pipelines and composite estimators
Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a [Pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline). 

Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) serves multiple purposes here:

* **Convenience and encapsulation**
  * You only have to call fit and predict once on your data to fit a whole sequence of estimators.

* **Joint parameter selection**
   * You can grid search over parameters of all estimators in the pipeline at once.
* **Safety**
  * Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers (i.e. must have a transform method). The last estimator may be any type (transformer, classifier, etc.).

## Building pipelines
The Pipeline is built using a list of (key, value) pairs, where the key is a string containing the name you want to give this step and value is an estimator object:

In [94]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(estimators)
pipe 



Pipeline(memory=None,
     steps=[('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)), ('clf', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

### utility function make_pipeline
The utility function `make_pipeline` is a shorthand for constructing pipelines; it takes a variable number of estimators and returns a pipeline, filling in the names automatically:

```python
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB()) 
Pipeline(memory=None,
         steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
                ('multinomialnb', MultinomialNB(alpha=1.0,
                                                class_prior=None,
                                                fit_prior=True))])
```

In [95]:
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import Binarizer
make_pipeline(Binarizer(), MultinomialNB()) 

Pipeline(memory=None,
     steps=[('binarizer', Binarizer(copy=True, threshold=0.0)), ('multinomialnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

### pipe.steps[0]
The estimators of a pipeline are stored as a list in the steps attribute:
    
```python
>>> pipe.steps[0]
('reduce_dim', PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False))
```

### access pipe step via step name
and as a dict in named_steps:

```python
>>> pipe.named_steps['reduce_dim']
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
```

### Access parameters of each step via `<estimator>__<parameter>`
Parameters of the estimators in the pipeline can be accessed using the `<estimator>__<parameter>` syntax:

```python
>>> pipe.set_params(clf__C=10) 
Pipeline(memory=None,
         steps=[('reduce_dim', PCA(copy=True, iterated_power='auto',...)),
                ('clf', SVC(C=10, cache_size=200, class_weight=None,...))])
```

Attributes of named_steps map to keys, enabling tab completion in interactive environments:

```python
>>> pipe.named_steps.reduce_dim is pipe.named_steps['reduce_dim']
True
```

This is particularly important for doing grid searches:

```python
>>> from sklearn.model_selection import GridSearchCV
>>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
...                   clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
```

### Individual steps can be skipped or replaced as parameters
Individual steps may also be replaced as parameters, and non-final steps may be ignored by setting them to `None`:

```python
>>> from sklearn.linear_model import LogisticRegression
>>> param_grid = dict(reduce_dim=[None, PCA(5), PCA(10)],
...                   clf=[SVC(), LogisticRegression()],
...                   clf__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(pipe, param_grid=param_grid)
```

Here reduce_dim can be
* None 
* PCA(5)
* PCA(10) 

In addition, individual steps may also be replaced as parameters. E.g.,

clf can be
* SVC()
* or LogisticRegression()




## Caching transformers: avoid repeated computation
Fitting transformers may be computationally expensive. With its `memory parameter` set, Pipeline will cache each transformer after calling fit. This feature is used to avoid computing the fit transformers within a pipeline if the parameters and input data are identical. 

**A typical example is the case of a grid search in which the transformers can be fitted only once and reused for each configuration.**

The parameter memory is needed in order to cache the transformers. memory can be either a string containing the directory where to cache the transformers or a `joblib.Memory object`:

```python
>>> from tempfile import mkdtemp
>>> from shutil import rmtree
>>> from sklearn.decomposition import PCA
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
>>> cachedir = mkdtemp()
>>> pipe = Pipeline(estimators, memory=cachedir)
>>> pipe 
Pipeline(...,
         steps=[('reduce_dim', PCA(copy=True,...)),
                ('clf', SVC(C=1.0,...))])
>>> # Clear the cache directory when you don't need it anymore
>>> rmtree(cachedir)
```

**Warning: use with care!**

## References

* SKLearn documentation and examples [compose.html#pipeline](https://scikit-learn.org/stable/modules/compose.html#pipeline)

Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

# FeatureUnion: composite feature spaces

FeatureUnion combines several transformer objects into a new transformer that combines their output. 

A FeatureUnion
* A FeatureUnion takes a list of transformer objects. 
* During fitting, each of these is fit to the data independently. 
* The transformers are applied in parallel, and the feature matrices they output are concatenated side-by-side into a larger matrix.

When you want to apply different transformations to each field of the data, see the related class `sklearn.compose.ColumnTransformer`.

FeatureUnion serves the same purposes as Pipeline - convenience and joint parameter estimation and validation.

FeatureUnion and Pipeline can be combined to create complex models.

(A FeatureUnion has no way of checking whether two transformers might produce identical features. It only produces a union when the feature sets are disjoint, and making sure they are the caller’s responsibility.)

A FeatureUnion is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object:

```python
>>>
>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined 
FeatureUnion(n_jobs=None,
             transformer_list=[('linear_pca', PCA(copy=True,...)),
                               ('kernel_pca', KernelPCA(alpha=1.0,...))],
             transformer_weights=None)
```

Like pipelines, feature unions have a shorthand constructor called make_union that does not require explicit naming of the components.

## Skip steps in the pipeline or replace
Like Pipeline, individual steps may be replaced using set_params, and ignored by setting to 'drop':

```python
>>>
>>> combined.set_params(kernel_pca='drop')
... 
FeatureUnion(n_jobs=None,
             transformer_list=[('linear_pca', PCA(copy=True,...)),
                               ('kernel_pca', 'drop')],
             transformer_weights=None)
```

Examples:

* [Concatenating multiple feature extraction methods](https://scikit-learn.org/stable/auto_examples/compose/plot_feature_union.html#sphx-glr-auto-examples-compose-plot-feature-union-py)

# ColumnTransformer for heterogeneous data

Many datasets contain features of different types, say text, floats, and dates, where each type of feature requires separate preprocessing or feature extraction steps. Often it is easiest to preprocess data before applying scikit-learn methods, for example using pandas. Processing your data before passing it to scikit-learn might be problematic for one of the following reasons:

Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
You may want to include the parameters of the preprocessors in a parameter search.
The ColumnTransformer helps performing different transformations for different columns of the data, within a Pipeline that is safe from data leakage and that can be parametrized. [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer) works on arrays, sparse matrices, and pandas DataFrames.

For more background, see [here](https://scikit-learn.org/stable/modules/compose.html#featureunion-composite-feature-spaces).

To each column, a different transformation can be applied, such as preprocessing or a specific feature extraction method:

In [96]:
>>> import pandas as pd
>>> X = pd.DataFrame(
...     {'city': ['London', 'London', 'Paris', 'Sallisaw'],
...      'title': ["His Last Bow", "How Watson Learned the Trick",
...                "A Moveable Feast", "The Grapes of Wrath"],
...      'expert_rating': [5, 3, 4, 5],
...      'user_rating': [4, 5, 4, 3]})


In [97]:
X

Unnamed: 0,city,title,expert_rating,user_rating
0,London,His Last Bow,5,4
1,London,How Watson Learned the Trick,3,5
2,Paris,A Moveable Feast,4,4
3,Sallisaw,The Grapes of Wrath,5,3


For this data, we might want to encode the 'city' column as a categorical variable using preprocessing.OneHotEncoder but apply a feature_extraction.text.CountVectorizer to the 'title' column. As we might use multiple feature extraction methods on the same column, we give each transformer a unique name, say `city_category` and `title_bow`. By default, the remaining rating columns are ignored (remainder='drop'):

In [98]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder

column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='drop')

print(f"column_trans.fit(X) :{column_trans.fit(X)}\n")
print(f"column_trans.get_feature_names() :{list(column_trans.get_feature_names())}\n")
print(f"column_trans.transform(X).toarray() :{column_trans.transform(X).toarray()}\n")

column_trans.fit(X) :ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
         transformer_weights=None,
         transformers=[('city_category', OneHotEncoder(categorical_features=None, categories=None, dtype='int',
       handle_unknown='error', n_values=None, sparse=True), ['city']), ('title_bow', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encod...accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), 'title')])

column_trans.get_feature_names() :['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw', 'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his', 'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable', 'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson', 'title_bow__wrath']

column_trans.transform(X).toarray() :[[1 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0]
 [1 0 0 0 

In the above example, 

* the` CountVectorizer expects` a 1D array as input and therefore the columns were specified as a string ('title').
* However, `preprocessing.OneHotEncoder` as most of other transformers expects 2D data, therefore in that case you need to specify the column as a list of strings `(['city'])`.

## column selection depends on the input type
Apart from a scalar or a single item list, the column selection can be specified as a list of multiple items, an integer array, a slice, or a boolean mask. 

* Strings can reference columns if the input is a DataFrame, 
* integers are always interpreted as the positional columns.

## The `remainder` parameter
We can keep the remaining rating columns by setting remainder='passthrough'. The values are appended to the end of the transformation:

In [99]:
column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(dtype='int'),['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder='passthrough')

column_trans.fit_transform(X)

array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]],
      dtype=int64)

The `remainder` parameter can be set to an estimator to transform the remaining rating columns. The transformed values are appended to the end of the transformation:

In [100]:
from sklearn.preprocessing import MinMaxScaler
column_trans = ColumnTransformer(
    [('city_category', OneHotEncoder(), ['city']),
     ('title_bow', CountVectorizer(), 'title')],
    remainder=MinMaxScaler())

column_trans.fit_transform(X)[:, -2:]



  return self.partial_fit(X, y)


array([[1. , 0.5],
       [0. , 1. ],
       [0.5, 0.5],
       [1. , 0. ]])

## make_column_transformer
The `make_column_transformer` function is available to more easily create a ColumnTransformer object. Specifically, the names will be given automatically. The equivalent for the above example would be:

In [101]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import make_column_transformer
column_trans = make_column_transformer(
    (OneHotEncoder(), ['city']),
    (CountVectorizer(), 'title'),
    remainder=MinMaxScaler())
column_trans

ColumnTransformer(n_jobs=None,
         remainder=MinMaxScaler(copy=True, feature_range=(0, 1)),
         sparse_threshold=0.3, transformer_weights=None,
         transformers=[('onehotencoder', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True), ['city']), ('countvectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dty...accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None), 'title')])


# Titanic Casestudy: Column Transformer with Mixed Types


This example illustrates how to apply different preprocessing and
feature extraction pipelines to different subsets of features,
using :class:`sklearn.compose.ColumnTransformer`.
This is particularly handy for the case of datasets that contain
heterogeneous data types, since we may want to scale the
numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after
mean-imputation, while the categorical data is one-hot
encoded after imputing missing values with a new category
(``'missing'``).

Finally, the preprocessing pipeline is integrated in a
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,
together with a simple classification model.



In [102]:
from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)


# Split the provided training data into training and validationa and test
# The kaggle evaluation test set has no labels
#
X = data.drop('survived', axis=1)
y = data['survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
#X_kaggle_test= datasets["test"][features]
# y_test = datasets["application_test"]['TARGET']   #why no  TARGET?!! (hint: kaggle competition)
print(f"X train           shape: {X_train.shape}")
print(f"X validation      shape: {X_valid.shape}")
print(f"X test            shape: {X_test.shape}")
#print(f"X X_kaggle_test   shape: {X_kaggle_test.shape}")
X_train.head()

X train           shape: (755, 13)
X validation      shape: (158, 13)
X test            shape: (134, 13)


Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
213,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C,6,,"Lexington, MA"
754,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S,,,"West Bromwich, England Pontiac, MI"
912,3,"Karaic, Mr. Milan",male,30.0,0,0,349246,7.8958,,S,,,
1025,3,"Moor, Master. Meier",male,6.0,0,1,392096,12.475,E121,S,14,,
170,1,"Ismay, Mr. Joseph Bruce",male,49.0,0,0,112058,0.0,B52 B54 B56,S,C,,Liverpool


## Baseline pipeline

In [103]:
# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['age', 'fare', 'parch', 'sibsp',]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])


model.fit(X_train, y_train)
print("model score: %.3f" % model.score(X_test, y_test))

model score: 0.769


In [104]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

del expLog

exp_name = "baseline"
try:
    expLog
except NameError:
   expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train LogLoss", 
                                   "Valid LogLoss",
                                   "Test  LogLoss"
                                  ])

expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [accuracy_score(y_train, model.predict(X_train)), 
                accuracy_score(y_valid, model.predict(X_valid)),
                accuracy_score(y_test, model.predict(X_test)),
                log_loss(y_train, model.predict_proba(X_train)),
                log_loss(y_valid, model.predict_proba(X_valid)),
                log_loss(y_test, model.predict_proba(X_test))],
    4)) 
expLog

Unnamed: 0,exp_name,Train Acc,Valid Acc,Test Acc,Train LogLoss,Valid LogLoss,Test LogLoss
0,baseline,0.7947,0.7911,0.7687,0.4647,0.4344,0.4783


###  Log loss Evaluation metrics for 3-class problem

Here we explain log loss for three examples. Submissions are evaluated using multi-class logarithmic loss. Each id has one true class. For each id, you must submit a predicted probability for each author. The formula is then:


$$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij}),$$

where N is the number of observations in the test set, M is the number of class labels (3 classes), log is the natural logarithm, yij is 1 if observation i belongs to class j and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.  A log loss of zero is best but is rarely achieved.

The submitted probabilities for a given sentences are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with max(min(p,1−10−15),10−15).

Let's try to interpret the logloss for a random model (i.e., predict a probability of $\frac{1}{3}$ for each of the three classes for all test cases):

```python

print(f"baseline log loss is {-np.log(1/3):0.3f}")   #baseline log loss is 1.0986122886681098 
print(f"baseline class prob  {np.exp(-1.0986122886681098):0.3f}")   #baseline log loss is 1.0986122886681098 
print(f"baseline log loss is {-np.log(1):0.3f}")   #predicted probability of true class is 1, then zero loss! 

baseline log loss is 1.099
baseline class prob  0.333
baseline log loss is -0.000
```

We only look at the predicted probability of the true class. It should be 1 but it is $\frac{1}{3}$.

### logloss discussion

The logloss for the baseline submission is $0.83$. That means we are predicting a probability of 0.43 for the target class; it should be 1 or close to 1. So we have lots of room for improvement!!

```python

np.exp(-0.83) # means we are predicting a probability of 0.43 for the target class; it should be 1 or close to 1.

```

In [105]:

print(f"baseline log loss is {-np.log(1/3):0.3f}")   #baseline log loss is 1.0986122886681098 
print(f"baseline class prob  {np.exp(-1.0986122886681098):0.3f}")   #baseline log loss is 1.0986122886681098 
print(f"baseline log loss is {-np.log(1):0.3f}")   #predicted probability of true class is 1, then zero loss! 


baseline log loss is 1.099
baseline class prob  0.333
baseline log loss is -0.000


In [106]:
np.exp(-0.83) # means we are predicting a probability of 0.43 for the target class; it should be 1 or close to 1.

0.4360492863215356

##  prediction pipeline in a grid search

Grid search can also be performed on the different preprocessing steps
 defined in the ``ColumnTransformer`` object, together with the classifier's
 hyperparameters as part of the ``Pipeline``.
 We will search for both the imputer strategy of the numeric preprocessing
 and the regularization parameter of the logistic regression using
 :class:`sklearn.model_selection.GridSearchCV`.



### Grid search with multi-level pipelines
Grid search is all about figuring out what the best hyperparameters of the data set is. To see the list of all the possible things you could fine tune, call get_params().keys() on your pipeline.

In [107]:
list(model.get_params().keys())

['memory',
 'steps',
 'preprocessor',
 'classifier',
 'preprocessor__n_jobs',
 'preprocessor__remainder',
 'preprocessor__sparse_threshold',
 'preprocessor__transformer_weights',
 'preprocessor__transformers',
 'preprocessor__num',
 'preprocessor__cat',
 'preprocessor__num__memory',
 'preprocessor__num__steps',
 'preprocessor__num__imputer',
 'preprocessor__num__scaler',
 'preprocessor__num__imputer__copy',
 'preprocessor__num__imputer__fill_value',
 'preprocessor__num__imputer__missing_values',
 'preprocessor__num__imputer__strategy',
 'preprocessor__num__imputer__verbose',
 'preprocessor__num__scaler__copy',
 'preprocessor__num__scaler__with_mean',
 'preprocessor__num__scaler__with_std',
 'preprocessor__cat__memory',
 'preprocessor__cat__steps',
 'preprocessor__cat__imputer',
 'preprocessor__cat__onehot',
 'preprocessor__cat__imputer__copy',
 'preprocessor__cat__imputer__fill_value',
 'preprocessor__cat__imputer__missing_values',
 'preprocessor__cat__imputer__strategy',
 'preprocesso

In [108]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

grid_search = GridSearchCV(model, param_grid, cv=10, iid=False)
grid_search.fit(X_train, y_train)

print(("best logistic regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))
model = grid_search

best logistic regression from grid search: 0.769


In [109]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

exp_name = "baseline_gridsearch"
try:
    expLog
except NameError:
   expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train LogLoss", 
                                   "Valid LogLoss",
                                   "Test  LogLoss"
                                  ])

expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [accuracy_score(y_train, model.predict(X_train)), 
                accuracy_score(y_valid, model.predict(X_valid)),
                accuracy_score(y_test, model.predict(X_test)),
                log_loss(y_train, model.predict_proba(X_train)),
                log_loss(y_valid, model.predict_proba(X_valid)),
                log_loss(y_test, model.predict_proba(X_test))],
    4)) 
expLog

Unnamed: 0,exp_name,Train Acc,Valid Acc,Test Acc,Train LogLoss,Valid LogLoss,Test LogLoss
0,baseline,0.7947,0.7911,0.7687,0.4647,0.4344,0.4783
1,baseline_gridsearch,0.796,0.7911,0.7687,0.4709,0.4381,0.4774


## Title-based transformer from Name feature

### Write the title transformer class and debug
Let's write the title transformer class step by step and unit test it.

* build a TitleAdder transform
* Test the TitleAdder transform using a simple one-step pipeline
* Test the TitleAdder transform using a two-step pipeline

In [192]:
X_train.head()  #just remind ourselves of the data

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
213,1,"Newell, Miss. Madeleine",female,31.0,1,0,35273,113.275,D36,C,6,,"Lexington, MA"
754,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S,,,"West Bromwich, England Pontiac, MI"
912,3,"Karaic, Mr. Milan",male,30.0,0,0,349246,7.8958,,S,,,
1025,3,"Moor, Master. Meier",male,6.0,0,1,392096,12.475,E121,S,14,,
170,1,"Ismay, Mr. Joseph Bruce",male,49.0,0,0,112058,0.0,B52 B54 B56,S,C,,Liverpool


In [175]:
from sklearn.base import BaseEstimator, TransformerMixin
import re

class TitleAdder(BaseEstimator, TransformerMixin):
    def __init__(self, features=None): # no *args or **kargs
        self.features = features
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        df = pd.DataFrame(X, columns=self.features)
        df['Title'] = df['name'].apply(lambda x: re.search(' ([A-Z][a-z]+)\.', x).group(1))
        # Apply the necessary transformations to obtain the 5 title categories
        # (Mr, Mrs, Miss, Master, Other) like it was done in section 5.1.2
        df['Title'] = df['Title'].replace({'Mlle':'Miss', 'Mme':'Mrs', 'Ms':'Miss'})
        df['Title'] = df['Title'].replace(['Don', 'Dona', 'Rev', 'Dr','Major', 'Lady', 'Sir', 
                                           'Col', 'Capt', 'Countess', 'Jonkheer'],'Other')
        #drop text features as we need to switch from a generic dateframe to a Numpy Array with the title column
        df.drop('name', axis=1, inplace=True)
        return np.array(df.values)  #return a Numpy Array to observe the pipeline protocol

In [181]:
def test_driver_title_simple_one_step_pipeline():
    print(f"X_train.shape: {X_train.shape}\n")
    print(f"X_train['name'][0:5]: \n{X_train['name'][0:5]}")
    test_pipeline = make_pipeline(TitleAdder(['name']))
    return(test_pipeline.fit_transform(X_train))
print(f"Test driver: \n{test_driver_title_simple_one_step_pipeline()[0:5,:]}")


X_train.shape: (755, 13)

X_train['name'][0:5]: 
213     Newell, Miss. Madeleine
754     Davies, Mr. John Samuel
912           Karaic, Mr. Milan
1025        Moor, Master. Meier
170     Ismay, Mr. Joseph Bruce
Name: name, dtype: object
Test driver: 
[['Miss']
 ['Mr']
 ['Mr']
 ['Master']
 ['Mr']]


In [189]:
def test_driver_title_simple_TWO_step_pipeline():
    test_pipeline = make_pipeline(TitleAdder(['name']), OneHotEncoder(handle_unknown='ignore'))
    return(test_pipeline.fit_transform(X_train))

test = test_driver_for_title()
print(f"Test driver: \n{test_driver_title_simple_TWO_step_pipeline()[0:5,:]}")
print(f"\n OHE for title has 5 unique values:  {np.unique(test)}\n")
print(f"OHE for first 5 examples \nMiss, Mr, Mr, Master, Mr\n{test_driver_title_simple_TWO_step_pipeline()[0:5,:]}")
test.shape
np.unique(test)


X_train.shape: (755, 13)

X_train['name'][0:5]: 
213     Newell, Miss. Madeleine
754     Davies, Mr. John Samuel
912           Karaic, Mr. Milan
1025        Moor, Master. Meier
170     Ismay, Mr. Joseph Bruce
Name: name, dtype: object
Test driver: 
  (0, 1)	1.0
  (1, 2)	1.0
  (2, 2)	1.0
  (3, 0)	1.0
  (4, 2)	1.0

 OHE for title has 5 unique values:  ['Master' 'Miss' 'Mr' 'Mrs' 'Other']

OHE for first 5 examples 
Miss, Mr, Mr, Master, Mr
  (0, 1)	1.0
  (1, 2)	1.0
  (2, 2)	1.0
  (3, 0)	1.0
  (4, 2)	1.0


array(['Master', 'Miss', 'Mr', 'Mrs', 'Other'], dtype=object)

### Hierarchical pipeline: not inline

Here we define subpiplines first and then incorporate them into the main pipeline. In the next section  all subpipelines are **inlined**.

In [112]:
from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
#
numeric_features = ['age', 'fare', 'parch', 'sibsp',]
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Title transform
title_features = ['name']
title_transformer = Pipeline(steps=[
    ('title_prep', TitleAdder()),
    ('title', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('title', title_transformer, title_features),
    ])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

model.fit(X_train, y_train)
print("model score: %.3f" % model.score(X_test, y_test))

model score: 0.791


### Hierarchical pipeline inline

Previously we defined pipelines first and then incorporated them into the main pipeline. Here inline all subpipelines are.

In [133]:
from __future__ import print_function

import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Read data from Titanic dataset.
titanic_url = ('https://raw.githubusercontent.com/amueller/'
               'scipy-2017-sklearn/091d371/notebooks/datasets/titanic3.csv')
data = pd.read_csv(titanic_url)

# We will train our classifier with the following features:
# Numeric Features:
# - age: float.
# - fare: float.
# Categorical Features:
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
# - sex: categories encoded as strings {'female', 'male'}.
# - pclass: ordinal integers {1, 2, 3}.

# We create the preprocessing pipelines for both numeric and categorical data.
#

# Features for the different transformers
numeric_features = ['age', 'fare', 'parch', 'sibsp',]
categorical_features = ['embarked', 'sex', 'pclass']
title_features = ['name']

preprocessor = ColumnTransformer(
    transformers=[
        ('num',  
         Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())  ]), 
         numeric_features),

        ('cat',  Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('onehot', OneHotEncoder(handle_unknown='ignore'))  ] ),
         categorical_features),

        ('title',  Pipeline(steps=[
            ('title_prep', TitleAdder()),
            ('title', OneHotEncoder(handle_unknown='ignore')) ] ),
        title_features),
    ])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
model = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs'))])

model.fit(X_train, y_train)
print("model score: %.3f" % model.score(X_test, y_test))

model score: 0.791


In [134]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

exp_name = "baseline_title"
try:
    expLog
except NameError:
   expLog = pd.DataFrame(columns=["exp_name", 
                                   "Train Acc", 
                                   "Valid Acc",
                                   "Test  Acc",
                                   "Train LogLoss", 
                                   "Valid LogLoss",
                                   "Test  LogLoss"
                                  ])

expLog.loc[len(expLog)] = [f"{exp_name}"] + list(np.round(
               [accuracy_score(y_train, model.predict(X_train)), 
                accuracy_score(y_valid, model.predict(X_valid)),
                accuracy_score(y_test, model.predict(X_test)),
                log_loss(y_train, model.predict_proba(X_train)),
                log_loss(y_valid, model.predict_proba(X_valid)),
                log_loss(y_test, model.predict_proba(X_test))],
    4)) 
expLog

Unnamed: 0,exp_name,Train Acc,Valid Acc,Test Acc,Train LogLoss,Valid LogLoss,Test LogLoss
0,baseline,0.7947,0.7911,0.7687,0.4647,0.4344,0.4783
1,baseline_gridsearch,0.796,0.7911,0.7687,0.4709,0.4381,0.4774
2,baseline_title,0.8093,0.8228,0.791,0.4478,0.4136,0.4549
3,baseline_title,0.8093,0.8228,0.791,0.4478,0.4136,0.4549


In [131]:
reduction_in_valid_error = 100*(expLog["Valid Acc"][2] - expLog["Valid Acc"][1]) /(1 - expLog["Valid Acc"][2])
reduction_in_test_error = 100*(expLog["Test  Acc"][2] - expLog["Test  Acc"][1]) /(1 - expLog["Test  Acc"][2])
print(f"Title feature led to a reduction in validation error of : {np.round(reduction_in_valid_error, 3)}%")
print(f"Title feature led to a reduction in Test       error of : {np.round(reduction_in_test_error, 3)}%")
print("Yay to the Title feature!!!")

Title feature led to a reduction in validation error of : 17.889%
Title feature led to a reduction in Test       error of : 10.67%
Yay to the Title feature!!!



# CaseStudy: 20newsgroups via a Column Transformer pipeline


Datasets can often contain components of that require different feature
extraction and processing pipelines.  This scenario might occur when:

1. Your dataset consists of heterogeneous data types (e.g. raster images and
   text captions)
2. Your dataset is stored in a Pandas DataFrame and different columns
   require different processing pipelines.

This example demonstrates how to use
:class:`sklearn.compose.ColumnTransformer` on a dataset containing
different types of features.  We use the 20-newsgroups dataset and compute
standard bag-of-words features for the subject line and body in separate
pipelines as well as ad hoc features on the body. We combine them (with
weights) using a ColumnTransformer and finally train a classifier on the
combined set of features.

The choice of features is not particularly helpful, but serves to illustrate
the technique.

This section is based on the following SciKit-Learn example:

* [20newsgroups via a Column Transformer pipeline](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py)



Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when:

Your dataset consists of heterogeneous data types (e.g. raster images and text captions)
Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines.
This example demonstrates how to use sklearn.compose.ColumnTransformer on a dataset containing different types of features. We use the 20-newsgroups dataset and compute standard bag-of-words features for the subject line and body in separate pipelines as well as ad hoc features on the body. We combine them (with weights) using a ColumnTransformer and finally train a classifier on the combined set of features.

The choice of features is not particularly helpful, but serves to illustrate the technique.



https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html#sphx-glr-auto-examples-compose-plot-column-transformer-py

In [135]:
# Author: Matt Terry <matt.terry@gmail.com>
#
# License: BSD 3 clause
from __future__ import print_function

import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_footer
from sklearn.datasets.twenty_newsgroups import strip_newsgroup_quoting
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.svm import LinearSVC


class TextStats(BaseEstimator, TransformerMixin):
    """Extract features from each document for DictVectorizer"""

    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        return [{'length': len(text),
                 'num_sentences': text.count('.')}
                for text in posts]


class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
    """Extract the subject & body from a usenet post in a single pass.

    Takes a sequence of strings and produces a dict of sequences.  Keys are
    `subject` and `body`.
    """
    def fit(self, x, y=None):
        return self

    def transform(self, posts):
        # construct object dtype array with two columns
        # first column = 'subject' and second column = 'body'
        features = np.empty(shape=(len(posts), 2), dtype=object)
        for i, text in enumerate(posts):
            headers, _, bod = text.partition('\n\n')
            bod = strip_newsgroup_footer(bod)
            bod = strip_newsgroup_quoting(bod)
            features[i, 1] = bod

            prefix = 'Subject:'
            sub = ''
            for line in headers.split('\n'):
                if line.startswith(prefix):
                    sub = line[len(prefix):]
                    break
            features[i, 0] = sub

        return features


pipeline = Pipeline([
    # Extract the subject & body
    ('subjectbody', SubjectBodyExtractor()),

    # Use ColumnTransformer to combine the features from subject and body
    ('union', ColumnTransformer(
        [
            # Pulling features from the post's subject line (first column)
            ('subject', TfidfVectorizer(min_df=50), 0),

            # Pipeline for standard bag-of-words model for body (second column)
            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 1),

            # Pipeline for pulling ad hoc features from post's body
            ('body_stats', Pipeline([
                ('stats', TextStats()),  # returns a list of dicts
                ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            ]), 1),
        ],

        # weight components in ColumnTransformer
        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            'body_stats': 1.0,
        }
    )),

    # Use a SVC classifier on the combined features
    # Linear Support Vector Classification
    ('svc', LinearSVC()),
])

# limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
                           subset='train',
                           categories=categories,
                           )
test = fetch_20newsgroups(random_state=1,
                          subset='test',
                          categories=categories,
                          )

pipeline.fit(train.data, train.target)
y = pipeline.predict(test.data)
print(classification_report(y, test.target))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


              precision    recall  f1-score   support

           0       0.66      0.82      0.73       259
           1       0.81      0.66      0.73       311

   micro avg       0.73      0.73      0.73       570
   macro avg       0.74      0.74      0.73       570
weighted avg       0.75      0.73      0.73       570



