# BLU02 - Learning Notebook - Part 3 of 3 - Advanced pipelines

In this notebook, we return to pipelines that we briefly presented in SLU16 and take a deeper dive.

In [1]:
import matplotlib.pyplot as plt

import pandas as pd
import os

from category_encoders.ordinal import OrdinalEncoder

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.cluster import KMeans

## 1. Why pipelines?
Where we are at this point:
* we are able to perform transformations on data, setting up tranformation pipelines using nothing but chained pandas operations
* we can combine different dataframes to extract information distributed in different tables.

Remember our workflow diagram? Let's look at it again.

![data_transformation_workflow](./media/data_processing_workflow.png)

After transformations, we want to feed our data to models. A standard workflow starts with splitting the data into a train and test set, so let's do that.

We will use the `works` dataset.

In [2]:
works = pd.read_csv('./data/works.csv')
works.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,52446,,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",,"Hill, Ureli Corelli",,False
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,52437,,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",,"Hill, Ureli Corelli",,False
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,52364,1.0,"Beethoven, Ludwig van","EGMONT, OP.84",Overture,"Hill, Ureli Corelli",,False
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,52434,,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Boucher, Alfred",,False
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,52453,,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",,"Hill, Ureli Corelli",,False


In [3]:
X_train, X_test = train_test_split(works)
print(f'Train dataset: {X_train.shape[0]} rows \nTest dataset: {X_test.shape[0]} rows')

Train dataset: 61932 rows 
Test dataset: 20644 rows


In pandas transformations, we merely transform the original dataframe into a new one. But sometimes, this isn't possible. 

Let's start with an example: encoding categorical variables. Remember: we need to perform the same transformations on train and test data (and whatever data comes next).

Below, we define a couple of transformations as we learned in the first notebook in this BLU. One is to remove the rows that are intervals, not real works, and drop the `Interval` and `isInterval` columns. Then we want to drop all ID columns and remove duplicates. Finally, we encode the categorical variable `ComposerName`. Then we encapsulate both transformations into the `transform_data` function.

In [4]:
def remove_intervals(df):
    """
        This function removes the intervals from the dataframe
    """
    df_ = df.copy()
    mask = df_['Interval'].isnull()
    df_ = (df_.loc[mask, :]
            .drop(columns=['Interval','isInterval']))
    return df_

def drop_id_fields(df):
    columns = ['GUID', 'ProgramID', 'WorkID', 'MovementID']
    df_ = df.copy()
    df_ = df_.drop(columns=columns).drop_duplicates()
    return df_
        
def label_encoder(df, column):
    """
        This function encodes a given categorical column
    """
    df_ = df.copy()
    df_[column + 'Encoded'] = df[column].astype('category').cat.codes
    return df_

def transform_data(df):
    """
        This function transforms the dataframe by removing the intervals and
        encoding the categorical columns
    """
    df_ = df.copy()
    df_ = (df_.pipe(remove_intervals)
            .pipe(drop_id_fields)
            .pipe(label_encoder, 'ComposerName'))
    return df_

Now we apply the tranformations to both test and train data. We are encoding the categorical variable `ComposerName`.

In [5]:
X_train_ = transform_data(X_train)
X_test_ = transform_data(X_test)

But wait, are we doing this right? We should apply the same transformations to train and test data and we don't seem to be doing that here.

Let's check the `ComposerName` encoding for a random composer:

In [6]:
X_train_.loc[X_train_.ComposerName=='Berg,  Alban',['ComposerName', 'ComposerNameEncoded']].head(1)

Unnamed: 0,ComposerName,ComposerNameEncoded
22071,"Berg, Alban",173


In [7]:
X_test_.loc[X_test_.ComposerName=='Berg,  Alban',['ComposerName', 'ComposerNameEncoded']].head(1)

Unnamed: 0,ComposerName,ComposerNameEncoded
66046,"Berg, Alban",92


Indeed, `ComposerName` is encoded differently in the train and test dataset.

This problem is significant and it leads us to make wrong predictions!!

There are other cases in which this kind of problems arises. For instance, when replacing missing values with the mean. You are supposed to compute the mean on the training set and use it to transform both train and test sets.

The way we learned to apply transformations defined as functions and chained methods does not work in the context of our workflow.

How do we solve this? We need **sklearn pipelines** instead of pandas pipelines. We also need to define our transformations as **sklearn transformers**, using the `fit-transform-predict` signature. That way, we can conserve the `state` of the transformation when needed and use it to transform both the train and test datasets.

## 2. Meet the sklearn-like transformers

There are three fundamental verbs in scikit-learn and sklearn-like libraries:
* `.fit()`
* `.transform()`
* `.predict()`.

You are already familiar with `.fit()` and `.predict()` from the predictors you used in S01 and `.fit()` and `.transform()` from the transformations like scaling.

Here, we will use the `.fit()` and `.transform()` combo to define custom transformations and use them in a pipeline. This is how it works.

![sklearn_like_transformation_pipeline](./media/sklearn_like_transformation_pipeline.png)

In short, we fit all the transformers in the pipeline on the training data and use it to transform the training and test data.

The `.fit()` step is executed only once and returns the transformer so we can use it later in the `.transform()` step.

### 2.1 Function transformer
Time to get practical: we transform our transformer functions into sklearn transformers and use an sklearn pipeline instead of a pandas pipeline.

We can use the `OrdinalEncoder` instead of our `label_encoder` function. (The one-hot is a better choice, but we keep it simple here.)

For the `remove_intervals` and `drop_id_fields` functions, we use the super practical `FunctionTransformer` (see documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer)).

In [8]:
remove_intervals_transformer = FunctionTransformer(remove_intervals, check_inverse=False)
drop_id_fields_transformer = FunctionTransformer(drop_id_fields, check_inverse=False)

Now to the pipeline:

In [9]:
columns = ['ComposerName']

transform_data_pipe = Pipeline([('remove_intervals', remove_intervals_transformer),
                                ('drop_id_fields', drop_id_fields_transformer), 
                                ('ordinal_encoder', OrdinalEncoder(cols=columns))])

Now we use our pipeline on the test and train data.

In [10]:
transform_data_pipe.fit_transform(X_train).head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
42087,1,"PRELUDES, LES (SYMPHONIC POEM NO. 3)",,"Stransky, Josef"
57774,2,TOSCA,"Duet: ""O dolci mani"" Act III","Antonini, Alfredo"
15012,3,"CONCERTO, VIOLIN, G MINOR (1913: UNPUBLISHED)",,"Stransky, Josef"
3366,4,"EURYANTHE, OP. 81, J. 291",Overture,"Hoogstraten, Willem van"
50816,5,PICTURES AT AN EXHIBITION (ARR. Gorchakov),,"Masur, Kurt"


In [11]:
transform_data_pipe.transform(X_test).head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
1938,26.0,"SYMPHONY NO. 9, E MINOR, OP.95 (FROM THE NEW W...",,"Stransky, Josef"
11448,57.0,"TAIWAN, REPUBLIC OF CHINA",,"Masur, Kurt"
76298,290.0,FAUST: BALLET MUSIC,,"Rudel, Julius"
50986,47.0,"SYMPHONIE FANTASTIQUE, OP.14",,"Maazel, Lorin"
52724,47.0,"DAMNATION DE FAUST, LA, OP. 24",Danse des Sylphes,"Damrosch, Walter"


We can inspect the pipeline components through the `named_steps` attribute. For instance we can check the mapping applied with the categorical encoder:

In [12]:
transform_data_pipe.named_steps.ordinal_encoder.mapping

[{'col': 'ComposerName',
  'mapping': Liszt,  Franz                     1
  Puccini,  Giacomo                 2
  Vivaldi,  Antonio                 3
  Weber,  Carl  Maria Von           4
  Musorgsky,  Modest                5
                                 ... 
  Furstenau,  Anton  Bernhard    2407
  Marshall,  Charles             2408
  Zhang,  Yusong                 2409
  Cerha,  Friedrich              2410
  NaN                              -2
  Length: 2411, dtype: int64,
  'data_type': dtype('O')}]

What kind of transformations can we perform this way? Some widespread ones are:
* Encoding (as we've seen)
* Scaling
* Vectorization (you will learn about this in the NLP specialization)
* Missing data imputation.

All steps but last in the pipeline have to be transformers. The last step can be another kinds of estimator, for instance a predictor.

### 2.2 Custom transformers

<img src="./media/megazord.png" width="150">

If a simple function wrapped in a `FunctionTransformer` won't do it, we can build our own custom transformers. They can be included in the sklearn pipeline as long as they follow the usual blueprint:
* Implement `Transformer.fit()`
* And `Transformer.transform()`.

All transformers are `estimators` in the sklearn universe (see the term in the [glossary](https://scikit-learn.org/stable/glossary.html#term-estimator)), therefore our custom transformer class will inherit from `sklearn.base.BaseEstimator`.

Transformers additionally need a `transform()` method that we are inheriting from `sklearn.base.TransformerMixin`.

All this ensures that the transformer will be compatible with pipelines and model selection tools such as grid search.

It is particularly important to notice that mixins should be “on the left” while the BaseEstimator should be “on the right” in the inheritance list for proper MRO.

This is the schema we're going to use. We initialize any parameters in the `__init__` method, fit and store the transformer with the `fit` method and use it to transform data with the `transform` method.

In [13]:
from sklearn.utils.validation import check_is_fitted

class FeatureMultiplier(TransformerMixin, BaseEstimator):
    def __init__(self, some_parameter):
        self.some_parameter = some_parameter

    def fit(self, X, y=None):
        # Fit the transformer and store it.
        self._is_fitted = True
        return self
        
    def transform(self, X):
        # Transform X.
        check_is_fitted(self)
        return X

    def __sklearn_is_fitted__(self):
        """
        Check fitted status and return a Boolean value.
        """
        return hasattr(self, "_is_fitted") and self._is_fitted     

Note the additional `__sklearn_is_fitted__` method and the `_is_fitted` attribute. Sklearn pipelines check if they are fitted when the `transform` method is called and if not, they throw a warning (which will become an error in the future). This is just to avoid that warning. For more information see [this section](https://scikit-learn.org/stable/developers/develop.html#developer-api-for-check-is-fitted) in the documentation.

We want to implement a custom imputer for our categorical data that will impute with the most frequent value. (We do this just for the sake of an example because, well, we could just use the [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html) for this.)
* We want the imputer to take a `strategy` parameter, although we will support only one option, the mode
* Fitting means taking the mode of each column and storing it
* Transforming implies replacing missing values with the given column modes.

We use pandas `mode` to calculate the mode of all columns, then `squeeze` the values into a series.

In [14]:
X_train.mode()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,884c64d6-1768-4cf1-85f1-0ac2f79bbe5c,10608,0,1.0,"Wagner, Richard","MEISTERSINGER, DIE, WWV 96",Overture,"Damrosch, Walter",Intermission,False


In [15]:
X_train.mode().squeeze()

GUID             884c64d6-1768-4cf1-85f1-0ac2f79bbe5c
ProgramID                                       10608
WorkID                                              0
MovementID                                        1.0
ComposerName                         Wagner,  Richard
WorkTitle                  MEISTERSINGER, DIE, WWV 96
Movement                                     Overture
ConductorName                        Damrosch, Walter
Interval                                 Intermission
isInterval                                      False
Name: 0, dtype: object

Here goes our transformer:

In [16]:
class CategoryImputer(TransformerMixin, BaseEstimator):
    def __init__(self, strategy=None):
        self.strategy=strategy
        
    def fit(self, X, y=None):
        if self.strategy is None:
            self.strategy = 'most_frequent'
        if self.strategy == 'most_frequent':
            self.fills = X.mode(axis=0).squeeze()
            self._is_fitted = True
            return self
        else:
            return 'Strategy not supported.'

    def transform(self, X):
        check_is_fitted(self)
        return pd.DataFrame(X).fillna(self.fills)

    def __sklearn_is_fitted__(self):
        """
        Check fitted status and return a Boolean value.
        """
        return hasattr(self, "_is_fitted") and self._is_fitted        

We insert it into our pipeline:

In [17]:
transform_and_impute_data_pipe = Pipeline([('remove_intervals', remove_intervals_transformer),
                                           ('drop_id_fields', drop_id_fields_transformer),
                                           ('ordinal_encoder', OrdinalEncoder(cols=columns)),
                                           ('cat_imputer', CategoryImputer(strategy='most_frequent'))])

In [18]:
transform_and_impute_data_pipe.fit(X_train)

0,1,2
,steps,"[('remove_intervals', ...), ('drop_id_fields', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,func,<function rem...x7fcebf8df920>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,False
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,func,<function dro...x7fcebf8dfce0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,False
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,verbose,0
,mapping,"[{'col': 'ComposerName', 'data_type': dtype('O'), 'mapping': Liszt, Franz..., dtype: int64}]"
,cols,['ComposerName']
,drop_invariant,False
,return_df,True
,handle_unknown,'value'
,handle_missing,'value'

0,1,2
,strategy,'most_frequent'


In [19]:
transform_and_impute_data_pipe.transform(X_train).head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
42087,1,"PRELUDES, LES (SYMPHONIC POEM NO. 3)",Overture,"Stransky, Josef"
57774,2,TOSCA,"Duet: ""O dolci mani"" Act III","Antonini, Alfredo"
15012,3,"CONCERTO, VIOLIN, G MINOR (1913: UNPUBLISHED)",Overture,"Stransky, Josef"
3366,4,"EURYANTHE, OP. 81, J. 291",Overture,"Hoogstraten, Willem van"
50816,5,PICTURES AT AN EXHIBITION (ARR. Gorchakov),Overture,"Masur, Kurt"


There we go! What about the test set?

In [20]:
transform_and_impute_data_pipe.transform(X_test).head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
1938,26.0,"SYMPHONY NO. 9, E MINOR, OP.95 (FROM THE NEW W...",Overture,"Stransky, Josef"
11448,57.0,"TAIWAN, REPUBLIC OF CHINA",Overture,"Masur, Kurt"
76298,290.0,FAUST: BALLET MUSIC,Overture,"Rudel, Julius"
50986,47.0,"SYMPHONIE FANTASTIQUE, OP.14",Overture,"Maazel, Lorin"
52724,47.0,"DAMNATION DE FAUST, LA, OP. 24",Danse des Sylphes,"Damrosch, Walter"


Victory awaits!

Now, can we throw a model in there? Perhaps we can. (We shouldn't, really, since we are exemplifying data wrangling workflows.) 

We will use k-means clustering on the transformed data.

In [21]:
megazord = Pipeline([('remove_intervals', remove_intervals_transformer),
                     ('drop_id_fields', drop_id_fields_transformer),
                     ('ordinal_encoder', OrdinalEncoder()),
                     ('cat_imputer', CategoryImputer(strategy='most_frequent')),
                     ('k_means', KMeans(n_init = 10))])

megazord.fit(X_train)
megazord.predict(X_test)

array([3, 3, 0, ..., 3, 1, 3], shape=(12568,), dtype=int32)

<img src="./media/noice.gif" width="500">

## 3. Further reading

The [column transformer](https://scikit-learn.org/stable/modules/compose.html#column-transformer) offers a way to apply different transformations to different columns of the dataframe.

If you'll need to program your own estimator in the future, be aware that sklearn provides a [guide and templates](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator).