# BLU02 - Learning Notebook - Data wrangling workflows - Part 3 of 3

In [1]:
import matplotlib.pyplot as plt

import pandas as pd
import os

from category_encoders.ordinal import OrdinalEncoder

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# 3 Advanced pipelines in scikit-learn

Remember our workflow diagram? Let's look at it again.

![data_transformation_workflow](./media/data_processing_workflow.png)

*Fig 1. - A standard workflow (again).*

Pandas, as amazing as it is, can only take us so far.

There, beyond the known universe, lies **modeling**.

Where we are at this point:
* We are to perform transformations on data, setting up robust pipelines using nothing but Pandas
* We can combine different dataframes, to enrich our datasets or generate new ones.

Thus, here we are, modeling lying ahead of us. What's exactly new about modeling though?

We will be using the same dataset, but this time we will create a train-test split, as we would do before modeling.

In [2]:
works = pd.read_csv('./data/works.csv')
works

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,52446,,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",,"Hill, Ureli Corelli",
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,52437,,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",,"Hill, Ureli Corelli",
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,52364,1.0,"Beethoven, Ludwig van","EGMONT, OP.84",Overture,"Hill, Ureli Corelli",
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,52434,,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Boucher, Alfred",
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,52453,,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",,"Hill, Ureli Corelli",
...,...,...,...,...,...,...,...,...,...
82571,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"McGegan, Nicholas",
82572,884c64d6-1768-4cf1-85f1-0ac2f79bbe5c,10608,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"Labadie, Bernard",
82573,f549e93f-b35f-4824-b0d5-d543953535f8,9542,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"Bicket, Harry",
82574,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"McGegan, Nicholas",


In [3]:
X_train, X_test = train_test_split(works)

## 3.1 How is modeling different from transformation

In Pandas, we merely transformed the original dataframe into a new one.

But sometimes, this isn't possible. Let's start with an example: encoding categorical variables.

Remember: we need to perform the same transformations on train and test data (and whatever data comes next).

In [4]:
def transform_data(df):
    """
        This function transforms the dataframe, removing the intervals and
        encoding the categorical columns
    """
    df = df.copy()
    df = (df.pipe(remove_intervals)
            .pipe(label_encoder, 'ComposerName'))
    return df


def remove_intervals(df):
    """
        This function remove the intervals from the dataframe
    """
    df = df.copy()
    mask = df['Interval'].isnull()
    df = (df.loc[mask, :]
            .drop(columns='Interval'))
    return df
    

def label_encoder(df, column):
    """
        This function encodes a categorical column
    """
    df = df.copy()
    df[column + 'Encoded'] = df[column].astype('category').cat.codes
    return df


X_train_ = transform_data(X_train)

train_alban_berg = X_train_['ComposerName'] == 'Berg,  Alban'
(X_train_.loc[train_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
         .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
33498,"Berg, Alban",171


All is good. We removed the intermissions (just like we did previously), and we transformed the original dataframe.

For convenience, we are keeping only the `ComposerName` and `ComposerNameEncoded` columns and removing duplicates.

Let's do the same to the test data.

In [5]:
X_test_ = transform_data(X_test)

test_alban_berg = X_test_['ComposerName'] == 'Berg,  Alban'
(X_test_.loc[test_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
        .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
8968,"Berg, Alban",102


Do you see the problem? The same `ComposerName` can (and will, in all probability) get different encodings.

This problem is significant, as it would lead us to make wrong predictions!!

There are other cases in which problems arise. For instance, when replacing missing values with the mean:
* You are supposed to compute the mean on the training set and use it to transform both train and test sets
* Otherwise, you end up underestimating your correct generalization error.

This particular unit is not about modeling at a conceptual level, but you get the point: 
* Somehow, you need to fit the transformer on your training data first (e.g., define the encodings, compute the means)
* Transform both train and test sets (and any data that might come in, really) using the pre-fitted transformers.

These transformations are more like modeling. In fact, all of this *is* modeling and part of your model. 

How do we solve this? **We need transformers that are more like models.**

## 3.2 Meet the sklearn-like transformers

There are at three fundamental verbs in scikit-learn and sklearn-like libraries:
* `.fit()`
* `.transform()`
* `.predict()`.

You are already familiar with `.fit()` and `.predict()`, from the Bootcamp and the Hackathon #1.  We use them to train models and make predictions.

Here, we will explore a new combo: `.fit()` and `.transform()`. This is how it works.

![sklearn_like_transformation_pipeline](./media/sklearn_like_transformation_pipeline.png)

*Fig 2. - A data pipeline with consistent transformers, fitted on the training set.*

In short, we fit a transformer on the training data and use to transform the training data.

We will, however, return the transformer so we can use it to transform new, incoming data as well. Confusing? Perhaps.

Time to get practical: meet the `categorical_encoders`, a set of transformers for encoding categorical variables.

In [6]:
encoder = OrdinalEncoder(cols=['ComposerName'])
X_train_ = encoder.fit_transform(X_train)
X_train_.head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval
12953,ab7f854e-c046-411c-afb6-8724186b7ede,11506,51538,,1,"PASSACAGLIA, OP.1",,"Gilbert, Alan",
42641,41110596-c5bf-4786-85a0-17e82f9a3e70,8692,7056,3.0,2,"TRIO, FLUTE, BASSON, PIANO (HARPSICHORD), G MA...",Thema con Variazioni,,
28512,f41f67f1-0179-4986-8b4e-c8df46b197c4,1146,52364,2.0,2,"EGMONT, OP.84","""Die Trommel geruhret,"" Lied","Stransky, Josef",
7388,5de63d4b-e822-4b00-95f7-2558c9d372b7,11183,8709,1.0,3,BEATRICE ET BENEDICT,Overture,"Smallens, Alexander",
18344,1eee8a13-841c-474e-aee9-ad8a495ed4c1,5630,52434,,2,"SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Lange, Hans",


We can now use transform our test set.

In [7]:
X_test_ = encoder.transform(X_test)
X_test_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval
31717,6af1668e-5729-400d-aa94-d44a4f7e540d,4245,0,,11.0,,,,Intermission
61775,06c24fb8-fb41-4383-845e-6af49978279f,7200,9003,11.0,18.0,"TANNHAUSER, WWV 70","Bacchanale (Venusberg music), Act I, scene i","Stransky, Josef",
4298,717e85ee-6957-45c9-8628-45f2ad36e614,2107,8957,1.0,57.0,"MARRIAGE OF FIGARO, THE, K.492",Overture,"Schelling, Ernest",
65694,c732a914-caf6-436b-b53e-cc5f56c19e74,3836,2331,,38.0,"SCHEHERAZADE, OP. 35",,"Mehta, Zubin",
33584,010210d2-6619-41e8-93d7-b2cd9b3b749e,11094,0,,11.0,,,,Intermission


Let's re-do our transformer functions so that they can either fit a transformer or accept a pre-fitted one.

We have to change our `label_encoder()` first to incorporate this logic. Then we need to adapt `transform_data()`.

In [8]:
def transform_data(df, encoder=None):
    df = df.copy()
    df, encoder = (df.pipe(remove_intervals)
                     .pipe(label_encoder, 'ComposerName', encoder))
    
    return df, encoder


def label_encoder(df, columns, encoder=None):
    if not encoder:
        encoder = OrdinalEncoder(cols=[columns])
        encoder.fit(df)
        
    preview_encodings(encoder)

    df = df.copy()
    df = encoder.transform(df)

    return df, encoder

    
def preview_encodings(encoder):
    encodings = encoder.category_mapping[0]['mapping'][:4]
    print('Encodings: {}'.format(encodings))
    return None

X_train_, encoder = transform_data(X_train)

Encodings: Webern,  Anton  von        1
Beethoven,  Ludwig  van    2
Berlioz,  Hector           3
Prokofiev,  Sergei         4
dtype: int64


  elif pd.api.types.is_categorical(cols):


In the code above, we changed our functions so that they can receive an encoder. 

Otherwise, they fit and return the new one for re-use.

From a consistency standpoint, things should be looking good. Nonetheless, you are previewing the encoder as a sanity check.

In [9]:
X_test_ = transform_data(X_test, encoder=encoder)

Encodings: Webern,  Anton  von        1
Beethoven,  Ludwig  van    2
Berlioz,  Hector           3
Prokofiev,  Sergei         4
dtype: int64


What kind of transformations do you need to perform this way? Some widespread ones are:
* Encoding (as we've seen)
* Scaling
* Vectorization (you will learn about this in the next specialization!)
* Missing data imputation.

Now, this changes things (right?):
* We lose the ability to do method chaining, as we have to return encodings as intermediate outputs
* We need to segregate pipelines for training (fit and transform) and test (transform), which adds complexity and it's error-prone.

Because we will perform the same transformations on all datasets, storing all the correct steps is critical for reproducibility and consistency.

It turns out, scikit-learn provides us with a distinctive take on pipelines, to wrap all of this in a single META-TRANSFORMER.

![megazord](./media/megazord.png)

*Fig 3. - A meta-transformer in practice.*

Meet the Megazord.

## 3.3 Pipelines

The sklearn's `Pipelines` provide a higher level of abstraction than the individual building blocks.

Let's tie together all these sequential transformers and run `Megazord.fit()` and `Megazord.transform()` on the whole thing.

That would make managing our code much easier, right? Let's do it:
* We want to replace the missing values with the mode
* We want to one-hot-encode all categorical variables.

First things first, some Pandas magic: let's drop the ID columns and exclude the intervals.

In [10]:
def prepare_data(df):
    df = df.copy()
    df = (df.pipe(drop_id_fields)
            .pipe(remove_intervals)
            .drop_duplicates())
    return df

def drop_id_fields(df):
    columns = ['GUID', 'ProgramID', 'WorkID', 'MovementID']
    df = df.copy()
    df = df.drop(columns=columns)
    return df

X_train_ = prepare_data(X_train)
X_train_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
12953,"Webern, Anton von","PASSACAGLIA, OP.1",,"Gilbert, Alan"
42641,"Beethoven, Ludwig van","TRIO, FLUTE, BASSON, PIANO (HARPSICHORD), G MA...",Thema con Variazioni,
28512,"Beethoven, Ludwig van","EGMONT, OP.84","""Die Trommel geruhret,"" Lied","Stransky, Josef"
7388,"Berlioz, Hector",BEATRICE ET BENEDICT,Overture,"Smallens, Alexander"
18344,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Lange, Hans"


In [11]:
X_test_ = prepare_data(X_test)
X_test_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
61775,"Wagner, Richard","TANNHAUSER, WWV 70","Bacchanale (Venusberg music), Act I, scene i","Stransky, Josef"
4298,"Mozart, Wolfgang Amadeus","MARRIAGE OF FIGARO, THE, K.492",Overture,"Schelling, Ernest"
65694,"Rimsky-Korsakov, Nikolai","SCHEHERAZADE, OP. 35",,"Mehta, Zubin"
15460,"Brahms, Johannes","SYMPHONY NO. 2 IN D MAJOR, OP. 73",,"Stransky, Josef"
80055,"Davis, Katherine","Onorati, Henry / Simeone / LITTLE DRUMMER BOY,...",,"Turnbull, Walter J."


Here's how we are going to do:
1. Create an `Imputer` that replace missing values with the mode of the column
2. Use said `SimpleImputer` to impute the missing values
3. Ordinal encode all the categorical features.

All in one, single, amazing, Megazord.

## 3.4 Custom transformers

We can build own custom transformers, for as long as they follow the usual blueprint:
* Implement `Transformer.fit()`
* And `Transformer.transform()`.

All scikit-learn estimators have `get_params()` and `set_params` functions. 

The easiest way to implement these functions sensibly is to inherit from `sklearn.base.BaseEstimator`, as we're doing.

And `Pipeline` compatibility requires a `fit_transform()` method that we are inheriting from `sklearn.base.TransformerMixin`.

In [12]:
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, some_parameter):
        self.some_parameter = some_parameter

    def fit(self, X, y=None):
        # Fit the transformer and store it.
        return self
        
    def transform(self, X):
        # Transform X.
        return X

Also, we may want our transformer to accept some parameters.

That's we are doing when we include `some_parameter` in the `__init__`.

Back to our transformers. Our blueprint:
* We want the estimator to be able to take a `strategy` parameter, although we will support only the mode
* Fitting requires taking the mode of each column and storing it
* Transform implies replacing missing values with the given column modes.

How are we going to compute the modes? Pandas, as always, provides a convenient `.mode()` method.

In [13]:
X_train_.mode()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
0,"Beethoven, Ludwig van",MESSIAH,Overture,"Damrosch, Walter"


To be able to use indexing we will use `df.squeeze()`, a convenient method to transform our dataframe into a `pd.Series`.

In [14]:
X_train.mode().squeeze()

GUID             884c64d6-1768-4cf1-85f1-0ac2f79bbe5c
ProgramID                                       10608
WorkID                                              0
MovementID                                        1.0
ComposerName                         Wagner,  Richard
WorkTitle                  MEISTERSINGER, DIE, WWV 96
Movement                                     Overture
ConductorName                        Damrosch, Walter
Interval                                 Intermission
Name: 0, dtype: object

We have everything we need.

In [15]:
class CategoryImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy=None):
        if strategy:
            self.strategy = strategy
        else:
            self.strategy = 'most_frequent'

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = X.mode(axis=0).squeeze()
            return self
        else:
            return 'Strategy not supported.'

    def transform(self, X):
        return pd.DataFrame(X).fillna(self.fills)


imputer = CategoryImputer()
X_train_ = imputer.fit_transform(X_train_)
X_train_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName
12953,"Webern, Anton von","PASSACAGLIA, OP.1",Overture,"Gilbert, Alan"
42641,"Beethoven, Ludwig van","TRIO, FLUTE, BASSON, PIANO (HARPSICHORD), G MA...",Thema con Variazioni,"Damrosch, Walter"
28512,"Beethoven, Ludwig van","EGMONT, OP.84","""Die Trommel geruhret,"" Lied","Stransky, Josef"
7388,"Berlioz, Hector",BEATRICE ET BENEDICT,Overture,"Smallens, Alexander"
18344,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",Overture,"Lange, Hans"


In [16]:
X_train_.isnull().sum()

ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
dtype: int64

There we go! What about the test set?

In [17]:
X_test_ = imputer.transform(X_test_)
X_test_.isnull().sum()

ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
dtype: int64

Victory awaits!

## 3.5 Everything together

Now, we want to fill in missing values and use one-hot-encoding (remember?), all at the same time.

We are reaching our destination!

In [18]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder())])

X_train_ = megazord.fit_transform(X_train)
X_test_ = megazord.transform(X_test)

This way we abstract all the logic of passing transformers around.

Now, can we throw a model in there? Perhaps we can.

(But we shouldn't, in a way. Please note that we are exemplifying data wrangling workflows.)

In [19]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder()),
                     ('k_means', KMeans())])

megazord.fit(X_train_)
megazord.predict(X_test_)

array([5, 2, 4, ..., 2, 5, 5], dtype=int32)

For the sake of simplicity, we are encoding categorical variables as if they were ordinal, instead of using one-hot-encoding, as recommended.

Take this for what it is: an example on how to build end-to-end pipelines for modeling in scikit-learn.