# BLU02 - Learning Notebook - Data wrangling workflows - Part 3 of 3

In [1]:
import matplotlib.pyplot as plt

import pandas as pd
import os

from category_encoders.ordinal import OrdinalEncoder

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# 3 Advanced pipelines in scikit-learn

Remember our workflow diagram? Let's look at it again.

![data_transformation_workflow](./media/data_processing_workflow.png)

*Fig 1. - A standard workflow (again).*

Pandas, as amazing as it is, can only take us so far.

There, beyond the known universe, lies **modeling**.

Where we are at this point:
* We are able to perform transformations on data, setting up robust pipelines using nothing but Pandas
* We can combine different dataframes, to enrich our datasets or generate new ones.

Thus, here we are, with modeling lying right ahead of us. What's exactly new about modeling though?

We will be using the same dataset, but this time we will create a train-test split, as we would do before modeling.

In [2]:
works = pd.read_csv('./data/works.csv')
works

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,52446,,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",,"Hill, Ureli Corelli",,False
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,52437,,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",,"Hill, Ureli Corelli",,False
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,52364,1.0,"Beethoven, Ludwig van","EGMONT, OP.84",Overture,"Hill, Ureli Corelli",,False
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,52434,,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Boucher, Alfred",,False
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,52453,,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",,"Hill, Ureli Corelli",,False
...,...,...,...,...,...,...,...,...,...,...
82571,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"McGegan, Nicholas",,False
82572,884c64d6-1768-4cf1-85f1-0ac2f79bbe5c,10608,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"Labadie, Bernard",,False
82573,f549e93f-b35f-4824-b0d5-d543953535f8,9542,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"Bicket, Harry",,False
82574,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"McGegan, Nicholas",,False


In [3]:
X_train, X_test = train_test_split(works)
print(f'Train dataset: {X_train.shape[0]} rows \nTest dataset: {X_test.shape[0]} rows')


Train dataset: 61932 rows 
Test dataset: 20644 rows


## 3.1 How is modeling different from transformation

In Pandas, we merely transformed the original dataframe into a new one.

But sometimes, this isn't possible. Let's start with an example: encoding categorical variables.

Remember: we need to perform the same transformations on train and test data (and whatever data comes next).

In [4]:
def transform_data(df):
    """
        This function transforms the dataframe by removing the intervals and
        encoding the categorical columns
    """
    df = df.copy()
    df = (df.pipe(remove_intervals)
            .pipe(label_encoder, 'ComposerName'))
    return df


def remove_intervals(df):
    """
        This function remove the intervals from the dataframe
    """
    df = df.copy()
    mask = df['Interval'].isnull()
    df = (df.loc[mask, :]
            .drop(columns='Interval'))
    return df
    

def label_encoder(df, column):
    """
        This function encodes a given categorical column
    """
    df = df.copy()
    df[column + 'Encoded'] = df[column].astype('category').cat.codes
    return df


X_train_ = transform_data(X_train)

train_alban_berg = X_train_['ComposerName'] == 'Berg,  Alban'
(X_train_.loc[train_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
         .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
48364,"Berg, Alban",177


All is good. We removed the intervals (just like we did previously), and we transformed the original dataframe.

For convenience, we are keeping only the `ComposerName` and `ComposerNameEncoded` columns and removing duplicates.

Let's do the same to the test data.

In [5]:
X_test_ = transform_data(X_test)

test_alban_berg = X_test_['ComposerName'] == 'Berg,  Alban'
(X_test_.loc[test_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
        .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
23114,"Berg, Alban",91


Do you see the problem? The same `ComposerName` can (and will, in all probability) get different encodings.

This problem is significant, as it would lead us to make wrong predictions!!

There are other cases in which this kind of problems arises. For instance, when replacing missing values with the mean:
* You are supposed to compute the mean on the training set and use it to transform both train and test sets
* Otherwise, you end up underestimating your correct generalization error.

This particular learning unit is not about modeling at a conceptual level, but you get the point: 
* Somehow, you need to fit the transformer on your training data first (e.g., define the encodings, compute the means)
* Transform both train and test sets (and any data that might come in, really) using the pre-fitted transformers.

These transformations are more like modeling. In fact, all of this *is* modeling and part of your model. 

How do we solve this? **We need transformers that are more like models.**

## 3.2 Meet the sklearn-like transformers

There are three fundamental verbs in scikit-learn and sklearn-like libraries:
* `.fit()`
* `.transform()`
* `.predict()`.

You are already familiar with `.fit()` and `.predict()`, from the Bootcamp and the Hackathon #1.  We use them to train models and make predictions.

Here, we will explore a new combo: `.fit()` and `.transform()`. This is how it works.

![sklearn_like_transformation_pipeline](./media/sklearn_like_transformation_pipeline.png)

*Fig 2. - A data pipeline with consistent transformers, fitted on the training set.*

In short, we fit a transformer on the training data and use it to transform the training data.

We will, however, return the transformer so we can use it to transform new, incoming data as well. Confusing? Perhaps.

Time to get practical: meet the `categorical_encoders`, a set of transformers for encoding categorical variables.

In [6]:
encoder = OrdinalEncoder(cols=['ComposerName'])
X_train_ = encoder.fit_transform(X_train)
X_train_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
63424,f433ebb7-ce15-493e-b7e4-fa20c7d9ffcf,4140,9006,10.0,1,GOTTERDAMMERUNG [GÖTTERDÄMMERUNG],"Siegfried's Funeral Music, ACT III, scene ii","Barbirolli, John",,False
41769,ed4130ea-a91d-414d-bd47-8b86bb5b6dd0,1016,9194,1.0,2,"BARTERED BRIDE, THE","Overture (""Overture to a Comedy"")","Stransky, Josef",,False
15638,16814952-bc58-45eb-b0de-211e499022d7,10357,50027,,3,"SERENADE FOR STRINGS, OP.48",,"Damrosch, Walter",,False
39300,19ec976d-3245-4b9e-bf70-246c8000ea7a,10342,8088,,4,"EARLE OF OXFORD'S MARCH, THE (BRASS AND PERCUS...",,"Holtan, Timothy J.",,False
30773,e49104fe-e447-4d6d-b6db-b6b60dbd28dd,12731,51965,,5,"TILL EULENSPIEGELS LUSTIGE STREICHE, OP. 28",,"Coates, Albert",,False


We can now use it transform our test set.

In [7]:
X_test_ = encoder.transform(X_test)
X_test_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
1699,035494fc-6d33-4761-bb40-c6c39fce5386,5445,53344,,44.0,"SYMPHONY NO. 8, G MAJOR, OP.88 (OLD NO. 4)",,"Stransky, Josef",,False
62165,98445cf7-55c3-4aeb-aeac-dfe87eb6ec63,2526,50064,,3.0,"SYMPHONY NO. 4, F MINOR, OP. 36",,"Stransky, Josef",,False
58368,c8e8ef33-dc28-4f96-b676-83747e6974d7,439,5637,,9.0,"IN THE SILENT NIGHT, OP. 4, NO. 3",,"Kostelanetz, Andre",,False
21605,d00dde6c-e183-469b-adad-a4bad9dd55c4,843,5801,8.0,108.0,"MIDSUMMER NIGHT'S DREAM, OP. 61",7. Nocturne: Andante tranquillo,"Gamson, Arnold",,False
50487,eb369a99-904c-45a1-9a1a-079cc8bffa02,8788,53254,,58.0,"SYMPHONY NO. 2, D MAJOR, OP. 43",,"Volkov, Ilan",,False


Let's re-do our transformer functions so that they can either fit a transformer or accept a pre-fitted one.

We have to change our `label_encoder()` first to incorporate this logic. Then we need to adapt `transform_data()`.

In [8]:
def transform_data(df, encoder=None):
    df = df.copy()
    df, encoder = (df.pipe(remove_intervals)
                     .pipe(label_encoder, 'ComposerName', encoder))
    
    return df, encoder


def label_encoder(df, columns, encoder=None):
    if not encoder:
        encoder = OrdinalEncoder(cols=[columns])
        encoder.fit(df)
        
    preview_encodings(encoder)

    df = df.copy()
    df = encoder.transform(df)

    return df, encoder

    
def preview_encodings(encoder):
    encodings = encoder.category_mapping[0]['mapping'][:4]
    print('Encodings: {}'.format(encodings))
    return None

X_train_, encoder = transform_data(X_train)

Encodings: Wagner,  Richard               1
Smetana,  Bedrich              2
Tchaikovsky,  Pyotr  Ilyich    3
Byrd,  William                 4
dtype: int64


In the code above, we changed our functions so that they can receive an encoder. 

Otherwise, they fit and return the new one for re-use.

From a consistency standpoint, things should be looking good. Nonetheless, you are previewing the encoder as a sanity check.

In [9]:
X_test_ = transform_data(X_test, encoder=encoder)

Encodings: Wagner,  Richard               1
Smetana,  Bedrich              2
Tchaikovsky,  Pyotr  Ilyich    3
Byrd,  William                 4
dtype: int64


What kind of transformations do you need to perform this way? Some widespread ones are:
* Encoding (as we've seen)
* Scaling
* Vectorization (you will learn about this in the next specialization!)
* Missing data imputation.

Now, this changes things (right?):
* We lose the ability to do method chaining, as we have to return encodings as intermediate outputs
* We need to segregate pipelines for training (fit and transform) and test (transform), which adds complexity and it's error-prone.

Since we want to perform the same transformations on all datasets, storing all the correct steps is critical for reproducibility and consistency.

As it turns out, scikit-learn provides us with a distinctive take on pipelines, allowing us to wrap all of this in a single META-TRANSFORMER.

Meet the Megazord.

![megazord](./media/megazord.png)

*Fig 3. - A meta-transformer in practice.*



## 3.3 Pipelines

The sklearn's `Pipelines` provide a higher level of abstraction than the individual building blocks.

Let's tie together all these sequential transformers and run `Megazord.fit()` and `Megazord.transform()` on the whole thing. This should make managing our code much easier, right?

Let's do it:
* We want to replace the missing values with the mode
* We want to one-hot-encode all categorical variables.

First things first, some Pandas magic: let's drop the ID columns and exclude the intervals.

In [10]:
def prepare_data(df):
    df = df.copy()
    df = (df.pipe(drop_id_fields)
            .pipe(remove_intervals)
            .drop_duplicates())
    
    df["isInterval"] = df["isInterval"].astype("category")
    return df

def drop_id_fields(df):
    columns = ['GUID', 'ProgramID', 'WorkID', 'MovementID']
    df = df.copy()
    df = df.drop(columns=columns)
    return df

X_train_ = prepare_data(X_train)
X_train_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName,isInterval
63424,"Wagner, Richard",GOTTERDAMMERUNG [GÖTTERDÄMMERUNG],"Siegfried's Funeral Music, ACT III, scene ii","Barbirolli, John",False
41769,"Smetana, Bedrich","BARTERED BRIDE, THE","Overture (""Overture to a Comedy"")","Stransky, Josef",False
15638,"Tchaikovsky, Pyotr Ilyich","SERENADE FOR STRINGS, OP.48",,"Damrosch, Walter",False
39300,"Byrd, William","EARLE OF OXFORD'S MARCH, THE (BRASS AND PERCUS...",,"Holtan, Timothy J.",False
30773,"Strauss, Richard","TILL EULENSPIEGELS LUSTIGE STREICHE, OP. 28",,"Coates, Albert",False


In [11]:
X_test_ = prepare_data(X_test)
X_test_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName,isInterval
1699,"Dvorak, Antonín","SYMPHONY NO. 8, G MAJOR, OP.88 (OLD NO. 4)",,"Stransky, Josef",False
62165,"Tchaikovsky, Pyotr Ilyich","SYMPHONY NO. 4, F MINOR, OP. 36",,"Stransky, Josef",False
58368,"Rachmaninoff, Sergei","IN THE SILENT NIGHT, OP. 4, NO. 3",,"Kostelanetz, Andre",False
21605,"Mendelssohn, Felix","MIDSUMMER NIGHT'S DREAM, OP. 61",7. Nocturne: Andante tranquillo,"Gamson, Arnold",False
50487,"Sibelius, Jean","SYMPHONY NO. 2, D MAJOR, OP. 43",,"Volkov, Ilan",False


Here's how we are going to do it:
1. Create an `Imputer` that replaces missing values with the mode of the column
2. Use said `SimpleImputer` to impute the missing values
3. Ordinal encode all the categorical features.

All in one, single, amazing, Megazord.

## 3.4 Custom transformers

We can build our own custom transformers, for as long as they follow the usual blueprint:
* Implement `Transformer.fit()`
* And `Transformer.transform()`.

All scikit-learn estimators have `get_params()` and `set_params()` functions. 

The easiest way to implement these functions sensibly is to inherit from `sklearn.base.BaseEstimator`, as we're doing below.

And `Pipeline` compatibility requires a `fit_transform()` method that we are inheriting from `sklearn.base.TransformerMixin`.

In [12]:
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, some_parameter):
        self.some_parameter = some_parameter

    def fit(self, X, y=None):
        # Fit the transformer and store it.
        return self
        
    def transform(self, X):
        # Transform X.
        return X

Also, we may want our transformer to accept some parameters. That's what we are doing when we include `some_parameter` in the `__init__`.

Back to our transformers. Our blueprint:
* We want the imputer to be able to take a `strategy` parameter, although we will support only the mode
* Fitting requires taking the mode of each column and storing it
* Transform implies replacing missing values with the given column modes.

How are we going to compute the modes? Pandas, as always, provides a convenient `.mode()` method.

In [13]:
X_train_.mode()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName,isInterval
0,"Beethoven, Ludwig van",MESSIAH,Overture,"Damrosch, Walter",False


To be able to use indexing we will use `df.squeeze()`, a convenient method to transform our dataframe into a `pd.Series`.

In [14]:
X_train_.mode().squeeze()

ComposerName     Beethoven,  Ludwig  van
WorkTitle                        MESSIAH
Movement                        Overture
ConductorName           Damrosch, Walter
isInterval                         False
Name: 0, dtype: object

We have everything we need.

In [15]:
class CategoryImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy=None):
        if strategy:
            self.strategy = strategy
        else:
            self.strategy = 'most_frequent'

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = X.mode(axis=0).squeeze()
            return self
        else:
            return 'Strategy not supported.'

    def transform(self, X):
        return pd.DataFrame(X).fillna(self.fills)


imputer = CategoryImputer()
X_train_ = imputer.fit_transform(X_train)
X_train_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
63424,f433ebb7-ce15-493e-b7e4-fa20c7d9ffcf,4140,9006,10.0,"Wagner, Richard",GOTTERDAMMERUNG [GÖTTERDÄMMERUNG],"Siegfried's Funeral Music, ACT III, scene ii","Barbirolli, John",Intermission,False
41769,ed4130ea-a91d-414d-bd47-8b86bb5b6dd0,1016,9194,1.0,"Smetana, Bedrich","BARTERED BRIDE, THE","Overture (""Overture to a Comedy"")","Stransky, Josef",Intermission,False
15638,16814952-bc58-45eb-b0de-211e499022d7,10357,50027,1.0,"Tchaikovsky, Pyotr Ilyich","SERENADE FOR STRINGS, OP.48",Overture,"Damrosch, Walter",Intermission,False
39300,19ec976d-3245-4b9e-bf70-246c8000ea7a,10342,8088,1.0,"Byrd, William","EARLE OF OXFORD'S MARCH, THE (BRASS AND PERCUS...",Overture,"Holtan, Timothy J.",Intermission,False
30773,e49104fe-e447-4d6d-b6db-b6b60dbd28dd,12731,51965,1.0,"Strauss, Richard","TILL EULENSPIEGELS LUSTIGE STREICHE, OP. 28",Overture,"Coates, Albert",Intermission,False


In [16]:
X_train_.isnull().sum()

GUID             0
ProgramID        0
WorkID           0
MovementID       0
ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
Interval         0
isInterval       0
dtype: int64

There we go! What about the test set?

In [17]:
X_test_ = imputer.transform(X_test)
X_test_.isnull().sum()

GUID             0
ProgramID        0
WorkID           0
MovementID       0
ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
Interval         0
isInterval       0
dtype: int64

Victory awaits!

## 3.5 Everything together

Now, we want to fill in missing values and use one-hot-encoding (remember?), all at the same time. We are reaching our destination!

In [18]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder())])

X_train_ = megazord.fit_transform(X_train)
X_test_ = megazord.transform(X_test)

This way we abstract all the logic of passing transformers around.

Now, can we throw a model in there? Perhaps we can.

(But we shouldn't, in a way, since we are exemplifying data wrangling workflows.)

In [19]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder()),
                     ('k_means', KMeans(n_init = 10)),
                    ])

megazord.fit(X_train)
megazord.predict(X_test)

array([5, 6, 4, ..., 5, 5, 3], dtype=int32)

For the sake of simplicity, we are encoding categorical variables as if they were ordinal, instead of using one-hot-encoding, as recommended.

Take this for what it is: an example on how to build end-to-end pipelines for modeling in scikit-learn.

## 3.6. Accessing Pipeline steps

`Pipeline` is great! Now you're wondering, how you can access individual steps in a pipeline.

For example, let's say you want to access the KMeans transformer and verify the number of features seen during fit.

You can access it via `named_steps` and pass the assigned name in the pipeline

In [20]:
megazord.named_steps['k_means'].n_features_in_

10

In [21]:
X_train.shape

(61932, 10)

Noice! This is the same as the number of features in our train

![noice](./media/noice.gif)
