# BLU02 - Learning Notebook - Data wrangling workflows - Part 3 of 3

In [1]:
import matplotlib.pyplot as plt

import pandas as pd
import os

from category_encoders.ordinal import OrdinalEncoder

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

# 3 Advanced pipelines in scikit-learn

Remember our workflow diagram? Let's look at it again.

![data_transformation_workflow](./media/data_processing_workflow.png)

*Fig 1. - A standard workflow (again).*

Pandas, as amazing as it is, can only take us so far.

There, beyond the known universe, lies **modeling**.

Where we are at this point:
* We are to perform transformations on data, setting up robust pipelines using nothing but Pandas
* We can combine different dataframes, to enrich our datasets or generate new ones.

Thus, here we are, modeling lying ahead of us. What's exactly new about modeling though?

We will be using the same dataset, but this time we will create a train-test split, as we would do before modeling.

In [2]:
works = pd.read_csv('./data/works.csv')
works

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,38e072a7-8fc9-4f9a-8eac-3957905c0002,3853,52446,,"Beethoven, Ludwig van","SYMPHONY NO. 5 IN C MINOR, OP.67",,"Hill, Ureli Corelli",,False
1,c7b2b95c-5e0b-431c-a340-5b37fc860b34,5178,52437,,"Beethoven, Ludwig van","SYMPHONY NO. 3 IN E FLAT MAJOR, OP. 55 (EROICA)",,"Hill, Ureli Corelli",,False
2,894e1a52-1ae5-4fa7-aec0-b99997555a37,10785,52364,1.0,"Beethoven, Ludwig van","EGMONT, OP.84",Overture,"Hill, Ureli Corelli",,False
3,34ec2c2b-3297-4716-9831-b538310462b7,5887,52434,,"Beethoven, Ludwig van","SYMPHONY NO. 2 IN D MAJOR, OP.36",,"Boucher, Alfred",,False
4,610a4acc-94e4-4cd6-bdc1-8ad020edc7e9,305,52453,,"Beethoven, Ludwig van","SYMPHONY NO. 7 IN A MAJOR, OP.92",,"Hill, Ureli Corelli",,False
...,...,...,...,...,...,...,...,...,...,...
82571,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"McGegan, Nicholas",,False
82572,884c64d6-1768-4cf1-85f1-0ac2f79bbe5c,10608,53976,47.0,"Handel, George Frideric",MESSIAH,Chorus: Worthy is the Lamb that was slain,"Labadie, Bernard",,False
82573,f549e93f-b35f-4824-b0d5-d543953535f8,9542,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"Bicket, Harry",,False
82574,734c1116-0caf-4f8b-80d0-5e423cd1bcc6,9678,53976,51.0,"Handel, George Frideric",MESSIAH,Chorus: Amen,"McGegan, Nicholas",,False


In [3]:
X_train, X_test = train_test_split(works)

## 3.1 How is modeling different from transformation

In Pandas, we merely transformed the original dataframe into a new one.

But sometimes, this isn't possible. Let's start with an example: encoding categorical variables.

Remember: we need to perform the same transformations on train and test data (and whatever data comes next).

In [4]:
def transform_data(df):
    """
        This function transforms the dataframe, by removing the intervals and
        encoding the categorical columns
    """
    df = df.copy()
    df = (df.pipe(remove_intervals)
            .pipe(label_encoder, 'ComposerName'))
    return df


def remove_intervals(df):
    """
        This function remove the intervals from the dataframe
    """
    df = df.copy()
    mask = df['Interval'].isnull()
    df = (df.loc[mask, :]
            .drop(columns='Interval'))
    return df
    

def label_encoder(df, column):
    """
        This function encodes a categorical column
    """
    df = df.copy()
    df[column + 'Encoded'] = df[column].astype('category').cat.codes
    return df


X_train_ = transform_data(X_train)

train_alban_berg = X_train_['ComposerName'] == 'Berg,  Alban'
(X_train_.loc[train_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
         .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
22907,"Berg, Alban",175


All is good. We removed the intermissions (just like we did previously), and we transformed the original dataframe.

For convenience, we are keeping only the `ComposerName` and `ComposerNameEncoded` columns and removing duplicates.

Let's do the same to the test data.

In [5]:
X_test_ = transform_data(X_test)

test_alban_berg = X_test_['ComposerName'] == 'Berg,  Alban'
(X_test_.loc[test_alban_berg, ['ComposerName', 'ComposerNameEncoded']]
        .drop_duplicates())

Unnamed: 0,ComposerName,ComposerNameEncoded
20597,"Berg, Alban",98


Do you see the problem? The same `ComposerName` can (and will, in all probability) get different encodings.

This problem is significant, as it would lead us to make wrong predictions!!

There are other cases in which problems arise. For instance, when replacing missing values with the mean:
* You are supposed to compute the mean on the training set and use it to transform both train and test sets
* Otherwise, you end up underestimating your correct generalization error.

This particular unit is not about modeling at a conceptual level, but you get the point: 
* Somehow, you need to fit the transformer on your training data first (e.g., define the encodings, compute the means)
* Transform both train and test sets (and any data that might come in, really) using the pre-fitted transformers.

These transformations are more like modeling. In fact, all of this *is* modeling and part of your model. 

How do we solve this? **We need transformers that are more like models.**

## 3.2 Meet the sklearn-like transformers

There are three fundamental verbs in scikit-learn and sklearn-like libraries:
* `.fit()`
* `.transform()`
* `.predict()`.

You are already familiar with `.fit()` and `.predict()`, from the Bootcamp and the Hackathon #1.  We use them to train models and make predictions.

Here, we will explore a new combo: `.fit()` and `.transform()`. This is how it works.

![sklearn_like_transformation_pipeline](./media/sklearn_like_transformation_pipeline.png)

*Fig 2. - A data pipeline with consistent transformers, fitted on the training set.*

In short, we fit a transformer on the training data and use it to transform the training data.

We will, however, return the transformer so we can use it to transform new, incoming data as well. Confusing? Perhaps.

Time to get practical: meet the `categorical_encoders`, a set of transformers for encoding categorical variables.

In [6]:
encoder = OrdinalEncoder(cols=['ComposerName'])
X_train_ = encoder.fit_transform(X_train)
X_train_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
57700,280cfee0-e485-460f-8fbc-cd743e8ece29,3765,2316,,1,BOLERO [BOLÉRO],,"Bernstein, Leonard",,False
72328,7484a789-46b7-484c-a638-93e9f3311887,5672,11775,3.0,2,THREE-CORNERED HAT (EL SOMBRERO DE TRES PICOS)...,Final Dance (Jota),"Mitropoulos, Dimitri",,False
16503,3907b86e-cd3b-4a29-8678-370231a07b7b,8691,53432,,3,"CONCERTO, PIANO, A MINOR, OP. 17",,"Damrosch, Walter",,False
7270,a1e305a5-ebbf-48e3-ad0c-cd932a7d2138,5174,50376,1.0,4,"FORZA DEL DESTINO, LA",Overture,"Mitropoulos, Dimitri",,False
76103,511b3eb7-096c-4679-a0e2-86f008057c49,11020,9213,1.0,5,VERY WARM FOR MAY,"""All the Things You Are""","Kostelanetz, Andre",,False


We can now use it transform our test set.

In [7]:
X_test_ = encoder.transform(X_test)
X_test_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
76191,3db17f4a-b644-42ad-ba4d-2ffceb324639,2786,52097,5.0,41.0,FACADE (FAÇADE): SUITE NO. 2,V. Popular Song; Grazioso,"Kostelanetz, Andre",,False
11086,8b7f2509-95d8-4ee8-afd3-6868a6b1cac7,7656,6734,,1287.0,NEW ERA DANCE,,"Slatkin, Leonard",,False
13618,9c5012b6-53b9-4a16-863c-8ef4468f372a,13135,11769,,489.0,LA VARIATIONS,,"Gilbert, Alan",,False
6106,43235cc8-39dc-44e4-9b0f-b6745f02635f,2713,3460,,34.0,"BRANDENBURG CONCERTO NO. 2 IN F MAJOR, BWV 1047",,"Stokowski, Leopold",,False
74272,3c011fac-fccd-46be-81fb-e38a355909ff,10645,1711,,96.0,COUNTRY GARDENS,,"Autori, Franco",,False


Let's re-do our transformer functions so that they can either fit a transformer or accept a pre-fitted one.

We have to change our `label_encoder()` first to incorporate this logic. Then we need to adapt `transform_data()`.

In [8]:
def transform_data(df, encoder=None):
    df = df.copy()
    df, encoder = (df.pipe(remove_intervals)
                     .pipe(label_encoder, 'ComposerName', encoder))
    
    return df, encoder


def label_encoder(df, columns, encoder=None):
    if not encoder:
        encoder = OrdinalEncoder(cols=[columns])
        encoder.fit(df)
        
    preview_encodings(encoder)

    df = df.copy()
    df = encoder.transform(df)

    return df, encoder

    
def preview_encodings(encoder):
    encodings = encoder.category_mapping[0]['mapping'][:4]
    print('Encodings: {}'.format(encodings))
    return None

X_train_, encoder = transform_data(X_train)

Encodings: Ravel,  Maurice             1
Falla,  Manuel  de          2
Paderewski,  Ignacy  Jan    3
Verdi,  Giuseppe            4
dtype: int64


In the code above, we changed our functions so that they can receive an encoder. 

Otherwise, they fit and return the new one for re-use.

From a consistency standpoint, things should be looking good. Nonetheless, you are previewing the encoder as a sanity check.

In [9]:
X_test_ = transform_data(X_test, encoder=encoder)

Encodings: Ravel,  Maurice             1
Falla,  Manuel  de          2
Paderewski,  Ignacy  Jan    3
Verdi,  Giuseppe            4
dtype: int64


What kind of transformations do you need to perform this way? Some widespread ones are:
* Encoding (as we've seen)
* Scaling
* Vectorization (you will learn about this in the next specialization!)
* Missing data imputation.

Now, this changes things (right?):
* We lose the ability to do method chaining, as we have to return encodings as intermediate outputs
* We need to segregate pipelines for training (fit and transform) and test (transform), which adds complexity and it's error-prone.

Because we will perform the same transformations on all datasets, storing all the correct steps is critical for reproducibility and consistency.

It turns out, scikit-learn provides us with a distinctive take on pipelines, to wrap all of this in a single META-TRANSFORMER.

![megazord](./media/megazord.png)

*Fig 3. - A meta-transformer in practice.*

Meet the Megazord.

## 3.3 Pipelines

The sklearn's `Pipelines` provide a higher level of abstraction than the individual building blocks.

Let's tie together all these sequential transformers and run `Megazord.fit()` and `Megazord.transform()` on the whole thing.

That would make managing our code much easier, right? Let's do it:
* We want to replace the missing values with the mode
* We want to one-hot-encode all categorical variables.

First things first, some Pandas magic: let's drop the ID columns and exclude the intervals.

In [10]:
def prepare_data(df):
    df = df.copy()
    df = (df.pipe(drop_id_fields)
            .pipe(remove_intervals)
            .drop_duplicates())
    
    df["isInterval"] = df["isInterval"].astype("category")
    return df

def drop_id_fields(df):
    columns = ['GUID', 'ProgramID', 'WorkID', 'MovementID']
    df = df.copy()
    df = df.drop(columns=columns)
    return df

X_train_ = prepare_data(X_train)
X_train_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName,isInterval
57700,"Ravel, Maurice",BOLERO [BOLÉRO],,"Bernstein, Leonard",False
72328,"Falla, Manuel de",THREE-CORNERED HAT (EL SOMBRERO DE TRES PICOS)...,Final Dance (Jota),"Mitropoulos, Dimitri",False
16503,"Paderewski, Ignacy Jan","CONCERTO, PIANO, A MINOR, OP. 17",,"Damrosch, Walter",False
7270,"Verdi, Giuseppe","FORZA DEL DESTINO, LA",Overture,"Mitropoulos, Dimitri",False
76103,"Kern, Jerome",VERY WARM FOR MAY,"""All the Things You Are""","Kostelanetz, Andre",False


In [11]:
X_test_ = prepare_data(X_test)
X_test_.head()

Unnamed: 0,ComposerName,WorkTitle,Movement,ConductorName,isInterval
76191,"Walton, William",FACADE (FAÇADE): SUITE NO. 2,V. Popular Song; Grazioso,"Kostelanetz, Andre",False
11086,"Kernis, Aaron Jay",NEW ERA DANCE,,"Slatkin, Leonard",False
13618,"Salonen, Esa-Pekka",LA VARIATIONS,,"Gilbert, Alan",False
6106,"Bach, Johann Sebastian","BRANDENBURG CONCERTO NO. 2 IN F MAJOR, BWV 1047",,"Stokowski, Leopold",False
74272,"Grainger, Percy",COUNTRY GARDENS,,"Autori, Franco",False


Here's how we are going to do it:
1. Create an `Imputer` that replace missing values with the mode of the column
2. Use said `SimpleImputer` to impute the missing values
3. Ordinal encode all the categorical features.

All in one, single, amazing, Megazord.

## 3.4 Custom transformers

We can build our own custom transformers, for as long as they follow the usual blueprint:
* Implement `Transformer.fit()`
* And `Transformer.transform()`.

All scikit-learn estimators have `get_params()` and `set_params()` functions. 

The easiest way to implement these functions sensibly is to inherit from `sklearn.base.BaseEstimator`, as we're doing.

And `Pipeline` compatibility requires a `fit_transform()` method that we are inheriting from `sklearn.base.TransformerMixin`.

In [12]:
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self, some_parameter):
        self.some_parameter = some_parameter

    def fit(self, X, y=None):
        # Fit the transformer and store it.
        return self
        
    def transform(self, X):
        # Transform X.
        return X

Also, we may want our transformer to accept some parameters.

That's what we are doing when we include `some_parameter` in the `__init__`.

Back to our transformers. Our blueprint:
* We want the estimator to be able to take a `strategy` parameter, although we will support only the mode
* Fitting requires taking the mode of each column and storing it
* Transform implies replacing missing values with the given column modes.

How are we going to compute the modes? Pandas, as always, provides a convenient `.mode()` method.

In [22]:
X_train_.mode()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
0,3566,9678,0,1.0,13,17,1,3,1,False


To be able to use indexing we will use `df.squeeze()`, a convenient method to transform our dataframe into a `pd.Series`.

In [25]:
X_train_.mode().squeeze()

GUID              3566
ProgramID         9678
WorkID               0
MovementID         1.0
ComposerName        13
WorkTitle           17
Movement             1
ConductorName        3
Interval             1
isInterval       False
Name: 0, dtype: object

We have everything we need.

In [15]:
class CategoryImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy=None):
        if strategy:
            self.strategy = strategy
        else:
            self.strategy = 'most_frequent'

    def fit(self, X, y=None):
        if self.strategy == 'most_frequent':
            self.fills = X.mode(axis=0).squeeze()
            return self
        else:
            return 'Strategy not supported.'

    def transform(self, X):
        return pd.DataFrame(X).fillna(self.fills)


imputer = CategoryImputer()
X_train_ = imputer.fit_transform(X_train)
X_train_.head()

Unnamed: 0,GUID,ProgramID,WorkID,MovementID,ComposerName,WorkTitle,Movement,ConductorName,Interval,isInterval
57700,280cfee0-e485-460f-8fbc-cd743e8ece29,3765,2316,1.0,"Ravel, Maurice",BOLERO [BOLÉRO],Overture,"Bernstein, Leonard",Intermission,False
72328,7484a789-46b7-484c-a638-93e9f3311887,5672,11775,3.0,"Falla, Manuel de",THREE-CORNERED HAT (EL SOMBRERO DE TRES PICOS)...,Final Dance (Jota),"Mitropoulos, Dimitri",Intermission,False
16503,3907b86e-cd3b-4a29-8678-370231a07b7b,8691,53432,1.0,"Paderewski, Ignacy Jan","CONCERTO, PIANO, A MINOR, OP. 17",Overture,"Damrosch, Walter",Intermission,False
7270,a1e305a5-ebbf-48e3-ad0c-cd932a7d2138,5174,50376,1.0,"Verdi, Giuseppe","FORZA DEL DESTINO, LA",Overture,"Mitropoulos, Dimitri",Intermission,False
76103,511b3eb7-096c-4679-a0e2-86f008057c49,11020,9213,1.0,"Kern, Jerome",VERY WARM FOR MAY,"""All the Things You Are""","Kostelanetz, Andre",Intermission,False


In [16]:
X_train_.isnull().sum()

GUID             0
ProgramID        0
WorkID           0
MovementID       0
ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
Interval         0
isInterval       0
dtype: int64

There we go! What about the test set?

In [17]:
X_test_ = imputer.transform(X_test)
X_test_.isnull().sum()

GUID             0
ProgramID        0
WorkID           0
MovementID       0
ComposerName     0
WorkTitle        0
Movement         0
ConductorName    0
Interval         0
isInterval       0
dtype: int64

Victory awaits!

## 3.5 Everything together

Now, we want to fill in missing values and use one-hot-encoding (remember?), all at the same time.

We are reaching our destination!

In [18]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder())])

X_train_ = megazord.fit_transform(X_train)
X_test_ = megazord.transform(X_test)

This way we abstract all the logic of passing transformers around.

Now, can we throw a model in there? Perhaps we can.

(But we shouldn't, in a way. Please note that we are exemplifying data wrangling workflows.)

In [19]:
megazord = Pipeline([('fill_na', CategoryImputer(strategy='most_frequent')),
                     ('encode', OrdinalEncoder()),
                     ('k_means', KMeans()),
                    ])

megazord.fit(X_train)
megazord.predict(X_test)

array([3, 4, 4, ..., 3, 7, 4], dtype=int32)

For the sake of simplicity, we are encoding categorical variables as if they were ordinal, instead of using one-hot-encoding, as recommended.

Take this for what it is: an example on how to build end-to-end pipelines for modeling in scikit-learn.

## 3.6. Accessing Pipeline steps

`Pipeline` is great! Now you're wondering, how you can access individual steps in a pipeline.

For example, let's say you want to access the KMeans transformer and verify the number of features seen during fit.

You can access it via `named_steps` and pass the assigned name in the pipeline

In [20]:
megazord.named_steps['k_means'].n_features_in_

10

In [21]:
X_train.shape

(61932, 10)

Noice! This is the same as the number of features in our train