# Transformation Pipelines

Pipelines are a sequence (order) of transformations on a dataset. Tranformers can be prebuilt, or customary (see [customa transformers]()).  All estimators but the last must be tranformers, or have the `fit_transform()` method.

The pipelines `fit()` method will call `fit_transform()` to all transformers in the pipeline. The input of the previous transformer, with exception to the first is passed in as the input to the next transformor. On the last estimator, it will just call the `fit()` method.

# Custom Transformers

A clean and complete transformation pipeline is specific to your dataset. Because of this, custom transformers are often required. And if we want our transformers to work seamlessly with skleanrs pipeline utilities, we are required to write the tranformer customary to sklearn interface via duck-typing. Well defined transformers will allow you to automate the process of building a optimized model as well as one that is self sufficient as new data elements come in.

In duck-typing, "If it walks like a duck and it quacks like a duck, then it must be a duck". In otherwords, if it has a `fit()` , `transform()` and `fit_transform()`, then it must be a transformer. Note: we can obtain `fit_transform` via inheritance from `TransformerMixin` and additional methods suchs as `get_params()` and `set_params()` from `BaseEstimator`.


## Example Custom Transformer

Here is an example custom transformer. It has one tunable hyperparameter, `add_bedrooms_per_room`.

In [2]:
from sklearn.base import BaseEstimator, TransformerMixin


rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self # nothing else to do
    
    def transform(self, X, y=None):
        # feature engineering
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
# housing_extra_attribs = attr_adder.transform(housing.values)

This net custom tranformer is built to be able to handle dataframes directly, since native tranformers work with numpy-type ndarrays. So by directly passing in a dataframe through this transformer first, we can deal with dataframes directly.

In [4]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

## Joining Pipelines

Suppose you have two different datasets, or you have want to be able to transform two or more data subsets of the same dataframe differently. We can achieve this by building two or more pipelines, and then combining the pipelines that will eventually come to represent our finally dataset.

In [22]:
import pandas as pd
import numpy as np
import string
from myutils.data import random

sample_df = random.df(n_num=3, n_cat=2, n_rows=50, p_empty=.33)
sample_df.head(3)

Unnamed: 0,A,B,C,D,E
0,69.678351,0.093403,,F,C
1,53.510613,59.64017,,X,B
2,68.253943,23.555079,30.025447,C,B


In [23]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler


num_attribs = ['A', 'B', 'C']
cat_attribs = ['D', 'E']

# build two pipelines, one for the numerical, and another for the catagorical
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('one_hot_encoder', OneHotEncoder()),
])

In [24]:
from sklearn.pipeline import FeatureUnion


# join the pipelines for form the final dataset
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

final_matrix = full_pipeline.fit_transform(sample_df)

# note that the shape changed from the original 5 due to the one hot encoder
final_matrix.shape

(50, 37)