In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

Goals
===========
 - Understand how to write custom classes in scikit-learn
 - Pipelines
 - Feature Unions

Custom Estimators and Transformers
===========

The scikit-learn library has a wealth of functionality available in its [classes](http://scikit-learn.org/stable/modules/classes.html). Occasionally you might want to customize the behavior of these classes, for example to add in functionality or for engineering reasons.

All estimators (e.g. linear regression, kmeans, etc ...) support `fit` and `predict` methods.  In fact, you can build your own by inheriting from classes in `sklearn.base` by using this template:                                                                                                 
``` python                                                                                                                                        
class Estimator(base.BaseEstimator, base.RegressorMixin):
  def __init__(self, ...):
  # initialization code
  
  def fit(self, X, y):
  # fit the model ...
    return self
    
  def predict(self, X):
    return # prediction
    
  def score(self, X, y):
    return # custom score implementation
```

Conforming to this convention has the benefit that many tools (e.g. cross-validation, grid search) rely on this interface so you can use your new estimators with the existing `sklearn` infrastructure.                                                                 
                                                                                
For example `grid_search.GridSearchCV` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html)) takes an estimator and some hyperparameters as arguments, and returns another estimator.  Upon fitting, it fits the best model (based on the inputted hyperparameters) and uses that for prediction.                                                                    
                                                                                
Of course, we sometimes need to process or transform the data before we can do machine-learning on it.  `sklearn` has Transformers to help with this.  They implement this interface:
``` python
class Transformer(base.BaseEstimator, base.TransformerMixin):
  def __init__(self, ...):
    # initialization code
    
  def fit(self, X, y=None):
    # fit the transformation
    return self
  
  def transform(self, X):
    return ... # transformation
```

A comprehensive `.fit_transform` is implemented based on the `.fit` and `.transform` methods in `base.TransformerMixin` ([docs](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html)). However, especially for transformers, `.fit` is often empty and only `.transform` actually does something.

The following is some example code to demonstrate the usage of custom classes.

In [None]:
import numpy as np
import scipy as sp
import sklearn as sk
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.datasets import load_boston

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=42)

class TruncateTransformer(sk.base.BaseEstimator, sk.base.TransformerMixin):
    """
    Returns the first k columns of a feature array
    """
    def __init__(self, k):
        self.k = k
        pass

    def fit(self, X, y):
        return self

    def transform(self, X):
        return X[:,:self.k]


class ShellEstimator(sk.base.BaseEstimator, sk.base.RegressorMixin):
    """
    A shell estimator that only takes prefit models and premade transformers.
    Its sole function is to combine a transformer and regressor into a single object.
    """
    def __init__(self, transformer, model):
        self.transformer = transformer
        self.model = model
        pass

    def fit(self, X, y):
        return self
    
    def score(self, X, y):
        X_test = self.transformer.transform(X)
        return self.model.score(X_test, y)

    def predict(self, X):
        X_test = self.transformer.transform(X)
        return self.model.predict(X_test)

linreg = LinearRegression(fit_intercept=True)
model = linreg.fit(X_train, y_train)
y_pred = model.predict(X_test)
print model.score(X_test, y_test)
    
truncator = TruncateTransformer(2)
X_train_k = truncator.transform(X_train)
X_test_k = truncator.transform(X_test)

k_model = linreg.fit(X_train_k, y_train)
y_pred_k = k_model.predict(X_test_k)
print k_model.score(X_test_k, y_test)

k_shell = ShellEstimator(truncator, k_model)
assert(k_shell.predict(X_test_k).all() == y_pred_k.all())
print k_shell.score(X_test_k, y_test)

Pipelines
===========

It turns out there's a built-in tool to chain together our transformers and estimators into one unit, and it scales much easier than custom estimators. They're called pipelines. The following code would replace all the fitting and scoring code above.  That is, the pipeline itself is an estimator (and implements the `.fit` and `.predict` methods).  Note that a pipeline can have multiple transformers chained up but at most one (optional) terminal estimator.

In [None]:
from sklearn import pipeline

k_pipe = pipeline.Pipeline([
  ('truncate', TruncateTransformer(2)),
  ('linreg', LinearRegression(fit_intercept=True))
  ])
k_pipe.fit(X_train, y_train)
print k_pipe.score(X_test, y_test)

Feature Unions
===========

[Feature unions](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) are designed to get around the problem that you might not be able to prepare your desired feature matrix with just one series of transformations. For example, you might have text features, categorical features, and a couple different kinds of numerical features and want feed them all into the same estimator or pipeline. Each feature type would require a different kind of transformer.

What the feature union does is a kind of *parallel* transformation operation using multiple transformers and consolidating them into one output matrix - which can then be fed into an estimator or pipeline. You can imagine that between the serial behavior of pipelines and the parallel behavior of feature unions you can create complex multi-stage workflows.

This example code applies several different transformations to X before throwing the features into a Linear Regressor.

In [None]:
class ReverseTruncateTransformer(sk.base.BaseEstimator, sk.base.TransformerMixin):
    """
    Returns the last k columns of a feature array
    """
    def __init__(self, k):
        self.k = k
        pass

    def fit(self, X, y):
        return self

    def transform(self, X):
        return X[:,-self.k:]
     
all_features = pipeline.FeatureUnion([
  ('first two cols', TruncateTransformer(2)),
  ('last two cols', ReverseTruncateTransformer(2))
  ])
k_union = pipeline.Pipeline([("features", all_features), ("linreg", LinearRegression(fit_intercept=True))])
k_union.fit(X_train, y_train)
print k_union.score(X_test, y_test)

You can also use feature unions to combine the predictions of multiple estimators. If you rewrite your estimators as transformers and feed them into a feature union which has eg. a linear regressor as an estimator, you'll be able to automatically weight and combine each individual prediction into an ensemble model.

To do this, you'll need to write custom transformers where the `.transform` method carries out the `.predict` implementation.

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*