# Scikit-Learn The Best Parts
## Isaac Lemon Laughlin
Lead Instructor, Principal Data Scientist

[Galvanize Inc.](www.galvanize.com)

[@lemonlaug](www.twitter.com/lemonlaug)



# Requirements

In [266]:
sklearn.__version__

'0.17.1'

# Agenda/Objectives
1. What are Pipelines and FeatureUnions?
1. Why should I care?
1. Basic example of how they work.
1. Best practices for writing custom Transformers.
1. Get one hairy example under our belts.
1. Note some of the weaknesses and forthcoming features.

# 1. What are Pipelines and FeatureUnions?

* Method for chaining multiple estimators into a single one.
* Estimators might include models, transformations etc.
* `FeatureUnion` takes calls estimators which returns columns in parallel and `np.hstack`s the results together.
* `Pipeline` Applies a sequence of transforms in series.
* They can (and should) be used together.

# The Best Part of `sklearn`

* Other parts of `sklearn`
 * Supervised learning
 * Unsupervised learning
 * Model selection and evaluation

# 2. Why are Pipelines and FeatureUnions so great?
* Encourage good habits like:
 * separation of concerns
     * cross-validation, development/computation
 * avoiding target-leakage by not accepting information about y in `transform`.
 * object orientedness
* Promotes modeling choices to parameters
    * Transformation to linearize a feature
    * Handling missing values
    * Constructing features
* Readability
    * Separates implementation details from general approach.
* Efficiency

# 2. Why are Pipelines and FeatureUnions so great?
![CRISP-DM Process Diagram](images/440px-CRISP-DM_Process_Diagram.png)

CRISP-DM is a formalization of the way many Data Scientists work. Pipelines serve to expedite two parts of this workflow: the back-and-forth between data preparation and modeling and in turn, the overall cyclic process.

Removing friction in this process means Data Scientists can explore more ideas, which is ultimately the process that leads to big gains in model performance. Note that data preparation involves feature engineering, which IMO is the single most important way to improve model performance in settings where the possibility of acquiring more data is real.

# 3. How do they work?

* Initialize with a list of (name, estimator) tuples.
* All but the last of these estimators must implement `transform` method.
* From the docs: <blockquote>Calling fit on the pipeline is the same as calling fit on each estimator in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.</blockquote>

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
#Pipelines are initialized with a list of (name, estimator) tuples.
estimators = [('reduce_dim', PCA()), ('svm', SVC())]
clf = Pipeline(estimators)
clf 

Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None, whiten=False)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

In [237]:
[x for x in dir(clf) if not x.startswith('_')]

['classes_',
 'decision_function',
 'fit',
 'fit_predict',
 'fit_transform',
 'get_params',
 'inverse_transform',
 'named_steps',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'score',
 'set_params',
 'steps',
 'transform']

In [260]:
from sklearn.grid_search import GridSearchCV
#Grid searching!
params = {'reduce_dim__n_components':[1,5,10,12,15,20],
         'svm__kernel':['linear', 'rbf']}
gs = GridSearchCV(clf, param_grid=params)
#Random data
from sklearn.datasets import make_classification
gs.fit(*make_classification())
gs.best_params_

{'reduce_dim__n_components': 12, 'svm__kernel': 'linear'}

So our pipeline object has chained two estimators together and given us a single consistent interface for both. It turns out that it's very powerful to be able to think of our complex, multi-step model as a single model. Especially when it comes to CV, which because we do not assume independence of our parameters requires us to search all parameters together.

# 4. Writing custom transformers
`sklearn` implements lots of good transformers, there are infinitely many more we may want to have so we'll often want to write our own.

```python
from sklearn.base import TransformerMixin, BaseEstimator
class MyTransformer(TransformerMixin, BaseEstimator):
    """Recommended signature for a custom transformer.
    
    Inheriting from TransformerMixin gives you fit_transform
    
    Inheriting from BaseEstimator gives you grid-searchable params.
    """
    def __init__(self):
        """If you need to parameterize your transformer,
        set the args here.
        
        Inheriting from BaseEstimator introduces the constraint
        that the args all be named keyword args, no positional 
        args or **kwargs.
        """
        pass
    ...
```    

```python
...
    def fit(self, X, y):
        """Recommended signature for custom transformer's
        fit method.
        
        Set state here with whatever information
        is needed to transform later.
        
        In some cases fit may do nothing. For example transforming 
        degrees Fahrenheit to Kelvin, requires no state.
        
        You can use y here, but won't have access to it in transform.
        """
        #You have to return self, so we can chain!
        return self
... 
```

```python
...   
    def transform(self, X):
        """Recommended signature for custom transformer's
        transform method.
        
        Transform some X data, optionally using state set in fit. This X
        may be the same X passed to fit, but it may also be new data,
        as in the case of a CV dataset. Both are treated the same.
        """
        #Do transforms.
        #transformed = foo(X)
        return transformed
```

# Practice:

Re-implement StandardScaler using the above stub.

Standardize features by removing the mean and scaling to unit variance:

$$ \frac{X - E(X)}{\sigma(X)} $$

## Practical hint:

Call your transformer `MyScaler` and save it in `scaler.py` then you can run unittests in tests/test_scaler.py.

Write your transformer from the notebook by:
```python
%%writefile scaler.py

class MyScaler:
    ...
```

Then to run the tests from the notebook:

```
!python -m unittest tests.test_scaler
```

In [138]:
#Try running some unittests to see if it's working correctly.
!python -m unittest tests.test_scaler

...
----------------------------------------------------------------------
Ran 3 tests in 0.001s

OK


In [3]:
# %load scaler.py
from sklearn.base import TransformerMixin, BaseEstimator
import numpy as np

class MyScaler(TransformerMixin, BaseEstimator):
    """Scale to zero mean and unit variance.
    """
    def fit(self, X, y):
        """Recommended signature for custom transformer's
        fit method.
        
        Set state in your transformer with whatever information
        is needed to transform later.
        """
        #You have to return self, so we can chain!
        self.mean = np.mean(X, axis=0)
        self.scale = np.std(X, axis=0)
        return self
    
    def transform(self, X):
        """Recommended signature for custom transformer's
        transform method.
        
        Use state (if any) to transform some X data. This X
        may be the same X passed to fit, but it may also be new data,
        as in the case of a CV dataset. Both are treated the same.
        """
        #Do transforms.
        Xt = X.copy()
        Xt -= self.mean
        Xt /= self.scale
        return Xt

# Feature Union

Calls `fit` and `transform` in parallel and `np.hstack`s the output together.

`transformer_weights` can scale the terms in the feature union. Useful for grid searching in regularized settings.

`n_jobs` arg can be used to get parallel computation.

For some complex transformers, alignment may be tricky! Pandas is good at this, but not helpful here because `np.hstack` is called, which ignores indexes.

Writing a generalizable transformer often means you will expect the correct column to be selected from your X matrix, oftentimes this means writing a selector, which is too bad.

In [242]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

class WordCounter(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        split = np.vectorize(lambda x: len(x.split()))
        return split(X)[:,np.newaxis]

In [243]:
corpus = ["What is your name?", 
          "What is your favorite color?",
          "What is the airspeed velocity of an unladen swallow?"]

fu = FeatureUnion([('tfidf', TfidfVectorizer()),
                  ('counter', WordCounter())])

#Pretty display of the output.
pd.DataFrame(fu.fit_transform(corpus).todense())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.0,0.0,0.0,0.0,0.391484,0.66284,0.0,0.0,0.0,0.0,0.0,0.391484,0.504107,4.0
1,0.0,0.0,0.55249,0.55249,0.32631,0.0,0.0,0.0,0.0,0.0,0.0,0.32631,0.420183,5.0
2,0.36043,0.36043,0.0,0.0,0.212876,0.0,0.36043,0.36043,0.36043,0.36043,0.36043,0.212876,0.0,9.0


# Practice

Write a feature union that takes a single-vector as input and returns 3 columns corresponding to the square-root transformation, the identity transformation, and the square transformation.

_Hint:_ `sklearn.preprocessing.FunctionTransformer` takes a function as an argument and converts it to a simple Transformer.

In [26]:
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

X = np.random.random((100,1))

fu = FeatureUnion([('sqrt',FunctionTransformer(np.sqrt)),
                  ('identity',FunctionTransformer()),
                  ('square', FunctionTransformer(lambda x: x**2))])
print(X[:5])
print(fu.fit_transform(X)[:5])

[[ 0.11021486]
 [ 0.64792271]
 [ 0.19943493]
 [ 0.8543808 ]
 [ 0.71626191]]
[[ 0.33198624  0.11021486  0.01214732]
 [ 0.80493646  0.64792271  0.41980384]
 [ 0.44658139  0.19943493  0.03977429]
 [ 0.92432722  0.8543808   0.72996655]
 [ 0.84632258  0.71626191  0.51303113]]


# Notes/Direction

## Efficiency
Some grid search steps may duplicate a lot of work by fitting/transforming the same data repeatedly. Caching may be forthcoming.

## Inverse Transforms
If implemented, can be used.

## Post-processing/transformations of y.
Not currently available.



In [231]:
!unzip data/Train.zip

Archive:  data/Train.zip
  inflating: Train.csv               


# 5. Practice

Putting it all together: creating a matrix of heterogeneous data types.

In [249]:
#Data from: https://www.kaggle.com/c/bluebook-for-bulldozers
df = pd.read_csv('data/Train.zip')
df.head(2)

Unnamed: 0,SalesID,SalePrice,MachineID,ModelID,datasource,auctioneerID,YearMade,MachineHoursCurrentMeter,UsageBand,saledate,...,Undercarriage_Pad_Width,Stick_Length,Thumb,Pattern_Changer,Grouser_Type,Backhoe_Mounting,Blade_Type,Travel_Controls,Differential_Type,Steering_Controls
0,1139246,66000,999089,3157,121,3.0,2004,68.0,Low,11/16/2006 0:00,...,,,,,,,,,Standard,Conventional
1,1139248,57000,117657,77,121,3.0,1996,4640.0,Low,3/26/2004 0:00,...,,,,,,,,,Standard,Conventional


# Practice

Here are some suggested transformers to use in building your pipeline:

| Column | Transformer | Notes |
| ------ | ----------- | ----- |
| `UsageBand` | `sklearn.preprocessing.OneHotEncoder` | |
| `YearMade` | `sklearn.preprocessing.Imputer` | May want to also add a dummy column noting which rows are affected |
| `fiProductClassDesc` | `sklearn.preprocessing.CountVectorizer` | You may ultimately want to reduce dimensionality of this using NMF or PCA. | 
| `State` | `sklearn.preprocessing.OneHotEncoder` | |
| `YearMade`, `SaleDate` | Create a custom transformer to compute the age at sale. | |
| `SalePrice` | Create a custom transformer that takes the K most recent sales within a ModelID | Beware alignment/target leakage. Use `GroupBy.transform(lambda x: x.ffill)`. This is delicate, so feel free to ask for a hint. |



In [29]:
#This item selector, usage demonstrated at
# http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
# Can be important for practically implementing FeatureUnions.
class ItemSelector(BaseEstimator, TransformerMixin):
    """For data grouped by feature, select subset of data at a provided key.

    The data is expected to be stored in a 2D data structure, where the first
    index is over features and the second is over samples.  i.e.

    >> len(data[key]) == n_samples

    Please note that this is the opposite convention to sklearn feature
    matrixes (where the first index corresponds to sample).

    ItemSelector only requires that the collection implement getitem
    (data[key]).  Examples include: a dict of lists, 2D numpy array, Pandas
    DataFrame, numpy record array, etc.

    >> data = {'a': [1, 5, 2, 5, 2, 8],
               'b': [9, 4, 1, 4, 1, 3]}
    >> ds = ItemSelector(key='a')
    >> data['a'] == ds.transform(data)

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[self.key]
