# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Pipelines and custom transfomers in SKLearn
Week 5 | Lesson 2.2

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- create pipelines for cleaning and manipulating data
- use pipelines to preprocess data from the SQL database
- use pipeline in combination with classification
- create a custom transformer using the `TransformerMixin` class

### STUDENT PRE-WORK
*Before this lesson, you should already be able to:*
- extract data from a database
- perform classification

Many organizations rely on data engineering teams to encode these common tasks into pipelines. **Data pipelines** are a series of automated data transformations that ensure the validity of your work for routine data maintenance tasks. 

## Data Pipelines 

The term _Pipeline_ is used to indicate a series of concatenated data transformations. Each stage of the pipeline feeds from the previous stage, i.e. the output of a stage is plugged into the input of the next stage and data flows through the pipeline from beginning to end as water flows through... a pipeline.

![Pipeline](./assets/images/pipeline.png)

Each processing stage has an input, where data comes in, and an output, where processed data comes out.

**Check:** What are some examples of data transformations?

Pipelines provide a higher level of abstraction than the individual building blocks.

## Pipelines in scikit-learn 
One way to improve coding and model management is to use pipelines in `scikit-learn`. These tie together all the steps that you may need to prepare your dataset and make your predictions. Because you will need to perform all of the same transformations on your evaluation data, putting this all together makes sense.

To show how a pipeline works, we'll use an example involving Natural Language Processing - this is a topic we will get more into next week. 

The data comes from the [Evergreen Stumbleupon Kaggle Competition](https://www.kaggle.com/c/stumbleupon/data), which you have seen before. You will need to get the train data, which you presumably already have somewhere or you can redownload it from the link. Binary evergreen labels (either evergreen (1) or non-evergreen (0)) are provided. We'll focus on the page title text.

In [2]:
from sklearn.pipeline import Pipeline
import pandas as pd
import json

data = pd.read_csv("stumbleupon.tsv", sep='\t')
data['title'] = data['boilerplate'].map(lambda x: json.loads(x).get('title', ''))
data['body'] = data['boilerplate'].map(lambda x: json.loads(x).get('body', ''))

titles = data['title'].fillna('')
titles.head(3)

0    IBM Sees Holographic Calls Air Breathing Batte...
1    The Fully Electronic Futuristic Starting Gun T...
2    Fruits that Fight the Flu fruits that fight th...
Name: title, dtype: object

In [3]:
y = data['label']
y.value_counts(normalize=True)

1    0.51332
0    0.48668
Name: label, dtype: float64

So we want to predict evergreenness from this text data. Each datapoint is a string of free form text. How can we feed this to a model? The simplest way is to build a dictionary of words and use those as features. This is what a `CountVectorizer` does.

Example:


|Sentence|the|cat|is|on|table|blue|
|---|---|---|---|---|---|---|
|The cat is on the table|2|1|1|1|1|0|
|The table is blue|1|0|1|0|1|1|
|...|||||||

In [4]:
# Returns a count of words appearing in the inputted text string
# by setting ngram_range to (1,2) we ask to return both single words and pairs of words
# by setting binary=True we return a 0 if the word occurs and a 1 if it does occur, no matter how many times it occurs
# by setting stop_words = 'english' we ignore very common words

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000, ngram_range=(1, 2), stop_words='english', binary=True)

vectorizer.fit(['IBM Sees Holographic Calls Air Breathing'])
vectorizer.get_feature_names()

[u'air',
 u'air breathing',
 u'breathing',
 u'calls',
 u'calls air',
 u'holographic',
 u'holographic calls',
 u'ibm',
 u'ibm sees',
 u'sees',
 u'sees holographic']

In [5]:
# Having fit the vectorizer, if we now input a new string we return the existence or not of those
# words / word pairs we set

vectorizer.transform(['IBM Sees Holographic Air']).todense()

matrix([[1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]])

**Check:** What is the meaning of the various parameters used at initialisation of the CountVectorizer?


Let's use the vectorizer to fit all the titles and build a feature matrix.

In [6]:
# Use fit to learn the vocabulary of the titles
vectorizer.fit(titles)
vectorizer.get_feature_names()[100:120]

[u'best',
 u'best new',
 u'better',
 u'betty',
 u'betty crocker',
 u'big',
 u'biggest',
 u'birthday',
 u'biscuits',
 u'bites',
 u'black',
 u'black bean',
 u'blog',
 u'blog archive',
 u'blogs',
 u'blood',
 u'blue',
 u'blueberry',
 u'body',
 u'boost']

In [7]:
# Use tranform to generate the sample X word matrix - one column per feature (word or n-grams)
X = vectorizer.transform(titles)

We use this input X, a matrix of all common n-grams in the dataset, as an input to our classifier. We want to classify how evergreen a story is, based on these inputs.


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = LogisticRegression()
# that's accuracy score for logistic regression
scores = cross_val_score(model, X, y)

print('CV scores: {}'.format(scores))
print('Average CVScore: {:0.3f} +/- {:0.3f}'.format(scores.mean(), scores.std()))

CV scores: [ 0.74695864  0.75578093  0.75608766]
Average CVScore: 0.753 +/- 0.004


## Combining Steps in Pipelines

Pipelines combines many steps, both pre-processing and model building, into a _single object_. Rather than manually evaluating the transformers and then feeding them into the models, pipelines can tie both of these steps together.

Similar to models and vectorizers in scikit-learn, pipelines are equipped with `fit` and `predict` or `predict_proba` methods (as any model would be), and they ensure that proper data transformations are performed.

In [9]:
# Split the data into a training set
training_data = data[:6000]
X_train = training_data['title'].fillna('')
y_train = training_data['label']

# These rows are rows obtained in the future, unavailable at training time
X_new = data[6000:]['title'].fillna('')

In [10]:
# We already set these above but just to be explicit we reinitialise them

vectorizer = CountVectorizer(max_features = 1000, ngram_range=(1, 2), stop_words='english', binary=True)
model = LogisticRegression()

In [11]:
from sklearn.pipeline import Pipeline

# We input key, value pairs to the pipeline. The key is a name we choose, but the value must be the
# object referring to the step of the pipeline, which is then fit

pipeline = Pipeline([('vec', vectorizer), ('model', model)])

In [20]:
# Let's say I want to set a hyperparameter of one of the elements of the pipeline
# after I've initialised it. I can do this with a two underscore __ notation
# invoking the key name that was set for the element (model or vec)

pipeline.set_params(model__C=10) 
#pipeline.set_params(vec__max_features=999)

Pipeline(steps=[('vec', CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=999, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [35]:
# Fit the full pipeline. This means we perform the steps laid out above
# First we fit the vectorizer, and then feed the output of that into the fit function of the model

pipeline.fit(X_train, y_train)

Pipeline(steps=[('vec', CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
       ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [36]:
# Here we apply each step of the pipeline for predictions
# The text is transformed with the vectorizer to match the features from the pipeline
# however instead of calling model.fit() as the last step the pipeline will call model.predict_proba()

pipeline.predict_proba(X_new)[:,1]

array([ 0.52191125,  0.71543968,  0.98634554, ...,  0.7151964 ,
        0.39140431,  0.32492484])

**Check** Add a `MaxAbsScaler` scaling step to the pipeline, which should occur after the vectorization.

So to clarify, the advantage is calling .fit() and .predict() on the whole pipeline and only needing to call these steps once. You could also have made your own function that linked these together but sklearn can handle all the steps
nicely.

In [29]:
# Perform grid search on both CountVectorizer and LogisticRegression
# that's pretty useful!

from sklearn.model_selection import GridSearchCV
params = dict(vec__binary=[True, False], model__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipeline, param_grid=params, scoring="accuracy")
grid_search.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('vec', CountVectorizer(analyzer=u'word', binary=True, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=999, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        ...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'model__C': [0.1, 10, 100], 'vec__binary': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [30]:
grid_search.best_params_

{'model__C': 0.1, 'vec__binary': True}

<a name="guided-practice"></a>
## make_pipeline and the preprocessing module

Scikit-learn pipelines can also be built using the `make_pipeline` command, which has a simpler syntax where you don't need to pass a name (item gets given a standard name).

In [25]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe1 = make_pipeline(StandardScaler(), LogisticRegression())    

pipe2 = Pipeline(steps=[('standard_scal',StandardScaler()), ('logistic_regr',LogisticRegression())])

In [26]:
pipe1

Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [27]:
pipe2

Pipeline(steps=[('standard_scal', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic_regr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

The two pipelines created above are identical, i.e. this is just an alternative way to do the same thing.

# Preprocessing in sklearn (in pairs)

The preprocessing module comes loaded with many very useful pre-processing classes.

**Check** in pairs, assign one function to each pair, read about it in the doc and then explain it to the class.


Data Manipulators
- Binarizer
- KernelCenterer
- MaxAbsScaler
- MinMaxScaler
- Normalizer
- PolynomialFeatures
- RobustScaler
- StandardScaler
- VarianceThreshold

Data Imputation
- Imputer

Function Transformer
- FunctionTransformer

Label Manipulators
- LabelBinarizer
- LabelEncoder
- MultiLabelBinarizer


Ok, so we can implement several of these in a pipeline in order to preprocess our data. In fact this is what we will do in the next lab.

<a name="demo_2"></a>
## Custom Transformers

We can implement custom transformers by extending the BaseClass in Scikit-Learn. This will be necessary if we want the transformer to only act on certain columns. Thankfully it is not too complicated as we just need to define how it responds to transform and fit. This is something you would have to do for production level code, so you can get a chance to implement python classes as we saw last week!

In [15]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

# the inheritance of BaseEstimator and TransformerMixin allows use of methods such as fit_transform()
# which are not explicitly defined here (you can just always import them for this purpose)
class FeatureMultiplier(BaseEstimator, TransformerMixin):
    
    # This initialises the object, any argument we want to define has to appear here
    def __init__(self, factor):
        self.factor = factor
    
    # This is where the actual transformation that we want occurs
    def transform(self, X, *_):
        return X * self.factor
    
    # This needs to be defined but we don't want it to do anything
    def fit(self, *_):
        return self

fm = FeatureMultiplier(2)

test = np.diag((1,2,3,4))
print test

fm.transform(test)

[[1 0 0 0]
 [0 2 0 0]
 [0 0 3 0]
 [0 0 0 4]]


array([[2, 0, 0, 0],
       [0, 4, 0, 0],
       [0, 0, 6, 0],
       [0, 0, 0, 8]])

**Check** Compare this with the `FunctionTransformer` from the preprocessing module.

**Check** Implement a custom transformer that selects a specific feature from a Pandas dataframe. It should be initialized with the column name or the column index and it should return the selected column when transforming a dataframe.

In [None]:
# General form for a custom transformer class

class SampleExtractor(BaseEstimator, TransformerMixin):

    def __init__(self, vars_):
        
    def transform(self, X, *_):
        # some manipulation

    def fit(self, X, *_):
        return self  

## make_union, FeatureUnion

What if you want to have several pipelines, that perform manipulations on data in a step-wise fashion, and then recombine the results of these pipes back together? That's where the union logic will come in. You can have your overall process with several pipeline branches, and then these pipeline branches can act independently (so not in sequence, as the pipeline does) and then recombine at the end.

Similarly to how we can either call Pipeline, or make_pipeline then we can call FeatureUnion or make_union to perform this operation.

While scikit-learn pipelines help with managing the transformation from raw data, there may be many steps before this takes place in your pipeline. These pipelines are often referred to as _ETL pipelines_ for "Extract, Transform, Load." In an _ETL pipeline_, the data is pulled or extracted from some source (like a database), transformed or manipulated, and then "loaded" into whatever system or analysis requires them. We will see this in the lab. Other ways you can manage such ETL pipelines besides the sklearn tools include:
- [Luigi](https://github.com/spotify/luigi), developed by Spotify
- [Airflow](https://github.com/airbnb/airflow), developed by AirBnB.

<a name="ind-practice"></a>
## Putting it all together

**Check** Revisit the dataset of lab 1.4. How could you use `make_pipeline` and `make_union` to build a pipeline that performs the same steps all in one pass?

1. review lab 1.4 and identify the steps that were perfomed
- for each of this steps figure out what the input and what the output is
    - is the input the whole dataframe or only a subset of the features?
    - is the output new features or a prediction?
- for each of this steps idendify what kind of transformer is needed:
    - is it a custom transformer?
    - does scikit-learn provide a transformer like this out of the box?
- if different features are treated differently, how to recombine them?




## GridSearch

Finally, one of the biggest advantages of the pipeline is the ability to perform gridsearch across the whole pipeline to tune hyperparameters of different steps. In this case we follow a notation in which we refer to the pipeline object by the name we gave it in the Pipeline step (or by its standard name in the make_pipeline method where we don't pass a dictionary key as a name) and add to underscores and the parameter (e.g. logisticregression__C) as the key of the
hyperparameter dictionary, and the values are the ones we wish to search over (as before). Hence we just do the whole grid search over a single pipeline object. This single thing is one of the biggest reasons to use pipelines and unions, and we will see it in the lab (though right at the end).

<a name="conclusion"></a>
## Conclusion
We have learnt how to use the `Pipeline` construct in order to chain several instructions in one single class. This enables to treat data-processing from a more abstract and more powerful perspective, and it's a pre-cursor to the work we will do when working with Big Data technologies.

***
### ADDITIONAL RESOURCES

- [Pipelines and Feature Union](http://scikit-learn.org/stable/modules/pipeline.html)
- [Example with complex pipeline](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#example-hetero-feature-union-py)