## Pipeline Components
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:

Split each document’s text into words.
Convert each document’s words into a numerical feature vector.
Learn a prediction model using the feature vectors and labels.

Let's start by understanding different components of a pipeline:
### 1.Transformer

A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

* A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
* A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.



### 2. Estimator

An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. 

For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.


### How it works
A Pipeline is specified as a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. For Transformer stages, the transform() method is called on the DataFrame. For Estimator stages, the fit() method is called to produce a Transformer (which becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s transform() method is called on the DataFrame.

We illustrate this for the simple text document workflow. The figure below is for the training time usage of a Pipeline.

<img src="images/ml-pipeline.png">

Steps involved in the above pipeline:
    1. Tokenization -- Converting raw text into tokenized form (Transformer)
    2. HashingTF -- Converting words to feature vectors (Transformer)
    3. Logistic Regression  -- Fit the data to logistic regression (Estimator)

## How to make your custom transformers?

What if we want to write our own transformer for the data we are dealing with? Should we create a class, with fit, fit_transform, and define all other methods? Inorder to simplify our work, we can just follow the pattern of existing transformers. 

Let's dig deeper into Transformer class. To dig deeper, let's have at the source code of one of the transformers
One hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

How to check the source code?

<img src="images/transformers.png">

A transformer class in scikit learn inherits two parent classes:
1. BaseEstimator
2. TransformerMixin

To read the source code for the BaseEstimator class, go here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L176


To read the source code for the TransformerMixin class, go here: https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/base.py#L490

TransformerMixin:
Inheriting from TransformerMixin ensures that all we need to do is write our fit and transform methods and we get fit_transform for free


BaseEstimator:
Inheriting from BaseEstimator ensures we get get_params and set_params for free.

## Making our own Transformers and Estimators

Sample example:
An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

Exercise
--------------

1. Write a transformer that adds some number to the input, the number that is added should be passed in `__init__`
2. Write a transformer that normalizes the input:
   - in the fit method you must save the column means
3. Combine these 2 transformers into a pipeline:
   - hint: write a class that accepts list of transformers as argument
   
HINTS: All transformers are classes! All classes must have an `__init__` function. All transformers must inherit from the TransformerMixin parent class. All estimators must inherit from the BaseEstimator parent class. All transformers must have `fit` and `transform` functions (methods).

In [5]:
import numpy as np

# answer - start

from sklearn.base import TransformerMixin, BaseEstimator

In [6]:
## Fill in the answer in template below:
import numpy as np

# answer - start

from sklearn.base import TransformerMixin

class AdderTransformer(TransformerMixin):
    

class MeanNormalizer(TransformerMixin): 

    
class TransformerPipeline(TransformerMixin):


# answer - end

IndentationError: expected an indented block (<ipython-input-6-7228c0e6d8a2>, line 11)

In [7]:
# %load solutions/transformerSol.py
class AdderTransformer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x + self.add
    
class MeanNormalizer(TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        self.means = x.mean(axis=0)
        return self
    
    def transform(self, x):
        return x - self.means    
    
class TransformerPipeline(TransformerMixin):
    
    def __init__(self, transformers):
        self.transformers = transformers
        
    def fit(self, x, y = None):
        x_ = x.copy()
        for transformer in self.transformers:
            transformer.fit(x_)
        return self
        
    def transform(self, x):
        x_ = x.copy()
        for transformer in self.transformers:
            x_ = transformer.transform(x_)
        return x_

In [8]:
X = np.ones((10,10))
adder = AdderTransformer(add=1)

In [9]:
X

array([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [10]:
##Add 1
add = AdderTransformer(add = 1)
add.fit_transform(X)

array([[2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2., 2., 2., 2., 2., 2.]])

In [12]:
## Mean normalize
norm = MeanNormalizer()
norm.fit_transform(X)

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [13]:
transformers = [AdderTransformer(add=1), 
                MeanNormalizer()]
pipeline = TransformerPipeline(transformers)

In [None]:
pipeline.fit_transform(X) 

## Tying it all up together

## How do we build a pipeline using SCikit Learn?
Fortunately scikit-learn provides a set of helpful functions to deal with pipelines.
2 of them are the most important:

1. `sklearn.pipeline.make_pipeline`

    In our previous example we could define our transformer like this
    
```python
adder_normalizer = make_pipeline(
    AdderTransformer(add=10),
    MeanNormalizer()
)
```
Calling `fit` on the pipeline is the same as calling fit on each estimator  or transformoers in turn, transform the input and pass it on to the next step. The pipeline has all the methods that the last estimator in the pipeline has, i.e. if the last estimator is a classifier, the Pipeline can be used as a classifier. If the last estimator is a transformer, again, so is the pipeline.

2. `sklearn.pipeline.make_union`

    Creates a union of transformers
    
    ```
    
             transformer 1
           /               \
          /                 \
    input                     output
          \                 /    
           \               /
             transformer 2
             
    ```
             
    It is useful when the dataset consists of several types of data that one must 
    deal with separately.


Alternative way to define pipelines
--------------

```python
from sklearn.pipeline import Pipeline

adder_normalizer = Pipeline([
    ('adder', AdderTransformer(add=10)),
    ('normalizer', MeanNormalizer()),    
])

print(adder_normalizer)

>> Pipeline(steps=[('adder', <__main__.AdderTransformer object at 0x7f9387473750>), ('normalizer', <__main__.MeanNormalizer object at 0x7f9387137e50>)])
```



Reference:
    
    1. https://spark.apache.org/docs/1.6.2/ml-guide.html#transformers
    2. https://github.com/rachelkberryman/DSR_Model_Pipelines_Course/blob/master/1.3%20Pipelines.ipynb
        