In [1]:
from utils import css_from_file
css_from_file('style/style.css')

Modeling pipelines
=================

Thinking about modeling as a series of transformations is really helpful.
Pipelines and functional transformations are the cleanest way to preprocess the data.
It has its roots in Category theory from mathematics.

Functional transformers are reusable and you can create many complicated things with them (think about Lego blocks).

Assumptions
-------------------

1. We will be using scikit-learn interface to pipelines.
2. We will use pandas dataframes as inputs to pipelines (useful).

There are 2 types of building blocks of machine learning pipelines: transformers and estimators



Transformers
---------

Blocks that have input and output and can be chained with other transformers.

For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> Output
```

`[ Select variables ]` - transformer for selecting variables

`[ Normalize ]` - normalization step

`[ Reduce dimensions ]` - dimension reduction


-------------------

Because every transformer has the same type of data as input and output altogether they 
also form a transformer.

```
Input -> [ [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] ] -> Output

Input -> [               Data preprocessing transformation                ] -> Output
```

-------------------

An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

-------------------

Notice that there are 2 methods:

1. **fit** - learns the information about the data - it becomes a stateful transformer
2. **transform** - applies the transformation 

There are 2 types of transformers:
1. **stateful** - they learn something when calling fit method
2. **stateless** - they don't learn anything

**Why stateless transformers are useful?**

Transformers that don't need historical data to learn can be used in a type of learning
called `online learning`. This type of learning fits pipelines beacuse it is an algorithm
that uses the stream of observations to learn.

It doesn't keep the history so there would be no way to use stateful transformers.


Exercise
--------------

1. Write a transformer that adds some number to the input, the number that is added should be passed in `__init__`
2. Write a transformer that normalizes the input:
   - in the fit method you must save the column means
3. Combine these 2 transformers into a pipeline:
   - hint: write a class that accepts list of transformers as argument

In [18]:
import numpy as np
from sklearn.base import BaseEstimator,TransformerMixin

# answer - start

# fill the classes with code
class AdderTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,add=0):
        self.add = add

    def fit(self, x, y = None):
        return self

    def transform(self, x):
        return x+self.add

class MeanNormalizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, x, y = None):
        self.mean = np.mean(x)
        return self

    def transform(self, x):
        return x - self.mean

class TransformerPipeline(BaseEstimator, TransformerMixin):
    def __init__(self,list_transformers):
        self.list = list_transformers

    def fit(self, x, y = None):
        x_ = x.copy()
        for trans in self.list:
            trans.fit(x_)
            x_ = trans.transform(x_)
        return self
    
    def transform(self, x):
        x_ = x.copy()
        for trans in self.list:
            x_ = trans.transform(x_)
        return x_

# answer - end

# tests
X = np.ones((10,10))
adder = AdderTransformer(add=1)
assert np.all(adder.transform(X) == X + 1), "Adder transformer wrong"

normalizer = MeanNormalizer()
assert np.allclose(normalizer.fit_transform(X),np.zeros((10,10))), "Mean normalizer wrong"

double_adder = TransformerPipeline([AdderTransformer(add=1), 
                                    AdderTransformer(add=2)])

assert np.allclose(double_adder.transform(X), X+3), "TransformerPipeline wrong"

**Double click to see the solution**

<div class='spoiler'>

class AdderTransformer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x + self.add
    
class MeanNormalizer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        self.means = x.mean(axis=0)
        return self
    
    def transform(self, x):
        return x - self.means    
    
class TransformerPipeline(TransformerMixin):
    
    def __init__(self, transformers):
        self.transformers = transformers
        
    def fit(self, x, y = None):
        x_ = x.copy()
        for transformer in self.transformers:
            transformer.fit(x_)
            x_ = transformer.transform(x_)
        return self
        
    def transform(self, x):
        x_ = x.copy()
        for transformer in self.transformers:
            x_ = transformer.transform(x_)
        return x_
</div>

Scikit-learn pipelines to the rescue
-------------

Fortunately scikit-learn provides a set of helpful functions to deal with pipelines.
2 of them are the most important:

1. `sklearn.pipeline.make_pipelines`

    In our previous example we could define our transformer like this
    
```python
adder_normalizer = make_pipeline(
    AdderTransformer(add=10),
    MeanNormalizer()
)
```

2. `sklearn.pipeline.make_union`

    Creates a union of transformers
    
    ```
    
             transformer 1
           /               \
          /                 \
    input                     output
          \                 /    
           \               /
             transformer 2
             
    ```
             
    It is useful when the dataset consists of several types of data that one must 
    deal with separately.


Alternative way to define pipelines
--------------

```python
from sklearn.pipeline import Pipeline

adder_normalizer = Pipeline([
    ('adder', AdderTransformer(add=10)),
    ('normalizer', MeanNormalizer()),    
])

print(adder_normalizer)

>> Pipeline(steps=[('adder', <__main__.AdderTransformer object at 0x7f9387473750>), ('normalizer', <__main__.MeanNormalizer object at 0x7f9387137e50>)])
```

It is useful to name the steps because sometimes we want to control the steps from outside - for example when searching for parameters.

Heterogenous data
==========================

Normally datasets are not matrices of numbers.
In real life it will be a mix of:
- categorical features
- numerical features
- dates
- text data
- with missing values / without missing values

Still you must create 1 pipeline to process all these types of information.

Possible transformations:
- **categorical features**:
    - one hot encoding - converting to binary values
    - convert to numerical values - by using a hash of categorical variable
    - target averaging - replace categorical feature with an average of the target
    
- **numerical features**:
    - fill missing values
    - create bins with ranges 
    - normalize, scale
    
- **text**
    - use bag of words vectorization
    - word2vec, sentence2vec

- **dates**
    - extract years, months, days, days of week

With pipelines you can split data flow into 

Estimators
----------

Normally at the end of the pipeline there are estimators -> predictive algortihms:
    
For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> [ Linear Regression ] -> Prediction
```

or more generally

```
Data -> [ Data preprocessing ] -> [ Estimator ] -> Prediction
```