Modeling pipelines
=================

Thinking about modeling as a series of transformations is really helpful.
Pipelines and functional transformations are the cleanest way to preprocess the data.
It has its roots in Category theory from mathematics.

Functional transformers are reusable and you can create many complicated things with them (think about Lego blocks).

Assumptions
-------------------

1. We will be using scikit-learn interface to pipelines.
2. We will use pandas dataframes as inputs to pipelines (useful).

There are 2 types of building blocks of machine learning pipelines: transformers and estimators



Transformers
---------

Blocks that have input and output and can be chained with other transformers.

For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> Output
```

`[ Select variables ]` - transformer for selecting variables

`[ Normalize ]` - normalization step

`[ Reduce dimensions ]` - dimension reduction


-------------------

Because every transformer has the same type of data as input and output altogether they 
also form a transformer.

```
Input -> [ [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] ] -> Output

Input -> [               Data preprocessing transformation                ] -> Output
```

-------------------

An example of transformer that does nothing

```python
from sklearn.base import BaseEstimator, TransformerMixin

class LazyTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x
```

-------------------

Notice that there are 2 methods:

1. **fit** - learns the information about the data - it becomes a stateful transformer
2. **transform** - applies the transformation 

There are 2 types of transformers:
1. **stateful** - they learn something when calling fit method
2. **stateless** - they don't learn anything

Exercise
--------------

1. Write a transformer that adds some number to the input
2. Write a transformer that normalizes the input:
   - in the fit method you must save the column means
3. Combine these 2 transformers into a pipeline:
   - hint: write a class that accepts list of transformers as argument

In [10]:
from sklearn.base import TransformerMixin


class AdderTransformer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x + add
    
class MeanNormalizer(TransformerMixin):
    
    def __init__(self, add=0):
        self.add = add
        
    def fit(self, x, y = None):
        self.means = x.mean(axis=0)
        return self
    
    def transform(self, x):
        return x - self.means    
    
class TransformerPipeline(TransformerMixin):
    
    def __init__(self, transformers):
        self.transformers = transformers
        
    def fit(self, x, y = None):
        x_ = x.copy()
        for transformer in self.transformers:
            transformer.fit(x_)
            x_ = self.transformer.transform(x_)
        return self
        
    def transform(self, x):
        x_ = x.copy()
        for transformer in self.transformers:
            x_ = transformer.transform(x_)
        return x_
    
#TODO: check if works

Scikit-learn pipelines to the rescue
-------------

Fortunately scikit-learn provides a set of helpful functions to deal with pipelines.
2 of them are the most important:

1. `sklearn.pipeline.make_pipelines`

    In our previous example we could define our transformer like this
    
```python
adder_normalizer = make_pipeline(
    AdderTransformer(add=10),
    MeanNormalizer()
)
```

2. `sklearn.pipeline.make_union`

    Creates a union of transformers
    
    ```
    
             transformer 1
           /               \
          /                 \
    input                     output
          \                 /    
           \               /
             transformer 2
             
    ```
             
    It is useful when the dataset consists of several types of data that one must 
    deal with separately.


Heterogenous data
==========================

Normally datasets are not matrices of numbers.
In real life it will be a mix of:
- categorical features
- numerical features
- dates
- text data
- with missing values / without missing values

Still you must create 1 pipeline to process all these types of information.

Possible transformations:
- **categorical features**:
    - one hot encoding - converting to binary values
    - convert to numerical values - by using a hash of categorical variable
    - target averaging - replace categorical feature with an average of the target
    
- **numerical features**:
    - fill missing values
    - create bins with ranges 
    - normalize, scale
    
- **text**
    - use bag of words vectorization
    - word2vec, sentence2vec

- **dates**
    - extract years, months, days, days of week

With pipelines you can split data flow into 

Estimators
----------

Normally at the end of the pipeline there are estimators -> predictive algortihms:
    
For example

```
Data -> [ Select variables ] -> [ Normalize ] -> [ Reduce dimensions ] -> [ Linear Regression ] -> Prediction
```

or more generally

```
Data -> [ Data preprocessing ] -> [ Estimator ] -> Prediction
```

Production example of a model
---------------------------

```python

from string import lower

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline, make_union

from some_library import ChangeCounter, TextStats, LanguageModelScore
from some_library import (ApplyFunction, MissingValuesFiller,
                          PandasSelector, make_pandas_categorical_vectorizer)
from some_library import Densify

"""
Data preparation and validation:
1. Put selector as a first transformer to make meaningful errors during
   calling
2. Fill missing values
3. Convert src_lang and dst_lang to lowercase

Features:
1. Character ngrams of src_text and dst_text
2. Word ngrams (with high frequency) of src_text and dst_text
3. TextStats - many different measures to compare src_text and dst_text
4. One hot encoding of categorical features: src_lang, dst_lang, category

Model:
1. RandomForestsClassifier
"""

classifier = make_pipeline(
    PandasSelector(columns=['category', 'src_lang', 'dst_lang',
                            'src_text', 'dst_text']),
    MissingValuesFiller(),
    ApplyFunction(columns=['src_lang', 'dst_lang'], fun=lower),

    # here we start adding features
    make_union(
        make_pipeline(
            PandasSelector(columns=['src_text']),
            CountVectorizer(analyzer='char',
                            ngram_range=(1, 1),
                            min_df=10)
        ),
        make_pipeline(
            PandasSelector(columns=['dst_text']),
            CountVectorizer(analyzer='char',
                            ngram_range=(1, 1),
                            min_df=10)
        ),
        make_pipeline(
            PandasSelector(columns=['src_text']),
            CountVectorizer(analyzer='word',
                            ngram_range=(1, 1),
                            min_df=25)
        ),
        make_pipeline(
            PandasSelector(columns=['dst_text']),
            CountVectorizer(analyzer='word',
                            ngram_range=(1, 1),
                            min_df=25)
        ),
        make_pandas_categorical_vectorizer(
            columns=['src_lang', 'dst_lang', 'category']
        ),
        TextStats()
    ),

    # densify makes RandomForestClassifier much faster
    Densify(),
    RandomForestClassifier(
        n_estimators=100,
        n_jobs=-1,
        min_samples_split=20, min_samples_leaf=10,
        verbose=True,
        random_state=1)
)```