## Python Classes 101

Before we introduce sklearn piplines, a bit of background about classes & inheritance in Python

Lets imagine we want to build three agents that take actions (either always turn left, always turn right or randomly do either)

We can do this without inheritance:

In [1]:
import numpy as np

In [2]:
from random import choice


class Left:
    def __init__(self):
        self.name = 'left'
        self.age = 0
        
    def act(self):
        self.age += 1
        return 'go left'
    

class Right:
    def __init__(self):
        self.name = 'right'
        self.age = 0
        
    def act(self):
        self.age += 1
        return 'go right'

We can instantiate these classes, and access their methods and attributes

In [3]:
left = Left()

left.name

'left'

In [4]:
right = Right()

acts = [right.act() for _ in range(3)]
acts

['go right', 'go right', 'go right']

In [5]:
right.age

3

You can see there is a lot of repeated code in the examples above - lets see how inheritance might help

In [6]:
class Agent:
    def __init__(self, name):
        self.name = name
        self.age = 0
        
    def act(self):
        self.age += 1
        
        
class Left(Agent):
    def __init__(self, name):
        super().__init__(name)
        
    def act(self):
        super().act()
        return 'left'

In [7]:
left = Left('child left')

acts = [left.act() for _ in range(4)]
acts

['left', 'left', 'left', 'left']

In [8]:
left.age

4

In [9]:
class Right(Agent):
    def __init__(self, name):
        super().__init__(name)
        
    def act(self):
        super().act()
        return 'right'

right = Right('child right')
acts = [right.act() for _ in range(5)]
acts

['right', 'right', 'right', 'right', 'right']

In [10]:
right.age

5

In [11]:
right.name

'child right'

Note how
- data can flow from the child class to the parent via super
- we can access the methods of the parent on the child
- we define the common functionality once


## sklearn pipelines

Why pipelines?
- reusable
- readable
- testable

There are two kinds of pipeline objects
- transformers (i.e. normalizing or standardizing)
- estimators (i.e. fitting a model)

The focus on this class is on transformers - but the real power of a pipeline comes when you have an estimator on the end

Lets first take a look at a transformer that adds a number to each row of a dataset:

In [12]:
from sklearn.base import TransformerMixin, BaseEstimator


class Adder(TransformerMixin, BaseEstimator):
    def __init__(self, num):
        self.num = num

    def fit(self, x, y = None):
        return self

    def transform(self, x):
        return x + np.full(shape=x.shape, fill_value=self.num)

Notice that there are 2 methods:

1. fit - learns information about the data (and becomes a stateful transformer)
2. transform - applies the transformation

In [13]:
a = Adder(2)

In [14]:
a.fit_transform(np.zeros((2, 2)))

array([[2., 2.],
       [2., 2.]])

## Practical

Build two tested transformers (write the tests first!)
1. select a column
2. standardization

By standardization I mean

$$ y = \frac{x-\mu}{\sigma} $$ 

Then write an integration test that tests a pipeline of the two transformers together

## Building a pipeline

A pipline should end with an estimator - essentially a stateful transformer that learns statistics from data.  Often the last step will be a model.

Here we will break with the TDD style and bulid the pipeline first (don't worry - you will get the chance to test it later).

In [107]:
def test_column_selector():
    names = ['a', 'b', 'j']
    
    data = np.hstack([np.arange(3), np.arange(3) * 10, np.arange(3) * 100]).reshape(-1, 3)
    
    selector = ColumnSelector(names, ['a'])
    np.testing.assert_array_equal(
        selector.transform(data), np.zeros(3).reshape(3, 1)
    )
    selector = ColumnSelector(names, ['j'])
    np.testing.assert_array_equal(
        selector.transform(data), np.array([2, 20, 200]).reshape(3, 1)
    )
    selector = ColumnSelector(names, ['a', 'j'])
    np.testing.assert_array_equal(
        selector.transform(data), np.array([[0, 2], [0, 20], [0, 200]])
    )

    
class ColumnSelector(TransformerMixin, BaseEstimator):
    def __init__(self, feature_names, selected_names):
        self.feature_names = feature_names
        self.selected_names = selected_names

    def fit(self, x, y = None):
        return self

    def transform(self, x):
        idxs = [
            self.feature_names.index(n) 
            for n in self.selected_names
        ]
        return x[:, idxs]
    
test_column_selector()

In [108]:
def test_standardizer():
    data = np.random.uniform(size=20).reshape((10, 2)) * 100
    standardizer = Standardizer()
    standardized = standardizer.fit_transform(data)
    np.testing.assert_array_less(np.mean(standardized, axis=0), 1e-14)
    np.testing.assert_allclose(np.var(standardized, axis=0), 1)

    
class Standardizer(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass

    def fit(self, x, y = None):
        self.means = np.mean(x, axis=0, keepdims=True)
        self.stds = np.std(x, axis=0, keepdims=True)
        return self

    def transform(self, x):
        return (x - self.means) / self.stds
    
test_standardizer()

In [109]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ColumnSelector(['a', 'b', 'c'], ['a', 'c']), Standardizer())
pipe

Pipeline(memory=None,
         steps=[('columnselector',
                 ColumnSelector(feature_names=['a', 'b', 'c'],
                                selected_names=['a', 'c'])),
                ('standardizer', Standardizer())],
         verbose=False)

In [110]:
train = np.random.uniform(size=15).reshape((5, 3)) * 100

In [111]:
processed_train = pipe.fit_transform(train)
processed_train

array([[ 0.7906889 , -1.80155413],
       [ 0.09156311,  0.28370057],
       [-0.60781632,  0.05496521],
       [-1.54422337,  1.27956567],
       [ 1.26978768,  0.18332267]])

We can now apply the same transformation to another dataset (i.e. the test set) by calling the `transform` method.  

This is a common misunderstanding in machine learning - you shouldn't refit any preprocessing statistics when testing (in the same way that you don't refit your model parameters).

In [113]:
test = np.random.uniform(size=6).reshape((2, 3)) * 100

processed_test = pipe.transform(test)
processed_test

array([[-1.52377907, -0.51514797],
       [ 2.45170606, -0.53616539]])

## Practical

Write an integration test of the entire pipeline

In [None]:
def test_pipe():
    
    pipe = make_pipeline(ColumnSelector(['a', 'b', 'c'], ['a', 'c']), Standardizer())

    data = np.random.uniform(size=15).reshape((5, 3)) * 100
    
    res = pip.fit_transform(data)
    
    test = pip.transform(data_other_data)
    
    test.mean != 0
    
    assert res == expected
    