## Python Classes 101

Before we introduce sklearn piplines, a bit of background about classes & inheritance in Python

Lets imagine we want to build three agents that take actions (either always turn left, always turn right or randomly do either)

We can do this without inheritance:

In [1]:
from random import choice


class Left:
    def __init__(self):
        self.name = 'left'
        self.age = 0
        
    def act(self):
        self.age += 1
        return 'go left'
    

class Right:
    def __init__(self):
        self.name = 'right'
        self.age = 0
        
    def act(self):
        self.age += 1
        return 'go right'

We can instantiate these classes, and access their methods and attributes

In [2]:
left = Left()

left.name

'left'

In [3]:
right = Right()

acts = [right.act() for _ in range(3)]
acts

['go right', 'go right', 'go right']

In [4]:
right.age

3

You can see there is a lot of repeated code in the examples above - lets see how inheritance might help

In [5]:
class Agent:
    def __init__(self, name):
        self.name = name
        self.age = 0
        
    def act(self):
        self.age += 1
        
        
class Left(Agent):
    def __init__(self, name):
        super().__init__(name)
        
    def act(self):
        super().act()
        return 'left'

In [6]:
left = Left('child left')

acts = [left.act() for _ in range(4)]
acts

['left', 'left', 'left', 'left']

In [7]:
left.age

4

In [8]:
class Right(Agent):
    def __init__(self, name):
        super().__init__(name)
        
    def act(self):
        super().act()
        return 'right'

right = Right('child right')
acts = [right.act() for _ in range(5)]
acts

['right', 'right', 'right', 'right', 'right']

In [9]:
right.age

5

Note how
- data can flow from the child class to the parent via super
- we can access the methods of the parent on the child
- we only define the functionality to increase the age once


## sklearn pipelines

Why pipelines?
- reusable
- readable
- testable

There are two kinds of pipeline objects
- transformers (i.e. normalizing or standardizing)
- estimators (i.e. fitting a model)

The focus on this class is on transformers - but the real power of a pipeline comes when you have an estimator on the end

Lets first take a look at a transformer that adds a number to each row of a dataset:

Always inherit from both

In [10]:
from sklearn.base import TransformerMixin, BaseEstimator


class Adder(TransformerMixin, BaseEstimator):
    def __init__(self, num):
        self.num = num

    def fit(self, x, y = None):
        return self

    def transform(self, x):
        return x + np.full(shape=x.shape, fill_value=self.num)

Notice that there are 2 methods:

1. fit - learns information about the data (and becomes a stateful transformer)
2. transform - applies the transformation

In [11]:
a = Adder(2)

In [12]:
a.fit_transform(np.zeros((2, 2)))

array([[2., 2.],
       [2., 2.]])

## Practical

Build two tested transformers (write the tests first!)
1. select a column
2. standardization

By standardization I mean

$$ y = \frac{x-\mu}{\sigma} $$ 

Then write an integration test that tests a pipeline of the two transformers together

## Building a pipeline

A pipline should end with an estimator - essentially a stateful transformer that learns statistics from data.  Often the last step will be a model 

In [127]:
def test_column_selector():
    names = ['a', 'b', 'j']
    selector = ColumnSelector(names)
    data = np.hstack([np.arange(3), np.arange(3) * 10, np.arange(3) * 100]).reshape(-1, 3)
    
    np.testing.assert_array_equal(
        selector.transform(['a'], data), np.zeros(3).reshape(3, 1)
    )
    np.testing.assert_array_equal(
        selector.transform(['j'], data), np.array([2, 20, 200]).reshape(3, 1)
    )
    np.testing.assert_array_equal(
        selector.transform(['a', 'j'], data), np.array([[0, 2], [0, 20], [0, 200]])
    )

    
class ColumnSelector(TransformerMixin, BaseEstimator):
    def __init__(self, feature_names):
        self.feature_names = feature_names

    def fit(self, x, y = None):
        return self

    def transform(self, names, data):
        idxs = [self.feature_names.index(n) for n in names]
        return data[:, idxs]
    
test_column_selector()

In [124]:
def test_standardizer():
    data = np.random.uniform(size=20).reshape((2, 10)) * 100
    standardizer = Standardizer()
    standardized = standardizer.fit_transform(data)
    np.testing.assert_array_less(np.mean(standardized, axis=1), 1e-14)
    np.testing.assert_allclose(np.var(standardized, axis=1), 1)

    
class Standardizer(TransformerMixin, BaseEstimator):
    def __init__(self):
        pass

    def fit(self, x, y = None):
        self.means = np.mean(x, axis=1, keepdims=True)
        self.stds = np.std(x, axis=1, keepdims=True)
        return self

    def transform(self, data):
        return (data - self.means) / self.stds
    
test_standardizer()

In [126]:
standardizer.fit_transform(data)

array([[-0.19024985,  1.44271957, -0.86379375, -0.07289577,  0.28804235,
         0.3318064 ,  0.08458321, -0.5509719 ,  1.57051565, -2.03975591],
       [-0.17284634,  1.15951359, -1.7555374 , -1.40611278, -0.23068822,
        -0.46682105,  0.24115641,  0.09574635,  1.1855813 ,  1.35000812]])

In [139]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ColumnSelector, Standardizer)
pipe

Pipeline(memory=None,
         steps=[('type-1', <class '__main__.ColumnSelector'>),
                ('type-2', <class '__main__.Standardizer'>)],
         verbose=False)

In [140]:
train = np.random.uniform(size=15).reshape((3, 5)) * 100

In [141]:
train_p = pipe.fit_transform(train)

AttributeError: 'numpy.ndarray' object has no attribute 'fit'