# When (Not) to Write Classes

- How to define and use classes
- Advantage 1: bundle code and data
- Advantage 2: Inheritance
- Advantage 3: Harmonize interfaces
- Advantage 4: Avoid long lists of arguments
- For each case I discuss disadvantages and alternative implementations 

# Introduction

### Disclaimer

- There are heated debates about what I'm going to present right now by people that are way more qualified than me. Feel free to disagree with everything I say. 

### Terminology

- Object: something that consists out of data and methods that work on that data
- Object oriented programming: building code out of objects that interact
- Class: a blueprint for an object
- Instance: an object of a certain class

### Examples

The following example is based on [A Beginner's Python Tutorial](https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes)https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes

### Define a class

In [2]:
import numpy as np


class Rectangle:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def area(self):
        return self.x * self.y
    
    def perimeter(self):
        return 2 * (self.x + self.y)
    
    def scale(self, factor):
        self.x *= factor
        self.y *= factor

### Make an instance of the class

In [3]:
rect = Rectangle(x=10, y=5)

### Use the class

In [4]:
rect.area()

50

In [47]:
rect.perimeter()

30

In [9]:
rect.scale(2)

In [10]:
rect.area()

3200

In [11]:
rect.perimeter()

240

## You have used objects before

- Everything in Python is an object
    - integers
    - floats
    - strings
    - DataFrames
    - ...
    
- Statsmodels is a object oriented library for statistics and econometrics
- here the model classes bundle information on the model specification, datasets and methods for estimation and inference.

In [12]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

dat = sm.datasets.get_rdataset("Guerry", "HistData").data
model = sm.OLS.from_formula('Lottery ~ Literacy + np.log(Pop1831)', data=dat)
results = model.fit()

results.summary()

0,1,2,3
Dep. Variable:,Lottery,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.333
Method:,Least Squares,F-statistic:,22.2
Date:,"Wed, 20 Mar 2019",Prob (F-statistic):,1.9e-08
Time:,19:39:26,Log-Likelihood:,-379.82
No. Observations:,86,AIC:,765.6
Df Residuals:,83,BIC:,773.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,246.4341,35.233,6.995,0.000,176.358,316.510
Literacy,-0.4889,0.128,-3.832,0.000,-0.743,-0.235
np.log(Pop1831),-31.3114,5.977,-5.239,0.000,-43.199,-19.424

0,1,2,3
Omnibus:,3.713,Durbin-Watson:,2.019
Prob(Omnibus):,0.156,Jarque-Bera (JB):,3.394
Skew:,-0.487,Prob(JB):,0.183
Kurtosis:,3.003,Cond. No.,702.0


# Discussion

The main advantage of a Rectangle class is that we can bundle data and code that works with the data. While this sounds very reasonable, I have some concerns:

- Is it really so intuitive to bundle code and data?
    - For me a rectangle is just a thing with two sides and not a thing with two sides that can calculate its area and perimeter and rescale itself.
    - Similar problems exist in real examples: E.g. Statsmodels
    
- In real examples, bundles often get too big (god-classes)
    - E.g. Statsmodels
- The scale method changes the state of the rectangle without creating a new instance
    - This invites bugs

Here is an alternative implementation

In [85]:
from collections import namedtuple

Rectangle = namedtuple('Rectangle', ['x', 'y'])


def area(rectangle):
    return rectangle.x * rectangle.y


def perimeter(rectangle):
    return 2 * (rectangle.x + rectangle.y)


def scale(rectangle, factor):
    new_rect = Rectangle(x=rectangle.x * factor, y=rectangle.y * factor)
    return new_rect

In [86]:
rect = Rectangle(10, 5)

area(rect)

50

In [87]:
perimeter(rect)

30

In [88]:
big_rect = scale(rect, 100)

- The new Rectangle is immutable
- Each function has a clear interface (only if there were more arguments)
- It would be easy to split the functions into different modules (only relevant in bigger examples)

# Inheritance

The previous example was just to show you how the mechanics of defining, instantiating and using a class work. I would not yet call that object oriented programming. Now we extend the example slightly:

Let's first define a very abstract Shape class, that doesn't do anything but tells us how a shape should look like

In [15]:
class Shape:
    def __init__(self):
        pass
    
    def area(self):
        raise NotImplementedError(
            'area has to be defined by the subclass.')
        
    def perimeter(self):
        raise NotImplementedError(
            'perimeter has to be defined by the subclass.')
        
    def scale(self, factor):
        raise NotImplementedError(
            'scale has to be defined by the subclass.')
        

Now we can actually define a rectangle, as subclass of a shape

In [16]:
class Rectangle(Shape):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def area(self):
        return self.x * self.y
    
    def perimeter(self):
        return 2 * (self.x + self.y)

In [17]:
rect = Rectangle(10, 2)
rect.scale(5)

NotImplementedError: scale has to be defined by the subclass.

In [18]:
class Rectangle(Shape):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def area(self):
        return self.x * self.y
    
    def perimeter(self):
        return 2 * (self.x + self.y)
    
    def scale(self, factor):
        self.x *= factor
        self.y *= factor

The real strength of object oriented code is to avoid code duplication by inheritance

In [19]:
class Square(Rectangle):
    def __init__(self, x):
        super().__init__(x, x)
        

In [20]:
sq = Square(5)

In [21]:
sq.area()

25

In [22]:
sq.perimeter()

20

# Discussion


- The advantage of subclassing shape is mainly that we get a good error message when we forget to implement something that should be part of any class that defines a shape. 
- But this is a very expensive form of documentation / testing because the documentation or test code is part of the real code and could introduce bugs!
- Also, you can go too far with abstract classes:

![](https://i.redd.it/rj8raf1riyny.png)

- Inheritance can be a powerful tool to reduce code duplication
- The above example is very unrealistic. Why do we even need to implement  special case for a square? In less contrived examples, inheritance also becomes more difficult
- Multiple inheritance is a nightmare to debug!
- Personal experience: When I tried to inherit methods from statsmodels.GenericLikelihoodModel in my skillmodels package, I ended up writing overwriting most of the methods in the subclass because they would not work out of the box.
- Let's look at the [Statsmodels source code](https://github.com/statsmodels/statsmodels/blob/master/statsmodels/base/model.py) just so you see a non-trivial example of object oriented code

# Use classes to harmonize interfaces

In [23]:
class Circle(Shape):
    def __init__(self, r):
        self.r = r
        
    def area(self):
        return np.pi * self.r ** 2
    
    
    def perimeter(self):
        return 2 * np.pi * self.r
    
    def scale(self, factor):
        self.r *= factor

In [24]:
ci = Circle(5)

In [25]:
ci.area()

78.53981633974483

In [26]:
ci.perimeter()

31.41592653589793

# Discussion

- Circles, Squares and Rectangles have a very similar interface
- Using classes to harmonize interfaces can make code much more readable!
- There are (imperfect) substitutes without classes:
    - multiple dispatch
    - we could write two modules: circle_funcs and rectangle_funcs

# Use classes to avoid long lists of arguments


Remember the Kalman filters from the programming course:

In [93]:
def predict(state, root_cov, params, shock_sds, kappa):
    """Predict *state* in next period and adjust *root_cov*.

    Args:
        state (pd.Series): period t estimate of the unobserved state vector
        root_cov (pd.DataFrame): lower triangular matrix square-root of the
            covariance matrix of the state vector in period t
        params (dict): keys are the names of the states (latent
            factors), values are series with parameters for the transition
            equation of that state.
        shock_sds (pd.Series): standard deviations of the shocks
        kappa (float): scaling parameter for the unscented predict

    Returns:
        predicted_state (pd.Series)
        predicted_root_cov (pd.DataFrame)

    """
    pass


def update(state, root_cov, measurement, loadings, meas_var):
    """Update *state* and *root_cov with* with a *measurement*.

    Args:
        state (pd.Series): pre-update estimate of the unobserved state vector
        root_cov (pd.DataFrame): lower triangular matrix square-root of the
            covariance matrix of the state vector before the update
        measurement (float): the measurement to incorporate
        loadings (pd.Series): the factor loadings

    Returns:
        updated_state (pd.Series)
        updated_root_cov (pd.Series)

    """
    pass
    

We could define a class that has most of the arguments as attributes and only some left as arguments. Example:

In [94]:
class KalmanFilter:
    def __init__(
        self,
        state, 
        root_cov, 
        params, 
        shock_sds, 
        kappa, 
        loadings, 
        meas_var
    ):
        self.state = state
        self.root_cov = root_cov
        self.params = params
        self.shock_sds = shock_sds
        self.kappa = kappa
        self.loadings = loadings
        self.meas_var = meas_var
        
    def update(self, measurement):
        pass
    
    def predict(self):
        pass

# Discussion

- This only moves the problem to the instatiation of the class
- Now there are no helpful function signatures that show how information flows through the code

### Alternative 1: functools partial

In [27]:
from functools import partial

def some_func(x, y, z):
    print(x + y + z)
    
new_func = partial(some_func, x=5, y=6)

new_func(z=3)


14


functools partial could be used to:

- get an update function that only has the arguments state, root_cov and measurement
- a predict function that only has the arguments state and root_cov

This preserves interfaces that document the flow of information, but gets rid of the constant parameters

### Alternative 2: bundle some of the arguments

for example we could define a namedtuple with all fixed parameters (params, shock_sds, kappa, loadings and meas_var) and then re-define the interfaces:

In [70]:
def predict(state, root_cov, fixed_params):
    pass


def update(state, root_cov, measurement, fixed_params):
    pass

# Summary

- There are many advantages of object orientation
    - bundling of data and code
    - inheritance
    - harmonized interfaces
    - shortend argument list
- My main problem with classes object oriented code are:
    - By using mutable objects and methods with side effects, programs have a lot of state
    - Classes often become too large
    - Class methods often don't have good signatures that show the flow of information in the code
    - Multiple inheritance is hard to debug
- Often, the alternatives are similarly or more readable and scale better when the programs become larger