# Intro to Object-Oriented Programming For Data Scientists
## Learn a new way of thinking - a new perspective
<img src='images/pexels.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@pixabay?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pixabay</a>
        on 
        <a href='https://www.pexels.com/photo/balance-blur-boulder-close-up-355863/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### What is Object-Oriented Programming?

It is no doubt that data scientists spend majority of their time doing procedural programming - writing code that executes as a sequence of steps. Doing data analysis in Jupyter Notebooks, writing Python scripts to clean the data are all examples of this. This type of thinking and doing things come to people naturally. After all, that is how we go through each day, one step - one process at a time.

However, as data scientists, we should be eternally grateful that there is another, much better system of coding - Object-Oriented Programming (OOP). OOP gives us massively powerful `numpy` functions, endlessly flexible `pandas` DataFrames and beautifully designed `sklearn` models. OOP enables different chunks of code to interact with each other allowing programmers to build flexible tools and frameworks that work seamlessly together. 

Object-Oriented Programming allows us to think about concepts and tools in terms of patterns and behaviors. In other words, anything that cannot be built using a sequential workflow should be built in OOP. For example, can you imagine `pandas` DataFrames to be built in some script bundled within thousands of functions?

Most important trait of OOP is that it enables us to combine both the state and behavior of some concept. For example, if you were to represent a programmer in code, you would have to store the features of that programmer in separate variables like name, salary, age, etc. Besides, each programmer has similar behaviors like coding, drinking coffee, sleeping and eating. You would have to create each of these behaviors using functions, entirely not aware of each other's existence.

However, in OOP, these tasks are elegantly achieved using data structures called **objects** and blueprints called **classes**. Classes give us a way to talk about things in a unified way. For example, if we created a class name Programmer, it would have the ability to generalize about programmers with every possible feature and behavior. If we wanted to talk about a certain programmer, we could just create an instance of the Programmer class. This instance would be called an *object*.

Using a single class, we could create infinitely many objects, each an instance of a single, unique item.

### Why Do You Need OOP As a Data Scientist?

Most of the time, [procedural programming](https://en.wikipedia.org/wiki/Procedural_programming) is more than enough for personal work in data science. However, as soon as you start writing code for others - often for software engineers on your team or your company, you have to start using OOP. As you will see later, OOP code is incredibly well-organized and reusable. Its modular design makes it easy to debug and maintain and share across a team of coworkers. 

Besides, entire open-source community writes OOP code, including data science community. After gaining enough experience and knowledge, you may want to contribute to the packages you love or create your own. Contributing to existing packages and frameworks requires OOP skills at highest standards and this is even more true if you want to develop packages yourself.

OOP might also help some of your personal projects in terms of code design. For example, web scraping projects immediately come to mind. If you can't find the data you are looking for, you may want to obtain it yourself by doing web scraping. In that case, you will often find yourself scraping thousands of examples of similar items and that would be a perfect opportunity to create a class for them.

For example, if you are scraping car prices from some website you can create a Car class which stores all characteristics of a car in a single object. Actually, I did such a project where I scraped car prices from a website called Autotrader. Sadly, I made the mistake of saving everything in dictionaries. The result was a complete disaster. Each car had several attributes and some attributes had attributes of their own. As you might guess, my dictionaries became a nested jungle making my code messy and unmaintainable even for my future self. 

You can see other examples of how other data scientist use OOP in their projects in this [article](https://towardsdatascience.com/improve-your-data-wrangling-with-object-oriented-programming-914d3ebc83a9).

### Everything Is an Object in Python

You might have heard that Python is an Object-Oriented Programming language. That's entirely true, everything in Python is an object and each object has a high-level class associated with them. You can see an object's class name by calling `type()` function on it:

In [2]:
type(4)

int

In [3]:
type(2.0)

float

In [4]:
type("Alpha")

str

In [6]:
type(np.mean)

function

In [7]:
type(pd.DataFrame())

pandas.core.frame.DataFrame

The existence of these unified classes is why we can use all `numpy` arrays or `pandas` DataFrames in the same way.

As I said earlier, classes bundle together the state and behavior of an item in the form of attributes and methods. For example, take a look at the attributes of a `numpy.ndarray`:

In [8]:
array = np.array([[4, 6, 9, 6, 4],
                  [5, 6, 7, 9, 9]])
# Shape of an array
array.shape

(2, 5)

In [9]:
# Data type
array.dtype

dtype('int32')

In [10]:
array.T

array([[4, 5],
       [6, 6],
       [9, 7],
       [6, 9],
       [4, 9]])

It has also methods such as:

In [11]:
# Reshaping the array
array = array.reshape(5, 2)
array

array([[4, 6],
       [9, 6],
       [4, 5],
       [6, 7],
       [9, 9]])

In [12]:
# Make the array 1D
array = array.flatten()
array

array([4, 6, 9, 6, 4, 5, 6, 7, 9, 9])

In [13]:
# Compute summary stats like the mean
array.mean()

6.5

Attributes and methods both use dot-notation - `object.attribute` and `object.method()`. You can see all available attributes and methods of an object by calling `dir()` function on it:

In [14]:
dir(array)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

### Class Anatomy: Attributes and Methods

In this section, you will get to know how classes work and how to write them by creating a `Point` class (a point in a 2D cartesian plane). Any class definition starts with a `class` keyword, followed by a class name, followed by a colon:

In [15]:
class Point:
    pass

To create an empty class or a placeholder for its body, we can put `pass` keyword. Even though the above class does noting, you can already create instances of it:

In [16]:
point_1 = Point()
print(point_1)

<__main__.Point object at 0x0000022A4D7234F0>


Creating an instances is done just like you would call a function. 

Now, let's create a method that gives an info on our class:

In [17]:
class Point:
    """
    An abstract class to represnt a point on 2D Cartesian Plane
    """
    
    def give_info(self, x, y):
        """A function that gives info about this class"""
        print(f'This point is situated at ({x}, {y}) on 2D plane.')

A method of a class is created in the same way as you create an ordinary Python function. The only difference is the `self` keyword which should always be the first parameter in a class method:

In [18]:
obj1 = Point()

obj1.give_info(1, 2)

This point is situated at (1, 2) on 2D plane.


If you pay attention, `give_info()` accepts 3 arguments including `self` but we only passed 2 arguments for X and Y and the method worked fine. So, what is this `self` keyword?

Like I said, classes are only initialized by when we create objects. While writing the class, we usually need a way to refer to these future objects that will be created using the class we are writing. So, the `self` keyword in `give_info` is used to refer to the future object `obj1` we created. Think of it as a placeholder, a stand-in for all future objects: `self.give_info(1, 2)` -> `obj1.give_info(1, 2)`. Also, there is noting special about the `self` keyword, any name can be used as long as it comes as the first parameter in function methods. But `self` is an agreed standard, so you should stick to it.

Now, let's create a new method that sets the coordinates of the point instead of printing them out:

In [19]:
class Point:
    
    def set_coords(self, x, y):
        """A function to set the coordinates of the point"""
        self.x = x
        self.y = y

In this new method, we encode the data (the coordinates) as attributes. Like methods, attributes should start with the `self` keyword followed by the name of the attribute. 

In [20]:
point_1 = Point()
point_1.set_coords(3, 5)
print("The x coordinate:", point_1.x)
print("The y coordinate:", point_1.y)

The x coordinate: 3
The y coordinate: 5


After initiating the object, we call the new `set_coords()` method passing the X and Y coordinates. Then, we can access the coordinates as attributes because they were created in the method and stored using the `self` keyword: `self.x` -> `point_1.x`.

Now, let's create another method that computes the distance between the point and the origin:

In [21]:
class Point:
    
    def set_coords(self, x, y):
        """A method to set the coordinates of the point"""
        self.x = x
        self.y = y
    
    def distance_to_origin(self):
        """
        A method to calculate the distance between the point and the origin
        """
        return np.sqrt(self.x ** 2 + self.y ** 2)


The `distance_to_origin` method does not take arguments but uses the coordinates set by `set_coords`. Just like before, to refer to any attribute defined within anywhere in class, you should refer to it with the `self` keyword.

In [22]:
point = Point()
point.set_coords(6, 5)
print(f"The distance to origin:", point.distance_to_origin())

The distance to origin: 7.810249675906654


### The Constructor Method

If you paid attention each attribute of the class was created within a method. In real projects, classes might have hundreds of attributes and creating a separate method for each attribute will make the class massive and introduce unnecessary code. That's why Python classes have a special constructor method that allows creating attributes upon object creation:

In [23]:
class Point:
    
    def __init__(self, x, y):
        self.x = x
        self.y = y

This constructor method has an exact syntax - `__init__()`: double underscores followed by `init` and followed by another double underscores. Like any class method, it takes the `self` keyword as the first argument, followed by any number of arguments we wish to be created whenever we create a new object. Let's see an example of the above constructor:

In [24]:
point = Point(4, 5)
print("The x coordinate:", point.x)
print("The y coordinate:", point.y)

The x coordinate: 4
The y coordinate: 5


Adding arguments to the constructor method makes it compulsory to the user to pass values to them:

In [25]:
point = Point()

TypeError: __init__() missing 2 required positional arguments: 'x' and 'y'

As you can see, trying to initiate the class without arguments raises a `TypeError`.

In short, you should add the attributes to the constructor method whenever you think these attributes should be initialized whenever an object is initialized. 

However, you should be able to draw the line between creating attributes in the constructor and in other class methods. For example, calculating the distance to origin should be done in a new method instead of in the constructor. Creating a new method for the task makes it easy to provide additional features for the behavior like checking whether passed arguments are of correct data types. This also ensures that the constructor is left uncluttered and organized.

### Final Example: A Class to Implement LinearRegression

As a small capstone example, let's create a class that implements a very simple Linear Regression algorithm:

In [29]:
class LinearRegressor:
    
    def fit(self, feature, target):
        pass
    
    def predict(self, new_data):
        pass
    
    def score(self, y_true, y_pred):
        pass

> Tip: Always try to create such blueprints for your classes. Planning goes a long way in creating high-quality code.

The above will be the blueprint of our class. For the sake of simplicity, this LinReg algorithm will only take a single feature to predict the target. In the `fit` method, we will find the slope and intercept of the line of best fit using `numpy`:

In [58]:
class LinearRegressor:
    """
    A class that implements a simple Linear Regression.
    """
    def fit(self, X, y):
        """
        A method to find the slope and intercept for the line of best fit.
        """
        self.slope, self.intercept = np.polyfit(X, y, deg=1)


[`np.polyfit`](https://numpy.org/doc/stable/reference/generated/numpy.polyfit.html) takes two arrays and computes least squares polynomial fit. We are specifying the degree of the polynomial with `deg` set to 1. The function returns the slop and intercept of the line of best fit for the given data. However, we should add a condition that the shapes of the arrays are the same just like in Scikit-learn:

In [61]:
class LinearRegressor:
    """
    A class that implements a simple Linear Regression.
    """
    def fit(self, X, y):
        """
        A method to find the slope and intercept for the line of best fit.
        """
        if len(X) == len(y):
            self.slope, self.intercept = np.polyfit(X, y, deg=1)
        else:
            raise ValueError("The dimensions of the arrays don't match...")

Now, if mismatched arrays are passed to `fit`, the user will get an error:

In [62]:
lin_reg = LinearRegressor()
lin_reg.fit(X=[1, 2, 3, 5], y=[2, 3, 4])

ValueError: The dimensions of the arrays don't match...

Next, we will build the `predict` method which should be easy:

In [63]:
class LinearRegressor:
    """
    A class that implements a simple Linear Regression.
    """
    def fit(self, X, y):
        """
        A method to find the slope and intercept for the line of best fit.
        """
        self.slope, self.intercept = np.polyfit(X, y, deg=1)

    def predict(self, new_data):
        """
        A method that makes predictions.
        """
        
        self.preds = self.intercept + self.slope * new_data
        
        return self.preds

We build the `predict` method using the formula of simple linear regression:

![image.png](attachment:0bfda41c-9735-4205-98a8-216a9e82f103.png)

Finally, we will implement `score` which will be plain-old Mean Squared Error (MSE):

In [68]:
class LinearRegressor:
    """
    A class that implements a simple Linear Regression.
    """
    def fit(self, X, y):
        """
        A method to find the slope and intercept for the line of best fit.
        """
        self.slope, self.intercept = np.polyfit(X, y, deg=1)

    def predict(self, new_data):
        """
        A method that makes predictions.
        """
        
        self.preds = self.intercept + self.slope * new_data
        
        return self.preds
    
    def score(self, y_true, y_pred):
        """
        A method that computes Mean Squared Error.
        """
        return np.mean((y_true - y_pred) ** 2)

That's it. Our class is ready! Let's test it using the `tips` dataset that is built-in to `seaborn`. We will try to predict `tip` amount using `total_bill` in a restaurant:

In [69]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Let's build feature and target arrays and generate train and test sets:

In [76]:
from sklearn.model_selection import train_test_split

# Build feature / target arrays
X, y = tips.total_bill, tips.tip

# Create train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1121218)

Finally, we will train our custom Linear Regressor and generate predictions and score them:

In [77]:
# Init the regressor
lin_reg = LinearRegressor()

# Fit
lin_reg.fit(X_train, y_train)

# Predict
preds = lin_reg.predict(X_test)

# Score
lin_reg.score(y_test, preds)

1.0685066464995827

We got an MSE of 1.0685. Let's check its performance by doing the same with Scikit-learn's actual `LinearRegression` class:

In [79]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lg = LinearRegression()

# Fit
lg.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))

# Predict and score
preds = lg.predict(X_test.values.reshape(-1, 1))
mean_squared_error(y_test, preds)

1.0685066464995823

As you can see, the results match are nearly identical. You can even check the values of slope and intercept:

In [80]:
lin_reg.slope, lin_reg.intercept

(0.1075724043281556, 0.9113442141681357)

In [81]:
lg.coef_, lg.intercept_

(array([[0.1075724]]), array([0.91134421]))

They are identical too! Congratulations! You just built a simple Linear Regressor model from scratch on your own!

### Conclusion

This post was all about the fundamentals of OOP. As you can see, you can built pretty cool things even with the basics. This is a Part 1 of the series I am planning to write on OOP. Stay tuned for the next part!