# Intro to Object-Oriented Programming For Data Scientists
## Learn a new way of thinking - a new perspective
<img src='images/pexels.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@pixabay?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pixabay</a>
        on 
        <a href='https://www.pexels.com/photo/balance-blur-boulder-close-up-355863/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### What is Object-Oriented Programming?

It is no doubt that data scientists spend majority of their time doing procedural programming - writing code that executes as a sequence of steps. Doing data analysis in Jupyter Notebooks, writing Python scripts to clean the data are all examples of this. This type of thinking and doing things come to people naturally. After all, that is how we go through each day, one step - one process at a time.

However, as data scientists, we should be eternally grateful that there is another, much better system of coding - Object-Oriented Programming (OOP). OOP gives us massively powerful `numpy` functions, endlessly flexible `pandas` DataFrames and beautifully designed `sklearn` models. OOP enables different chunks of code to interact with each other allowing programmers to build flexible tools and frameworks that work seamlessly together. 

Object-Oriented Programming allows us to think about concepts and tools in terms of patterns and behaviors. In other words, anything that cannot be built using a sequential workflow should be built in OOP. For example, can you imagine `pandas` DataFrames to be built in some script bundled within thousands of functions?

Most important trait of OOP is that it enables us to combine both the state and behavior of some concept. For example, if you were to represent a programmer in code, you would have to store the features of that programmer in separate variables like name, salary, age, etc. Besides, each programmer has similar behaviors like coding, drinking coffee, sleeping and eating. You would have to create each of these behaviors using functions, entirely not aware of each other's existence.

However, in OOP, these tasks are elegantly achieved using data structures called **objects** and blueprints called **classes**. Classes give us a way to talk about programmers in a unified way. For example, if we created a class name Programmer, it would have the ability to generalize about programmers with every possible feature and behavior. If we wanted to talk about a certain programmer, we could just create an instance of the Programmer class with the specifics of that programmer. This instance would be called an *object*.

Using a single class, we could create infinitely many objects, each an instance of a single, unique item.

### Why Do You Need OOP As a Data Scientist?

Most of the time, [procedural programming](https://en.wikipedia.org/wiki/Procedural_programming) is more than enough for personal work in data science. However, as soon as you start writing code for others - often for software engineers on your team or your company, you have to start using OOP. As you will see later, OOP code is incredibly well-organized and reusable. Its modular design makes it easy to debug and maintain and share across a team of coworkers. 

Besides, entire open-source community writes OOP code, including data science community. After gaining enough experience and knowledge, you may want to contribute to the packages you love or create your own. Contributing to existing packages and frameworks requires OOP skills at highest standards and this is even more true if you want to develop packages yourself.

OOP might also help some of your personal projects in terms of code design. For example, web scraping projects immediately come to mind. If you can't find the data you are looking for, you may want to obtain it yourself by doing web scraping. In that case, you will often yourself scraping thousands of examples of similar items and that would be a perfect opportunity to create a class for them.

For example, if you are scraping car prices from some website you can create a Car class which stores all characteristics of a car in a single object. Actually, I did such a project where I scraped car prices from a website called Autotrader. Sadly, I made the mistake of saving everything in dictionaries. The result was a complete disaster. See, each car had several attributes and some attributes had attributes of their own. As you might guess, my dictionaries became a nested jungle making my code messy and unmaintainable even for my future self. 

You can an example of how other data scientist use OOP in their projects in this [article](https://towardsdatascience.com/improve-your-data-wrangling-with-object-oriented-programming-914d3ebc83a9).

### Everything Is an Object in Python

You might have heard that Python is an Object-Oriented Programming language. That's entirely true, everything in Python is an object and each object has a high-level class associated with them. You can see their class name by calling `type()` function on them:

In [16]:
type(4)

int

In [17]:
type(2.0)

float

In [18]:
type("Alpha")

str

In [7]:
def my_func():
    pass

In [19]:
type(np.mean)

function

In [24]:
type(pd.DataFrame())

pandas.core.frame.DataFrame

The existence of these unified classes is why we can use all `numpy` arrays or `pandas` DataFrames in the same way.

As I said earlier, classes bundle together the state and behavior of an item in the form of attributes and methods. For example, take a look at the attributes of a `numpy.ndarray`:

In [25]:
array = np.array([[4, 6, 9, 6, 4],
                  [5, 6, 7, 9, 9]])
# Shape of an array
array.shape

(2, 5)

In [26]:
# Data type
array.dtype

dtype('int32')

In [27]:
array.T

array([[4, 5],
       [6, 6],
       [9, 7],
       [6, 9],
       [4, 9]])

It has also methods such as:

In [30]:
# Reshaping the array
array = array.reshape(5, 2)
array

array([[4, 6],
       [9, 6],
       [4, 5],
       [6, 7],
       [9, 9]])

In [31]:
# Make the array 1D
array = array.flatten()
array

array([4, 6, 9, 6, 4, 5, 6, 7, 9, 9])

In [32]:
# Compute summary stats like the mean
array.mean()

6.5

Attributes and methods both use dot-notation - `object.attribute` and `object.method()`. You can see all available attributes and methods of an object by calling `dir()` function on it:

In [33]:
dir(array)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_function__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_ufunc__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__complex__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imatmul__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__

### Class Anatomy: Attributes and Methods

In this section, you will get to know how classes work and how to write them by creating a `Point` class (a point in a 2D cartesian plane). Any class definition starts with a `class` keyword, followed by a class name, followed by a colon:

In [35]:
class Point:
    pass

To create an empty class or a placeholder, we can put `pass` keyword. Even though the above class does noting, you can already create instances of it:

In [37]:
point_1 = Point()
print(point_1)

<__main__.Point object at 0x000001EF8467C310>


Creating an instances is done just like you would call a function. 

Now, let's create a method that gives an info on our class:

In [44]:
class Point:
    """
    An abstract class to represnt a point on 2D Cartesian Plane
    """
    
    def give_info(self, x, y):
        """A function that gives info about this class"""
        print(f'This point is situated at ({x}, {y}) on 2D plane.')

A method of a class is created in the same way as you create an ordinary Python function. The only difference is the `self` keyword which should always be the first parameter in a class method:

In [46]:
obj1 = Point()

obj1.give_info(1, 2)

This point is situated at (1, 2) on 2D plane.


If you pay attention, `give_info()` accepts 3 arguments including `self` but we only passed 2 arguments for X and Y and the method worked fine. So, what is this `self` keyword?

Like I said, classes are only initialized by their instances as objects. While writing the class, we usually need a way to refer to these future objects that will be created using the class we are writing. So, the `self` keyword in `give_info` is used to refer to the future object `obj1` we created. Think of it as a placeholder: `self.give_info(1, 2)` -> `obj1.give_info(1, 2)`. Also, there is noting special about the `self` keyword, any name can be used as soon as it comes as the first parameter in function methods. But `self` is an agreed standard, so you should stick to it.

Now, let's create a new method that sets the coordinates of the point instead of printing them out:

In [48]:
class Point:
    
    def set_coords(self, x, y):
        """A function to set the coordinates of the point"""
        self.x = x
        self.y = y

In this new method, we encode the data (the coordinates) as attributes. Like methods, attributes should start with the `self` keyword followed by the name of the attribute. 

In [50]:
point_1 = Point()
point_1.set_coords(3, 5)
print("The x coordinate:", point_1.x)
print("The y coordinate:", point_1.y)

The x coordinate: 3
The y coordinate: 5


After initiating the object, we call the new `set_coords()` method passing the X and Y coordinates. Then, we can access the coordinates as attributes because they were stored using the `self` keyword: `self.x` -> `point_1.x`.

Now, let's create another method that computes the distance between the point and the origin:

In [56]:
class Point:
    
    def set_coords(self, x, y):
        """A method to set the coordinates of the point"""
        self.x = x
        self.y = y
    
    def distance_to_origin(self):
        """
        A method to calculate the distance between the point and the origin
        """
        return np.sqrt(self.x ** 2 + self.y ** 2)


The `distance_to_origin` method does not take arguments but uses the coordinates set by `set_coords`. Just like before, to refer to any attribute defined within anywhere in class, you should refer to it with the `self` keyword.

In [54]:
point = Point()
point.set_coords(6, 5)
print(f"The distance to origin:", point.distance_to_origin())

The distance to origin: 7.810249675906654


### The Constructor Method