# Object Oriented Programing

- OOP is a method of structuring your code into reusable units called classes
- a class is like a blueprint and it does two things:
    - it describes the properties (aka attributes) the object can have
    - it also describes the behaviors (aka methods) of the object
- when you fill up the template, you create an object
    - an object is an instance of a class

## The advantages of OOP in data science

- modularity:
    - it breaks down large and complex programs into smaller, managable pieces (objects and their methods)
    - the code is easier to read and maintain as a result


- reusability:
    - objects and classes can be reused across different projects or within different parts of the same project


- scalability:
    - easy to scale your programs as they grow by adding new objects or methods
    - changes won't affect the entire code base, only specific objects


- abstraction:
    - OOP abstracts away the details of data processing and modeling
    - it helps users to focus more on analysis rather than the underlying complex code


- most data science packages are object oriented because of these reasons!!!
    - numpy, pandas, sklearn, matplotlib, tensorflow, keras, pytorch, etc.

## Let's take a look at a simple example!

In [1]:
class DataPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def distance_from_origin(self):
        return (self.x**2 + self.y**2) ** 0.5

point1 = DataPoint(3, 4)

- DataPoint is the class, point1 is an instance of the class (an object)
- `__init__` contains the attributes of the class
    - `self.x` and `self.y` in this case

In [2]:
# print the attributes of the point1 object
print(point1.x) # Note: no parenthesis is used after .x 
print(point1.y) # Note: no parenthesis is used after .y

3
4


- distance_from_origin is a method of the DataPoint class
- it calculates the distance of the point from the origin (0,0)

In [3]:
print(point1.distance_from_origin()) # Note: parenthesis is used when a method is called

5.0


- **encapsulation**, **inheritance**, and **polymorphism** are more advanced OOP concents that we won't use in this class
- they are important when you work with multiple objects
- but we will have one object per home work to keep things simpler
- feel free to look up those terms though if you want to learn more about OOP

## Typical class structure in this course

In [None]:
class ML_algorithm:
    '''
    The class of a supervised ML algorithm, a mathematical function which converts feature values into prediction.
    It minimizes a loss function using some optimization algorithm in train.
    It uses the trained model to provide predictions.
    '''
    def __init__(self, hyperparameter1, hyperparameter2, ...):
        '''
        the attributes of the model
        '''
        # hyperparameters like regularization, kernel width, max depth, etc.
        # hyperparameters are not updated by the methods of the class!
        # when you do cross-validation, you'd create a new object for each hyperparameter combination
        self.hyperparameter1 = hyperparameter1
        self.hyperparameter2 = hyperparameter2
        ...
        # you would initialize any other model parameters here (e.g., weights in linear and logistic regression)
        # these parameters are updated by .train() to minimize the loss
        self.parameters = ...
        
    def train(self, X, Y):
        '''
        Trains the ML model by finding the optimal set of parameters using an optimization algorithm.
        In sklearn .train() is often called .fit()
        @params:
            X: 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: 1D Numpy array containing the corresponding values for each example
        @return:
            None - self.parameters will be updated, nothing needs to be returned
        '''
        # [TODO]


    def predict(self, X):
        '''
        Returns predictions of the model on a set of examples X.
        @params:
            X: a 2D Numpy array where each row contains an example, padded by 1 column for the bias
        @return:
            A 1D Numpy array with one element for each row in X containing the predicted value.
        '''
        # [TODO]
        return y_pred


    def loss(self, X, Y):
        '''
        Returns the loss function on some dataset (X, Y).
        @params:
            X: 2D Numpy array where each row contains an example, padded by 1 column for the bias
            Y: 1D Numpy array containing the corresponding values for each example
        @return:
            A float number which is the loss of the model on the dataset
        '''
        # [TODO]
        return loss

## When not to use OOP?
- simple problems don't require OOP, defining classes and methods can sometimes be an overkill
- OOP can introduce computational overhead (both memory and speed), not ideal for high performance computing
- OOP is not ideal for parallelism
- not ideal for data-heavy applications like transformations and pipelines
    - functional programing is often easier to read, maintain, and often also faster in this case
    - retail example (below)
        - you work with the log files of a retail company
        - each row in the log describes a customer buying a certain product
        - you have the idea to write a customer class to handle the data

In [None]:
class customer:
    """
    a class to collect all data on a customer and to calculate some stats
    
    """
    def __init__(self,customer_ID,DataFrame):
        self.customer_ID = customer_ID
        self.data = DataFrame[DataFrame['customer'] == self.customer_ID]
        
    def nr_products_bought(self):
        return self.data.shape(0) # return number of rows
    
    def avg_price(self):
        return self.data['price'].mean()
    
# open the log file
df = pd.read_csv('log_file.csv')
customers = []
for customer_ID in customer_IDs:
    customer = customer(customer_ID,df) # we create a customer object
    customers.append(customer) # store it in a list

- the approach above is very slow...
- sometimes it is better to manipulate data on all customers at once (vectorization)


# Time for our first Mud card!