# Object-Oriented Programming for Data Science Applications

OOP is particularly useful in data science for creating custom classes that encapsulate data pipelines, models, or datasets, making complex workflows easier to manage and extend.

## Why OOP in Data Science?
- **Modularity**: Break down complex data workflows into manageable classes and methods.
- **Reusability**: Create classes that can be reused across projects (e.g., a standard data cleaner).
- **Encapsulation**: Hide implementation details and expose only necessary interfaces.
- **Inheritance and Polymorphism**: Extend base classes for specific tasks, like different machine learning models.


## 1. Core OOP Concepts

### Classes and Objects
A **class** is a blueprint for creating objects. An **object** is an instance of a class.

In data science, a class might represent a dataset or a model.

### The __init__ Method and self

#### The __init__ Method
The `__init__` method is a special method in Python classes, also known as the constructor. It is automatically called when a new instance (object) of the class is created. Its primary purpose is to initialize the object's attributes with values provided during object creation or with default values.

- It allows you to set up the initial state of the object.
- The method name is always `__init__` (with double underscores before and after 'init').
- It takes `self` as the first parameter, followed by any other parameters you want to pass when creating the object.

#### The self Parameter
The `self` parameter is a reference to the current instance of the class. It is used to access variables and methods that belong to the class.

- It must be the first parameter in every method definition inside the class (including `__init__`).
- When you call a method on an object, Python automatically passes the object itself as the `self` argument.
- `self` allows you to store and retrieve instance-specific data (attributes).
- By convention, it's named `self`, but you could technically name it anything (though it's strongly recommended to use `self` for clarity).

Example: In the class below, `__init__` initializes `name` and `data` attributes using `self`, making them accessible throughout the class.

In [3]:
# Example: A simple Dataset class

class Dataset:
    def __init__(self, name, data):
        self.name = name  # Attribute
        self.data = data  # Attribute
    
    def describe(self):  # Method
        return f"Dataset '{self.name}' has {len(self.data)} entries."

# Creating an object (instance)
sample_data = [1, 2, 3, 4, 5]
my_dataset = Dataset("Sample", sample_data)
print(my_dataset.describe())

Dataset 'Sample' has 5 entries.


### Encapsulation
Encapsulation bundles data and methods, restricting direct access to some components (using private attributes with '_').

In data science, this protects sensitive data or ensures proper data handling.

In [4]:
class SecureDataset:
    def __init__(self, data):
        self._data = data  # Private attribute (convention)
    
    def get_summary(self):
        return sum(self._data) / len(self._data) if self._data else 0

# Usage
secure_ds = SecureDataset([10, 20, 30])
print(secure_ds.get_summary())  # Access via method
# print(secure_ds._data)  # Possible but discouraged. The key reason is that Python does not have true private variables like Java or C++.

20.0


### Inheritance
Inheritance allows a class to inherit attributes and methods from another class.

In data science, create a base 'Model' class and derive specific models like 'LinearRegressionModel'.

In [5]:
class BaseModel:
    def __init__(self, model_name):
        self.model_name = model_name
    
    def train(self):
        print(f"Training {self.model_name}...")

class LinearRegressionModel(BaseModel):
    def predict(self):
        print("Making linear predictions.")

# Usage
lr_model = LinearRegressionModel("LR")
lr_model.train()
lr_model.predict()

Training LR...
Making linear predictions.


####  Multiple Inheritance 
Multiple inheritance means a class can inherit from more than one parent class.

##### Method Resolution Order (MRO)

MRO defines the order in which base classes are searched when executing a method.
In simple terms:

- Python checks the child class .

- Then it checks base classes from left to right .

- Then the base classes of base classes, etc.

In [6]:
class A:
    def show(self):
        print("A")

class B:
    def show(self):
        print("B")

class C(A, B):
    pass

obj = C()
obj.show()  # Output will be from class A since C -- > A --> B

A


In [7]:
class A:
    def show(self):
        print("A")

class B(A):
    def show(self):
        print("B")

class C(A):
    def show(self):
        print("C")

class D(B, C):
    pass

d = D()
d.show() # Output will be from class B since D --> B --> C --> A

B


### Polymorphism
Polymorphism allows different classes to be treated as instances of the same class through a common interface.

In data science, different models can implement a 'fit' method differently but be used interchangeably.

In [8]:
class DecisionTreeModel(BaseModel):
    def train(self):
        print(f"Building tree for {self.model_name}...")

# Polymorphism in action
models = [LinearRegressionModel("LR"), DecisionTreeModel("DT")]
for model in models:
    model.train()  # Same method call, different behavior

Training LR...
Building tree for DT...


## 2. OOP in Data Science Applications

Let's build a practical example: A `DataPipeline` class that handles loading, cleaning, and analyzing data using Pandas and NumPy.

In [9]:
import pandas as pd
import numpy as np

class DataPipeline:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None
    
    def load_data(self):
        self.data = pd.read_csv(self.file_path)
        print("Data loaded successfully.")
    
    def clean_data(self):
        if self.data is not None:
            self.data.dropna(inplace=True)  # Simple cleaning
            print("Data cleaned.")
    
    def analyze(self):
        if self.data is not None:
            return self.data.describe()
    
    def run_pipeline(self):
        self.load_data()
        self.clean_data()
        return self.analyze()

# Usage (assume 'test_data1.csv' exists)
pipeline = DataPipeline('test_data1.csv')
results = pipeline.run_pipeline()
print(results)

Data loaded successfully.
Data cleaned.
             Age
count   5.000000
mean   31.000000
std     2.738613
min    28.000000
25%    29.000000
50%    31.000000
75%    32.000000
max    35.000000
