# TP01: CleaningData - Applying OOP and Design Patterns


Each exercise builds upon the previous one - so by the end, we will have a complete CSV cleaning data using OOP, decorators, and patterns.

## Exercise 1: Build the CSVReader Class (OOP Foundation)
Objective:
- Reinforce OOP fundamentals: classes, constructors, methods, encapsulation.
- Learn to design reusable data classes for DS projects.
Concept:
A CSVReader class should encapsulate all functionality related to reading CSV files.

Instructions:
1.	Create a class CSVReader.
2.	Define an __init__ method that takes file_path.
3.	Add a read() method that loads the CSV file using pandas.
4.	Add a preview(n) method that prints the top n rows.


In [None]:
import pandas as pd 
class CSVReader:
    """ A class to read and preview csv files"""
    def __init__(self, file_path):
        self.file_path = file_path
        self.df = None 
    def reader(self):
        """ Load csv into self.df and return it."""
        self.df = pd.read_csv(self.file_path)
        return self.df
    def preview(self,n):
        """Return top n rows of the dataframe."""
        if self.df is None:
            self.reader()
        return self.df.head(n)
#Usage 
csv_reader = CSVReader("/Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS/Pratice/sample_data.csv")
csv_reader.reader()
print(csv_reader.preview(5))

   id     name   age  height_cm  weight_kg         city  score
0   1    Alice  29.0      165.0       68.0     New York   85.0
1   2      Bob   NaN      172.0        NaN  Los Angeles   90.0
2   3  Charlie  35.0      168.0       72.0      Chicago    NaN
3   4    David   NaN        NaN       80.0      Houston   75.0
4   5      Eva  27.0      160.0       55.0     New York   88.0


' Load csv into self.df and return it.'

### Exercise 2: Apply Pattern for Missing Values Handling
Objective:
- Learn the Strategy Pattern - one interface, many interchangeable behaviors.
- Practice inheritance and polymorphism.

Concept:

We often need different strategies to handle missing data (drop, fill mean, fill mode, etc.).
The pattern allows flexible switching between methods without changing the main code.

Instructions:
1.	Create an abstract class MissingValueStrategy with a method handle(df).
2.	Create subclasses:
o	DropMissing
o	FillMean
o	FillMode
3.	Create a DataCleaner class that applies the chosen strategy.


In [18]:
import pandas as pd
from abc import ABC, abstractmethod
class MissingValueStrategy(ABC):
    """ Strategy class interface for handling missing values """
    @abstractmethod
    def handle(self,df:pd.DataFrame):
        pass
class DropMissing(MissingValueStrategy):
    """ subclass that handles strategy to drop missing values """
    def handle(self, df:pd.DataFrame):
        return df.dropna()

class FillMean(MissingValueStrategy):
    def handle(self,df:pd.DataFrame):
        n_cols = df[["age","height_cm","weight_kg","score"]].mean()
        return df.fillna(n_cols)

class FillMode(MissingValueStrategy):
    def handle(self,df:pd.DataFrame):
        n_cols = df[["age","height","weight_kg","score"]].mode()
        return df.fillna(n_cols)

class DataCleaner:
    """ Context class that uses a MissingValueStrategy to clean data """
    def __init__(self,strategy:MissingValueStrategy):
        self.strategy = strategy
    def clean (self,df):
        return self.strategy.handle(df)




In [19]:
#Example Usage:
cleaner = DataCleaner(DropMissing())
clean = cleaner.clean(csv_reader.reader())
print("\nDataFrame after Dropping Missing Values:")
print(clean)


DataFrame after Dropping Missing Values:
   id   name   age  height_cm  weight_kg      city  score
0   1  Alice  29.0      165.0       68.0  New York   85.0
4   5    Eva  27.0      160.0       55.0  New York   88.0


# Exercise 3: Add Decorators for Logging and Timing

Objective:
- Learn the Decorator Pattern to add reusable functionality (logging, timing).
- Understand how decorators support the “Open/Closed Principle”.

Concept:

Decorators wrap functions to extend their behavior without altering the original code.
Instructions:
1.	Create two decorators:
o	@log_action → logs when a method starts and finishes.
o	@log_time → calculates and prints execution time.
2.	Apply them to methods in CSVReader or DataCleaner.


In [6]:
import time 
import logging
from functools import wraps
import pandas as pd 
from abc import ABC, abstractmethod

In [7]:
logging.basicConfig(level=logging.INFO) # Configure logging settings 
# --- Decorators ---
def log_action(func):
    """ Logging Decorator """
    @wraps(func)
    def wrapper(*args, **kwargs):
        logging.info(f"Starting {func.__name__}")
        result = func(*args, **kwargs)
        logging.info(f"Finished {func.__name__}")
        return result
    return wrapper
# --- Timing Decorator ---
def log_time(func):
    """ Timing Decorator """
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        logging.info(f"{func.__name__} took {end_time - start_time:.4f} seconds")
        return result
    return wrapper
class CSVReader:
    def __init__(self, file_path):
        self.file_path = file_path
    @log_action
    @log_time
    def reader(self):
        self.df = pd.read_csv(self.file_path)
    def preview(self, n):
        self.n = n
        return self.df.head(self.n)
class MissingValueStrategy(ABC):
    @abstractmethod
    def handle(self,df:pd.DataFrame):
        pass
class DropMissing(MissingValueStrategy):
    def handle(self, df:pd.DataFrame):
        return df.dropna()

class FillMean(MissingValueStrategy):
    def handle(self,df:pd.DataFrame):
        n_cols = df[["age","height_cm","weight_kg","score"]].mean(numeric_only=True)
        return df.fillna(n_cols)

class FillMode(MissingValueStrategy):
    def handle(self,df:pd.DataFrame):
        n_cols = df[["age","height","weight_kg","score"]].mode()
        return df.fillna(n_cols)
@log_action
@log_time
class DataCleaner:
    def __init__(self,strategy:MissingValueStrategy):
        self.strategy = strategy
    def clean (self,df):
        return self.strategy.handle(df)


        

In [8]:
reader = CSVReader("/Users/macbookair/Documents/Data Science 5th Year/Advanced Programing for DS/Pratice/sample_data.csv")
df =reader.reader()
print(reader.preview(5))

INFO:root:Starting reader
INFO:root:reader took 0.0090 seconds
INFO:root:Finished reader


   id     name   age  height_cm  weight_kg         city  score
0   1    Alice  29.0      165.0       68.0     New York   85.0
1   2      Bob   NaN      172.0        NaN  Los Angeles   90.0
2   3  Charlie  35.0      168.0       72.0      Chicago    NaN
3   4    David   NaN        NaN       80.0      Houston   75.0
4   5      Eva  27.0      160.0       55.0     New York   88.0


In [9]:
cleaner = DataCleaner(FillMean())
cleaning = cleaner.clean(reader.df)
cleaning 

INFO:root:Starting DataCleaner
INFO:root:DataCleaner took 0.0000 seconds
INFO:root:Finished DataCleaner


Unnamed: 0,id,name,age,height_cm,weight_kg,city,score
0,1,Alice,29.0,165.0,68.0,New York,85.0
1,2,Bob,30.25,172.0,70.0,Los Angeles,90.0
2,3,Charlie,35.0,168.0,72.0,Chicago,86.0
3,4,David,30.25,167.0,80.0,Houston,75.0
4,5,Eva,27.0,160.0,55.0,New York,88.0
5,6,Frank,30.0,175.0,85.0,Los Angeles,86.0
6,7,Grace,30.25,162.0,60.0,Chicago,92.0


## Exercise 4: Implement Factory Pattern for Data Transformations

Objective:

- Practice Factory Pattern for scalable creation of transformation objects.
- Apply abstraction and composition.

Concept:
Instead of manually creating objects, use a factory that decides what transformation to apply.

Instructions:
1.	Create an abstract class DataTransform with a method apply(df).
2.	Implement subclasses:
o	NormalizeColumns
o	RemoveDuplicates
o	StandardizeText
3.	Create a TransformFactory that returns transformation objects based on string input.


In [None]:
class DataTransform(ABC):
    """ Abstract base class for data transformations """
    @abstractmethod
    def apply(self, df: pd.DataFrame) -> pd.DataFrame:
        pass
class NormalizeColumns(DataTransform):
    """ Normalize numeric columns to [0,1] range """
    def apply(self, df):
        numeric_cols = df.select_dtypes(include=['number']).columns
        df[numeric_cols] = (df[numeric_cols] - df[numeric_cols].min()) / (df[numeric_cols].max() - df[numeric_cols].min())
        return df
class RemoveDuplicates(DataTransform):
    """ Remove duplicate rows from the DataFrame """
    def apply(self, df):
        return df.drop_duplicates()
    def apply(self, df):
        return df.drop_duplicates()
class StandardizeText(DataTransform):
    """ Standardize text columns to lowercase and strip whitespace """
    def apply(self, df):
        text_cols = df.select_dtypes(include=['object']).columns
        for col in text_cols:
            df[col] = df[col].str.lower().str.strip()
        return df
class TransformationFactory:
    """ Factory class to create data transformation objects """
    @staticmethod
    def create(name, **kwargs):
        if name == "normalize":
            return NormalizeColumns()
        elif name == "remove_duplicates":
            return RemoveDuplicates()
        elif name == "standardize_text":
            return StandardizeText()
        else:
            raise ValueError(f"Unknown transformation: {name}")


In [37]:
#Example Usage:
factory = TransformationFactory()
t1 = factory.create("normalize")
df_transformed = t1.apply(cleaning)
df_transformed

Unnamed: 0,id,name,age,height_cm,weight_kg,city,score
0,0.0,Alice,0.25,0.333333,0.433333,New York,0.588235
1,0.166667,Bob,0.40625,0.8,0.5,Los Angeles,0.882353
2,0.333333,Charlie,1.0,0.533333,0.566667,Chicago,0.647059
3,0.5,David,0.40625,0.466667,0.833333,Houston,0.0
4,0.666667,Eva,0.0,0.0,0.0,New York,0.764706
5,0.833333,Frank,0.375,1.0,1.0,Los Angeles,0.647059
6,1.0,Grace,0.40625,0.133333,0.166667,Chicago,1.0


## Exercise 5: Build a Full Cleaning Pipeline (Template Method Pattern)


Objective:
- Combine all previous concepts into one cleaning pipeline.
- Use the Template Method Pattern to define a consistent workflow.

Concept:
The Template Method defines the skeleton of a process and lets subclasses override specific steps.

Instructions:
1.	Create an abstract class DataPipeline with a run() method defining steps:
o	load()
o	clean()
o	transform()
o	save()
2.	Create CSVDataPipeline that implements each step using previous classes.


In [38]:
from abc import ABC, abstractmethod
from pathlib import Path
from typing import Optional, Union, List, Dict, Any
import pandas as pd
import logging

# configure logging
logging.basicConfig(level=logging.INFO, format="[%(asctime)s] %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

# ---- Abstract Template ----
class DataPipeline(ABC):
    """
    Template Method base class for data pipelines.
    Subclasses must implement load(), clean(), transform(), save().
    """

    def run(self, source: Union[str, Path, pd.DataFrame], *, output_path: Optional[Union[str, Path]] = None) -> pd.DataFrame:
        """
        The template method: orchestrates the full pipeline.
        Accepts either a file path (str/Path) or a preloaded DataFrame.
        Returns the final cleaned DataFrame.
        """
        logger.info("Pipeline started")
        df = self.load(source)
        logger.info("Loaded data: %s rows, %s cols", len(df), len(df.columns))
        df = self.clean(df)
        logger.info("After clean: %s rows, %s cols", len(df), len(df.columns))
        df = self.transform(df)
        logger.info("After transform: %s rows, %s cols", len(df), len(df.columns))
        # optional post-processing hook (subclass may override)
        df = self.postprocess(df)
        if output_path is not None:
            self.save(df, output_path)
            logger.info("Saved cleaned data to %s", output_path)
        logger.info("Pipeline finished")
        return df

    @abstractmethod
    def load(self, source: Union[str, Path, pd.DataFrame]) -> pd.DataFrame:
        """Load data from a path or accept a DataFrame directly."""
        raise NotImplementedError

    @abstractmethod
    def clean(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply missing-value strategies, type fixes, etc."""
        raise NotImplementedError

    @abstractmethod
    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply transforms (normalize, dedupe, standardize text, etc.)."""
        raise NotImplementedError

    @abstractmethod
    def save(self, df: pd.DataFrame, output_path: Union[str, Path]) -> None:
        """Persist the cleaned DataFrame (CSV by default)."""
        raise NotImplementedError

    # optional hook
    def postprocess(self, df: pd.DataFrame) -> pd.DataFrame:
        """Optional final adjustments (override when needed)."""
        return df


# ---- CSV implementation that uses previous components ----
class CSVDataPipeline(DataPipeline):
    """
    CSVDataPipeline composes:
      - a CSVReader (or pandas.read_csv)
      - a DataCleaner (strategy-based)
      - a TransformFactory to create transforms
    """

    def __init__(
        self,
        cleaner,                      # instance of DataCleaner (expects .clean(df) or strategy handle)
        transform_configs: Optional[List[Dict[str, Any]]] = None,
        reader_kwargs: Optional[Dict[str, Any]] = None,
        factory=None                  # TransformFactory instance or None (you can pass class)
    ):
        """
        Params:
            cleaner: a DataCleaner-like object (has .clean(df) or .strategy.handle)
            transform_configs: list of {"name": str, "params": {...}} applied in order
            reader_kwargs: kwargs passed to pd.read_csv when loading from path
            factory: TransformFactory instance (must provide .create(name, **params))
        """
        self.cleaner = cleaner
        self.transform_configs = transform_configs or []
        self.reader_kwargs = reader_kwargs or {}
        self.factory = factory

    def load(self, source: Union[str, Path, pd.DataFrame]) -> pd.DataFrame:
        # Accept a DataFrame directly
        if isinstance(source, pd.DataFrame):
            logger.info("Source is a DataFrame, copying to avoid mutation.")
            return source.copy()
        # Accept a path-like (string or Path)
        path = Path(source)
        if not path.exists():
            raise FileNotFoundError(f"Input file not found: {path}")
        # Use pandas directly or your CSVReader wrapper
        df = pd.read_csv(path, **self.reader_kwargs)
        return df

    def clean(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Use the injected cleaner. Two common interfaces:
        - cleaner.clean(df) -> DataFrame
        - cleaner.strategy.handle(df) -> DataFrame
        The code below tries to support either.
        """
        if hasattr(self.cleaner, "clean"):
            return self.cleaner.clean(df)
        elif hasattr(self.cleaner, "strategy") and hasattr(self.cleaner.strategy, "handle"):
            return self.cleaner.strategy.handle(df)
        else:
            raise AttributeError("Cleaner must provide .clean(df) or .strategy.handle(df)")

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Apply a list of transform configs (each config = {"name": str, "params": {...}}).
        The factory must implement create(name, **params) -> DataTransform.
        """
        out = df.copy()
        if not self.transform_configs:
            return out

        if self.factory is None:
            raise RuntimeError("TransformFactory instance not provided to CSVDataPipeline")

        for conf in self.transform_configs:
            name = conf.get("name")
            params = conf.get("params", {})
            logger.info("Applying transform %s with params %s", name, params)
            transform_obj = self.factory.create(name, **params)  # must return DataTransform
            out = transform_obj.apply(out)
        return out

    def save(self, df: pd.DataFrame, output_path: Union[str, Path]) -> None:
        p = Path(output_path)
        p.parent.mkdir(parents=True, exist_ok=True)
        df.to_csv(p, index=False)

    # Optional override for the final step
    def postprocess(self, df: pd.DataFrame) -> pd.DataFrame:
        # example: reset index, enforce dtypes, drop temporary cols
        return df.reset_index(drop=True)



In [None]:

# ---- Example usage ----
if __name__ == "__main__":
    class SimpleCleaner:
        def clean(self, df): return df.fillna(method="ffill").copy()

    class DummyFactory:
        def create(self, name, **params):
            # Minimal transform object with apply(df) method
            class IdentityTransform:
                def apply(self, df): return df
            return IdentityTransform()

    cleaner = SimpleCleaner()
    factory = DummyFactory()
    transforms = [{"name": "identity"}]

    pipeline = CSVDataPipeline(cleaner=cleaner, transform_configs=transforms, factory=factory)
    # pass either a path or DataFrame:
    sample_df = pd.DataFrame({"A":[1, None, 3], "name":["Alice", None, "Charlie"]})
    result = pipeline.run(sample_df, output_path=None)   # returns final DataFrame
    print(result)

INFO:__main__:Pipeline started
INFO:__main__:Source is a DataFrame, copying to avoid mutation.
INFO:__main__:Loaded data: 3 rows, 2 cols
  def clean(self, df): return df.fillna(method="ffill").copy()
INFO:__main__:After clean: 3 rows, 2 cols
INFO:__main__:Applying transform identity with params {}
INFO:__main__:After transform: 3 rows, 2 cols
INFO:__main__:Pipeline finished


     A     name
0  1.0    Alice
1  1.0    Alice
2  3.0  Charlie
