# Object Orientated Programming (OOP) Example Pipeline

## Remarks

We use the notebook in the project folder i.e. file IMDB_notebook.ipynb as the basis for this notebook.

I want to thank [Stephen Nwoye](https://github.com/Stephen-Data-Engineer-Public) for the helpful starter code together with recommending the following helpful videos:

* [Python Tutorial - Introduction to Classes](https://www.youtube.com/watch?v=u4Ryk0YuW6A) by [Dave Ebbekaar](https://www.youtube.com/@daveebbelaar), and 
* [Python OOP Tutorial 1: Classes and Instances](https://www.youtube.com/watch?v=ZDa-Z5JzLYM) by [Corey Schafer](https://www.youtube.com/@coreyms).

## Classes

We define the `DataExtractor` class to have an `extract` method, such that this method returns data in the form of a dictionary.

Next, we define the `DataTransformer` class to have two methods: `checks` and `clean`. Here the `checks` method only gives print statements while the `clean` method returns data in the form of a dictionary.

We define the `Dataloader` class to have a `load` method. This method just takes a parameter in the form of a path together with data as a dictionary and converts the data to multiple csv files and puts them  the given folder.

Finally, we define the `ETLPipeline` class to have a single `run` method that runs the different instances of `DataExtractor`,  `DataTransformer` and `Dataloader` in sequence, such that after each main stage we load the data to files.



In [0]:
import pandas as pd
import os

class DataExtractor:

    # extractor constructor
    def __init__(self, tables: dict):
        self.tables = tables

    # extractor extract method
    def extract(self):
        data = {}
        for url, table_name in self.tables.items():
            if url.endswith('.csv'):
                data[table_name] = pd.read_csv(url, engine="python")
            elif url.endswith('.tsv'):
                data[table_name] = pd.read_csv(url, delimiter='\t', engine="python")
        return data


class DataTransformer:

    # transformer constructor
    def __init__(self, columns_to_clean: dict, needed_table: list):
        self.columns_to_clean = columns_to_clean
        self.needed_table = needed_table

    # transformer method that combines checks
    def checks(self, data):

        # loop through data wanting tables
        for table, _ in data.items():

            # conditional statement using needed_table
            if table in self.needed_table:
                print(table)

                # null check
                null_count = data[table].isna().sum()
                print(null_count)

                # table info check
                print((data[table]).info())
                print("---")

    # transformer clean method
    def clean(self, data):

        # loop through items in dictionary
        for table_name, columns in self.columns_to_clean.items():

            # we remove NA and fill NA with zeros on tables
            if table_name == "Domestic_Box_Office_Franchises":
                data[table_name] = data[table_name].fillna(0)
            else:
                data[table_name] = data[table_name].dropna()

            # inner loop
            # remove dollar signs and commas
            # convert column to numeric type
            for col in columns:
                if data[table_name][col].dtype == 'object':
                    data[table_name][col] = (
                        data[table_name][col]
                        .str.replace('$', '', regex=False)
                        .str.replace(',', '', regex=False)
                    )
                    data[table_name][col] = pd.to_numeric(data[table_name][col], errors='coerce')
        return data
    


class DataLoader:

    # loader constructor
    def __init__(self, destination_folder: str):
        self.destination_folder = destination_folder

    # loader load method
    def load(self, data):
        os.makedirs(self.destination_folder, exist_ok=True)
        for table_name, df in data.items():
            path = os.path.join(self.destination_folder, f"{table_name}.csv")
            df.to_csv(path, index=False)

# 
class ETLPipeline:

    # ETL pipeline constructor
    def __init__(self, tables, columns_to_clean, folders, needed_table):
        self.extractor = DataExtractor(tables)
        self.transformer = DataTransformer(columns_to_clean, needed_table)
        self.raw_folder, self.clean_folder, self.transformed_folder = folders

    # etl pipeline run method
    def run(self):
        print("Extracting data...")
        data = self.extractor.extract()

        print("Saving raw data...")
        DataLoader(self.raw_folder).load(data)

        print("Checking Data")
        self.transformer.checks(data)

        print("Cleaning data...")
        cleaned = self.transformer.clean(data)
        DataLoader(self.clean_folder).load(cleaned)

        print("Transforming data...")
        # your transform_data() logic can go here
        transformed = cleaned  # placeholder
        DataLoader(self.transformed_folder).load(transformed)

        print("ETL complete!")

# we use this line in order to make the commands run in a certain way
if __name__ == "__main__":

    TABLES = {
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/MovieLens_movies.csv": "movies_Id",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/IMDb%20BoxOfficeMojo%20-%20Brands%20(US%20%26%20Canada).tsv": "brands_US_and_Canada",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/IMDb%20BoxOfficeMojo%20-%20Brand_%20Marvel%20Comics.tsv": "brand_marvel_comics",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/The%20Numbers%20-%20Domestic%20Box%20Office%20Daily%20-%20The%20Avengers.tsv": "Domestic_Box_Office_Daily_The_Avengers",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/The%20Numbers%20-%20Domestic%20Box%20Office%20-%20Franchises.tsv": "Domestic_Box_Office_Franchises",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/The%20Numbers%20-%20Domestic%20Box%20Office%20-%20Franchises%20-%20Marvel%20Cinematic%20Universe.tsv": "Domestic_Box_Office_Franchises_Marvel_Cinematic",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/World%20Wide%20Box%20Office%20All%20Time%20Top%201000.tsv": "World_Wide_Box_Office_All_Time_Top_1000",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/IMDb%20BoxOfficeMojo%20-%20Franchises%20(US%20%26%20Canada).tsv": "Franchises_us_and_Canada",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/IMDb%20BoxOfficeMojo%20-%20Franchise_%20top20.tsv": "top_20_for_each_Franchise",
    "https://raw.githubusercontent.com/mansik95/IMDB-Analysis/master/Data/MovieLens_tags.csv": "tags"
    }

    COLUMNS_TO_CLEAN = {
        'Domestic_Box_Office_Franchises': ['Domestic_Box_Office', 'Infl_Adj_Dom_Box_Office', 'Worldwide_Box_Office'],
        'Domestic_Box_Office_Franchises_Marvel_Cinematic': ['Production_Budget', 'Opening_Weekend', 'Domestic_Box_Office', 'Worldwide_Box_Office'],
        'top_20_for_each_Franchise': ['Lifetime_Gross','Opening_Gross','Max_Theaters']
    }

    FOLDERS = (
        "/Workspace/Users/john.arhin@gmail.com/Full-Stack-IMBD-Data-Analysis/src/RAW_DATA_FOLDER",
        "/Workspace/Users/john.arhin@gmail.com/Full-Stack-IMBD-Data-Analysis/src/CLEANED_DATA_FOLDER",
        "/Workspace/Users/john.arhin@gmail.com/Full-Stack-IMBD-Data-Analysis/src/TRANSFORMED_DATA_FOLDER"
    )


    NEEDED_TABLE = ["tags","Domestic_Box_Office_Franchises_Marvel_Cinematic","movies_Id","brands_US_and_Canada","Domestic_Box_Office_Franchises","World_Wide_Box_Office_All_Time_Top_1000","top_20_for_each_Franchise"]

    pipeline = ETLPipeline(TABLES, COLUMNS_TO_CLEAN, FOLDERS, NEEDED_TABLE)
    pipeline.run()

We remark that the use of the conditional variable check `if __name__ == "__main__"` is explained in this video [Python Tutorial: if __name__ == '__main__'](https://www.youtube.com/watch?v=sugvnHA7ElY).

A Databricks run of this notebook can be found [here](https://dbc-54b899f0-8dbe.cloud.databricks.com/jobs/3957824083901/runs/950822748280727?o=2413881793511514).