# Week #4 - Live Class
Data Pipeline Course - Sekolah Engineer - Pacmann Academy 



## Review

### Data Transformation

Data Transformation method:

- Enrichment: Adding additional information or attributes to the data.
- Aggregation: Combining multiple data points into a single summary.
- Joining: Combining data from multiple sources based on a common key.
- Anonymization: Removing or obfuscating personally identifiable information from the data.
- Filtering: Selecting specific data points based on certain criteria.
- Splitting: Dividing a single data point into multiple parts.
- Structuring: Organizing the data into a specific format or structure.
- Deduplication: Removing duplicate data points from the dataset.
- Conversion: Changing the data format or type.


# Data Validation

Data validation is the process of ensuring that data is accurate, complete, and consistent. It may involve checking:
- missing values, 
- verifying data types, 
- performing range checks, 
- or applying any other rules or constraints to ensure the quality and integrity of the data.

## Study Case: Dell DVD Store

### Case Description

`Problem`

The Dell DVD Store is facing challenges with its current data processing. The store needs to handle data from multiple sources such as spreadsheets, databases, and APIs. The key challenges include:
- Database: The Dell DVD Store saves data from current system.
- API: Retrieves data from the old system and contains historical data from the old system.
- Spreadsheet: Contains analysis results from the team about order status based on the current product stock.

<img src= https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/live_w4_2.png width="1000"> <br>

`Solution`

To address these challenges, we propose creating a comprehensive data pipeline for the Dell DVD Store. This pipeline will involve the following steps:
- Data Extraction:
Sources: Extract data from spreadsheets, databases, and APIs.
Techniques: Use both full and incremental extraction methods to retrieve data efficiently.
- Data Load:
Staging: Load raw data into a staging database (PostgreSQL) without transformation.
Final Load: Transfer clean and transformed data to the final destination.
Failure Handling: Log failed data loads to MinIO object storage for reprocessing
- Data Transformation:
Cleaning: Handle missing values, incorrect data formats, and other data quality issues.
Trasnforming: Add derived fields and calculated metrics as needed.
- Data Validation: process of checking and ensuring that data meets predefined rules

`Tools and Technologies`:
- Python: For build Data Pipeline
- PostgreSQL: For log, staging and final data storage.
- MinIO: For load failed data and invalid data.
- Docker: For running MinIO

### Previous Live Class
- Performe Data Ingestion (Extract and Load) from Source to Staging Area
- Load Failed Data to Object Storage

### Next Task
- Extract Data From Staging
- Transformation Data
- Validation Data
- Load Invalid Data to Object Storage
- Load Valid Data to Data Warehouse


### Target Schema


<img src= https://sekolahdata-assets.s3.ap-southeast-1.amazonaws.com/notebook-images/mde-data-ingestion-spark/live_w4_1.png width="800"> <br>
DDL Schema: [Link](https://drive.google.com/file/d/1NUNR84AGnHDbxrsEhIXWnJZ-MmPTlrKa/view?usp=sharing)

### Source to Target Mapping

Source to Target Mapping Documentation: [Link](https://github.com/Kurikulum-Sekolah-Pacmann/data_pipeline_dellstore/blob/main/source_target_mapping.md)

### Data Validation

Here are some validation rules that can be applied to the previously mentioned tables to ensure data integrity and quality.

Validation Rule:
1. Customer Table Validation:
    - Validate that email addresses conform to standard formats (e.g., yahoo.com, hotmail.com, gmail.com).
    - Ensure that the phone number contains exactly 10 digits.
    - Validate the credit card expiration date format to be in the YYYY/MM format.
2. Products Table Validation
    - Ensure that the price value is within the range of 0 to 100.
3. Orders Table Validation:
    - Ensure that net_amount, tax, and total_amount are positive values.
4. Orderline Table Validation
    - Ensure that quantity is a positive number.
5. order_status_analytic Table Validation
    - Validate that the status is either partial, fulfilled, or backordered.

Other constraints, such as NOT NULL columns, are managed by the database constraints.


### Staging to Warehouse

In [2]:
from dotenv import load_dotenv
import os
import pandas as pd
from datetime import datetime
import re

from minio import Minio
from io import BytesIO

from sqlalchemy import create_engine
import sqlalchemy
from pangres import upsert

#### Load File Env

In [3]:
# Load the environment variables
load_dotenv(".env")

# Get the database environment variables
DB_HOST = os.getenv("DB_HOST")
DB_USER = os.getenv("DB_USER")
DB_PASS = os.getenv("DB_PASS")

DB_NAME_STG = os.getenv("DB_NAME_STG")
DB_SHCHEMA_STG = os.getenv("DB_SHCHEMA_STG")
DB_NAME_log = os.getenv("DB_NAME_log")
DB_NAME_DW = os.getenv("DB_NAME_DW")
MODEL_PATH = os.getenv("MODEL_PATH")

# Get the Minio environment variables
ACCESS_KEY_MINIO = os.getenv("ACCESS_KEY_MINIO")
SECRET_KEY_MINIO = os.getenv("SECRET_KEY_MINIO")

To manage SQL queries efficiently using external `.sql` files, you can create a function that reads these files and returns their content.<br>
Each `.sql` file should contain the SQL query for the respective table.

In [4]:
def read_sql(table_name):
    #open your file .sql
    with open(f"{MODEL_PATH}{table_name}.sql", 'r') as file:
        content = file.read()
    
    #return query text
    return content

#### Log Function

In [5]:
def etl_log(log_msg: dict):
    """
    This function is used to save the log message to the database.
    """
    try:
        # create connection to database
        conn = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME_log}")
        
        # convert dictionary to dataframe
        df_log = pd.DataFrame([log_msg])

        #extract data log
        df_log.to_sql(name = "etl_log",  # Your log table
                        con = conn,
                        if_exists = "append",
                        index = False)
    except Exception as e:
        print("Can't save your log message. Cause: ", str(e))

In [6]:
def read_etl_log(filter_params: dict):
    """
    This function read_etl_log that reads log information from the etl_log table and extracts the maximum etl_date for a specific process, step, table name, and status.
    """
    try:
        # create connection to database
        conn = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME_log}")
        
        # To help with the incremental process, get the etl_date from the relevant process
        """
        SELECT MAX(etl_date)
        FROM etl_log "
        WHERE 
            step = %s and
            table_name ilike %s and
            status = %s and
            process = %s
        """
        query = sqlalchemy.text(read_sql("log"))

        # Execute the query with pd.read_sql
        df = pd.read_sql(sql=query, con=conn, params=(filter_params,))

        #return extracted data
        return df
    except Exception as e:
        print("Can't execute your query. Cause: ", str(e))

#### Extract Data From Staging

In [7]:
def extract_staging(table_name: str, schema_name: str) -> pd.DataFrame:
    """
    This function is used to extract data from the staging database. 
    """
    try:
        # create connection to database staging
        conn = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME_STG}")

        # Get date from previous process
        filter_log = {"step_name": "warehouse",
                    "table_name": table_name,
                    "status": "success",
                    "process": "load"}
        etl_date = read_etl_log(filter_log)


        # If no previous extraction has been recorded (etl_date is empty), set etl_date to '1111-01-01' indicating the initial load.
        # Otherwise, retrieve data added since the last successful extraction (etl_date).
        if(etl_date['max'][0] == None):
            etl_date = '1111-01-01'
        else:
            etl_date = etl_date[max][0]
            # etl_date = etl_date.strftime("%Y-%m-%d")

        # Constructs a SQL query to select all columns from the specified table_name where created_at is greater than etl_date.
        query = f"SELECT * FROM {schema_name}.{table_name} WHERE created_at > %s::timestamp"

        # Execute the query with pd.read_sql
        df = pd.read_sql(sql=query, con=conn, params=(etl_date,))
        log_msg = {
                "step" : "warehouse",
                "process":"extraction",
                "status": "success",
                "source": "database",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
        }
        return df
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process":"extraction",
            "status": "failed",
            "source": "database",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp
            "error_msg": str(e)
        }
        print(e)
    finally:
        # Save the log message
        etl_log(log_msg)


#### Handle Error Data

In [8]:
def handle_error(data, bucket_name:str, table_name:str, process:str):
    """
    This function is used to handle error or invalid data by uploading the DataFrame to a MinIO bucket.
    """
    current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # Initialize MinIO client
    client = Minio('localhost:9000',
                access_key=ACCESS_KEY_MINIO,
                secret_key=SECRET_KEY_MINIO,
                secure=False)

    # Make a bucket if it doesn't exist
    if not client.bucket_exists(bucket_name):
        client.make_bucket(bucket_name)

    # Convert DataFrame to CSV and then to bytes
    csv_bytes = data.to_csv().encode('utf-8')
    csv_buffer = BytesIO(csv_bytes)

    # Upload the CSV file to the bucket
    client.put_object(
        bucket_name=bucket_name,
        object_name=f"{process}_{table_name}_{current_date}.csv", #name the fail source name and current etl date
        data=csv_buffer,
        length=len(csv_bytes),
        content_type='application/csv'
    )

    # List objects in the bucket
    objects = client.list_objects(bucket_name, recursive=True)
    for obj in objects:
        print(obj.object_name)

#### Load to Warehouse

In [54]:
def load_warehouse(data, schema:str, table_name: str, idx_name:str, source):
    try:
        # create connection to database
        conn = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME_DW}")
        
        # set data index or primary key
        data = data.set_index(idx_name)
        
        # Do upsert (Update for existing data and Insert for new data)
        upsert(con = conn,
                df = data,
                table_name = table_name,
                schema = schema,
                if_row_exists = "update")
        
        #create success log message
        log_msg = {
                "step" : "warehouse",
                "process":"load",
                "status": "success",
                "source": source,
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
            }
        # return data
    except Exception as e:

        #create fail log message
        log_msg = {
            "step" : "warehouse",
            "process":"load",
            "status": "failed",
            "source": source,
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S") , # Current timestamp
            "error_msg": str(e)
        }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process='load')
        except Exception as e:
            print(e)
    finally:
        etl_log(log_msg)

    

#### Data Transformation

##### Extract Data Warehouse

Create Function to extract data from data warehouse to obtain value of foreign key

In [55]:
def extract_target(table_name: str):
    """
    this function is used to extract data from the data warehouse.
    """
    conn = create_engine(f"postgresql://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME_DW}")

    # Constructs a SQL query to select all columns from the specified table_name where created_at is greater than etl_date.
    query = f"SELECT * FROM {table_name}"

    # Execute the query with pd.read_sql
    df = pd.read_sql(sql=query, con=conn)
    
    return df

##### Table Category


- Source Table: categories
- Target Table: categories

| Source Field | Target Field  | Transformation Rule                                  |
|--------------|---------------|------------------------------------------------------|
| category     | category_nk   | Direct Mapping                                       |
| -            | category_id   | Auto Generated using `uuid_generate_v4()`            |
| categoryname | category_name | Direct Mapping                                       |

In [11]:
def transform_categories(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data categoriy from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column category to category_nk
        data = data.rename(columns={'category':'category_nk', 'categoryname':'category_name'})

        # deduplication based on category_nk and category name
        data = data.drop_duplicates(subset='category_nk')

        # drop column created_at
        data = data.drop(columns=['created_at'])
        
        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": "category",
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": "category",
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        
         # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)

##### Table Customer

- Source Table: customers
- Target Table: customers

| Source Field         | Target Field           | Transformation Rule                                                  |
|----------------------|------------------------|----------------------------------------------------------------------|
| customerid           | customer_nk            | Direct Mapping                                                       |
| -                    | customer_id            | Auto Generated using `uuid_generate_v4()`                            |
| firstname            | first_name             | Direct Mapping                                                       |
| lastname             | last_name              | Direct Mapping                                                       |
| address1             | address1               | Direct Mapping                                                       |
| address2             | address2               | Direct Mapping                                                       |
| city                 | city                   | Direct Mapping                                                       |
| state                | state                  | Direct Mapping                                                       |
| zip                  | zip                    | Direct Mapping                                                       |
| country              | country                | Direct Mapping                                                       |
| region               | region                 | Direct Mapping                                                       |
| email                | email                  | Direct Mapping                                                       |
| phone                | phone                  | Direct Mapping                                                       |
| creditcardtype       | credit_card_type       | Direct Mapping                                                       |
| creditcard           | credit_card            | Direct Mapping  and Masking Value                                    |
| creditcardexpiration | credit_card_expiration | Direct Mapping                                                       |
| username             | username               | Direct Mapping                                                       |
| password             | password               | Direct Mapping                                                       |
| age                  | age                    | Direct Mapping                                                       |
| income               | income                 | Direct Mapping                                                       |
| gender               | gender                 | Direct Mapping                                                       |



In [12]:
def transform_customer(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data customer from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column customer
        data = data.rename(columns={'customerid':'customer_nk', 'firstname':'first_name', 
                                    'lastname':'last_name', 'address':'address', 'city':'city', 'state':'state',
                                    'zip':'zip', 'email':'email', 'creditcardtype':'credit_card_type', 
                                    'creditcard':'credit_card', 'creditcardexpiration':'credit_card_expiration', 
                                    'username':'username', 'password':'password'})
        
        # deduplication based on customer_nk
        data = data.drop_duplicates(subset='customer_nk')

        # Masking credit card number
        data['credit_card'] = data['credit_card'].apply(lambda x: re.sub(r'\d', 'X', x[:-4]) + x[-4:])

        # drop column created_at
        data = data.drop(columns=['created_at'])
        
        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        
         # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)

##### Table Products

- Source Table: products
- Target Table: products

| Source Field  | Target Field     | Transformation Rule                                                                 |
|---------------|------------------|-------------------------------------------------------------------------------------|
| prod_id       | product_nk       | Direct Mapping                                                                      |
| -             | product_id       | Auto Generated using `uuid_generate_v4()`                                           |
| category      | category_id      | Lookup `category_id` from `categories` table based on `category`                    |
| title         | title            | Direct Mapping                                                                      |
| actor         | actor            | Direct Mapping                                                                      |
| price         | price            | Direct Mapping                                                                      |
| special       | special          | Direct Mapping                                                                      |
| common_prod_id| common_prod_id   | Direct Mapping                                                                      |

In [13]:
def transform_product(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data product from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column product
        data = data.rename(columns={'prod_id':'product_nk', 'category':'category_nk'})
        
        # deduplication based on product_nk
        data = data.drop_duplicates(subset='product_nk')

        # Extract data from the `categories` table
        categories = extract_target('categories')

        #Lookup `category_id` from `categories` table based on `category`   
        data['category_id'] = data['category_nk'].apply(lambda x: categories.loc[categories['category_nk'] == x, 'category_id'].values[0])
        
        # drop column created_at
        data = data.drop(columns=['created_at','category_nk'])

        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)

##### Table Inventory

- Source Table: inventory
- Target Table: inventory

| Source Field  | Target Field     | Transformation Rule                                  |
|---------------|------------------|------------------------------------------------------|
| prod_id       | product_nk       | Direct Mapping                                       |
| -             | product_id       | Use the product_id from the product table by matching the product_nk (source)          |
| quan_in_stock | quantity_stock   | Direct Mapping                                       |
| sales         | sales            | Direct Mapping                                       |


In [14]:
def transform_inventory(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data inventory from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column inventory
        data = data.rename(columns={'prod_id':'product_nk', 'quan_in_stock':'quantity_stock'})
        
        # deduplication based on product_nk
        data = data.drop_duplicates(subset='product_nk')

        # Extract data from the `categories` table
        products = extract_target('products')

        #Lookup `category_id` from `categories` table based on `category`   
        data['product_id'] = data['product_nk'].apply(lambda x: products.loc[products['product_nk'] == x, 'product_id'].values[0])
        
        # drop column created_at
        data = data.drop(columns=['created_at'])


        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)


##### Table Orders

- Source Table: orders
- Target Table: orders

| Source Field | Target Field  | Transformation Rule                                                                 |
|--------------|---------------|-------------------------------------------------------------------------------------|
| -            | order_id      | Auto Generated using `uuid_generate_v4()`                                           |
| orderid     | order_nk      | Direct Mapping                                                                      |
| customerid  | customer_id   | Use the customer_id from the customer table by matching the customer_nk (source)    |
| orderdate   | order_date    | Direct Mapping                                                                      |
| status       | status        | Direct Mapping                                                                      |
| netammount   | net_ammount    | Direct Mapping                                                                      |
| tax       | tax        | Direct Mapping                                                                      |
| totalammount   | total_ammount    | Direct Mapping                                                                      |

In [15]:
def transform_orders(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data orders from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column orders
        data = data.rename(columns={'orderid':'order_nk', 'customerid':'customer_nk', 'orderdate':'order_date', 
                                    'netamount':'net_amount', 'tax':'tax', 'totalamount':'total_amount'})
        
        # Extract data from the `customer` table
        customers = extract_target('customers')

        # Lookup `customer_id` from `customer` table based on `customer_nk`   
        data['customer_id'] = data['customer_nk'].apply(lambda x: customers.loc[customers['customer_nk'] == x, 'customer_id'].values[0])
        
        # drop column created_at
        data = data.drop(columns=['created_at','customer_nk'])

        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



##### Table Orderlines

- Source Table: orderlines
- Target Table: orderlines

| Source Field  | Target Field  | Transformation Rule                                                                     |
|---------------|---------------|-----------------------------------------------------------------------------------------|
| orderlineid  | orderline_nk  | Direct Mapping                                                                          |
| -             | orderline_id  | Auto Generated using `uuid_generate_v4()`                                               |
| orderid       | order_id      | Lookup `order_id` from `orders` table based on `orderid`                                |
| prod_id       | product_id    | Lookup `product_id` from `products` table based on `prod_id`                            |
| quantity      | quantity      | Direct Mapping                                                                          |
| orderdate     | order_date    | Direct Mapping                                                                          |

In [16]:
def transform_orderlines(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data orderlines from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column orderlines
        data = data.rename(columns={'orderlineid':'orderline_nk', 'orderid':'order_nk', 'prod_id':'product_nk', 
                                    'quantity':'quantity', 'orderdate':'order_date'})
        
        # Extract data from the `orders` table
        orders = extract_target('orders')

        # Lookup `order_id` from `orders` table based on `orderid`   
        data['order_id'] = data['order_nk'].apply(lambda x: orders.loc[orders['order_nk'] == x, 'order_id'].values[0])
        
        # Extract data from the `product` table
        products = extract_target('products')

        # Lookup `product_id` from `product` table based on `prod_id`   
        data['product_id'] = data['product_nk'].apply(lambda x: products.loc[products['product_nk'] == x, 'product_id'].values[0])
        
        # drop unnecessary columns
        data = data.drop(columns=['created_at','order_nk','product_nk'])


        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



##### Table customer_orders_history
Target:
- Table customers
- Table product
- Table orders
- Table orderlines

- Source Table: customer_orders_history
- Target Table: customers

| Source Field              | Target Field          | Transformation Rule                                                                  |
|---------------------------|-----------------------|--------------------------------------------------------------------------------------|
| customer_id               | customer_nk           | Direct Mapping                                                                       |
| -                         | customer_id           | Auto Generated using `uuid_generate_v4()`                                            |
| customer_firstname        | first_name            | Direct Mapping                                                                       |
| customer_lastname         | last_name             | Direct Mapping                                                                       |
| customer_address1         | address1              | Direct Mapping                                                                       |
| customer_address2         | address2              | Direct Mapping                                                                       |
| customer_city             | city                  | Direct Mapping                                                                       |
| customer_state            | state                 | Direct Mapping                                                                       |
| customer_zip              | zip                   | Direct Mapping                                                                       |
| customer_country          | country               | Direct Mapping                                                                       |
| customer_region           | region                | Direct Mapping                                                                       |
| customer_email            | email                 | Direct Mapping                                                                       |
| customer_phone            | phone                 | Direct Mapping                                                                       |
| customer_creditcardtype   | credit_card_type      | Direct Mapping and Masking                                                                       |
| customer_creditcard       | credit_card           | Direct Mapping                                                                       |
| customer_creditcardexpiration | credit_card_expiration | Direct Mapping                                                                   |
| customer_username         | username              | Direct Mapping                                                                       |
| customer_password         | password              | Direct Mapping                                                                       |
| customer_age              | age                   | Direct Mapping                                                                       |
| customer_income           | income                | Direct Mapping                                                                       |
| customer_gender           | gender                | Direct Mapping                                                                       |

- Source Table: customer_orders_history
- Target Table: products

| Source Field              | Target Field          | Transformation Rule                                                                  |
|---------------------------|-----------------------|--------------------------------------------------------------------------------------|
| product_id                | product_nk            | Direct Mapping                                                                       |
| -                         | product_id            | Auto Generated using `uuid_generate_v4()`                                            |
| product_category          | category_id           | Lookup `category_id` from `categories` table based on `product_category`             |
| product_title             | title                 | Direct Mapping                                                                       |
| product_actor             | actor                 | Direct Mapping                                                                       |
| product_price             | price                 | Direct Mapping                                                                       |
| product_special           | special               | Direct Mapping                                                                       |
| product_common_prod_id    | common_prod_id        | Direct Mapping                                                                       |

- Source Table: customer_orders_history
- Target Table: orders

| Source Field              | Target Field          | Transformation Rule                                                                  |
|---------------------------|-----------------------|--------------------------------------------------------------------------------------|
| order_id                  | order_nk              | Direct Mapping                                                                       |
| -                         | order_id              | Auto Generated using `uuid_generate_v4()`                                            |
| order_customerid          | customer_id           | Lookup `customer_id` from `customers` table based on `order_customerid`              |
| order_date                | order_date            | Direct Mapping                                                                       |
| order_netamount           | net_amount            | Direct Mapping                                                                       |
| order_tax                 | tax                   | Direct Mapping                                                                       |
| order_totalamount         | total_amount          | Direct Mapping                                                                       |

- Source Table: customer_orders_history
- Target Table: orderlines

| Source Field              | Target Field          | Transformation Rule                                                                  |
|---------------------------|-----------------------|--------------------------------------------------------------------------------------|
| orderline_id              | orderline_nk          | Direct Mapping                                                                       |
| -                         | orderline_id          | Auto Generated using `uuid_generate_v4()`                                            |
| order_id                  | order_id              | Lookup `order_id` from `orders` table based on `order_id`                            |
| product_id                | product_id            | Lookup `product_id` from `products` table based on `product_id`                      |
| orderline_quantity        | quantity              | Direct Mapping                                                                       |
| orderline_orderdate       | order_date            | Direct Mapping                                                                       |



In [17]:
def transform_order_hist_cust(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data customer from customer_order_hist staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column for customers
        data = data.rename(columns={
                                'customer_id': 'customer_nk',
                                'customer_firstname': 'first_name',
                                'customer_lastname': 'last_name',
                                'customer_address1': 'address1',
                                'customer_address2': 'address2',
                                'customer_city': 'city',
                                'customer_state': 'state',
                                'customer_zip': 'zip',
                                'customer_country': 'country',
                                'customer_region': 'region',
                                'customer_email': 'email',
                                'customer_phone': 'phone',
                                'customer_creditcardtype': 'credit_card_type',
                                'customer_creditcard': 'credit_card',
                                'customer_creditcardexpiration': 'credit_card_expiration',
                                'customer_username': 'username',
                                'customer_password': 'password',
                                'customer_age': 'age',
                                'customer_income': 'income',
                                'customer_gender': 'gender'
                            }) 
        
        columns_to_keep = [
            'customer_nk', 'customer_id', 'first_name', 'last_name', 
            'address1', 'address2', 'city', 'state', 'zip', 
            'country', 'region', 'email', 'phone', 
            'credit_card_type', 'credit_card', 'credit_card_expiration', 
            'username', 'password', 'age', 'income', 'gender'
        ]

        # Drop unnecessary columns
        data = data.drop(columns=[col for col in data.columns if col not in columns_to_keep])

        # Deduplication based on customer_nk
        data = data.drop_duplicates(subset='customer_nk')

        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



In [18]:
def transform_order_hist_prod(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data orders from customer_order_hist staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column for products
        data = data.rename(columns={
            'product_id': 'product_nk', 
            'product_category': 'category_nk', 
            'product_title': 'title', 
            'product_actor': 'actor', 
            'product_price': 'price', 
            'product_special': 'special', 
            'product_common_prod_id': 'common_prod_id'
        })

        # Deduplication based on product_nk
        data = data.drop_duplicates(subset='product_nk')

        # Extract data from the `categories` table
        categories = extract_target('categories')

        #Lookup `category_id` from `categories` table based on `category`   
        data['category_id'] = data['category_nk'].apply(lambda x: categories.loc[categories['category_nk'] == x, 'category_id'].values[0])
        
        # Get relevant columns
        data = data[['product_nk', 'category_id', 'title', 'actor', 'price', 'special', 'common_prod_id']]



        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



In [19]:
def transform_order_hist_order(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data order from customer_order_hist staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column for orders
        data = data.rename(columns={
                    'order_id': 'order_nk', 
                    'order_customerid': 'customer_nk', 
                    'order_date': 'order_date', 
                    'order_netamount': 'net_amount', 
                    'order_tax': 'tax', 
                    'order_totalamount': 'total_amount'
                })


        # Deduplication based on order_nk
        data = data.drop_duplicates(subset='order_nk')

        # Extract data from the `customers` table
        customer = extract_target('customers')

        #Lookup `customer_id` from `customers` table based on `customer_nk`   
        data['customer_id'] = data['customer_nk'].apply(lambda x: customer.loc[customer['customer_nk'] == x, 'customer_id'].values[0])
        
        # Get relevant columns
        data = data[['order_nk', 'customer_id', 'order_date', 'net_amount', 'tax', 'total_amount']]

        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



In [20]:
def transform_order_hist_orderline(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data orderline from customer_order_hist staging database to the data warehouse.
    """
    try:
        process = "transformation"
        #drop column order_date
        data = data.drop(columns=['order_date'])

        # rename column for orders
        data = data.rename(columns={
            'orderline_id': 'orderline_nk', 
            'order_id': 'order_nk', 
            'product_id': 'product_nk', 
            'orderline_quantity': 'quantity', 
            'orderline_orderdate': 'order_date'
        })

        # Deduplication based on order_nk
        data = data.drop_duplicates(subset=['orderline_nk','order_nk','product_nk','quantity'])

        # Extract data from the `orders` table
        orders = extract_target('orders')

        # Lookup `order_id` from `orders` table based on `orderid`   
        data['order_id'] = data['order_nk'].apply(lambda x: orders.loc[orders['order_nk'] == x, 'order_id'].values[0])
        
        # Extract data from the `product` table
        products = extract_target('products')

        # Lookup `product_id` from `product` table based on `prod_id`   
        data['product_id'] = data['product_nk'].apply(lambda x: products.loc[products['product_nk'] == x, 'product_id'].values[0])
        
        
        # Get relevant columns
        data = data[['orderline_nk', 'order_id', 'product_id', 'quantity', 'order_date']]
        
        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



##### Table Customer History

- Source Table: cust_hist
- Target Table: cust_hist

| Source Field | Target Field   | Transformation Rule                                                                 |
|--------------|----------------|-------------------------------------------------------------------------------------|
| customerid   | customer_id    | Lookup `customer_id` from `customers` table based on `customerid`                   |
| orderid      | order_id       | Lookup `order_id` from `orders` table based on `orderid`                             |
| prod_id      | product_id     | Lookup `product_id` from `products` table based on `prod_id`                         |
| created_at   | created_at     | Direct Mapping                                                                      |

In [21]:
def transform_cust_hist(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data cust_hist from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column orderlines
        data = data.rename(columns={'customerid':'customer_nk', 'prod_id':'product_nk', 'orderid':'order_nk'})
        
        # Extract data from the `customers` table
        customers = extract_target('customers')

        # Lookup `customer_id` from `customers` table based on `customerid`   
        data['customer_id'] = data['customer_nk'].apply(lambda x: customers.loc[customers['customer_nk'] == x, 'customer_id'].values[0])
        

        # Extract data from the `orders` table
        orders = extract_target('orders')

        # Lookup `order_id` from `orders` table based on `orderid`   
        data['order_id'] = data['order_nk'].apply(lambda x: orders.loc[orders['order_nk'] == x, 'order_id'].values[0])
        
        # Extract data from the `product` table
        products = extract_target('products')

        # Lookup `product_id` from `product` table based on `prod_id`   
        data['product_id'] = data['product_nk'].apply(lambda x: products.loc[products['product_nk'] == x, 'product_id'].values[0])
        
        # drop unnecessary columns
        data = data.drop(columns=['customer_nk','order_nk','product_nk'])


        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



##### Table order_status_analytic

- Source Table: order_status_analytic
- Target Table: order_status_analytic

| Source Field | Target Field     | Transformation Rule                                  |
|--------------|------------------|------------------------------------------------------|
| orderid      | order_nk         | Direct Mapping                                       |
| -            | order_id         | Auto Generated using `uuid_generate_v4()`            |
| sum_stock    | sum_stock        | Direct Mapping                                       |
| status       | status           | Direct Mapping                                       |

In [22]:
def transform_order_status_analytic(data: pd.DataFrame, table_name: str) -> pd.DataFrame:
    """
    This function is used to transform data order_status_analytic from staging database to the data warehouse.
    """
    try:
        process = "transformation"
        # rename column order_status_analytic
        data = data.rename(columns={'orderid':'order_nk'})

        # Extract data from the `orders` table
        orders = extract_target('orders')

        # Lookup `order_id` from `orders` table based on `orderid`   
        data['order_id'] = data['order_nk'].apply(lambda x: orders.loc[orders['order_nk'] == x, 'order_id'].values[0])
        
        # drop unnecessary columns
        data = data.drop(columns='created_at')

        log_msg = {
                "step" : "warehouse",
                "process": process,
                "status": "success",
                "source": "staging",
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
                }
        
        return data
    except Exception as e:
        log_msg = {
            "step" : "warehouse",
            "process": process,
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
            }
        print(e)
        # Handling error: save data to Object Storage
        try:
            handle_error(data = data, bucket_name='error-dellstore', table_name= table_name, process=process)
        except Exception as e:
            print(e)
    finally:
        # Save the log message
        etl_log(log_msg)



#### Data Validation

##### Table Customer Validation

- Customer Table Validation:
    - Validate that email addresses conform to standard formats (e.g., yahoo.com, hotmail.com, gmail.com).
    - Ensure that the phone number contains exactly 10 digits.
    - Validate the credit card expiration date format to be in the YYYY/MM format.


In [23]:
#validation email domain
def validate_email_format(email):
    email_regex = re.compile(r"^[\w\.-]+@(yahoo\.com|hotmail\.com|gmail\.com)$")
    return bool(email_regex.match(email))

In [24]:
# Ensure phone number contains 10 digits
def validate_phone_format(phone):
    phone_regex = re.compile(r"^\d{10}$")
    return bool(phone_regex.match(phone))

In [25]:
# Validate credit card number expiration date format is YYYY/MM
def validate_credit_card_expiration_format(expiration_date):
    expiration_date_regex = re.compile(r"^\d{4}/\d{2}$")
    return bool(expiration_date_regex.match(expiration_date))   

##### Table Product Validation

- Products Table Validation
    - Ensure that the price value is within the range of 0 to 100.

In [27]:
# Ensure that the price value is within the range of 0 to 100.
def validate_price_range(price):
    return 0 <= price <= 100

##### Table Orders and Orderlines

- Orders Table Validation:
    - Ensure that net_amount, tax, and total_amount are positive values.
- Orderlines Table Validation
    - Ensure that quantity is a positive number.

In [44]:
# Ensure that net_amount, tax, and total_amount are positive values.
def validate_positive_value(value):
    return value >= 0

##### Table order_status_analytic 

- order_status_analytic Table Validation
    - Validate that the status is either partial, fulfilled, or backordered.


In [45]:
#  Validate that the status is either partial, fulfilled, or backordered.
def validate_order_status(status):
    return status in ['partial', 'fulfilled', 'backordered']

#### Validation Function

In [31]:
def validation_data(data: pd.DataFrame, table_name: str, validation_functions: dict) -> pd.DataFrame:
    """
    This function is used to validate data based on the specified validation functions.
    """
    try:
        # Create a report DataFrame
        report_data = {f'validate_{name}': data[name].apply(func) for name, func in validation_functions.items()}
        report_df = pd.DataFrame(report_data)

        # Summarize status data by all conditions
        report_df['all_valid'] = report_df.all(axis=1)

        # Filter out valid rows (all_valid = 'True')
        valid_data_df = data[report_df['all_valid']]

        # Filter out invalid rows (all_valid = 'False')
        invalid_data_df = data[~report_df['all_valid']]

        # Create success log message
        log_msg = {
            "step": "warehouse",
            "process": "validation",
            "status": "success",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
        }
        return valid_data_df, invalid_data_df
    except Exception as e:
        # Create fail log message
        log_msg = {
            "step": "warehouse",
            "process": "validation",
            "status": "failed",
            "source": "staging",
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S"),  # Current timestamp,
            "error_msg": str(e)
        }
    finally:
        etl_log(log_msg)


#### Pipeline Warehouse

In [35]:
# Data Categories
df_category = extract_staging(table_name = "categories", schema_name = DB_SHCHEMA_STG)
category_tf = transform_categories(data = df_category, table_name = "categories")
load_warehouse(data = category_tf, schema = "public", table_name = "categories",
                idx_name = "category_nk", source = "staging")

In [38]:
#Data Customer
df_customer = extract_staging(table_name="customers", schema_name=DB_SHCHEMA_STG)
customer_tf = transform_customer(data=df_customer, table_name="customers")
valid_cust, invalid_cust = validation_data(data=customer_tf, table_name="customers", validation_functions={"email": validate_email_format, 
                                                                                                           "phone": validate_phone_format, 
                                                                                                           "credit_card_expiration": validate_credit_card_expiration_format})
load_warehouse(data=valid_cust, schema="public", table_name="customers",
                idx_name="customer_nk", source="staging")
if (not invalid_cust.empty):
    handle_error(data=invalid_cust, bucket_name='error-dellstore', table_name="customers", process='validation')

In [47]:
# Data Product
df_product = extract_staging(table_name="products", schema_name=DB_SHCHEMA_STG)
product_tf = transform_product(data=df_product, table_name="products")
valid_product, invalid_product = validation_data(data=product_tf, table_name="products", validation_functions={"price": validate_price_range})
load_warehouse(data=valid_product, schema="public", table_name="products", 
               idx_name="product_nk", source="staging")
if (not invalid_product.empty):
    handle_error(data=invalid_product, bucket_name='error-dellstore', table_name="products", process='validation')

In [48]:
# Data Inventory
df_inventory = extract_staging(table_name="inventory", schema_name=DB_SHCHEMA_STG)
inventory_tf = transform_inventory(data=df_inventory, table_name="inventory")
load_warehouse(data=inventory_tf, schema="public", table_name="inventory", idx_name="product_nk", source="staging")

In [49]:
# Data Orders
df_orders = extract_staging(table_name="orders", schema_name=DB_SHCHEMA_STG)
orders_tf = transform_orders(data=df_orders, table_name="orders")
valid_orders, invalid_orders = validation_data(data=orders_tf, table_name="orders", validation_functions={"net_amount": validate_positive_value, 
                                                                                                           "tax": validate_positive_value, 
                                                                                                           "total_amount": validate_positive_value})
load_warehouse(data=valid_orders, schema="public", table_name="orders",
                idx_name="order_nk", source="staging")
if (not invalid_orders.empty):
    handle_error(data=invalid_orders, bucket_name='error-dellstore', table_name="orders", process='validation')

In [50]:
# Data Orderlines
df_orderlines = extract_staging(table_name="orderlines", schema_name=DB_SHCHEMA_STG)
orderlines_tf = transform_orderlines(data=df_orderlines, table_name="orderlines")
valid_orderlines, invalid_orderlines = validation_data(data=orderlines_tf, table_name="orderlines", validation_functions={"quantity": validate_positive_value})
load_warehouse(data=valid_orderlines, schema="public", 
               table_name="orderlines", idx_name=["orderline_nk","order_id","product_id","quantity"], source="staging")
if (not invalid_orderlines.empty):
    handle_error(data=invalid_orderlines, bucket_name='error-dellstore', table_name="orderlines", process='validation')

In [56]:
# Data customer_orders_history
df_order_hist = extract_staging(table_name="customer_orders_history", schema_name=DB_SHCHEMA_STG)

# Data Customer
cust_order_hist_tf = transform_order_hist_cust(data=df_order_hist, table_name="customer_orders_history")
valid_cust_hist, invalid_cust_hist = validation_data(data=cust_order_hist_tf, table_name="customers", validation_functions={"email": validate_email_format, 
                                                                                                           "phone": validate_phone_format, 
                                                                                                           "credit_card_expiration": validate_credit_card_expiration_format})
load_warehouse(data=valid_cust_hist, schema="public", table_name="customers", idx_name=["customer_nk"], source="staging")
if (not invalid_cust_hist.empty):
    handle_error(data=invalid_cust_hist, bucket_name='error-dellstore', table_name="customers", process='validation')

# Data Product
prod_order_hist_tf = transform_order_hist_prod(data=df_order_hist, table_name="customer_orders_history")
valid_order_hist_prod, invalid_order_hist_prod = validation_data(data=prod_order_hist_tf, table_name="products", validation_functions={"price": validate_price_range})
load_warehouse(data=valid_order_hist_prod, schema="public", table_name="products", idx_name=["product_nk"], source="staging")
if (not invalid_order_hist_prod.empty):
    handle_error(data=invalid_order_hist_prod, bucket_name='error-dellstore', table_name="products", process='validation')

# Data Orders
order_hist_tf = transform_order_hist_order(data=df_order_hist, table_name="customer_orders_history")
valid_order_hist, invalid_order_hist = validation_data(data=order_hist_tf, table_name="orders", validation_functions={"net_amount": validate_positive_value, 
                                                                                                           "tax": validate_positive_value, 
                                                                                                           "total_amount": validate_positive_value})
load_warehouse(data=valid_order_hist, schema="public", table_name="orders", idx_name=["order_nk"], source="staging")
if (not invalid_order_hist.empty):
    handle_error(data=invalid_order_hist, bucket_name='error-dellstore', table_name="orders", process='validation')

# Data Orderlines
orderline_hist_tf = transform_order_hist_orderline(data=df_order_hist, table_name="customer_orders_history")
valid_orderline_hist, invalid_orderline_hist = validation_data(data=orderline_hist_tf, table_name="orderlines", validation_functions={"quantity": validate_positive_value})
load_warehouse(data=valid_orderline_hist, schema="public", table_name="orderlines", 
               idx_name=["orderline_nk","order_id","product_id","quantity"], source="staging")
if (not invalid_orderline_hist.empty):
    handle_error(data=invalid_orderline_hist, bucket_name='error-dellstore', table_name="orderlines", process='validation')

In [52]:
#Data Customer History
df_cust_hist = extract_staging(table_name="cust_hist", schema_name=DB_SHCHEMA_STG)
cust_hist_tf = transform_cust_hist(data=df_cust_hist, table_name="cust_hist")
load_warehouse(data=cust_hist_tf, schema="public", table_name="cust_hist", 
               idx_name=["customer_id","order_id","product_id"], source="staging")

In [53]:
#Data Order Status Analytic
df_order_analytic = extract_staging(table_name="order_status_analytic", schema_name=DB_SHCHEMA_STG)
order_analytic_tf = transform_order_status_analytic(data=df_order_analytic, table_name="order_status_analytic")
valid_orders_analytic, invalid_orders_analytic = validation_data(data=order_analytic_tf, table_name="order_status_analytic", validation_functions={"status": validate_order_status})
load_warehouse(data=valid_orders_analytic, schema="public", table_name="order_status_analytic", 
               idx_name="order_id", source="staging")
if (not invalid_orders_analytic.empty):
    handle_error(data=invalid_orders_analytic, bucket_name='error-dellstore', table_name="order_status_analytic", process='validation')

## Module Structure

You can modularize the functions mentioned earlier into separate modules for better management and organization. Here is a suggested structure:

```
project
│   README.md
│   .env
│   pipeline_staging.py    
└───src
│   └───log
│   |    │   log.py
│   └───staging
│       └───extract
│       │   │   extract_db.py
│       │   │   extract_api.py
│       │   │   extract_spreadsheet.py
│       └───load
│       │   │   load_minio.py
│       │   │   load_staging.py
│       └───models
│       │   │   customers.sql
│       │   │   products.sql
│   └───warehouse
│       └───extract
│       │   │   extract_staging.py
│       └───load
│       │   │   load_minio.py
│       │   │   load_warehouse.py
│       └───models
│       │   │   log.sql
│       └───transform
│       │   │   products.py
│       │   │   customers.py
│       └───validation
│       │   │   validation_data.py
└───creds
│   │   data-pipeline.json
└───pipeline_staging.py
└───pipeline_warehouse.py


link git repository: [git repository](https://github.com/Kurikulum-Sekolah-Pacmann/data_pipeline_dellstore.git)