# Week #3 - Data Load
Data Pipeline Course - Sekolah Engineer - Pacmann Academy 



## Description 
Data loading involves the process of transferring extracted and/or transformed data into a target storage system. This chapter explores some methods for loading data from extraction results into a staging database (raw data), as well as strategies for handling and storing failed data loads (and for the next stage) in object storage as .csv files.

## Case Description
<img src='pict/load1.png' width="800"> <br>

In the Data Extraction module, we successfully extracted data from the following sources:
1. Spreadsheet
2. Database
3. API

In the Data Load module, we will focus on the following tasks:
1. Handle failed data loads in object storage (MinIO) 
2. Load raw data into the staging database (PostgreSQL)

In [85]:
from dotenv import load_dotenv
import os
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

# create modul log_to_csv from previouse section
from src.log.log import log_to_csv

## Load and Handle Failure Data
we will learn how to handle any data load failures by storing the failed data in object storage using MinIO

1. Create Access and Secret Key

    To interact with the MinIO server, you need to set up the MinIO client with an access key and a secret key. These keys act as credentials that allow the client to authenticate and authorize access to MinIO resources. Here’s how to create and configure these keys:

    a. Access MinIO Console: <br>
    Open the MinIO console in your browser. If you are running MinIO locally with Docker, the console is usually accessible at http://localhost:9000. <br>

    b. Go to Access Key and Create Access Key <br>
    <img src='pict/load2.png' width="800"> <br>

    b. Click Create, without changing anything<br>
    <img src='pict/load3.png' width="800"> <br>

    Once you have the access and secret keys, you need to configure them in your application code. 

2. File .env
    Save the access and secret key of your minio obtained from the MiniO service

    example:
    ```
    ACCESS_KEY_MINIO = 'YOUR ACCESS_KEY_MINIO'
    SECRET_KEY_MINIO = 'YOUR SECRET_KEY_MINIO'
    ```

In [None]:
load_dotenv(".env", override=True)

In [None]:
ACCESS_KEY_MINIO = os.getenv("ACCESS_KEY_MINIO")
SECRET_KEY_MINIO = os.getenv("SECRET_KEY_MINIO")

3. Library

In [None]:
!pip install minio

In [None]:
#The Minio libray is used to interact with a MinIO server. 
from minio import Minio

# BytesIO provides a way to work with binary data in memory as if it were a file.
from io import BytesIO

4. Create Function handle_error to dump failure data to MiniO

    You will conver your dataframe in to csv file using BytesIO

In [None]:
def handle_error(data, bucket_name:str, table_name:str):

    current_date = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    # Initialize MinIO client
    client = Minio('localhost:9000',
                access_key=ACCESS_KEY_MINIO,
                secret_key=SECRET_KEY_MINIO,
                secure=False)

    # Make a bucket if it doesn't exist
    if not client.bucket_exists(bucket_name):
        client.make_bucket(bucket_name)

    # Convert DataFrame to CSV and then to bytes
    csv_bytes = data.to_csv().encode('utf-8')
    csv_buffer = BytesIO(csv_bytes)

    # Upload the CSV file to the bucket
    client.put_object(
        bucket_name=bucket_name,
        object_name=f"{table_name}_{current_date}.csv", #name the fail source name and current etl date
        data=csv_buffer,
        length=len(csv_bytes),
        content_type='application/csv'
    )

    # List objects in the bucket
    objects = client.list_objects(bucket_name, recursive=True)
    for obj in objects:
        print(obj.object_name)

## Staging Area

### Create database staging

Create a staging database with the following tables:
- customer
- category
- orders

In the staging area, we will store raw data without performing any transformations.

``` sql
-- Create the category table
CREATE TABLE category (
    category_id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create the customer table
CREATE TABLE customer (
    customer_id SERIAL PRIMARY KEY,
    first_name VARCHAR(255) NOT NULL,
    last_name VARCHAR(255) NOT NULL,
    email VARCHAR(255) NOT NULL,
    phone VARCHAR(100),
    address TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create the order table
CREATE TABLE order_detail (
    order_detail_id SERIAL PRIMARY KEY,
    order_id int REFERENCES orders(order_id),
    product_id varchar(255) NOT NULL,
    price NUMERIC(10, 2) NOT NULL,
    quantity INT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(order_id, product_id, quantity)
);

```

Load Process

We will Apply upsert using libray pangres based on primary key for each data

In [None]:
from pangres import upsert
def load_staging(data, schema:str, table_name: str, idx_name:str, source):
    try:
        # create connection to database
        conn = create_engine("postgresql://postgres:aku@localhost/staging")
        
        # set data index or primary key
        data = data.set_index(idx_name)
        
        # Do upsert (Update for existing data and Insert for new data)
        upsert(con = conn,
                df = data,
                table_name = table_name,
                schema = schema,
                if_row_exists = "update")
        
        #create success log message
        log_msg = {
                "step" : "load staging",
                "status": "success",
                "source": source,
                "table_name": table_name,
                "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
            }
        return data
    except Exception as e:

        #create fail log message
        log_msg = {
            "step" : "load staging",
            "status": "failed",
            "source": source,
            "table_name": table_name,
            "etl_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")  # Current timestamp
        }

        # Handling error: save data to Object Storage
        try:
            print(e)
            handle_error(data = data, bucket_name='error', table_name= table_name)
        except Exception as e:
            print(e)
    finally:
        log_to_csv(log_msg, 'log.csv')

    

## Extract and Load Data

From previeous section, we will load the extracted data to staging area and save the failed data to object storage

In [None]:
# transform previous section to module
from src.extract.extract_spreadsheet import extract_spreadsheet
from src.extract.extract_database import extract_database
from src.extract.extract_api import extract_api

In [None]:
# Extract Data Category from Spreadsheet
KEY_CATEGORY  = os.getenv("KEY_CATEGORY")

data_category = extract_spreadsheet(worksheet_name = 'category',
                                    key_file = KEY_CATEGORY)

In [None]:
data_category

In [None]:
# Load to Staging
column_category=['category_id','name','description']
load_staging(data=data_category[column_category], schema= "public", 
              table_name = "category", idx_name = "category_id", 
              source = "Spreadsheet")

In [None]:
# Extract Data Customer from Database
df_customer = extract_database(table_name="customer")

In [None]:
df_customer

In [None]:
# Load to Staging
column_customer=['customer_id','first_name','last_name','email','phone','address']
load_staging(data=df_customer[column_customer], schema= "public", 
              table_name = "customer", idx_name = "customer_id", 
              source = "database")

In [87]:
current_date = datetime.now().strftime("%Y-%m-%d")
link_api = "https://api-order-teal.vercel.app/api/dummydata"
list_parameter = {
    "start_date": "2020-01-01",
    "end_date": "2025-01-01"
}

df_order = extract_api(link_api, list_parameter, "orders")

In [88]:
df_order

Unnamed: 0,created_at,customer_id,order_date,order_id,price,product_id,quantity,status,updated_at
0,2022-01-30 00:00:00.000,697,2022-01-30 00:00:00.000,IINI91PP812,1599.0,B08ZN4B121,7,Success,2022-01-30 00:00:00.000
1,2022-01-30 00:00:00.000,697,2022-01-30 00:00:00.000,IINI91PP812,999.0,B0B94JPY2N,13,Success,2022-01-30 00:00:00.000
2,2022-01-30 00:00:00.000,697,2022-01-30 00:00:00.000,IINI91PP812,299.0,B07MP21WJD,9,Success,2022-01-30 00:00:00.000
3,2022-01-30 00:00:00.000,697,2022-01-30 00:00:00.000,IINI91PP812,999.0,B08G43CCLC,9,Success,2022-01-30 00:00:00.000
4,2021-01-03 00:00:00.000,172,2021-01-03 00:00:00.000,ONNA03MN757,3999.0,B0B217Z5VK,5,Success,2021-01-03 00:00:00.000
...,...,...,...,...,...,...,...,...,...
3624,2021-04-24 00:00:00.000,639,2021-04-24 00:00:00.000,AANA44AN436,1230.0,B07NKNBTT3,1,Success,2021-04-24 00:00:00.000
3625,2022-10-10 00:00:00.000,529,2022-10-10 00:00:00.000,IAAC58MO380,1499.0,B0083T231O,15,Success,2022-10-10 00:00:00.000
3626,2022-10-10 00:00:00.000,529,2022-10-10 00:00:00.000,IAAC58MO380,1440.0,B07VZYMQNZ,4,Success,2022-10-10 00:00:00.000
3627,2022-10-10 00:00:00.000,529,2022-10-10 00:00:00.000,IAAC58MO380,670.0,B09PTT8DZF,10,Success,2022-10-10 00:00:00.000


In [89]:
# Load to Staging
# This stage will failed because data type of this order_id is different with in the staging table
# So, we will save the failure data to object storage 
column_order=['order_id','customer_id','order_date','product_id','quantity','price','status']
load_staging(data=df_order[column_order], schema= "public", 
              table_name = "orders", idx_name = ["order_id","product_id","quantity"], 
              source = "api")

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,customer_id,order_date,price,status
order_id,product_id,quantity,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
IINI91PP812,B08ZN4B121,7,697,2022-01-30 00:00:00.000,1599.0,Success
IINI91PP812,B0B94JPY2N,13,697,2022-01-30 00:00:00.000,999.0,Success
IINI91PP812,B07MP21WJD,9,697,2022-01-30 00:00:00.000,299.0,Success
IINI91PP812,B08G43CCLC,9,697,2022-01-30 00:00:00.000,999.0,Success
ONNA03MN757,B0B217Z5VK,5,172,2021-01-03 00:00:00.000,3999.0,Success
...,...,...,...,...,...,...
AANA44AN436,B07NKNBTT3,1,639,2021-04-24 00:00:00.000,1230.0,Success
IAAC58MO380,B0083T231O,15,529,2022-10-10 00:00:00.000,1499.0,Success
IAAC58MO380,B07VZYMQNZ,4,529,2022-10-10 00:00:00.000,1440.0,Success
IAAC58MO380,B09PTT8DZF,10,529,2022-10-10 00:00:00.000,670.0,Success


Bucket 'error' with failure data

<img src='pict/load4.png' width="800"> <br>

link git repository: https://github.com/Kurikulum-Sekolah-Pacmann/ingestion_data_pipeline.git