## ETL Product Data Pipeline
In this notebook, we will build a robust ETL pipeline to process product data from an API. The pipeline will:
* Extract product data from multiple API endpoints or clients.
* Transform the data to standardize column names, handle nested or missing values, and calculate additional metrics like discounted price.
* Load the cleaned data into a PostgreSQL database using a staging-to-main table pattern.
* Log all steps and errors for traceability and debugging.

### 1. Importing Required Libraries


In [6]:
import logging
import json
import pandas as pd
import requests
from sqlalchemy import create_engine, text

### 2. Logging

In [7]:
logging.basicConfig(
    filename="etl_product_pipeline.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    
)
logging.info("ETL pipeline started")

## 3. Defining Functions


### 3.1 Extract Function
* Fetches paginated data from REST API endpoints with configurable page limits.
* Extracts specified key from JSON response; falls back to full response if key missing.
* Stops early when total record count is reached to avoid unnecessary requests.
* Skips empty pages and continues fetching remaining data.
* Logs page progress, errors, and completion status for monitoring.
* Returns combined data from all pages as a pandas DataFrame.

In [8]:
def extract(base_url, key="products", limit=30, max_pages=20):
    all_data = []
    
    with requests.Session() as session:
        for page in range(max_pages):
            url = f"{base_url}?limit={limit}&skip={page * limit}"
            try:
                response = session.get(url, timeout=10)
                response.raise_for_status()
                data = response.json()
                
                # Extract relevant data
                page_data = data.get(key, data)
                if not page_data:
                    logging.debug(f"Empty page {page}, continuing")
                    continue
                
                all_data.extend(page_data)
                logging.info(f"Page {page}: {len(page_data)} records")
                
                # Stop if we've fetched all records
                if (total := data.get("total")) and len(all_data) >= total:
                    logging.info(f"All {total} records fetched")
                    break
                    
            except requests.RequestException as e:
                logging.error(f"Error on page {page}: {e}")
                break

    df = pd.DataFrame(all_data)
    logging.info(f"Extract completed: {len(df)} records fetched")
    return df


### 3.2 Transform Function
* Explodes nested reviews into separate rows for granular analysis.
* Extracts review ratings from nested dictionaries and removes original review column.
* Drops rows with missing critical fields (id, title, price, review_rating).
* Validates and filters data: removes invalid prices and out-of-range discounts.
* Calculates discounted price based on original price and discount percentage.
* Converts all columns to appropriate data types (float, int).
* Selects and reorders relevant columns, normalizes names to lowercase.
* Returns clean, analysis-ready DataFrame with consistent structure.

In [9]:
def transform(df):
   
    df = df.copy()

    df = df.explode("reviews", ignore_index=True)
    df["review_rating"] = df["reviews"].apply(lambda x: x["rating"] if isinstance(x, dict) else None)
    df = df.drop(columns=["reviews"])
    
    df = df.dropna(subset=["id", "title"], how="all")
    df = df.dropna(subset=["price","review_rating"]) 

    df = df[df['price'] > 0]
    df = df[(df['discountPercentage'] >= 0) & (df['discountPercentage'] <= 100)]        

    df["price_with_discount"] = (df["price"] * (1 - df["discountPercentage"] / 100)).round(2)

    df["price"] = df["price"].astype(float)
    df["discountPercentage"] = df["discountPercentage"].astype(float)
    df["rating"] = df["rating"].astype(float)
    df["review_rating"] = df["review_rating"].astype(int)
    df["price_with_discount"] = df["price_with_discount"].astype(float)

    df = df[["id", "title", "category", "price", "discountPercentage",
          "rating", "brand", "review_rating", "price_with_discount"]]

    df.columns = df.columns.str.lower()

    df = df.reset_index(drop=True)
    
    return df


### 3.3 Load Function
* Creates a staging table and loads DataFrame into it temporarily.
* Inserts all records from staging into the main table with timestamps.
* Adds `created_at` and `updated_at` columns automatically during insert.
* Uses SQL transactions to ensure data integrity (all-or-nothing).
* Optionally drops the staging table after successful load (default: True).
* Logs each step: staging load, data insert, and staging table cleanup.

In [10]:
def load(df, sql_connection, table_name, drop_staging=True):
    from sqlalchemy import text
    schema = 'etl_schema'
    staging_table = f"{table_name}_staging"
    
    with sql_connection.begin() as conn:
        # Load to staging
        df.to_sql(staging_table, conn, if_exists='replace', index=False, schema=schema)
        logging.info(f"Loaded {len(df)} records to {schema}.{staging_table}")
        
        # Insert all records from staging
        merge_query = text(f"""
            INSERT INTO {schema}.{table_name} (
                id, title, category, price, discountPercentage,
                rating, brand, review_rating, price_with_discount,
                created_at, updated_at
            )
            SELECT id, title, category, price, discountPercentage, rating, brand,
                review_rating, price_with_discount,
                CURRENT_TIMESTAMP, CURRENT_TIMESTAMP
            FROM {schema}.{staging_table};
        """)
        conn.execute(merge_query)
        logging.info(f"Inserted data to {schema}.{table_name}")
        
        # Drop staging
        if drop_staging:
            conn.execute(text(f"DROP TABLE IF EXISTS {schema}.{staging_table}"))
            logging.info(f"Dropped staging table")


### 4. Usage 

In [11]:
# PostgreSQL Connection
engine = create_engine(
    'postgresql://etl_user:sifre@localhost:5432/etl_pipeline'  # .env oluşturulabilirdi.
)

try:
    # Extract
    df = extract("https://dummyjson.com/products", limit=30)
    logging.info(f"Extract completed: {len(df)} raw records")
    
    # Transform
    df_t = transform(df)
    logging.info(f"Transform completed: {len(df_t)} clean records")
    
    # Load
    load(df_t, engine, table_name="products", drop_staging=True)
    logging.info("ETL Pipeline completed successfully!")
    
except Exception as e:
    logging.critical(f"Pipeline failed: {e}")
    raise
finally:
    engine.dispose()
    logging.info("Database connection closed")