# üì¶ Order Fact Table - Full Load Processing

## üéØ Purpose
Transform raw order data through the Medallion Architecture (Bronze ‚Üí Silver ‚Üí Gold) with aggregation to monthly grain, then merge with parent company fact table.

## üìä Pipeline Overview

| Layer | Purpose | Key Operations |
|-------|---------|-----------------|
| **Bronze** | Raw order ingestion | Load CSV, add metadata, append |
| **Silver** | Clean & deduplicate | Parse dates, validate IDs, remove duplicates, join products |
| **Gold** | Monthly aggregation | Group by month, sum quantities, merge to parent |

## üîë Input & Output Tables

**Inputs:**
- `orders/landing/*.csv` - Raw order files from ADLS

**Outputs:**
- `fmcg.bronze.orders` - Raw orders with metadata
- `fmcg.silver.orders` - Cleaned orders with product joins
- `fmcg.gold.sb_fact_orders` - SportsBar monthly aggregates
- `fmcg.gold.fact_orders` - Parent company fact table (merged)

## üìã Key Transformations
‚úÖ **Multi-format Date Parsing:** Handles multiple date formats automatically  
‚úÖ **Customer ID Validation:** Numeric check, default invalid to 999999  
‚úÖ **Deduplication:** Remove duplicate order records  
‚úÖ **Product Joining:** Integrate product master data  
‚úÖ **Monthly Aggregation:** Transform daily ‚Üí monthly grain  
‚úÖ **File Management:** Archive processed files after success

---

In [0]:
# ============================================================================
# IMPORT REQUIRED LIBRARIES
# ============================================================================
# PySpark functions for data transformation and Delta Lake merge operations

from pyspark.sql import functions as F
from delta.tables import DeltaTable

# ============================================================================
# LIBRARY USAGE:
# - F: String processing, date parsing, aggregation functions
# - DeltaTable: MERGE operations for incremental/full loads
# ============================================================================

In [0]:
# ============================================================================
# IMPORT SHARED CONFIGURATION
# ============================================================================
# Load schema names and configuration from setup utilities
# This ensures consistency across all notebooks

# %run /Workspace/Project1/1_setup_catalog/utilities

In [0]:
# ============================================================================
# VERIFICATION: Print imported schema names
# ============================================================================
# Confirms utilities were successfully imported
# Expected output: bronze silver gold

print(bronze_schema, silver_schema, gold_schema)

bronze silver gold


In [0]:
# ============================================================================
# CONFIGURE DATABRICKS WIDGETS & PATHS
# ============================================================================
# Define parameters and storage paths for order processing

dbutils.widgets.text("catalog", "fmcg", "Catalog")
dbutils.widgets.text("data_source", "orders", "Data Source")

catalog = dbutils.widgets.get("catalog")
data_source = dbutils.widgets.get("data_source")

# ============================================================================
# DEFINE ADLS GEN2 PATHS
# ============================================================================
# Path structure for orders:
# - landing: Incoming CSV files waiting to be processed
# - processed: Files that have been successfully loaded (archive)

base_path = f"abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/{data_source}"
landing_path = f"{base_path}/landing/"
processed_path = f"{base_path}/processed/"

print("Base Path: ", base_path)
print("Landing Path: ", landing_path)
print("Processed Path: ", processed_path)

# ============================================================================
# DEFINE TABLE NAMES
# ============================================================================
# Fully qualified table names (catalog.schema.table)

bronze_table = f"{catalog}.{bronze_schema}.{data_source}"
silver_table = f"{catalog}.{silver_schema}.{data_source}"
gold_table = f"{catalog}.{gold_schema}.sb_fact_{data_source}"

print("Bronze Table: ", bronze_table)
print("Silver Table: ", silver_table)
print("Gold Table: ", gold_table)

Base Path:  abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/orders
Landing Path:  abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/orders/landing/
Processed Path:  abfss://conatiner-de-practice@adlsgen2narayan.dfs.core.windows.net/orders/processed/


## üü† BRONZE LAYER - Raw Order Ingestion

**Purpose:** Load raw order data from ADLS with minimal transformation  
**Update Pattern:** Append mode (add new records)  
**Key Characteristics:** Full lineage with metadata, Change Data Feed enabled

### Process:
1. Read CSV files from landing directory
2. Add metadata (timestamp, file name, file size)
3. Append to Bronze table (preserves historical data)

In [0]:
# ============================================================================
# BRONZE LAYER: READ RAW ORDER DATA
# ============================================================================
# Load CSV files from landing directory with metadata tracking

df = spark.read.options(header=True, inferSchema=True).csv(f"{landing_path}/*.csv")\
    .withColumn("read_timestamp", F.current_timestamp())\
    .select("*", "_metadata.file_name", "_metadata.file_size")

print("Total Rows: ", df.count())
df.show(5)

Total Rows:  51810
+------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+
|    order_id|order_placement_date|customer_id|product_id|order_qty|      read_timestamp|           file_name|file_size|
+------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+
|FOCT62720602|Tuesday, Septembe...|     ABC987|  25891301|     71.0|2025-11-30 15:21:...|orders_2025_09_30...|    41446|
|FOCT62720602|Tuesday, Septembe...|     789720|  25891502|    125.0|2025-11-30 15:21:...|orders_2025_09_30...|    41446|
|FOCT62720602|Tuesday, Septembe...|     789720|  25891403|    462.0|2025-11-30 15:21:...|orders_2025_09_30...|    41446|
|FOCT62720602|Tuesday, Septembe...|    INVALID|  25891601|    133.0|2025-11-30 15:21:...|orders_2025_09_30...|    41446|
|FOCT62720602|Tuesday, Septembe...|     789720|  25891602|     79.0|2025-11-30 15:21:...|orders_2025_09_30...|    41446|
+------------

In [0]:
# ============================================================================
# BRONZE LAYER: APPEND TO DELTA TABLE
# ============================================================================
# Persist raw orders to Bronze layer
# Mode: APPEND (preserve historical data across runs)

df.write\
 .format("delta") \
 .option("delta.enableChangeDataFeed", "true") \
 .mode("append") \
 .saveAsTable(bronze_table)

## üìÅ File Management - Archive Processed Files

**Purpose:** Prevent reprocessing of same files in future runs  
**Method:** Move files from landing ‚Üí processed directory  
**Benefit:** Audit trail and data lineage tracking

In [0]:
# ============================================================================
# FILE ARCHIVAL: Move processed files to archive directory
# ============================================================================
# After successful processing, move files from landing to processed directory
# This prevents accidental reprocessing and maintains data lineage
#
# Logic:
# 1. List all files in landing directory
# 2. For each file, move to processed directory
# 3. Third parameter (True) forces overwrite if file exists
# ============================================================================

files = dbutils.fs.ls(landing_path)
for file_info in files:
    dbutils.fs.mv(
        file_info.path,
        f"{processed_path}/{file_info.name}",
        True
    )

## üü° SILVER LAYER - Data Cleaning & Standardization

**Purpose:** Create clean, validated orders ready for analysis  
**Update Pattern:** MERGE (upsert)  
**Key Operations:**
- ‚úÖ Filter non-null order quantities
- ‚úÖ Validate and sanitize customer IDs
- ‚úÖ Parse dates in multiple formats
- ‚úÖ Remove duplicate orders
- ‚úÖ Join with product master data

### Data Quality Steps:
1. Load Bronze orders
2. Validate order quantities (non-null)
3. Clean customer IDs (numeric check)
4. Parse dates (4 formats supported)
5. Remove duplicates
6. Join with products
7. Merge to Silver table

In [0]:
# ============================================================================
# SILVER LAYER: LOAD DATA FROM BRONZE
# ============================================================================
# Read raw orders from Bronze layer for transformation

df_orders = spark.sql(f"SELECT * FROM {bronze_table}")
df_orders.show(2)

+------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+
|    order_id|order_placement_date|customer_id|product_id|order_qty|      read_timestamp|           file_name|file_size|
+------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+
|FJUL33320501|          2025/07/01|     789320|  25891203|    150.0|2025-11-30 15:23:...|orders_2025_07_01...|    20744|
|FJUL33320501|          2025/07/01|     789320|  25891301|     46.0|2025-11-30 15:23:...|orders_2025_07_01...|    20744|
+------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+
only showing top 2 rows


## üîÑ Silver Layer Transformations

Apply data quality and standardization rules:

In [0]:
# ============================================================================
# DATA QUALITY STEP 1: VALIDATE ORDER QUANTITIES
# ============================================================================
# Business Rule: Only process orders with non-null quantities
# Impact: Removes invalid/incomplete orders

df_orders = df_orders.filter(F.col("order_qty").isNotNull())


# ============================================================================
# DATA QUALITY STEP 2: CLEAN CUSTOMER ID
# ============================================================================
# Problem: Some customer IDs may be invalid/non-numeric
# Solution: Keep numeric IDs, set invalid ones to 999999 (error code)
# This enables tracking of problematic records while maintaining continuity

df_orders = df_orders.withColumn(
    "customer_id",
    F.when(F.col("customer_id").rlike("^[0-9]+$"), F.col("customer_id"))
     .otherwise("999999")
     .cast("string")
)


# ============================================================================
# DATA QUALITY STEP 3: PARSE DATE - REMOVE WEEKDAY PREFIX
# ============================================================================
# Problem: Some dates have weekday prefix (e.g., "Tuesday, July 01, 2025")
# Solution: Remove weekday and comma using regex
# Result: "July 01, 2025" ready for parsing

df_orders = df_orders.withColumn(
    "order_placement_date",
    F.regexp_replace(F.col("order_placement_date"), r"^[A-Za-z]+,\s*", "")
)


# ============================================================================
# DATA QUALITY STEP 4: PARSE DATE - MULTI-FORMAT SUPPORT
# ============================================================================
# Problem: Dates come in multiple formats from different sources
# Solution: Try multiple formats with coalesce fallback
# 
# Supported Formats:
# 1. yyyy/MM/dd (2025-01-15)
# 2. dd-MM-yyyy (15-01-2025)  
# 3. dd/MM/yyyy (15/01/2025)
# 4. MMMM dd, yyyy (January 15, 2025)
#
# coalesce returns first non-NULL result

df_orders = df_orders.withColumn(
    "order_placement_date",
    F.coalesce(
        F.try_to_date("order_placement_date", "yyyy/MM/dd"),
        F.try_to_date("order_placement_date", "dd-MM-yyyy"),
        F.try_to_date("order_placement_date", "dd/MM/yyyy"),
        F.try_to_date("order_placement_date", "MMMM dd, yyyy"),
    )
)


# ============================================================================
# DATA QUALITY STEP 5: REMOVE DUPLICATE ORDERS
# ============================================================================
# Business Rule: Each order should be unique by (order_id, date, customer, product, qty)
# Deduplication prevents double-counting in aggregations

df_orders = df_orders.dropDuplicates(
    ["order_id", "order_placement_date", "customer_id", "product_id", "order_qty"]
)


# ============================================================================
# DATA QUALITY STEP 6: CONVERT PRODUCT ID TO STRING
# ============================================================================
# Standardize product_id type for consistent joins with product master

df_orders = df_orders.withColumn('product_id', F.col('product_id').cast('string'))

In [0]:
# check what's the maximum and minimum date
df_orders.agg(
    F.min("order_placement_date").alias("min_date"),
    F.max("order_placement_date").alias("max_date")
).show()

+----------+----------+
|  min_date|  max_date|
+----------+----------+
|2025-07-01|2025-11-30|
+----------+----------+



### Join with products

In [0]:
df_products = spark.table("fmcg.silver.products")
df_joined = df_orders.join(df_products, on="product_id", how="inner").select(df_orders["*"], df_products["product_code"])

df_joined.show(5)

+-------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+--------------------+
|     order_id|order_placement_date|customer_id|product_id|order_qty|      read_timestamp|           file_name|file_size|        product_code|
+-------------+--------------------+-----------+----------+---------+--------------------+--------------------+---------+--------------------+
|FJUL312422401|          2025-07-10|     789422|  25891401|    406.0|2025-11-30 15:23:...|orders_2025_07_10...|    20202|da6bfc596c1360ca0...|
|FJUL316103602|          2025-07-14|     789103|  25891403|    235.0|2025-11-30 15:23:...|orders_2025_07_14...|    20354|77b6f538a9d0e0cf8...|
|FJUL316402601|          2025-07-14|     789402|  25891601|    167.0|2025-11-30 15:23:...|orders_2025_07_14...|    20354|716fa4e54b7894c91...|
|FJUL320720201|          2025-07-19|     789720|  25891103|    358.0|2025-11-30 15:23:...|orders_2025_07_19...|    21044|102628255d24304d6...|

In [0]:
if not (spark.catalog.tableExists(silver_table)):
    df_joined.write.format("delta").option(
        "delta.enableChangeDataFeed", "true"
    ).option("mergeSchema", "true").mode("overwrite").saveAsTable(silver_table)
else:
    silver_delta = DeltaTable.forName(spark, silver_table)
    silver_delta.alias("silver").merge(df_joined.alias("bronze"), "silver.order_placement_date = bronze.order_placement_date AND silver.order_id = bronze.order_id AND silver.product_code = bronze.product_code AND silver.customer_id = bronze.customer_id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

## GOLD

In [0]:
df_gold = spark.sql(f"SELECT order_id, order_placement_date as date, customer_id as customer_code, product_code, product_id, order_qty as sold_quantity FROM {silver_table};")

df_gold.show(2)

+-------------+----------+-------------+--------------------+----------+-------------+
|     order_id|      date|customer_code|        product_code|product_id|sold_quantity|
+-------------+----------+-------------+--------------------+----------+-------------+
|FJUL312422401|2025-07-10|       789422|da6bfc596c1360ca0...|  25891401|        406.0|
|FJUL316103602|2025-07-14|       789103|77b6f538a9d0e0cf8...|  25891403|        235.0|
+-------------+----------+-------------+--------------------+----------+-------------+
only showing top 2 rows


In [0]:
if not (spark.catalog.tableExists(gold_table)):
    print("creating New Table")
    df_gold.write.format("delta").option(
        "delta.enableChangeDataFeed", "true"
    ).option("mergeSchema", "true").mode("overwrite").saveAsTable(gold_table)
else:
    gold_delta = DeltaTable.forName(spark, gold_table)
    gold_delta.alias("source").merge(df_gold.alias("gold"), "source.date = gold.date AND source.order_id = gold.order_id AND source.product_code = gold.product_code AND source.customer_code = gold.customer_code").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

creating New Table


## Merging with Parent company
- Note: We want data for monthly level but child data is on daily level

## Full Load

In [0]:
df_child = spark.sql(f"SELECT date, product_code, customer_code, sold_quantity FROM {gold_table}")
df_child.show(10)

+----------+--------------------+-------------+-------------+
|      date|        product_code|customer_code|sold_quantity|
+----------+--------------------+-------------+-------------+
|2025-07-10|da6bfc596c1360ca0...|       789422|        406.0|
|2025-07-14|77b6f538a9d0e0cf8...|       789103|        235.0|
|2025-07-14|716fa4e54b7894c91...|       789402|        167.0|
|2025-07-19|102628255d24304d6...|       789720|        358.0|
|2025-07-19|451f7167b28a25bde...|       789403|        169.0|
|2025-07-27|e91ba9d665f90254d...|       789403|        367.0|
|2025-08-03|d9ebd1ca64d23951a...|       789903|         59.0|
|2025-08-03|c68834ceaff15846b...|       789303|         85.0|
|2025-08-10|102628255d24304d6...|       789902|        436.0|
|2025-08-14|77b6f538a9d0e0cf8...|       789402|        382.0|
+----------+--------------------+-------------+-------------+
only showing top 10 rows


In [0]:
df_child.count()

40811

In [0]:
df_monthly = (
    df_child
    # 1. Get month start date (e.g., 2025-11-30 ‚Üí 2025-11-01)
    .withColumn("month_start", F.trunc("date", "MM"))   # or F.date_trunc("month", "date").cast("date")

    # 2.Group at monthly grain by month_start + product_code + customer_code
    .groupBy("month_start", "product_code", "customer_code")
    .agg(
        F.sum("sold_quantity").alias("sold_quantity")
    )

    # 3. Rename month_start back to `date` to match your target schema
    .withColumnRenamed("month_start", "date")
)

df_monthly.show(5, truncate=False)

+----------+----------------------------------------------------------------+-------------+-------------+
|date      |product_code                                                    |customer_code|sold_quantity|
+----------+----------------------------------------------------------------+-------------+-------------+
|2025-07-01|da6bfc596c1360ca07bda4e0ae6bfe3b8456517fc6e8ddc265630ff940f9ab05|789422       |5011.0       |
|2025-07-01|77b6f538a9d0e0cf845db5c2cbecec46fdd30303b501e06f64baf1d4dc0e66f9|789103       |5203.0       |
|2025-07-01|716fa4e54b7894c910180276e0535d49afb25cdcfac09533fb74ae00689e5742|789402       |1726.0       |
|2025-07-01|102628255d24304d6bbe0438b1ac992054f262e0814d306d0a34d7356cef3268|789720       |3712.0       |
|2025-07-01|451f7167b28a25bde73995910e31c07dfa26411f1db47847f19e16747effbdaa|789403       |1816.0       |
+----------+----------------------------------------------------------------+-------------+-------------+
only showing top 5 rows


In [0]:
df_monthly.count()

3060

In [0]:
gold_parent_delta = DeltaTable.forName(spark, f"{catalog}.{gold_schema}.fact_orders")
gold_parent_delta.alias("parent_gold").merge(df_monthly.alias("child_gold"), "parent_gold.date = child_gold.date AND parent_gold.product_code = child_gold.product_code AND parent_gold.customer_code = child_gold.customer_code").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]