## Delta Live Tables Pipeline for Inventory Dimension (Auto CDC + SCD Type 2)


### Scenario

A retail company receives frequent updates to its product/item master data (e.g., item name, brand, category, price).
Downstream dashboards require both:
- the **latest** view of each item (current state), and
- the ability to **track changes over time** (historical versions)

To solve this, we build a **Delta Live Tables (DLT)** pipeline that:
1. Reads the latest items snapshot from the source table (`pyspark_cata.source.items`)
2. Applies light standardization + de-duplication (keep the latest row per `item_id`)
3. Maintains a **Type 2 Slowly Changing Dimension** in a target dimension table using **Auto CDC flow**

### Pipeline overview (DLT layers)

- **Bronze**: `items_raw`  
  Raw ingestion from the source table (no business logic)

- **Silver**: `items_enr`  
  Cleaned/enriched dataset with deduplication to keep the latest record per `item_id`

- **Gold/Dimension**: `items_dim`  
  Maintains full history using **SCD Type 2** based on `updated_at`

> `create_auto_cdc_flow` automatically applies changes from `items_enr` to `items_dim`
> using `item_id` as the key and `updated_at` as the sequencing column.

In [None]:
import dlt
from pyspark.sql.functions import col, desc, row_number
from pyspark.sql.window import Window

# -----------------------------
# Bronze: Raw items ingestion
# -----------------------------
@dlt.table(
    name="items_raw"
)
def items_raw():
    # Raw read from the authoritative source table
    return spark.read.table("pyspark_cata.source.items")


# ----------------------------------------
# Silver: Clean / Deduplicate latest items
# ----------------------------------------
@dlt.table(
    name="items_enr"
)
def items_enr():
    df = dlt.read("items_raw")

    # Deduplicate by keeping the most recent record per item_id (based on updated_at)
    w = Window.partitionBy("item_id").orderBy(desc("updated_at"))
    df = df.withColumn("dedup", row_number().over(w))

    return df.where(col("dedup") == 1).drop("dedup")


# -----------------------------------------
# Gold/Dim: SCD Type 2 dimension table
# -----------------------------------------
dlt.create_streaming_table(
    name="items_dim"
)

dlt.create_auto_cdc_flow(
    target="items_dim",
    source="items_enr",
    keys=["item_id"],
    sequence_by="updated_at",
    stored_as_scd_type=2,
)

Here is the image of DAG generated after Dry Run on Databricks

<img src = "images/pipeline_dag.png" height = "120">

### Validation checklist

After the pipeline runs:
- `items_dim` should contain historical versions per `item_id`
- Only one active version exists per `item_id` (depending on DLT SCD2 representation)
- Ordering by `item_id, updated_at` should show attribute change history