# Workshop 3: Lakeflow Pipelines (Spark Declarative Pipelines)

## The Story

We are building a Modern Data Warehouse for a retail company. We have raw CSV data landing in our Data Lake, and we need to transform it into a high-quality **Star Schema** for reporting.

We will use **Databricks Lakeflow Pipelines** to build a declarative pipeline that handles:
1.  **Ingestion (Bronze):** Automatically loading new files using Auto Loader.
2.  **Transformation (Silver):** Cleaning data, joining tables, and handling Slowly Changing Dimensions (SCD).
3.  **Aggregation (Gold):** Creating business-ready tables for BI.

## The Data Model (Star Schema)

We will build the following schema from our source CSV files:

### 1. Fact Table: `fact_sales`
*   **Source:** Joins `SalesOrderHeader` (Orders) and `SalesOrderDetail` (Line Items).
*   **Grain:** One row per product in an order.
*   **Business Logic:**
    *   Filter only **Shipped** orders (`Status = 5`).
    *   Calculate `TotalLineAmount` (if needed, or use `LineTotal`).
*   **Quality Check (Expectations):**
    *   `valid_qty`: `OrderQty > 0`
    *   `valid_price`: `UnitPrice >= 0`

### 2. Dimension: `dim_customer` (SCD Type 2)
*   **Source:** `Customers.csv`
*   **Behavior:** **SCD Type 2 (History)**. We want to track changes in customer details (e.g., Name, Email) over time.
*   **Key Columns:** `CustomerID`, `FirstName`, `LastName`, `EmailAddress`, `CompanyName`.

### 3. Dimension: `dim_product` (SCD Type 1)
*   **Source:** Joins `Product.csv` and `ProductCategory.csv`.
*   **Behavior:** **SCD Type 1 (Overwrite)**. If a product name or category changes, we just update the record. We don't need history for product typos.
*   **Key Columns:** `ProductID`, `Name`, `ProductNumber`, `Color`, `CategoryName`.

---

## How to run this?

This notebook is a **Lakeflow Pipeline definition**. You cannot run it cell-by-cell like a normal notebook!

**Steps to create the Pipeline:**

1.  **Go to Workflows -> Lakeflow Pipelines (Delta Live Tables)**.
2.  **Click "Create Pipeline"**.
3.  **Fill in the details:**
    *   **Pipeline Name:** `workshop_lakeflow_pipeline`
    *   **Product edition:** **Advanced** (Required for SCD Type 2).
    *   **Pipeline mode:** **Triggered** (Best for workshops/batch) or Continuous.
    *   **Source code:** Select THIS notebook (`W3_lakeflow_pipeline.ipynb`).
    *   **Destination:** Unity Catalog.
        *   **Catalog:** `workshop_catalog` (or your assigned catalog).
        *   **Target Schema:** `workshop_lakeflow`.
    *   **Configuration:**
        *   Add Key: `source_path`
        *   Value: `/Volumes/workspace/default/dataset/workshop/main/` (Check your actual path!)

> **[PLACEHOLDER: Screenshot of Pipeline Configuration]**

4.  **Click "Create"**.
5.  **Click "Start"** to run the pipeline for the first time.

> **[PLACEHOLDER: Screenshot of Running Pipeline Graph]**

In [None]:
from pyspark import pipelines as dp
from pyspark.sql.functions import col, current_timestamp, expr

# We read the path from the Pipeline Configuration
# Default to a dummy path if not set (for syntax checking in editor)
source_path = spark.conf.get("source_path", "/Volumes/workspace/default/dataset/workshop/main/")

## Step 1: Bronze Layer (Ingestion)

We use **Auto Loader** (`cloudFiles`) to incrementally ingest data from CSV files.
We need to define 5 bronze tables:
1.  `bronze_customers`
2.  `bronze_products`
3.  `bronze_categories`
4.  `bronze_headers`
5.  `bronze_details`

**Hint:**
*   Use the `@dp.table` decorator.
*   Inside the function, use `spark.readStream.format("cloudFiles")`.
*   Set `cloudFiles.format` to `csv`.
*   Use `inferSchema` or provide schema (for workshop `inferSchema` is fine).
*   Load from `{source_path}/FileName.csv`.

In [None]:
# TODO: Define Bronze Tables

# 1. bronze_customers
@dp.table(
    comment="Raw customers data ingested from CSV"
)
def bronze_customers():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/Customers.csv")
    )

# TODO: Implement the rest of bronze tables (products, categories, headers, details)
# ...
# ...
# ...

## Step 2: Silver Layer (Dimensions & SCD)

### 2.1 Dimension: Customers (SCD Type 2)

We want to track history of customer changes.
*   **Target Table:** `dim_customer`
*   **Source:** `bronze_customers`
*   **Keys:** `CustomerID`
*   **Sequence By:** `ModifiedDate` (to determine order of updates)
*   **SCD Type:** 2

**Hint:**
Use `dp.create_streaming_table("table_name")` first.
Then use `dp.create_auto_cdc_flow(...)`.

### 2.2 Dimension: Products (SCD Type 1)

We want to overwrite changes (no history).
First, we need to **JOIN** `bronze_products` with `bronze_categories` to get the Category Name.

**Hint:**
1.  Create a helper table `@dp.table` named `products_joined` that performs the join.
2.  Create the target streaming table `dim_product`.
3.  Use `dp.create_auto_cdc_flow(...)` with `stored_as_scd_type = 1`.

In [None]:
# TODO: Define dim_customer (SCD Type 2)
# dp.create_streaming_table("dim_customer")
# dp.create_auto_cdc_flow(...)


# TODO: Define products_joined (Helper Table)
# @dp.table
# def products_joined():
#     ...


# TODO: Define dim_product (SCD Type 1)
# dp.create_streaming_table("dim_product")
# dp.create_auto_cdc_flow(...)

## Step 3: Gold Layer (Fact Table)

### 3.1 Fact: Sales

We need to create `fact_sales` by joining `bronze_headers` and `bronze_details`.
We also want to enforce **Data Quality** expectations.

**Requirements:**
1.  Join `bronze_headers` (h) and `bronze_details` (d) on `SalesOrderID`.
2.  Filter for `Status = 5` (Shipped).
3.  Select relevant columns: `SalesOrderID`, `OrderDate`, `CustomerID`, `ProductID`, `OrderQty`, `UnitPrice`, `LineTotal`.
4.  **Expectation:** Drop rows where `OrderQty <= 0`.

**Hint:**
Use `@dp.expect_or_drop("name", "condition")`.

In [None]:
# TODO: Define fact_sales
# @dp.table
# @dp.expect_or_drop("valid_qty", "OrderQty > 0")
# def fact_sales():
#     ...

## Step 4: Orchestration (Theory)

Once the pipeline is created, it can be scheduled as a **Job**.

1.  Go to **Workflows -> Jobs**.
2.  Create a new Job.
3.  Task Type: **Pipeline**.
4.  Select your `workshop_lakeflow_pipeline`.
5.  Set Schedule (e.g., "Every day at 8:00 AM").

This allows you to treat your ETL pipeline just like any other scheduled task in a Data Warehouse.

---

# Solution

Below is the complete code for the pipeline. You can copy-paste this if you get stuck.

In [None]:
# ============================================================
# FULL SOLUTION - Workshop 3: Lakeflow Pipelines
# ============================================================

# --- BRONZE LAYER ---

@dp.table(comment="Raw customers data")
def bronze_customers():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/Customers.csv")
    )

@dp.table(comment="Raw products data")
def bronze_products():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/Product.csv")
    )

@dp.table(comment="Raw product categories data")
def bronze_categories():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/ProductCategory.csv")
    )

@dp.table(comment="Raw sales headers")
def bronze_headers():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/SalesOrderHeader.csv")
    )

@dp.table(comment="Raw sales details")
def bronze_details():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/SalesOrderDetail.csv")
    )

# --- SILVER LAYER (DIMENSIONS) ---

# SCD Type 2 for Customers
dp.create_streaming_table("dim_customer")

dp.create_auto_cdc_flow(
    target = "dim_customer",
    source = "bronze_customers",
    keys = ["CustomerID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 2
)

# SCD Type 1 for Products (with Join)
@dp.table
def products_joined():
    p = dp.read("bronze_products")
    c = dp.read("bronze_categories")
    return p.join(c, p.ProductCategoryID == c.ProductCategoryID, "left") \
            .select(
                p.ProductID, 
                p.Name.alias("ProductName"), 
                p.ProductNumber, 
                p.Color,
                c.Name.alias("CategoryName"), 
                p.ModifiedDate
            )

dp.create_streaming_table("dim_product")

dp.create_auto_cdc_flow(
    target = "dim_product",
    source = "products_joined",
    keys = ["ProductID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 1
)

# --- GOLD LAYER (FACTS) ---

@dp.table(comment="Fact table for sales analysis")
@dp.expect_or_drop("valid_qty", "OrderQty > 0")
@dp.expect_or_drop("valid_price", "UnitPrice >= 0")
def fact_sales():
    h = dp.read("bronze_headers")
    d = dp.read("bronze_details")
    
    # Filter for Shipped orders (Status = 5)
    return h.filter(col("Status") == 5) \
            .join(d, h.SalesOrderID == d.SalesOrderID, "inner") \
            .select(
                h.SalesOrderID,
                h.OrderDate,
                h.CustomerID,
                d.ProductID,
                d.OrderQty,
                d.UnitPrice,
                d.LineTotal
            )