# Workshop 3: Lakeflow Pipelines

## The Story

We are building a Modern Data Warehouse for a retail company. We have raw CSV data landing in our Data Lake, and we need to transform it into a high-quality **Star Schema** for reporting.

We will use **Databricks Lakeflow Pipelines** to build a declarative pipeline that handles:
1.  **Ingestion (Bronze):** Automatically loading new files.
2.  **Transformation (Silver):** Cleaning data and handling Slowly Changing Dimensions (SCD).
3.  **Aggregation (Gold):** Creating business-ready tables.

## The Data Model (Star Schema)

We will build the following schema:

### 1. Fact Table: `fact_sales`
*   **Source:** Joins `SalesOrderHeader` and `SalesOrderDetail`.
*   **Grain:** One row per product in an order.
*   **Key Metrics:** `OrderQty`, `UnitPrice`, `LineTotal`.
*   **Quality Check:** `OrderQty` must be greater than 0.

### 2. Dimension: `dim_customer` (SCD Type 2)
*   **Source:** `Customers.csv`
*   **Behavior:** **SCD Type 2 (History)**. We want to track changes in customer details over time.
*   **Key Columns:** `CustomerID`, `FirstName`, `LastName`, `Email`.

### 3. Dimension: `dim_product` (SCD Type 1)
*   **Source:** Joins `Product.csv` and `ProductCategory.csv`.
*   **Behavior:** **SCD Type 1 (Overwrite)**. If a product name changes, we just update it. We don't need history for typos.
*   **Key Columns:** `ProductID`, `Name`, `CategoryName`.

---

## How to run this?

This notebook is a **Lakeflow Pipeline definition**. You cannot run it cell-by-cell like a normal notebook!

**Steps to create the Pipeline:**
1.  Go to **Lakeflow Pipelines**.
2.  Click **Create Pipeline**.
3.  **Product edition:** Advanced (required for SCD).
4.  **Pipeline mode:** Triggered (for workshop) or Continuous.
5.  **Source code:** Select THIS notebook.
6.  **Destination:** Unity Catalog (Schema: `workshop_lakeflow`).
7.  **Configuration:**
    *   `source_path`: `/Volumes/workspace/default/dataset/workshop/main/` (Adjust to your path)

In [None]:
import dlt
from pyspark.sql.functions import col, current_timestamp

# We read the path from the Pipeline Configuration
# Default to a dummy path if not set (for syntax checking)
source_path = spark.conf.get("source_path", "/Volumes/workspace/default/dataset/workshop/main/")

## Step 1: Bronze Layer (Ingestion)

We use **Auto Loader** (`cloudFiles`) to incrementally ingest data.
Define 5 bronze tables: `bronze_customers`, `bronze_products`, `bronze_categories`, `bronze_headers`, `bronze_details`.

**Hint:**
Use the `@dlt.table` decorator.
Inside the function, use `spark.readStream.format("cloudFiles")`.
Remember to set `cloudFiles.format` to `csv`.

In [None]:
# TODO: Define Bronze Tables
# 1. bronze_customers
# 2. bronze_products
# 3. bronze_categories
# 4. bronze_headers
# 5. bronze_details

## Step 2: Silver Layer (Dimensions & SCD)

### 2.1 Dimension: Customers (SCD Type 2)

We want to track history. Use `dlt.apply_changes`.

**Hint:**
Use `dlt.create_streaming_table("table_name")` first.
Then use `dlt.apply_changes(...)`.
Key parameters: `target`, `source`, `keys`, `sequence_by`, and `stored_as_scd_type = 2`.

### 2.2 Dimension: Products (SCD Type 1)

We want to overwrite changes. First, join Products with Categories.

**Hint:**
1. Create a helper table `@dlt.table` that joins products and categories.
2. Create the target streaming table.
3. Use `dlt.apply_changes(...)` with `stored_as_scd_type = 1`.

In [None]:
# TODO: Define dim_customer (SCD Type 2)


# TODO: Define intermediate joined table for products


# TODO: Define dim_product (SCD Type 1)

## Step 3: Silver Layer (Fact Table)

Create `fact_sales` by joining Headers and Details.
Add a **Quality Expectation** to drop invalid rows.

**Hint:**
Use `@dlt.table` and `@dlt.expect_or_drop(...)`.
Read source tables using `dlt.read("table_name")`.
Perform the join in PySpark.

In [None]:
# TODO: Define fact_sales with Expectation

## Step 4: Gold Layer (Aggregation)

Create a business report: `sales_by_category`.
Calculate total sales per category.

**Hint:**
Join `fact_sales` and `dim_product`.
Group by category and sum the sales.

In [None]:
# TODO: Define sales_by_category report

# Solution

The complete code is below.

---
**Scheduling Note:**
To schedule this pipeline:
1. Go to the Pipeline settings.
2. Click **Schedule**.
3. Choose "Scheduled" and set the cron syntax (e.g., every hour).
This works exactly like a Job!
---

In [None]:
# ============================================================
# FULL SOLUTION - Workshop 3: Lakeflow Pipelines
# ============================================================

import dlt
from pyspark.sql.functions import col, current_timestamp

source_path = spark.conf.get("source_path", "/Volumes/workspace/default/dataset/workshop/main/")

# --- BRONZE LAYER ---

@dlt.table
def bronze_customers():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/Customers.csv")
    )

@dlt.table
def bronze_products():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/Product.csv")
    )

@dlt.table
def bronze_categories():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/ProductCategory.csv")
    )

@dlt.table
def bronze_headers():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/SalesOrderHeader.csv")
    )

@dlt.table
def bronze_details():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/SalesOrderDetail.csv")
    )

# --- SILVER LAYER (DIMENSIONS) ---

# SCD Type 2 for Customers
dlt.create_streaming_table("dim_customer")

dlt.apply_changes(
    target = "dim_customer",
    source = "bronze_customers",
    keys = ["CustomerID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 2
)

# SCD Type 1 for Products (with Join)
@dlt.table
def products_joined():
    p = dlt.read("bronze_products")
    c = dlt.read("bronze_categories")
    return p.join(c, p.ProductCategoryID == c.ProductCategoryID, "left") \
            .select(p.ProductID, p.Name.alias("ProductName"), p.ProductNumber, c.Name.alias("CategoryName"), p.ModifiedDate)

dlt.create_streaming_table("dim_product")

dlt.apply_changes(
    target = "dim_product",
    source = "products_joined",
    keys = ["ProductID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 1
)

# --- SILVER LAYER (FACTS) ---

@dlt.table
@dlt.expect_or_drop("valid_qty", "OrderQty > 0")
def fact_sales():
    h = dlt.read("bronze_headers")
    d = dlt.read("bronze_details")
    
    return h.join(d, h.SalesOrderID == d.SalesOrderID, "inner") \
            .select(
                h.SalesOrderID,
                h.OrderDate,
                h.CustomerID,
                d.ProductID,
                d.OrderQty,
                d.UnitPrice,
                d.LineTotal
            )

# --- GOLD LAYER ---

@dlt.table
def sales_by_category():
    f = dlt.read("fact_sales")
    p = dlt.read("dim_product")
    
    return f.join(p, f.ProductID == p.ProductID, "inner") \
            .groupBy("CategoryName") \
            .sum("LineTotal") \
            .withColumnRenamed("sum(LineTotal)", "TotalSales")