# Workshop 3: Lakeflow Pipelines (Spark Declarative Pipelines)

## The Story

We are building a modern Data Warehouse using **Lakeflow Pipelines** (formerly Delta Live Tables).
Lakeflow uses **Spark Declarative Pipelines (SDP)** to define data flows.

The business requires a **Star Schema** to analyze sales, with specific requirements for handling data changes:

1.  **Products (SCD Type 1)**: If a product name changes, simply update it. We only care about the current name.
2.  **Customers (SCD Type 2)**: If a customer updates their profile, we need to keep a history of changes.

**Your Mission:**
1.  **Ingest**: Create a **Bronze** layer using Auto Loader.
2.  **Clean**: Create a **Silver** layer with data quality expectations.
3.  **Model**: Use `APPLY CHANGES INTO` (Auto CDC) to implement **SCD Type 1** and **SCD Type 2** logic.
4.  **Deploy**: Create a Lakeflow Pipeline job.

**Note:** We will use the new `databricks.pipelines` (aliased as `dp`) module instead of the deprecated `dlt`.

**Time:** 45 minutes

In [None]:
# Note: This notebook is a Lakeflow Pipeline definition. 
# It is NOT meant to be run cell-by-cell interactively (except this setup block).
# You will deploy this as a Pipeline in the "Workflows" tab.

%run ../00_setup

# --- INDEPENDENT SETUP ---
# Ensure source files exist for the pipeline to read
source_path = f"{volume_path}/main"
required_files = ["Customers.csv", "Product.csv", "SalesOrderDetail.csv"]

print(f"Checking source data in {source_path}...")
try:
    files = [f.name for f in dbutils.fs.ls(source_path)]
    missing = [f for f in required_files if f not in files]
    if missing:
        print(f"⚠️ Missing files: {missing}. Please upload them to {source_path}.")
    else:
        print(f"✅ All required source files found in {source_path}.")
        print(f"   - Customers.csv")
        print(f"   - Product.csv")
        print(f"   - SalesOrderDetail.csv")
except Exception as e:
    print(f"Error checking files: {e}")

# We will use this path in the Pipeline code
print(f"\nCopy this path for your Pipeline code: {source_path}")

## Step 1: Bronze Layer (Ingestion)

We use **Auto Loader** (`cloudFiles`) to ingest data incrementally.
We need to define three bronze tables:
1.  `bronze_customers`
2.  `bronze_products`
3.  `bronze_sales` (from SalesOrderDetail)

**Hint:**
```python
import databricks.pipelines as dp

@dp.table
def bronze_customers():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .load(f"{source_path}/Customers.csv")
  )
```

In [None]:
import databricks.pipelines as dp
from pyspark.sql.functions import *

# Hardcode the path for the Pipeline (since variables from other cells don't pass to the runtime easily)
# REPLACE THIS with the path printed in the setup cell above!
input_path = "/Volumes/main/default/workshop_files/main" 

# TODO: Define bronze_customers
# @dp.table
# def bronze_customers():
#     return ...

# TODO: Define bronze_products

# TODO: Define bronze_sales (SalesOrderDetail.csv)

## Step 2: Silver Layer (Cleaning)

Create "clean" versions of our tables.
*   **Customers**: Must have a valid `EmailAddress`.
*   **Products**: Must have a `Name`.

Use `@dp.expect_or_drop` to enforce these rules.

**Hint:**
```python
@dp.table
@dp.expect_or_drop("valid_email", "EmailAddress IS NOT NULL")
def silver_customers():
  return dp.read("bronze_customers")
```

In [None]:
# TODO: Define silver_customers (Expect EmailAddress IS NOT NULL)

# TODO: Define silver_products (Expect Name IS NOT NULL)

# TODO: Define silver_sales (Pass through from bronze_sales)

## Step 3: Gold Layer - SCD Handling (Auto CDC)

This is where Lakeflow Pipelines shines. We use `dp.apply_changes` (Auto CDC) to handle history.

### SCD Type 1: Products
We only want the **latest** version of a product.
*   `stored_as_scd_type=1`
*   Keys: `ProductID`
*   Sequence: `ModifiedDate` (to know which is newer)

### SCD Type 2: Customers
We want a **history** of changes.
*   `stored_as_scd_type=2`
*   Keys: `CustomerID`
*   Sequence: `ModifiedDate`

**Hint:**
```python
dp.create_streaming_table("dim_product")

dp.apply_changes(
  target = "dim_product",
  source = "silver_products",
  keys = ["ProductID"],
  sequence_by = col("ModifiedDate"),
  stored_as_scd_type = 1
)
```

In [None]:
# TODO: Create dim_product (SCD Type 1)
# dp.create_streaming_table("dim_product")
# dp.apply_changes(...)

# TODO: Create dim_customer (SCD Type 2)
# dp.create_streaming_table("dim_customer")
# dp.apply_changes(..., stored_as_scd_type = 2)

## Step 4: Gold Fact Table

Finally, create the `fact_sales` table.
In a Star Schema, the Fact table links to Dimensions via keys.

**Hint:**
```python
@dp.table
def fact_sales():
  return dp.read("silver_sales").select("SalesOrderID", "CustomerID", "ProductID", "OrderQty", "UnitPrice", "LineTotal")
```

In [None]:
# TODO: Define fact_sales
# @dp.table
# def fact_sales():
#     return ...

In [None]:
# ============================================================
# FULL SOLUTION - Workshop 3: Lakeflow Pipelines (SDP)
# ============================================================

import databricks.pipelines as dp
from pyspark.sql.functions import col

# Path to source data (Update this!)
input_path = "/Volumes/main/default/workshop_files/main"

# --- Bronze Layer ---
@dp.table
def bronze_customers():
    return (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
            .option("header", "true").option("inferSchema", "true")
            .load(f"{input_path}/Customers.csv"))

@dp.table
def bronze_products():
    return (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
            .option("header", "true").option("inferSchema", "true")
            .load(f"{input_path}/Product.csv"))

@dp.table
def bronze_sales():
    return (spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv")
            .option("header", "true").option("inferSchema", "true")
            .load(f"{input_path}/SalesOrderDetail.csv"))

# --- Silver Layer ---
@dp.table
@dp.expect_or_drop("valid_email", "EmailAddress IS NOT NULL")
def silver_customers():
    return dp.read("bronze_customers")

@dp.table
@dp.expect_or_drop("valid_name", "Name IS NOT NULL")
def silver_products():
    return dp.read("bronze_products")

@dp.table
def silver_sales():
    return dp.read("bronze_sales")

# --- Gold Layer (SCD) ---

# SCD Type 1: Products (Overwrite)
dp.create_streaming_table("dim_product")

dp.apply_changes(
    target = "dim_product",
    source = "silver_products",
    keys = ["ProductID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 1
)

# SCD Type 2: Customers (History)
dp.create_streaming_table("dim_customer")

dp.apply_changes(
    target = "dim_customer",
    source = "silver_customers",
    keys = ["CustomerID"],
    sequence_by = col("ModifiedDate"),
    stored_as_scd_type = 2
)

# Fact Table
@dp.table
def fact_sales():
    return dp.read("silver_sales")

## Step 5: Deploying the Lakeflow Pipeline

To run this code, you must create a Pipeline resource.

1.  **Navigate**: Click **Workflows** in the sidebar, then **Delta Live Tables** (soon to be renamed **Lakeflow Pipelines**).
2.  **Create Pipeline**:
    *   **Pipeline Name**: `Lakeflow_Workshop_YourName`
    *   **Product Edition**: `Advanced` (Required for SCD Type 2 / `APPLY CHANGES`).
    *   **Source Code**: Select **this notebook**.
    *   **Destination**: Unity Catalog (`catalog.schema`).
3.  **Start**: Click **Start** to run the pipeline.

### Observe:
*   The **Graph** visualization showing dependencies.
*   **Data Quality** metrics (how many records dropped in Silver).
*   **SCD Handling**: Check `dim_customer` - it will have `__START_AT` and `__END_AT` columns automatically added!