# Workshop 3: Lakeflow Pipelines (Delta Live Tables)

## 3.1. The Story

We are building a modern Data Warehouse using **Lakeflow Pipelines** (formerly Delta Live Tables).
Lakeflow uses **Spark Declarative Pipelines (SDP)** to define data flows.

The business requires a **Star Schema** to analyze sales, with specific requirements for handling data changes:

1.  **Products (SCD Type 1)**: If a product name changes, simply update it. We only care about the current name.
2.  **Customers (SCD Type 2)**: If a customer updates their profile, we need to keep a history of changes.
3.  **Sales (Fact)**: Transactional data linked to dimensions.

**Your Mission:**
1.  **Ingest**: Create a **Bronze** layer using Auto Loader.
2.  **Clean**: Create a **Silver** layer with data quality expectations.
3.  **Model**: Use `APPLY CHANGES INTO` (Auto CDC) to implement **SCD Type 1** and **SCD Type 2** logic.
4.  **Deploy**: Create a Lakeflow Pipeline job.
5.  **Update**: Simulate data updates and observe SCD behavior.

**Time:** 45 minutes

In [None]:
# Note: This notebook is a Lakeflow Pipeline definition. 
# It is NOT meant to be run cell-by-cell interactively (except this setup block).
# You will deploy this as a Pipeline in the "Workflows" tab.

%run ../00_setup

# === CONFIGURATION ===
# We will use a separate path for this workshop to simulate data arrival
WORKSHOP_SOURCE_PATH = f"{volume_path}/lakeflow_workshop_source"
CHECKPOINT_PATH = f"{volume_path}/lakeflow_checkpoints"

# Clean up previous runs
dbutils.fs.rm(WORKSHOP_SOURCE_PATH, True)
dbutils.fs.rm(CHECKPOINT_PATH, True)

# === DATA PREPARATION (SIMULATION) ===
# We split source data into 2 batches to simulate incremental updates

# 1. Load Source Data
df_customers = spark.read.option("header", "true").csv(f"{DATASET_BASE_PATH}/workshop/main/Customers.csv")
df_products = spark.read.option("header", "true").csv(f"{DATASET_BASE_PATH}/workshop/main/Product.csv")
df_sales = spark.read.option("header", "true").csv(f"{DATASET_BASE_PATH}/workshop/main/SalesOrderDetail.csv")

# 2. Split into Batch 1 (Initial Load) and Batch 2 (Updates)
# Customers: Split 50/50
cust_b1, cust_b2 = df_customers.randomSplit([0.5, 0.5], seed=42)

# Products: Split 50/50
prod_b1, prod_b2 = df_products.randomSplit([0.5, 0.5], seed=42)

# Sales: Split 50/50
sales_b1, sales_b2 = df_sales.randomSplit([0.5, 0.5], seed=42)

# 3. Write Batch 1 to Source Path
print(f"Preparing Batch 1 in {WORKSHOP_SOURCE_PATH}...")
cust_b1.write.mode("overwrite").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/Customers")
prod_b1.write.mode("overwrite").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/Product")
sales_b1.write.mode("overwrite").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/SalesOrderDetail")

print("✅ Batch 1 ready. You can now create the pipeline.")
print(f"👉 Source Path for Pipeline: {WORKSHOP_SOURCE_PATH}")

## 3.2. Pipeline Definition

In the cell below, you will define the entire pipeline using **Delta Live Tables (DLT)** syntax.

### Requirements:

1.  **Bronze Layer (Ingestion)**
    *   Use `spark.readStream.format("cloudFiles")` (Auto Loader).
    *   Load `Customers`, `Product`, and `SalesOrderDetail` from the source path.
    *   Format is `csv`.

2.  **Silver Layer (Quality & Cleaning)**
    *   **Silver Customers**: Expect `EmailAddress IS NOT NULL`.
    *   **Silver Products**: Expect `Name IS NOT NULL`.
    *   **Silver Sales**: Expect `OrderQty > 0`.

3.  **Gold Layer (Dimensional Modeling)**
    *   **Dim Product (SCD Type 1)**: Use `dlt.apply_changes`. Keys: `ProductID`. Sequence: `ModifiedDate`.
    *   **Dim Customer (SCD Type 2)**: Use `dlt.apply_changes`. Keys: `CustomerID`. Sequence: `ModifiedDate`.
    *   **Fact Sales**: Join `silver_sales` with `dim_product` (current) to enrich with product details.
    *   *(Note: SalesOrderDetail does not contain CustomerID, so we will focus on Product analysis for this workshop)*

### Hint: SCD Type 2 Logic
```python
dlt.create_streaming_table("dim_customer")

dlt.apply_changes(
  target = "dim_customer",
  source = "silver_customers",
  keys = ["CustomerID"],
  sequence_by = col("ModifiedDate"),
  stored_as_scd_type = 2
)
```

In [None]:
import dlt
from pyspark.sql.functions import *

# Path to the workshop source data (defined in the setup cell above)
source_path = f"{volume_path}/lakeflow_workshop_source"

# ==========================================
# BRONZE LAYER
# ==========================================

# TODO: Define bronze_customers
# @dlt.table
# def bronze_customers():
#   return (
#     spark.readStream.format("cloudFiles")
#       .option("cloudFiles.format", "csv")
#       .load(f"{source_path}/Customers")
#   )

# TODO: Define bronze_products

# TODO: Define bronze_sales (SalesOrderDetail)


# ==========================================
# SILVER LAYER
# ==========================================

# TODO: Define silver_customers (Expect EmailAddress IS NOT NULL)

# TODO: Define silver_products (Expect Name IS NOT NULL)

# TODO: Define silver_sales (Expect OrderQty > 0)


# ==========================================
# GOLD LAYER
# ==========================================

# TODO: Create dim_product (SCD Type 1)
# dlt.create_streaming_table("dim_product")
# dlt.apply_changes(...)

# TODO: Create dim_customer (SCD Type 2)
# dlt.create_streaming_table("dim_customer")
# dlt.apply_changes(..., stored_as_scd_type = 2)

# TODO: Create fact_sales
# @dlt.table
# def fact_sales():
#     # Read silver_sales
#     # Join with dim_product (current records only)
#     pass


## 3.3. Deployment

To run this code, you must create a Pipeline resource.

1.  **Navigate**: Click **Workflows** in the sidebar, then **Delta Live Tables**.
2.  **Create Pipeline**:
    *   **Pipeline Name**: `Lakeflow_Workshop_YourName`
    *   **Product Edition**: `Advanced` (Required for SCD Type 2).
    *   **Source Code**: Select **this notebook**.
    *   **Destination**: Unity Catalog (`catalog.schema`).
    *   **Channel**: `Current`.
3.  **Start**: Click **Start** to run the pipeline.

### Observe:
*   The **Graph** visualization showing dependencies.
*   **Data Quality** metrics (how many records dropped in Silver).
*   **SCD Handling**: Check `dim_customer` - it will have `__START_AT` and `__END_AT` columns automatically added!

## 3.4. Lakeflow Jobs (Orchestration)

Once your pipeline is ready, you can orchestrate it as part of a larger workflow.

1.  Go to **Workflows** -> **Jobs**.
2.  Create a new Job.
3.  **Task 1**: Type = **Pipeline**, Select your `Lakeflow_Workshop` pipeline.
4.  **Task 2**: Type = **Notebook**, Select a notebook that queries the Gold tables (e.g., for reporting).

This allows you to chain the ingestion/ETL pipeline with downstream consumers.

In [None]:
# === SIMULATE DATA UPDATE (BATCH 2) ===
# Run this cell AFTER the pipeline has finished the first run.

print("Simulating data updates (Batch 2)...")

# Write Batch 2 to Source Path
cust_b2.write.mode("append").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/Customers")
prod_b2.write.mode("append").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/Product")
sales_b2.write.mode("append").option("header", "true").csv(f"{WORKSHOP_SOURCE_PATH}/SalesOrderDetail")

print("✅ Batch 2 arrived.")
print("👉 Now go back to your Pipeline and click 'Start' (Refresh) to process new data.")