# LAB 07: Lakeflow Declarative Pipeline

**Duration:** ~45 min | **Day:** 3 | **Difficulty:** Advanced
**After module:** M07: Medallion & Lakeflow Pipelines

> *"Build a Medallion pipeline: Bronze (streaming), Silver (validated), Gold (aggregated) using Lakeflow SQL declarations."*

This lab has two sections:
1. **Section 1: Workshop** — Build & run a complete pipeline in the Databricks UI
2. **Section 2: Practice** — Write and verify Lakeflow SQL declarations

## Setup

In [0]:
%run ../../setup/00_setup

In [0]:
# Lakeflow pipeline target schema (matches Step 2 config)
user_name = CATALOG.replace(f"{CATALOG_PREFIX}_", "")
user_schema = f"lakeflow_workshop"
print(f"Pipeline target schema: {user_schema}")

---

## Section 1: Workshop — Building the Pipeline

In this hands-on workshop, we create a complete Lakeflow pipeline from SQL files — uploading source code, configuring the pipeline in the Databricks UI, running it, and validating the results.

---

### SQL Files Overview

The pipeline source code is organized by medallion layer — each SQL file declares one table or view.

![../../../assets/images/training_2026/day3/57bea06e969c45dab4de8bea9ec980b1.webp](../../../assets/images/training_2026/day3/57bea06e969c45dab4de8bea9ec980b1.webp)

### Step 1: Upload SQL Files

**Option A: Via UI** — Workspace → Users → create `lakeflow_pipeline` folder → upload SQL files

**Option B: Via Git Folders** — Git Folders → Add Git Folder → clone training repository

### Step 2: Create Pipeline

1. **Workflows** → **Jobs & Pipelines** → **Create ETL Pipeline**
2. **Catalog:** `retailhub_<your_name>`
3. **Pipeline Name:** `lakeflow_pipeline_<your_name>`
4. **Target Schema:** `<your_name>_lakeflow`
5. **Source Code:** Add existing assets → choose pipeline root folder and source code folder

<img src="../../../assets/images/fab4fef8e72d4d5786ba818e6c2f73c5.png" width="800">

<img src="../../../assets/images/a4e21cb35d3b45f2ba184877139cdc48.png" width="800">

<img src="../../../assets/images/4e75e96544f142508c3bd8b0b4dbf446.png" width="800">

![../../../assets/images/training_2026/day3/dc9c3eea154e4cb29aee62e6edd5bcca.webp](../../../assets/images/training_2026/day3/dc9c3eea154e4cb29aee62e6edd5bcca.webp)

### Step 3: Configure Variables

**Configuration** → **Add configuration**:

| Key | Value |
|-----|-------|
| `customer_path` | `/Volumes/<your catalog>/default/datasets/customers` |
| `order_path` | `/Volumes/<your catalog>/default/datasets/orders` |
| `product_path` | `/Volumes/<your catalog>/default/datasets/products/products.parquet/` |

Open settings and go to Pipeline Configuration 

<img src="../../../assets/images/b2fa841a52004939ac2679f9edef8dc3.png" width="800">

Add configuration : 

<img src="../../../assets/images/dc31d5dec7444c1d959b8e1f220c10ce.png" width="800">

<img src="../../../assets/images/6f16ad4f7fd941ebbb4fb286dbb8fbfd.png" width="800">

You should also see a DAG diagram built based on Spark Declarative Pipelines definition

![../../../assets/images/training_2026/day3/05f00c9fedb54b0b81ec65bf182a92af.webp](../../../assets/images/training_2026/day3/05f00c9fedb54b0b81ec65bf182a92af.webp)

### Step 4: Run the Pipeline

Start the pipeline and test incremental processing by adding new data files.

![../../../assets/images/training_2026/day3/8d06de8a2a674cc1bc119b5d91b2d1ce.webp](../../../assets/images/training_2026/day3/8d06de8a2a674cc1bc119b5d91b2d1ce.webp)

1. Add new file to folder orders/stream/ from /Volumes/retailhub_trener/default/datasets/demo/ingestion/orders/stream/
2. Run pipeline again
3. Check Event Log - should process only new files

In [0]:
dbutils.fs.mv("/Volumes/retailhub_trener/default/datasets/demo/ingestion/orders/stream/", "/Volumes/retailhub_trener/default/datasets/orders/stream/", recurse=True)

### Step 5: Verify Results

In [0]:
# Check fact_sales with joins to dimensions
display(spark.sql(f"""
    SELECT 
        f.order_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        p.product_name,
        d.date,
        f.quantity,
        f.net_amount
    FROM {CATALOG}.{user_schema}.fact_sales f
    LEFT JOIN {CATALOG}.{user_schema}.dim_customer c ON f.customer_key = c.customer_key
    LEFT JOIN {CATALOG}.{user_schema}.dim_product p ON f.product_key = p.product_key
    LEFT JOIN {CATALOG}.{user_schema}.dim_date d ON f.order_date_key = d.date_key
    LIMIT 10
"""))

In [0]:
# Find customers with change history
display(spark.sql(f"""
    SELECT 
        customer_id, first_name, city,
        __START_AT, __END_AT,
        CASE WHEN __END_AT IS NULL THEN 'Current' ELSE 'Historical' END AS status
    FROM {CATALOG}.{user_schema}.silver_customers
    WHERE customer_id IN (
        SELECT customer_id 
        FROM {CATALOG}.{user_schema}.silver_customers 
        GROUP BY customer_id HAVING COUNT(*) > 1
    )
    ORDER BY customer_id, __START_AT
"""))

### Monitoring and Troubleshooting

Common issues encountered when running Lakeflow pipelines and how to resolve them.

| Issue | Cause | Solution |
|---------|-----------|-------------|
| Pipeline hangs | Cluster too small | Increase min workers |
| Missing data | Constraint DROP ROW | Check Data Quality tab |
| Schema mismatch | Schema change | Full refresh |

---

## Section 2: Practice — Lakeflow SQL Declarations

Write and verify Lakeflow SQL syntax for each medallion layer.

---
## Task 1: Write Bronze Declaration

Complete the SQL below to create a Bronze streaming table from JSON files.

**This SQL would go in a pipeline SQL file.** Here we practice the syntax.

In [0]:
# Practice: what the Bronze SQL declaration looks like
# (This won't execute outside of a Lakeflow pipeline, but verify the syntax)

bronze_sql = f"""
-- TODO: Complete the Bronze declaration
CREATE OR REFRESH ________ TABLE bronze_orders
AS SELECT * 
FROM STREAM ________('{DATASET_PATH}/orders/stream', format => 'json');
"""

print("Bronze SQL declaration:")
print(bronze_sql)

In [0]:
# -- Validation --
assert "STREAMING" in bronze_sql.upper(), "Should use STREAMING TABLE"
assert "read_files" in bronze_sql.lower(), "Should use read_files()"
print("Task 1 OK: Bronze declaration syntax correct")

---
## Task 2: Write Silver Declaration with Expectations

Complete the Silver layer with data quality constraints.

In [0]:
# TODO: Complete Silver SQL with expectations
silver_sql = """
CREATE OR REFRESH STREAMING TABLE silver_orders (
    CONSTRAINT valid_id EXPECT (order_id IS NOT NULL) ON VIOLATION ________,
    CONSTRAINT positive_amount EXPECT (total_price > 0) ON VIOLATION ________
)
AS SELECT 
    order_id,
    customer_id,
    product_id,
    CAST(quantity AS INT) AS quantity,
    CAST(total_price AS DOUBLE) AS total_price,
    CAST(order_date AS DATE) AS order_date,
    payment_method,
    store_id,
    current_timestamp() AS processed_at
FROM STREAM(bronze_orders);
"""

print("Silver SQL declaration:")
print(silver_sql)

In [0]:
# -- Validation --
assert "CONSTRAINT" in silver_sql.upper(), "Should have CONSTRAINT declarations"
assert "DROP ROW" in silver_sql.upper(), "Should use ON VIOLATION DROP ROW"
assert "bronze" in silver_sql.lower(), "Should reference bronze_orders"
print("Task 2 OK: Silver declaration with expectations correct")

---
## Task 3: Write Gold Declaration

Create a Materialized View for daily revenue summary.

In [0]:
# TODO: Complete Gold declaration
gold_sql = """
CREATE OR REFRESH ________ ________ gold_daily_revenue
AS SELECT 
    order_date,
    SUM(total_price) AS total_revenue,
    COUNT(*) AS total_orders,
    AVG(total_price) AS avg_order_value
FROM silver_orders
GROUP BY order_date
ORDER BY order_date;
"""

print("Gold SQL declaration:")
print(gold_sql)

In [0]:
# -- Validation --
assert "MATERIALIZED VIEW" in gold_sql.upper(), "Gold should use MATERIALIZED VIEW"
assert "silver" in gold_sql.lower(), "Should reference silver_orders"
print("Task 3 OK: Gold Materialized View declaration correct")

---
## Task 4: Compare STREAMING TABLE vs MATERIALIZED VIEW

Fill in the comparison table (markdown exercise).

| Feature | STREAMING TABLE | MATERIALIZED VIEW |
|---------|----------------|-------------------|
| Processing mode | ________ | ________ |
| Best for | ________ | ________ |
| Read from source | `STREAM(table_name)` | `table_name` |
| Supports expectations | Yes | Yes |

---
## Task 5: Verify Pipeline Results (after running the pipeline)

After creating and running the pipeline via the UI, query the results here.

In [0]:
# TODO: After running the pipeline, uncomment and verify:

# display(spark.sql(f"SELECT COUNT(*) as cnt FROM {CATALOG}.bronze.bronze_orders"))
# display(spark.sql(f"SELECT COUNT(*) as cnt FROM {CATALOG}.silver.silver_orders"))
# display(spark.sql(f"SELECT * FROM {CATALOG}.gold.gold_daily_revenue ORDER BY order_date"))

---
## Task 6: Check Pipeline Event Log

Query the pipeline event log to see data quality metrics.

In [0]:
# TODO: After running the pipeline, uncomment to check event log:

# display(spark.sql("""
#     SELECT timestamp, details:flow_progress:data_quality:expectations
#     FROM event_log(TABLE({CATALOG}.silver.silver_orders))
#     WHERE details:flow_progress:data_quality IS NOT NULL
# """))

---
## Lab Complete!

You have:
- Written Bronze STREAMING TABLE declarations
- Written Silver declarations with data quality expectations
- Written Gold MATERIALIZED VIEW declarations
- Understood ST vs MV differences
- (If pipeline ran) Verified results and checked data quality metrics

> **Exam Tip:** In Spark Declarative Pipelines, tables within the same pipeline reference each other directly by name — no prefix needed. Use `STREAM(table_name)` for streaming reads and just `table_name` for batch reads.

> **Next:** LAB 08 - Lakeflow Jobs & Orchestration

## Cleanup (Optional)

In [0]:
# Pipeline cleanup is done via Lakeflow UI (delete the pipeline)
print("LAB 07 complete. Delete the pipeline from Lakeflow UI when done.")