# M07: Medallion Architecture & Lakeflow


The Medallion architecture (Bronze → Silver → Gold) is the standard for organizing data in the Lakehouse. We'll explore declarative Lakeflow pipelines: STREAMING TABLE, MATERIALIZED VIEW, constraints (Expectations), and the FLOW mechanism. We'll build a complete pipeline in SQL and PySpark, including SCD Type 1 and Type 2 handling via AUTO CDC.

| Exam Domain | Weight |
|---|---|
| Production Pipelines | 13% |
| Incremental Data Processing | 20% |
---

## SCD Type 1 & Type 2

Slowly Changing Dimensions (SCD) define strategies for handling changes in dimensional data over time. Type 1 overwrites old values, while Type 2 preserves full history with validity timestamps.

---

### What is SCD?

**Slowly Changing Dimension (SCD)** — how to handle changes in dimensional data.

| Type | Strategy | Result |
|-----|-----------|----------|
| **SCD Type 0** | Retain original | Always Warsaw |
| **SCD Type 1** | Overwrite | Only Krakow |
| **SCD Type 2** | Track history | Both records with dates |
| **SCD Type 3** | Add column | `current_city=Krakow`, `previous_city=Warsaw` |

### SCD Type 1 — Overwrite

- **No history** — old values are overwritten
- **Use cases:** Error corrections, non-historical data
- **Implementation:** `MERGE INTO ... WHEN MATCHED THEN UPDATE SET *`

In [0]:
%sql
-- ### SCD Type 1 in Lakeflow:
CREATE FLOW silver_products_scd1_flow
AS AUTO CDC INTO silver_products
FROM bronze_products
KEYS (product_id)
SEQUENCE BY ingestion_ts
STORED AS SCD TYPE 1;  -- Overwrite without history

In [0]:
%sql
MERGE INTO dim_customer t
USING source_customers s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

### SCD Type 2 — Track History

- **Full history** of changes with `__START_AT`, `__END_AT`
- **Use cases:** Audit, historical analysis, compliance
- **Current record:** `__END_AT IS NULL`

### Demo: SCD Type 2 with AUTO CDC

In [0]:
%sql
-- Creating target table SCD2
CREATE OR REFRESH STREAMING TABLE silver_customers (
  customer_id        STRING,
  first_name         STRING,
  last_name          STRING,
  city               STRING,
  -- SCD2 columns added automatically:
  __START_AT         TIMESTAMP,
  __END_AT           TIMESTAMP
);

-- Flow with AUTO CDC for SCD2
CREATE FLOW silver_customers_scd2_flow
AS AUTO CDC INTO silver_customers
FROM STREAM bronze_customers
KEYS (customer_id)         -- Business key
SEQUENCE BY ingestion_ts   -- Column determining the order
STORED AS SCD TYPE 2;      -- SCD Type

-- ### Key elements:
-- - **KEYS** - columns identifying the business record
-- - **SEQUENCE BY** - column determining the order of changes
-- - **STORED AS SCD TYPE 1|2** - SCD type

---

## Lakeflow Pipelines — Declarations

Lakeflow (formerly Delta Live Tables) enables a declarative approach to building data pipelines — you define the desired outcome rather than step-by-step logic. This section covers table types, expectations, FLOW declarations, and both SQL and PySpark syntax.

---

### What is Lakeflow?

**Lakeflow** (formerly Delta Live Tables) — declarative framework for data pipelines.

| Approach | Example | You describe |
|-----------|----------|-------------|
| **Imperative** | `df.write.mode("overwrite")...` | HOW |
| **Declarative** | `CREATE TABLE AS SELECT...` | WHAT |

Key benefits: automatic dependencies, built-in data quality, unified batch/streaming, automatic recovery, lineage & monitoring.

### Table Types in Lakeflow

| Type | Usage | Processing |
|-----|--------|-----------|
| **STREAMING TABLE** | Append-only ingestion | Incremental (new rows only) |
| **MATERIALIZED VIEW** | Aggregations, Gold layer | Full recomputation |
| **VIEW** | Intermediate logic | Not materialized |

> **Exam:** STREAMING TABLE = incremental (new data only). MATERIALIZED VIEW = full recompute each refresh.

**STREAMING TABLE vs MATERIALIZED VIEW (Exam Topic):**

| Feature | STREAMING TABLE | MATERIALIZED VIEW |
|---------|----------------|-------------------|
| Data Source | Streaming (append-only) | Batch or streaming |
| Processing | Incremental (new rows only) | Full recomputation |
| Updates | Append-only | Full refresh |
| Use Case | Bronze/Silver layers | Gold aggregations |
| Supports AUTO CDC | Yes | No |
| Query Pattern | `STREAM(source)` | Regular `SELECT` |
| Idempotent | Yes (checkpoints) | Yes (full refresh) |

**Key Exam Point:** STREAMING TABLEs process only NEW data (incremental). MATERIALIZED VIEWs recompute the FULL result on each refresh.

### Demo: STREAMING TABLE with Constraints

Constraint actions: **`DROP ROW`** (remove invalid), **`FAIL UPDATE`** (fail pipeline), or no action (log only).

In [0]:
%sql
CREATE OR REFRESH STREAMING TABLE silver_orders
(
  CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_quantity EXPECT (quantity > 0)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_price EXPECT (unit_price >= 0)
    ON VIOLATION FAIL UPDATE
)
AS
SELECT
  order_id,
  customer_id,
  product_id,
  CAST(order_datetime AS TIMESTAMP) AS order_ts,
  quantity,
  unit_price,
  (quantity * unit_price) AS gross_amount
FROM STREAM(bronze_orders);

### Demo: MATERIALIZED VIEW (Gold)

In [0]:
%sql
-- Dimension - current snapshot from SCD2
CREATE OR REFRESH MATERIALIZED VIEW dim_customer
AS
SELECT
  customer_id,
  first_name,
  last_name,
  email,
  city,
  customer_segment
FROM silver_customers
WHERE __END_AT IS NULL;

-- Date Dimension
CREATE OR REFRESH MATERIALIZED VIEW dim_date
AS
SELECT DISTINCT
  CAST(date_format(order_date, 'yyyyMMdd') AS INT) AS date_key,
  order_date AS date,
  year(order_date) AS year,
  quarter(order_date) AS quarter,
  month(order_date) AS month
FROM silver_orders;

-- Fact - streaming from Silver
CREATE OR REFRESH STREAMING TABLE fact_sales
AS
SELECT
  order_id,
  customer_id,
  product_id,
  order_date_key,
  quantity,
  gross_amount,
  net_amount
FROM STREAM(silver_orders);

### What is FLOW?

**FLOW** separates table definition from data source. Key capabilities:
1. **Multiple sources → one table** (e.g. backfill + streaming)
2. **CDC** with automatic SCD via `AUTO CDC`
3. **`INSERT INTO ONCE`** — one-time backfill
4. **`INSERT INTO`** — continuous streaming

In [0]:
%sql
-- 1. We define empty target table
CREATE OR REFRESH STREAMING TABLE bronze_orders;

-- 2. We define FLOW(s) which populate it
CREATE FLOW flow_name
AS INSERT INTO target_table BY NAME
SELECT ... FROM source;

### Demo: Backfill + Streaming Pattern

In [0]:
%sql
-- Target table
CREATE OR REFRESH STREAMING TABLE bronze_orders;

-- FLOW 1: One-time backfill
CREATE FLOW bronze_orders_backfill
AS 
INSERT INTO ONCE bronze_orders BY NAME
SELECT
  order_id,
  customer_id,
  product_id,
  order_datetime,
  'batch' AS source_system,
  _metadata.file_path AS source_file_path,
  current_timestamp() AS load_ts
FROM read_files(
  '${order_path}/orders_batch.json',
  format => 'json'
);

-- FLOW 2: Continuous streaming
CREATE FLOW bronze_orders_stream
AS 
INSERT INTO bronze_orders BY NAME
SELECT
  order_id,
  customer_id,
  'stream' AS source_system,
  _metadata.file_path AS source_file_path,
  current_timestamp() AS load_ts
FROM STREAM read_files(
  '${order_path}/stream/orders_stream_*.json',
  format => 'json'
);

### Demo: AUTO CDC for SCD Type 2

AUTO CDC: compares new records by `KEYS`, detects changes, closes old record (`__END_AT`), inserts new (SCD2) or overwrites (SCD1).

In [0]:
%sql
-- SCD2 Table with schema
CREATE OR REFRESH STREAMING TABLE silver_customers (
  customer_id        STRING,
  first_name         STRING,
  last_name          STRING,
  email              STRING,
  city               STRING,
  __START_AT         TIMESTAMP,
  __END_AT           TIMESTAMP
);

-- AUTO CDC Flow
CREATE FLOW silver_customers_scd2_flow
AS AUTO CDC INTO silver_customers
FROM STREAM bronze_customers
KEYS (customer_id)
SEQUENCE BY ingestion_ts
STORED AS SCD TYPE 2;

### PySpark Declarations

Lakeflow pipelines can also be defined in Python using the `pyspark.pipelines` decorators instead of SQL.

In [0]:
from pyspark import pipelines as dp
from pyspark.sql.functions import *

# STREAMING TABLE
@dp.table(
    name="bronze_customers",
    comment="Raw customers from CSV"
)
def bronze_customers():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "csv")
            .option("header", "true")
            .load(spark.conf.get("customer_path"))
            .select(
                "*",
                "_metadata.file_path".alias("source_file_path"),
                current_timestamp().alias("load_ts")
            )
    )

# MATERIALIZED VIEW
@dp.table(name="dim_customer")
def dim_customer():
    return (
        spark.read.table("silver_customers")
            .filter(col("__END_AT").isNull())
            .select("customer_id", "first_name", "last_name")
    )

### PySpark: Expectations

Decorators: **`@dp.expect`** (log only), **`@dp.expect_or_drop`** (drop record), **`@dp.expect_or_fail`** (fail pipeline).

In [0]:
@dp.table(name="silver_orders")
@dp.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dp.expect_or_drop("valid_customer", "customer_id IS NOT NULL")
@dp.expect_or_drop("valid_quantity", "quantity > 0")
@dp.expect("valid_price", "unit_price >= 0")
def silver_orders():
    return (
        spark.readStream.table("bronze_orders")
            .select(
                "order_id",
                "customer_id",
                col("order_datetime").cast("timestamp").alias("order_ts"),
                (col("quantity") * col("unit_price")).alias("gross_amount")
            )
    )

### PySpark: AUTO CDC

Python equivalent of SQL AUTO CDC — uses `dp.create_auto_cdc_flow()` to define SCD logic programmatically.

In [0]:
from pyspark import pipelines as dp

# Define the target table
dp.create_streaming_table(
    name="silver_customers",
    schema="""
        customer_id STRING,
        first_name STRING,
        city STRING,
        __START_AT TIMESTAMP,
        __END_AT TIMESTAMP
    """
)

# Define the CDC flow
dp.create_auto_cdc_flow(
    target="silver_customers",
    source="bronze_customers",
    keys=["customer_id"],
    sequence_by="ingestion_ts",
    stored_as_scd_type=2  # or 1
)

---

## Workshop — Building the Pipeline

In this hands-on demo, we create a complete Lakeflow pipeline from SQL files — uploading source code, configuring the pipeline in the Databricks UI, running it, and validating the results.

---

### SQL Files Overview

The pipeline source code is organized by medallion layer — each SQL file declares one table or view.

![image_1771356717743.png](./image_1771356717743.png "image_1771356717743.png")

### Step 1: Upload SQL Files

**Option A: Via UI** — Workspace → Users → create `lakeflow_pipeline` folder → upload SQL files

**Option B: Via Git Folders** — Git Folders → Add Git Folder → clone training repository

### Step 2: Create Pipeline

1. **Workflows** → **Jobs & Pipelines** → **Create ETL Pipeline**
2. **Catalog:** `retailhub_<your_name>`
3. **Pipeline Name:** `lakeflow_pipeline_<your_name>`
4. **Target Schema:** `<your_name>_lakeflow`
5. **Source Code:** Add existing assets → choose pipeline root folder and source code folder

<img src="../../../assets/images/fab4fef8e72d4d5786ba818e6c2f73c5.png" width="800">

<img src="../../../assets/images/a4e21cb35d3b45f2ba184877139cdc48.png" width="800">

<img src="../../../assets/images/4e75e96544f142508c3bd8b0b4dbf446.png" width="800">

![image_1771357001640.png](./image_1771357001640.png "image_1771357001640.png")



### Step 3: Configure Variables

**Configuration** → **Add configuration**:

| Key | Value |
|-----|-------|
| `customer_path` | `/Volumes/<your catalog>/default/datasets/customers` |
| `order_path` | `/Volumes/<your catalog>/default/datasets/orders` |
| `product_path` | `/Volumes/<your catalog>/default/datasets/products/products.parquet/` |

Open settings and go to Pipeline Configuration 

<img src="../../../assets/images/b2fa841a52004939ac2679f9edef8dc3.png" width="800">

Add configuration : 

<img src="../../../assets/images/dc31d5dec7444c1d959b8e1f220c10ce.png" width="800">



<img src="../../../assets/images/6f16ad4f7fd941ebbb4fb286dbb8fbfd.png" width="800">


You should also see a DAG diagram built based on Spark Declarative Pipelines definition

![image_1771357068422.png](./image_1771357068422.png "image_1771357068422.png")


### Step 4: Run the Pipeline

Start the pipeline and test incremental processing by adding new data files.

![image_1771357652994.png](./image_1771357652994.png "image_1771357652994.png")

1. Add new file to folder orders/stream/
2. Run pipeline again
3. Check Event Log - should process only new files

### Step 5: Verify Results

In [0]:
#Provide your user_schema
# user_schema is set via 00_setup (BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA)
%run ../../setup/00_setup

In [0]:
# Check fact_sales with joins to dimensions
display(spark.sql(f"""
    SELECT 
        f.order_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        p.product_name,
        d.date,
        f.quantity,
        f.net_amount
    FROM {CATALOG}.{user_schema}.fact_sales f
    LEFT JOIN {CATALOG}.{user_schema}.dim_customer c ON f.customer_id = c.customer_id
    LEFT JOIN {CATALOG}.{user_schema}.dim_product p ON f.product_id = p.product_id
    LEFT JOIN {CATALOG}.{user_schema}.dim_date d ON f.order_date_key = d.date_key
    LIMIT 10
"""))

In [0]:
# Find customers with change history
display(spark.sql(f"""
    SELECT 
        customer_id, first_name, city,
        __START_AT, __END_AT,
        CASE WHEN __END_AT IS NULL THEN 'Current' ELSE 'Historical' END AS status
    FROM {CATALOG}.{user_schema}.silver_customers
    WHERE customer_id IN (
        SELECT customer_id 
        FROM {CATALOG}.{user_schema}.silver_customers 
        GROUP BY customer_id HAVING COUNT(*) > 1
    )
    ORDER BY customer_id, __START_AT
"""))

### Monitoring and Troubleshooting

Common issues encountered when running Lakeflow pipelines and how to resolve them.

| Issue | Cause | Solution |
|---------|-----------|-------------|
| Pipeline hangs | Cluster too small | Increase min workers |

| Missing data | Constraint DROP ROW | Check Data Quality tab || Schema mismatch | Schema change | Full refresh |

## Deploying as a Lakeflow Job

The medallion pipeline can be deployed as a **multi-task Lakeflow Job** with individual notebook tasks.

**Simplified medallion notebooks** are in: `materials/medallion/`

| Layer | Notebook | Purpose |
|-------|----------|---------|
| Bronze | `bronze_customers.ipynb` | Batch CSV → Delta |
| Bronze | `bronze_orders.ipynb` | Batch JSON → Delta |
| Silver | `silver_customers.ipynb` | Dedup + normalize |
| Silver | `silver_orders_cleaned.ipynb` | Quality filters + computed columns |
| Gold | `gold_customer_orders_summary.ipynb` | Join + aggregate metrics |
| Gold | `gold_daily_orders.ipynb` | Daily order aggregation |
| Validate | `materials/orchestration/task_validate_pipeline.py` | Check all tables + event_log |

**DAG Structure:**
```
bronze_customers ──→ silver_customers ──────→ gold_customer_orders_summary ──→ validate
bronze_orders ────→ silver_orders_cleaned ──→ gold_daily_orders ─────────────↗
```

---

In [0]:
# Job JSON Configuration — Medallion Pipeline
# Use this with Databricks Jobs REST API or Databricks CLI

job_config = {
    "name": "Medallion_Pipeline_Job",
    "tasks": [
        {
            "task_key": "bronze_customers",
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/bronze_customers",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema": "bronze",
                    "source_path": "/Volumes/{{job.parameters.catalog}}/default/landing/customers/"
                }
            },
            "new_cluster": {"spark_version": "15.4.x-scala2.12", "num_workers": 1}
        },
        {
            "task_key": "bronze_orders",
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/bronze_orders",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema": "bronze",
                    "source_path": "/Volumes/{{job.parameters.catalog}}/default/landing/orders/"
                }
            },
            "new_cluster": {"spark_version": "15.4.x-scala2.12", "num_workers": 1}
        },
        {
            "task_key": "silver_customers",
            "depends_on": [{"task_key": "bronze_customers"}],
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/silver_customers",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema_bronze": "bronze",
                    "schema_silver": "silver"
                }
            }
        },
        {
            "task_key": "silver_orders_cleaned",
            "depends_on": [{"task_key": "bronze_orders"}],
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/silver_orders_cleaned",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema_bronze": "bronze",
                    "schema_silver": "silver"
                }
            }
        },
        {
            "task_key": "gold_customer_orders_summary",
            "depends_on": [
                {"task_key": "silver_customers"},
                {"task_key": "silver_orders_cleaned"}
            ],
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/gold_customer_orders_summary",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema_silver": "silver",
                    "schema_gold": "gold"
                }
            }
        },
        {
            "task_key": "gold_daily_orders",
            "depends_on": [{"task_key": "silver_orders_cleaned"}],
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/medallion/gold_daily_orders",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema_silver": "silver",
                    "schema_gold": "gold"
                }
            }
        },
        {
            "task_key": "validate_pipeline",
            "depends_on": [
                {"task_key": "gold_customer_orders_summary"},
                {"task_key": "gold_daily_orders"}
            ],
            "notebook_task": {
                "notebook_path": "/Workspace/Users/<email>/materials/orchestration/task_validate_pipeline",
                "base_parameters": {
                    "catalog": "{{job.parameters.catalog}}",
                    "schema_bronze": "bronze",
                    "schema_silver": "silver",
                    "schema_gold": "gold",
                    "job_run_id": "{{run.id}}"
                }
            }
        }
    ],
    "parameters": [
        {"name": "catalog", "default": "<your_catalog>"}
    ],
    "trigger": {
        "file_arrival": {
            "url": "/Volumes/<catalog>/default/landing_zone/trigger",
            "min_time_between_triggers_seconds": 60,
            "wait_after_last_change_seconds": 15
        }
    },
    "max_concurrent_runs": 1,
    "timeout_seconds": 3600
}

import json
print(json.dumps(job_config, indent=2))

### Event Log Table

The validation task (`task_validate_pipeline`) logs results to a **pipeline_event_log** table for monitoring and auditing.

```sql
CREATE TABLE IF NOT EXISTS <catalog>.default.pipeline_event_log (
    event_id        STRING,
    event_timestamp TIMESTAMP,
    job_run_id      STRING,
    event_type      STRING,      -- e.g. 'PIPELINE_VALIDATION'
    status          STRING,      -- 'SUCCESS' or 'FAILURE'
    details         STRING       -- JSON with per-table check results
);
```

**Validation task behavior:**
- Checks every table (bronze → silver → gold) has rows
- Logs `SUCCESS` or `FAILURE` to `pipeline_event_log`
- If any table is empty/missing → task raises exception → Job marked as failed
- Dynamic `job_run_id` via `{{run.id}}` parameter

**Query event log:**
```sql
SELECT event_id, event_timestamp, status, details
FROM <catalog>.default.pipeline_event_log
WHERE event_type = 'PIPELINE_VALIDATION'
ORDER BY event_timestamp DESC
LIMIT 10;
```

---

## Summary

Key concepts and exam-relevant keywords covered in this module.

---

| Topic | Key Concept | Exam Keywords |
|---|---|---|
| **Medallion** | Bronze → Silver → Gold | Raw, Validated, Business-ready |
| **SCD Type 1** | Overwrite, no history | `MERGE INTO ... UPDATE SET *` |
| **SCD Type 2** | Track history | `__START_AT`, `__END_AT`, `AUTO CDC` |
| **STREAMING TABLE** | Append-only, incremental | `CREATE STREAMING TABLE`, `STREAM()` |
| **MATERIALIZED VIEW** | Full recomputation | `CREATE MATERIALIZED VIEW` |
| **FLOW** | Separate source from table | `INSERT INTO ONCE` (backfill) |
| **Expectations** | Data quality constraints | `DROP ROW`, `FAIL UPDATE`, warn-only |

> **← M06: Advanced Transforms | Day 3 | M08: Orchestration →**

---

### Resources

- [Databricks Lakeflow Docs](https://docs.databricks.com/en/delta-live-tables/index.html)
- [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [SCD with Lakeflow](https://docs.databricks.com/en/delta-live-tables/cdc.html)