# M07: Medallion Architecture & Lakeflow


The Medallion architecture (Bronze → Silver → Gold) is the standard for organizing data in the Lakehouse. We'll explore declarative Lakeflow pipelines: STREAMING TABLE, MATERIALIZED VIEW, constraints (Expectations), and the FLOW mechanism. We'll build a complete pipeline in SQL and PySpark, including SCD Type 1 and Type 2 handling via AUTO CDC.

| Exam Domain | Weight |
|---|---|
| Production Pipelines | 13% |
| Incremental Data Processing | 20% |
---

## Section 1: Theory — SCD & Lakeflow Declarations

---

## SCD Type 1 & Type 2

Slowly Changing Dimensions (SCD) define strategies for handling changes in dimensional data over time. Type 1 overwrites old values, while Type 2 preserves full history with validity timestamps.

---

### What is SCD?

**Slowly Changing Dimension (SCD)** — how to handle changes in dimensional data.

| Type | Strategy | Result |
|-----|-----------|----------|
| **SCD Type 0** | Retain original | Always Warsaw |
| **SCD Type 1** | Overwrite | Only Krakow |
| **SCD Type 2** | Track history | Both records with dates |
| **SCD Type 3** | Add column | `current_city=Krakow`, `previous_city=Warsaw` |

### SCD Type 1 — Overwrite

- **No history** — old values are overwritten
- **Use cases:** Error corrections, non-historical data
- **Implementation:** `MERGE INTO ... WHEN MATCHED THEN UPDATE SET *`

In [0]:
%sql
-- ### SCD Type 1 in Lakeflow:
CREATE FLOW silver_products_scd1_flow
AS AUTO CDC INTO silver_products
FROM bronze_products
KEYS (product_id)
SEQUENCE BY ingestion_ts
STORED AS SCD TYPE 1;  -- Overwrite without history

In [0]:
%sql
MERGE INTO dim_customer t
USING source_customers s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

### SCD Type 2 — Track History

- **Full history** of changes with `__START_AT`, `__END_AT`
- **Use cases:** Audit, historical analysis, compliance
- **Current record:** `__END_AT IS NULL`

### Demo: SCD Type 2 with AUTO CDC

In [0]:
%sql
-- Creating target table SCD2
CREATE OR REFRESH STREAMING TABLE silver_customers (
  customer_id        STRING,
  first_name         STRING,
  last_name          STRING,
  city               STRING,
  -- SCD2 columns added automatically:
  __START_AT         TIMESTAMP,
  __END_AT           TIMESTAMP
);

-- Flow with AUTO CDC for SCD2
CREATE FLOW silver_customers_scd2_flow
AS AUTO CDC INTO silver_customers
FROM STREAM bronze_customers
KEYS (customer_id)         -- Business key
SEQUENCE BY ingestion_ts   -- Column determining the order
STORED AS SCD TYPE 2;      -- SCD Type

-- ### Key elements:
-- - **KEYS** - columns identifying the business record
-- - **SEQUENCE BY** - column determining the order of changes
-- - **STORED AS SCD TYPE 1|2** - SCD type

---

## Lakeflow Pipelines — Declarations

Lakeflow (formerly Delta Live Tables) enables a declarative approach to building data pipelines — you define the desired outcome rather than step-by-step logic. This section covers table types, expectations, FLOW declarations, and both SQL and PySpark syntax.

---

### What is Lakeflow?

**Lakeflow** (formerly Delta Live Tables) — declarative framework for data pipelines.

| Approach | Example | You describe |
|-----------|----------|-------------|
| **Imperative** | `df.write.mode("overwrite")...` | HOW |
| **Declarative** | `CREATE TABLE AS SELECT...` | WHAT |

Key benefits: automatic dependencies, built-in data quality, unified batch/streaming, automatic recovery, lineage & monitoring.

### Table Types in Lakeflow

| Type | Usage | Processing |
|-----|--------|-----------|
| **STREAMING TABLE** | Append-only ingestion | Incremental (new rows only) |
| **MATERIALIZED VIEW** | Aggregations, Gold layer | Full recomputation |
| **VIEW** | Intermediate logic | Not materialized |

> **Exam:** STREAMING TABLE = incremental (new data only). MATERIALIZED VIEW = full recompute each refresh.

**STREAMING TABLE vs MATERIALIZED VIEW (Exam Topic):**

| Feature | STREAMING TABLE | MATERIALIZED VIEW |
|---------|----------------|-------------------|
| Data Source | Streaming (append-only) | Batch or streaming |
| Processing | Incremental (new rows only) | Full recomputation |
| Updates | Append-only | Full refresh |
| Use Case | Bronze/Silver layers | Gold aggregations |
| Supports AUTO CDC | Yes | No |
| Query Pattern | `STREAM(source)` | Regular `SELECT` |
| Idempotent | Yes (checkpoints) | Yes (full refresh) |

**Key Exam Point:** STREAMING TABLEs process only NEW data (incremental). MATERIALIZED VIEWs recompute the FULL result on each refresh.

### Demo: STREAMING TABLE with Constraints

Constraint actions: **`DROP ROW`** (remove invalid), **`FAIL UPDATE`** (fail pipeline), or no action (log only).

In [0]:
%sql
CREATE OR REFRESH STREAMING TABLE silver_orders
(
  CONSTRAINT valid_order_id EXPECT (order_id IS NOT NULL)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_quantity EXPECT (quantity > 0)
    ON VIOLATION DROP ROW,
  CONSTRAINT valid_price EXPECT (unit_price >= 0)
    ON VIOLATION FAIL UPDATE
)
AS
SELECT
  order_id,
  customer_id,
  product_id,
  CAST(order_datetime AS TIMESTAMP) AS order_ts,
  quantity,
  unit_price,
  (quantity * unit_price) AS gross_amount
FROM STREAM(bronze_orders);

### Demo: MATERIALIZED VIEW (Gold)

In [0]:
%sql
-- Dimension - current snapshot from SCD2
CREATE OR REFRESH MATERIALIZED VIEW dim_customer
AS
SELECT
  customer_id,
  first_name,
  last_name,
  email,
  city,
  customer_segment
FROM silver_customers
WHERE __END_AT IS NULL;

-- Date Dimension
CREATE OR REFRESH MATERIALIZED VIEW dim_date
AS
SELECT DISTINCT
  CAST(date_format(order_date, 'yyyyMMdd') AS INT) AS date_key,
  order_date AS date,
  year(order_date) AS year,
  quarter(order_date) AS quarter,
  month(order_date) AS month
FROM silver_orders;

-- Fact - streaming from Silver
CREATE OR REFRESH STREAMING TABLE fact_sales
AS
SELECT
  order_id,
  customer_id,
  product_id,
  order_date_key,
  quantity,
  gross_amount,
  net_amount
FROM STREAM(silver_orders);

### What is FLOW?

**FLOW** separates table definition from data source. Key capabilities:
1. **Multiple sources → one table** (e.g. backfill + streaming)
2. **CDC** with automatic SCD via `AUTO CDC`
3. **`INSERT INTO ONCE`** — one-time backfill
4. **`INSERT INTO`** — continuous streaming

In [0]:
%sql
-- 1. We define empty target table
CREATE OR REFRESH STREAMING TABLE bronze_orders;

-- 2. We define FLOW(s) which populate it
CREATE FLOW flow_name
AS INSERT INTO target_table BY NAME
SELECT ... FROM source;

### Demo: Backfill + Streaming Pattern

In [0]:
%sql
-- Target table
CREATE OR REFRESH STREAMING TABLE bronze_orders;

-- FLOW 1: One-time backfill
CREATE FLOW bronze_orders_backfill
AS 
INSERT INTO ONCE bronze_orders BY NAME
SELECT
  order_id,
  customer_id,
  product_id,
  order_datetime,
  'batch' AS source_system,
  _metadata.file_path AS source_file_path,
  current_timestamp() AS load_ts
FROM read_files(
  '${order_path}/orders_batch.json',
  format => 'json'
);

-- FLOW 2: Continuous streaming
CREATE FLOW bronze_orders_stream
AS 
INSERT INTO bronze_orders BY NAME
SELECT
  order_id,
  customer_id,
  'stream' AS source_system,
  _metadata.file_path AS source_file_path,
  current_timestamp() AS load_ts
FROM STREAM read_files(
  '${order_path}/stream/orders_stream_*.json',
  format => 'json'
);

### Demo: AUTO CDC for SCD Type 2

AUTO CDC: compares new records by `KEYS`, detects changes, closes old record (`__END_AT`), inserts new (SCD2) or overwrites (SCD1).

In [0]:
%sql
-- SCD2 Table with schema
CREATE OR REFRESH STREAMING TABLE silver_customers (
  customer_id        STRING,
  first_name         STRING,
  last_name          STRING,
  email              STRING,
  city               STRING,
  __START_AT         TIMESTAMP,
  __END_AT           TIMESTAMP
);

-- AUTO CDC Flow
CREATE FLOW silver_customers_scd2_flow
AS AUTO CDC INTO silver_customers
FROM STREAM bronze_customers
KEYS (customer_id)
SEQUENCE BY ingestion_ts
STORED AS SCD TYPE 2;

### PySpark Declarations

Lakeflow pipelines can also be defined in Python using the `pyspark.pipelines` decorators instead of SQL.

In [0]:
from pyspark import pipelines as dp
from pyspark.sql.functions import *

# STREAMING TABLE
@dp.table(
    name="bronze_customers",
    comment="Raw customers from CSV"
)
def bronze_customers():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "csv")
            .option("header", "true")
            .load(spark.conf.get("customer_path"))
            .select(
                "*",
                col("_metadata.file_path").alias("source_file_path"),
                current_timestamp().alias("load_ts")
            )
    )

# MATERIALIZED VIEW
@dp.table(name="dim_customer")
def dim_customer():
    return (
        spark.read.table("silver_customers")
            .filter(col("__END_AT").isNull())
            .select("customer_id", "first_name", "last_name")
    )

### PySpark: Expectations

Decorators: **`@dp.expect`** (log only), **`@dp.expect_or_drop`** (drop record), **`@dp.expect_or_fail`** (fail pipeline).

In [0]:
@dp.table(name="silver_orders")
@dp.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dp.expect_or_drop("valid_customer", "customer_id IS NOT NULL")
@dp.expect_or_drop("valid_quantity", "quantity > 0")
@dp.expect("valid_price", "unit_price >= 0")
def silver_orders():
    return (
        spark.readStream.table("bronze_orders")
            .select(
                "order_id",
                "customer_id",
                col("order_datetime").cast("timestamp").alias("order_ts"),
                (col("quantity") * col("unit_price")).alias("gross_amount")
            )
    )

### PySpark: AUTO CDC

Python equivalent of SQL AUTO CDC — uses `dp.create_auto_cdc_flow()` to define SCD logic programmatically.

In [0]:
from pyspark import pipelines as dp

# Define the target table
dp.create_streaming_table(
    name="silver_customers",
    schema="""
        customer_id STRING,
        first_name STRING,
        city STRING,
        __START_AT TIMESTAMP,
        __END_AT TIMESTAMP
    """
)

# Define the CDC flow
dp.create_auto_cdc_flow(
    target="silver_customers",
    source="bronze_customers",
    keys=["customer_id"],
    sequence_by="ingestion_ts",
    stored_as_scd_type=2  # or 1
)

---

## Section 2: UI Demo — Building a Lakeflow Pipeline

Step-by-step walkthrough of creating, configuring, and executing a Lakeflow pipeline in the Databricks workspace. This demo covers the full lifecycle: pipeline creation, asset configuration, compute provisioning, validation (dry run), and production execution.

> **Estimated time:** ~15 min (including 5-8 min for initial cluster provisioning)

---

### Step 1: Create Pipeline

Navigate to **Jobs & Pipelines** in the left sidebar and select **ETL Pipeline**.

<img src="../../../assets/images/training_2026/day3/fd6ad124b1764ec0a2867cb7c612c75d.webp" width="800">

Assign a unique, descriptive name to your pipeline (e.g., `<your_name>_lakeflow_demo`).

<img src="../../../assets/images/training_2026/day3/eb57b69aa1494a759d1f69fc1de84189.webp" width="800">

### Step 2: Configure Target — Catalog & Schema

Select the target **Unity Catalog** catalog and schema for pipeline output tables. In a shared training environment, create a dedicated schema with a unique name (e.g., `<your_name>_lakeflow_demo`) to avoid namespace conflicts with other participants.

<img src="../../../assets/images/training_2026/day3/bef9f61c293743f190f68d188265f635.webp" width="800">

### Step 3: Add Source Assets

Click **Add existing assets** to link the pipeline to repository-managed source code.

<img src="../../../assets/images/training_2026/day3/198cba2b98dd4bafbceb4d178597cfc6.webp" width="800">

Configure the following paths:
- **Pipeline root folder** → `materials/lakeflow/lakeflow_ny_demo`
- **Source code path** → `transformations` subdirectory within `lakeflow_ny_demo`

<img src="../../../assets/images/training_2026/day3/8b6b591b1b3843f6a39fbe400448c311.webp" width="800">

Verify that the final configuration matches the expected setup before proceeding:

<img src="../../../assets/images/training_2026/day3/8300f3f7570043ceaf1411a8be91e9a7.webp" width="800">

### Step 4: Review Pipeline Structure

The left panel displays the auto-discovered pipeline graph — all tables, views, and dependencies extracted automatically from the source code repository.

<img src="../../../assets/images/training_2026/day3/1e476ee5b1f94c649511adddc93a2bfd.webp" width="800">

Navigate to the **Settings** tab to configure compute resources.

<img src="../../../assets/images/training_2026/day3/5c2bfc315a2148268f2c9bf313bbc928.webp" width="800">

### Step 5: Configure Compute Resources

Open **Compute** → **Edit compute configuration** to adjust cluster settings.

<img src="../../../assets/images/training_2026/day3/7674419b96ca494d8c9e5c0ed9bc45fe.webp" width="800">

Set **Cluster mode** to `Fixed size` with **Workers = 1** to minimize resource consumption within training quota limits.

<img src="../../../assets/images/training_2026/day3/7980e9740c6b4ab99183520e23684aeb.webp" width="800">

Under **Advanced settings**, select worker type **D4ds_v5** — optimized for the training environment's vCPU quota constraints.

<img src="../../../assets/images/training_2026/day3/c410d04e1ddd4be99b52fbb35f124bac.webp" width="800">

> **Note:** Training environments have strict vCPU quotas. Using a minimal fixed-size cluster ensures reliable execution without exceeding resource limits.

Click **Save** and close the compute configuration dialog.

### Step 6: Validate Pipeline — Dry Run

Click **Dry Run** to validate the pipeline without processing data. This verifies all SQL/Python declarations, resolves table dependencies, and confirms resource availability.

<img src="../../../assets/images/training_2026/day3/b06e4093070d4febababc2678f560ead.webp" width="800">

> **Tip:** Initial cluster provisioning may take 5-8 minutes. This is expected for the first run — use this time for a break.

After successful validation, the pipeline DAG is generated — showing all tables, dependencies, and data flow paths from source to gold layer.

<img src="../../../assets/images/training_2026/day3/0ebbedde030b43e7bed35295000c42a9.webp" width="800">

### Step 7: Execute Pipeline — Full Refresh

With validation confirmed, click **Run Pipeline** → select **Full table refresh** to execute the complete pipeline end-to-end.

<img src="../../../assets/images/training_2026/day3/075c2f87b48d4e38ab018495c631aaa7.webp" width="800">

A successful execution shows all tables with **green status indicators** and a fully green DAG — confirming that data has been processed through the complete **Bronze → Silver → Gold** medallion architecture.

<img src="../../../assets/images/training_2026/day3/044431e4088049faa10fb2aee4e9d84e.webp" width="800">

> **Result:** The Lakeflow pipeline has been executed end-to-end — from raw source ingestion (Bronze) through validated transformations (Silver) to business-ready aggregations (Gold) — using the declarative framework with automatic dependency management and data quality enforcement.

## Summary

Key concepts and exam-relevant keywords covered in this module.

---

| Topic | Key Concept | Exam Keywords |
|---|---|---|
| **Medallion** | Bronze → Silver → Gold | Raw, Validated, Business-ready |
| **SCD Type 1** | Overwrite, no history | `MERGE INTO ... UPDATE SET *` |
| **SCD Type 2** | Track history | `__START_AT`, `__END_AT`, `AUTO CDC` |
| **STREAMING TABLE** | Append-only, incremental | `CREATE STREAMING TABLE`, `STREAM()` |
| **MATERIALIZED VIEW** | Full recomputation | `CREATE MATERIALIZED VIEW` |
| **FLOW** | Separate source from table | `INSERT INTO ONCE` (backfill) |
| **Expectations** | Data quality constraints | `DROP ROW`, `FAIL UPDATE`, warn-only |

> **← M06: Advanced Transforms | Day 3 | M08: Orchestration →**

---

### Resources

- [Databricks Lakeflow Docs](https://docs.databricks.com/en/delta-live-tables/index.html)
- [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [SCD with Lakeflow](https://docs.databricks.com/en/delta-live-tables/cdc.html)