# Lakeflow Spark Declarative Pipelines

**Training Objective:** Understanding Lakeflow declarative framework for building batch and streaming pipelines, plus practical implementation of Bronze→Silver→Gold with SQL API.

**Topics Covered:**
- Lakeflow concepts: declarative approach to pipeline definition
- SQL vs Python API (focus on SQL)
- Materialized views / streaming tables
- Expectations: warn / drop / fail (data quality)
- Event log and lineage per table
- Automatic orchestration

## Context and Requirements

- **Training Day**: Day 3 - Transformation & Governance
- **Notebook Type**: Demo
- **Technical Requirements**:
 - Databricks Runtime 16.4 LTS or newer (recommended: 17.3 LTS)
 - Unity Catalog enabled
 - Permissions: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
 - Cluster: Standard or Serverless Compute
 
**Note:** This notebook demonstrates **SQL API** for Lakeflow SDP. Python API (`create_streaming_table()`, `table()`) is an alternative with the same functionality.

> **Update (June 2025):** The product name changed from "Delta Live Tables (DLT)" to "Lakeflow Spark Declarative Pipelines (SDP)". The functionality remains the same. Additionally, "Databricks Jobs" is now "Lakeflow Jobs".

## Theoretical Introduction - Lakeflow Spark Declarative Pipelines

**Section Objective:** Understanding what Lakeflow SDP is and how it revolutionizes ETL/ELT pipeline building.

---

### What is Lakeflow Spark Declarative Pipelines?

**Lakeflow Spark Declarative Pipelines (SDP)** is a declarative framework for creating batch and streaming data pipelines in SQL and Python. It extends Apache Spark Declarative Pipelines, running on the optimized Databricks Runtime.

```
┌─────────────────────────────────────────────────────────────┐
│ TRADITIONAL APPROACH (Procedural) │
├─────────────────────────────────────────────────────────────┤
│ 1. Write code: df = spark.read.table(...) │
│ 2. Write transformations: df.filter().groupBy()... │
│ 3. Write orchestration: if/else, try/catch, retry logic │
│ 4. Write monitoring: log metrics, track failures │
│ 5. Write quality checks: manual assertions │
│ 6. Deploy: schedule in Jobs, manage dependencies │
│ │
│ = Hundreds of lines of code, manual orchestration, error handling │
└─────────────────────────────────────────────────────────────┘

 

┌─────────────────────────────────────────────────────────────┐
│ LAKEFLOW SDP (Declarative) │
├─────────────────────────────────────────────────────────────┤
│ 1. Declare WHAT you want (WHAT): │
│ CREATE OR REFRESH STREAMING TABLE bronze AS ... │
│ CREATE OR REFRESH MATERIALIZED VIEW silver AS ... │
│ CREATE OR REFRESH MATERIALIZED VIEW gold AS ... │
│ │
│ 2. Lakeflow automatically: │
│ Orchestrates order (dependency DAG) │
│ Retry on failures │
│ Incremental processing │
│ Monitoring (Event Log) │
│ Data quality (expectations) │
│ Lineage tracking │
│ │
│ = A dozen lines of SQL, zero orchestration code │
└─────────────────────────────────────────────────────────────┘
```

---

### Key Benefits of Lakeflow SDP

**1. Automatic Orchestration**

Lakeflow automatically:
- Analyzes dependencies between tables (who reads from whom)
- Builds DAG (Directed Acyclic Graph)
- Executes in correct order with maximum parallelization
- Retry at levels: task → flow → pipeline

```sql
-- Just declare:
CREATE OR REFRESH MATERIALIZED VIEW silver AS 
 SELECT * FROM bronze; -- Lakeflow knows: silver depends on bronze

CREATE OR REFRESH MATERIALIZED VIEW gold AS 
 SELECT * FROM silver; -- Lakeflow knows: gold depends on silver

-- Execution order: bronze → silver → gold (automatic!)
```

**2. Declarative Processing**

Declarative API reduces hundreds of lines of code to a few:

```sql
-- Traditional (procedural):
-- 1. Read source
-- 2. Apply transformations
-- 3. Handle schema evolution
-- 4. Write to Delta
-- 5. Error handling
-- 6. Retry logic
-- 7. Metrics logging
-- = ~100+ lines of code

-- Lakeflow (declarative):
CREATE OR REFRESH STREAMING TABLE orders AS
 SELECT * FROM STREAM read_files('/path/to/orders');
-- = 2 lines, all of the above is automatic!
```

**3. Incremental Processing**

Lakeflow processes only new/changed data:

- **Streaming tables**: Append-only, each record once
- **Materialized views**: Incremental refresh (Databricks detects changes in source)
- **AUTO CDC**: Out-of-order events handling, SCD Type 1/2

**4. Built-in Data Quality**

Expectations = SQL constraints with flexible handling:

```sql
CREATE OR REFRESH STREAMING TABLE orders (
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW,
 CONSTRAINT valid_date EXPECT (order_date IS NOT NULL) ON VIOLATION FAIL UPDATE
)
AS SELECT * FROM ...
```

---

### Basic Concepts

- **Lakeflow SDP**: Declarative framework for batch + streaming pipelines
- **Flow**: Processing unit (Append, AUTO CDC, Materialized View)
- **STREAMING TABLE**: Delta table for streaming/incremental data (append-only, low-latency)
- **MATERIALIZED VIEW**: Delta table with incremental refresh (batch, cache results)
- **VIEW (temporary)**: Ephemeral, no persist, always recompute
- **SINK**: Streaming target (Delta, Kafka, EventHub, custom Python)
- **Pipeline**: Collection of flows + tables + views + sinks (unit of deployment)
- **Expectations**: Data quality constraints (warn/drop/fail)
- **Event Log**: Delta table with metrics, lineage, quality metrics

**Why is this important?**

Lakeflow SDP eliminates boilerplate code and lets you focus on business logic instead of orchestration. The declarative model provides:
- **Separation of concerns**: WHAT (declaration) vs HOW (execution engine)
- **Reusability**: Same declarations in dev/test/prod
- **Observability**: Event Log out-of-the-box
- **Reliability**: Automatic retry and error handling

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../00_setup

## Configuration

Library imports and environment variable setup:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import uuid

# Set catalog as default
spark.sql(f"USE CATALOG {CATALOG}")

# Source data paths
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

# Display user context
display(spark.createDataFrame([
    ("Catalog", CATALOG),
    ("Schema Bronze", BRONZE_SCHEMA),
    ("Schema Silver", SILVER_SCHEMA),
    ("Schema Gold", GOLD_SCHEMA),
    ("User", raw_user),
    ("Orders path", ORDERS_JSON),
    ("Customers path", CUSTOMERS_CSV),
    ("Products path", PRODUCTS_PARQUET)
], ["Parameter", "Value"]))

---

## Section 1: Lakeflow Concepts - Flows, Tables, Views

**Theoretical Introduction:**

Lakeflow SDP operates on three key concepts: **Flows** (how data flows), **Streaming Tables** (append-only targets), and **Materialized Views** (batch targets with incremental refresh).

---

### Flow Types (Data Flow Types)

**Flow** is a data processing unit in Lakeflow - it defines HOW data flows from source to target.

```
┌─────────────────────────────────────────────────────────────┐
│ FLOW TYPES │
├─────────────────────────────────────────────────────────────┤
│ 1. APPEND FLOW │
│ • Source: Append-only (files, Kafka, Kinesis, Delta) │
│ • Semantics: Streaming (continuous processing) │
│ • Guarantee: Exactly-once per record │
│ • Latency: Low (seconds) │
│ • Use case: Real-time ingest, log streaming │
│ • Target: STREAMING TABLE │
│ │
│ SQL Example: │
│ CREATE OR REFRESH STREAMING TABLE orders AS │
│ SELECT * FROM STREAM read_files('/path'); │
│ │
│ 2. AUTO CDC FLOW │
│ • Source: Change Data Capture (CDF-enabled Delta) │
│ • Semantics: Streaming with CDC operations │
│ • Operations: INSERT, UPDATE, DELETE, TRUNCATE │
│ • Sequencing: Out-of-order handling (automatic) │
│ • SCD: Type 1 (update) or Type 2 (history tracking) │
│ • Use case: Sync with transactional DB, audit trail │
│ • Target: STREAMING TABLE │
│ │
│ SQL Example: │
│ AUTO CDC INTO target_table │
│ FROM source_table │
│ KEYS (user_id) │
│ SEQUENCE BY timestamp │
│ APPLY AS DELETE WHEN operation = 'DELETE'; │
│ │
│ 3. MATERIALIZED VIEW FLOW │
│ • Source: Batch read (Delta tables, views) │
│ • Semantics: Batch (scheduled/triggered) │
│ • Refresh: Incremental (only changed partitions) │
│ • Cache: Results persisted (performance) │
│ • Recompute: Full on schema changes or explicit │
│ • Use case: Aggregations, slow queries, BI dashboards │
│ • Target: MATERIALIZED VIEW │
│ │
│ SQL Example: │
│ CREATE OR REFRESH MATERIALIZED VIEW daily_summary AS │
│ SELECT date, SUM(amount) FROM orders GROUP BY date; │
└─────────────────────────────────────────────────────────────┘
```

**Key Differences:**

| Flow Type | Processing | Source | Latency | Incremental | Use Case |
|-----------|------------|--------|---------|-------------|----------|
| **Append** | Streaming | Append-only | Seconds | (watermarks) | Real-time ingest |
| **AUTO CDC** | Streaming | CDC events | Seconds | (sequencing) | DB sync, SCD |
| **Materialized View** | Batch | Any Delta | Minutes | (smart refresh) | Aggregations, BI |

---

### STREAMING TABLE vs MATERIALIZED VIEW

| Aspect | STREAMING TABLE | MATERIALIZED VIEW |
|--------|-----------------|-------------------|
| **Semantics** | Streaming (continuous) | Batch (scheduled/triggered) |
| **Processing** | Exactly-once per record | Incremental refresh (changed data) |
| **Source** | `STREAM` keyword required | Batch read (no STREAM) |
| **Latency** | Low (seconds) | Higher (minutes) |
| **State** | Bounded (watermarks) | Stateless (recompute) |
| **Joins** | Stream-snapshot (static dims) | Full recompute (always correct) |
| **Use case** | Real-time ingest, CDC | Aggregations, slow queries |
| **Schema evolution** | Limited (full refresh) | Flexible |

**When to use:**
- **STREAMING TABLE**: Ingest from files/Kafka, CDC, low-latency transformations
- **MATERIALIZED VIEW**: Aggregations, joins with frequent dimension changes, pre-compute slow queries

---

### VIEW (temporary)

**VIEW** is an ephemeral object - no persist, always recompute on query.

**Use cases:**
- Intermediate transformations (reusable logic)
- Data quality checks (don't publish to catalog)
- Testing (don't save to Delta)

```sql
-- VIEW: doesn't save to Delta
CREATE OR REFRESH VIEW temp_filtered AS
 SELECT * FROM bronze WHERE status = 'ACTIVE';

-- Use in downstream table
CREATE OR REFRESH MATERIALIZED VIEW silver AS
 SELECT * FROM temp_filtered;
```

---

### Automatic Dependency Resolution (DAG)

Lakeflow automatically builds DAG from dependencies:

```sql
-- Declarations (you don't specify order):
CREATE OR REFRESH STREAMING TABLE bronze AS ...;
CREATE OR REFRESH MATERIALIZED VIEW silver AS SELECT * FROM bronze;
CREATE OR REFRESH MATERIALIZED VIEW gold AS SELECT * FROM silver;

-- Lakeflow execution order (automatic):
-- 1. bronze (no dependencies)
-- 2. silver (depends on bronze) [parallel if multiple silvers]
-- 3. gold (depends on silver) [parallel if multiple golds]
```

**Key point:** You declare WHAT, Lakeflow decides HOW and WHEN.

### Example 1.1: STREAMING TABLE - Bronze Layer Ingest

**Objective:** Demonstrate STREAMING TABLE for real-time ingest with Auto Loader

**Approach:**
1. Use `read_files()` for Auto Loader (SQL API)
2. `STREAM` keyword for streaming semantics
3. Write to STREAMING TABLE (append-only)

In [0]:
# Example 1.1 - Bronze Layer (traditional approach for demonstration)
# In production pipeline, we would use Lakeflow CREATE OR REFRESH STREAMING TABLE

spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# Bronze layer: load raw data from JSON (batch for demo)
orders_bronze_df = (
    spark.read
    .format("json")
    .option("multiLine", "true")
    .load(ORDERS_JSON)
    .withColumn("_bronze_ingest_timestamp", F.current_timestamp())
    .withColumn("_bronze_source_file", F.input_file_name())
    .withColumn("_bronze_ingested_by", F.lit(raw_user))
    .withColumn("_bronze_version", F.lit(1))
)

# Save to Delta (Bronze table)
bronze_table = "orders_bronze"
(
    orders_bronze_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(bronze_table)
)

# Preview
display(spark.table(bronze_table).limit(5))

---

## Section 2: Silver Layer - MATERIALIZED VIEW + Expectations

**Theoretical Introduction:**

Silver layer cleans and validates data from Bronze. MATERIALIZED VIEW provides incremental refresh - processes only changed data. Expectations are built-in data quality constraints.

**Key Concepts:**
- **MATERIALIZED VIEW**: Batch processing with incremental refresh
- **Expectations**: SQL constraints with actions: EXPECT (warn), DROP ROW, FAIL UPDATE
- **Data Quality Gates**: Validations between layers

**Practical Application:**
- Deduplication by business key
- Validation: NOT NULL, ranges, business rules
- Standardization: dates, text formats, type casting

### Example 2.1: MATERIALIZED VIEW with Expectations (Silver Layer)

**Objective:** Demonstrate MATERIALIZED VIEW for Silver with data quality constraints

**Lakeflow SQL Syntax (Production):**

```sql
-- In production Lakeflow pipeline:
CREATE OR REFRESH MATERIALIZED VIEW silver.orders_silver (
 -- Expectations: Data Quality Constraints
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW,
 CONSTRAINT valid_date EXPECT (order_datetime IS NOT NULL) ON VIOLATION DROP ROW,
 CONSTRAINT valid_ids EXPECT (order_id IS NOT NULL AND customer_id IS NOT NULL)
)
COMMENT 'Silver layer - cleaned orders with quality checks'
AS
SELECT
 order_id,
 customer_id,
 product_id,
 store_id,
 to_date(order_datetime) AS order_date,
 to_timestamp(order_datetime) AS order_timestamp,
 quantity,
 unit_price,
 CAST(total_amount AS DECIMAL(10,2)) AS total_amount,
 UPPER(TRIM(payment_method)) AS payment_method,
 CASE 
 WHEN total_amount > 0 THEN 'COMPLETED'
 ELSE 'UNKNOWN'
 END AS order_status,
 current_timestamp() AS _silver_processed_timestamp,
 'VALID' AS _data_quality_flag
FROM bronze.orders_bronze;
```

**Expectations Explanation:**
- **EXPECT (warn)**: Log violation, keep record (default)
- **ON VIOLATION DROP ROW**: Remove invalid record
- **ON VIOLATION FAIL UPDATE**: Abort pipeline on violation (strict mode)

**Traditional Implementation (for notebook demo):**

In [0]:
# Example 2.1 - Silver Layer with data quality (traditional approach for demo)

spark.sql(f"USE SCHEMA {SILVER_SCHEMA}")

# Load data from Bronze
orders_bronze_df = spark.table(f"{BRONZE_SCHEMA}.{bronze_table}")

# Silver transformations with validation (Expectations simulation)
orders_silver_df = (
    orders_bronze_df
    # Deduplication
    .dropDuplicates(["order_id"])
    # NOT NULL validation (DROP ROW equivalent)
    .filter(F.col("order_id").isNotNull())
    .filter(F.col("customer_id").isNotNull())
    .filter(F.col("product_id").isNotNull())
    # Business rule validation
    .filter(F.col("total_amount") > 0)
    .filter(F.col("order_datetime").isNotNull())
    # Standardization
    .withColumn("order_date", F.to_date(F.col("order_datetime")))
    .withColumn("order_timestamp", F.to_timestamp(F.col("order_datetime")))
    .withColumn("total_amount", F.col("total_amount").cast("decimal(10,2)"))
    .withColumn("payment_method", F.upper(F.trim(F.col("payment_method"))))
    # Derived columns
    .withColumn("order_status", 
                F.when(F.col("total_amount") > 0, "COMPLETED").otherwise("UNKNOWN"))
    # Silver metadata
    .withColumn("_silver_processed_timestamp", F.current_timestamp())
    .withColumn("_data_quality_flag", F.lit("VALID"))
)

# Quality metrics
bronze_count = orders_bronze_df.count()
silver_count = orders_silver_df.count()
rejected_count = bronze_count - silver_count
rejection_rate = (rejected_count / bronze_count * 100) if bronze_count > 0 else 0

# Data Quality Metrics
display(spark.createDataFrame([
    ("Bronze input", bronze_count),
    ("Silver output", silver_count),
    ("Rejected", rejected_count),
    ("Rejection rate %", round(rejection_rate, 2))
], ["Metric", "Value"]))

# Save to Silver schema
silver_table = "orders_silver"
(
    orders_silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(silver_table)
)

# Sample Silver data
display(spark.table(silver_table).limit(5))

---

## Section 3: Gold Layer - Business Aggregates

**Theoretical Introduction:**

Gold layer contains pre-aggregated business metrics, denormalized tables, and KPIs. MATERIALIZED VIEW with incremental refresh ensures only affected partitions are recalculated.

**Key Concepts:**
- **Business-level aggregates**: Daily/Monthly summaries, KPIs
- **Denormalization**: Pre-computed joins for performance
- **Incremental refresh**: Only affected partitions

**Practical Application:**
- BI dashboards (Power BI, Tableau)
- Executive reporting
- ML feature stores

### Example 3.1: MATERIALIZED VIEW for Gold (Daily Aggregates)

**Objective:** Demonstrate Gold layer with business aggregates and KPIs

**Lakeflow SQL Syntax (Production):**

```sql
-- In production Lakeflow pipeline:
CREATE OR REFRESH MATERIALIZED VIEW gold.daily_order_summary
COMMENT 'Gold layer - daily order summary (KPI)'
AS
SELECT
 order_date,
 order_status,
 -- Volume metrics
 COUNT(order_id) AS total_orders,
 COUNT(DISTINCT customer_id) AS unique_customers,
 -- Revenue metrics
 SUM(total_amount) AS total_revenue,
 AVG(total_amount) AS avg_order_value,
 MIN(total_amount) AS min_order_value,
 MAX(total_amount) AS max_order_value,
 -- Derived KPIs
 ROUND(SUM(total_amount) / COUNT(DISTINCT customer_id), 2) AS revenue_per_customer,
 -- Gold metadata
 current_timestamp() AS _gold_created_timestamp,
 'DAILY' AS _gold_aggregation_level
FROM silver.orders_silver
GROUP BY order_date, order_status
ORDER BY order_date DESC, order_status;
```

**Automatic Dependency:** Lakeflow knows that `gold.daily_order_summary` depends on `silver.orders_silver` → automatic execution order!

**Traditional Implementation:**

In [0]:
# Example 3.1 - Gold Layer (Daily Aggregates)

spark.sql(f"USE SCHEMA {GOLD_SCHEMA}")

# Load data from Silver
orders_silver_df = spark.table(f"{SILVER_SCHEMA}.{silver_table}")

# Gold aggregation: Daily order summary with KPIs
daily_summary_df = (
    orders_silver_df
    .groupBy("order_date", "order_status")
    .agg(
        # Volume metrics
        F.count("order_id").alias("total_orders"),
        F.countDistinct("customer_id").alias("unique_customers"),
        # Revenue metrics
        F.sum("total_amount").alias("total_revenue"),
        F.avg("total_amount").alias("avg_order_value"),
        F.min("total_amount").alias("min_order_value"),
        F.max("total_amount").alias("max_order_value")
    )
    # Derived KPIs
    .withColumn("revenue_per_customer", 
                F.round(F.col("total_revenue") / F.col("unique_customers"), 2))
    # Gold metadata
    .withColumn("_gold_created_timestamp", F.current_timestamp())
    .withColumn("_gold_aggregation_level", F.lit("DAILY"))
    .orderBy("order_date", "order_status")
)

# Summary statistics
total_days = daily_summary_df.select("order_date").distinct().count()
total_orders_gold = daily_summary_df.agg(F.sum("total_orders")).collect()[0][0]
total_revenue_gold = daily_summary_df.agg(F.sum("total_revenue")).collect()[0][0]

# Gold layer summary
display(spark.createDataFrame([
    ("Total days aggregated", str(total_days)),
    ("Total orders", f"{total_orders_gold:,}"),
    ("Total revenue", f"${total_revenue_gold:,.2f}")
], ["Metric", "Value"]))

# Save to Gold schema
gold_table = "daily_order_summary"
(
    daily_summary_df
    .write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(gold_table)
)

# Sample Gold data (daily KPIs)
display(spark.table(gold_table).limit(5))

---

## Section 4: Event Log and Lineage

**Theoretical Introduction:**

Lakeflow automatically logs all operations to **Event Log** (Delta table). Event Log contains:
- Flow progress (success/failure per table)
- Data quality metrics (expectations violations)
- Lineage tracking (source → target)
- Performance metrics (duration, records processed)

**Key Concepts:**
- **Event Log**: Delta table in `system/events` (per pipeline)
- **Flow types**: `flow_definition`, `flow_progress`, `expectation`, `user_action`
- **Lineage**: Automatic dependency tracking Bronze → Silver → Gold

**Practical Application:**
- Monitoring pipeline health
- Debugging failures
- Data quality reporting
- Audit and compliance

### Example 4.1: Event Log - Monitoring and Lineage

**Event Log Location:**
```
dbfs:/pipelines/<pipeline_id>/system/events
```

**Sample Event Log Queries (in production Lakeflow pipeline):**

```python
# 1. Query Event Log
event_log_path = "dbfs:/pipelines/<pipeline_id>/system/events"
event_log_df = spark.read.format("delta").load(event_log_path)

# 2. Flow progress per table
flow_progress = (
 event_log_df
 .filter("event_type = 'flow_progress'")
 .select("timestamp", "details.flow_name", "details.output_records", "details.status")
)

# 3. Expectations violations (data quality metrics)
expectations_df = (
 event_log_df
 .filter("event_type = 'expectation'")
 .select(
 "timestamp",
 "details.dataset",
 "details.name",
 "details.passed_records",
 "details.failed_records"
 )
)

# 4. Lineage tracking
lineage_df = (
 event_log_df
 .filter("event_type = 'flow_definition'")
 .select("details.flow_name", "details.input_datasets", "details.output_dataset")
)
```

---

## Section 5: SQL vs Python API

**Introduction:**

Lakeflow SDP offers two equivalent APIs: **SQL** and **Python**. The choice depends on team preferences and use case.

### Syntax Comparison

| Aspect | SQL | Python |
|--------|-----|--------|
| **STREAMING TABLE** | `CREATE OR REFRESH STREAMING TABLE` | `@dp.table()` |
| **MATERIALIZED VIEW** | `CREATE OR REFRESH MATERIALIZED VIEW` | `@dp.materialized_view()` |
| **VIEW** | `CREATE OR REFRESH VIEW` | `@dp.view()` / `@dp.temporary_view()` |
| **Expectations** | `CONSTRAINT ... EXPECT ... ON VIOLATION` | `@dp.expect()`, `@dp.expect_or_drop()`, `@dp.expect_or_fail()` |
| **Streaming read** | `FROM STREAM table` | `spark.readStream.table()` |

---

### Example: Same Pipeline in SQL and Python

**SQL Approach:**

```sql
-- Bronze
CREATE OR REFRESH STREAMING TABLE bronze.orders AS
SELECT * FROM STREAM read_files('/path/orders', format => 'json');

-- Silver
CREATE OR REFRESH MATERIALIZED VIEW silver.orders (
 CONSTRAINT valid_amount EXPECT (total_amount > 0) ON VIOLATION DROP ROW
)
AS SELECT 
 order_id, 
 customer_id, 
 CAST(total_amount AS DECIMAL(10,2)) AS total_amount
FROM bronze.orders;

-- Gold
CREATE OR REFRESH MATERIALIZED VIEW gold.daily_summary AS
SELECT 
 DATE(order_date) AS date,
 SUM(total_amount) AS revenue
FROM silver.orders
GROUP BY DATE(order_date);
```

**Python Approach (equivalent):**

```python
from pyspark import pipelines as dp
from pyspark.sql import functions as F

# Bronze
@dp.table(comment="Bronze orders")
def orders_bronze():
 return (
 spark.readStream
 .format("cloudFiles")
 .option("cloudFiles.format", "json")
 .load("/path/orders")
 )

# Silver
@dp.materialized_view(comment="Silver orders")
@dp.expect_or_drop("valid_amount", "total_amount > 0")
def orders_silver():
 return (
 spark.read.table("bronze.orders")
 .select(
 "order_id",
 "customer_id",
 F.col("total_amount").cast("decimal(10,2)")
 )
 )

# Gold
@dp.materialized_view(comment="Gold daily summary")
def daily_summary():
 return (
 spark.read.table("silver.orders")
 .groupBy(F.to_date("order_date").alias("date"))
 .agg(F.sum("total_amount").alias("revenue"))
 )
```

---

### When to Use SQL vs Python?

**Use SQL if:**
- Team has strong SQL skills
- Simple transformations (filters, aggregations)
- Integration with BI tools (SQL-native workflows)
- Less metaprogramming

**Use Python if:**
- You need loops / dynamic table creation
- Complex transformations (UDFs, window functions)
- Integration with ML workflows
- Testing (unit tests for transformations)

**Best practice:** Mix them! SQL for simple, Python for complex.

---

## Resource Cleanup

Clean up resources created during the notebook:

In [0]:
# Optional cleanup of test resources
# WARNING: Run only if you want to delete all created data

# Delete Demo tables
# spark.sql(f"DROP TABLE IF EXISTS {BRONZE_SCHEMA}.{bronze_table}")
# spark.sql(f"DROP TABLE IF EXISTS {SILVER_SCHEMA}.{silver_table}")
# spark.sql(f"DROP TABLE IF EXISTS {GOLD_SCHEMA}.{gold_table}")

# Clear cache
# spark.catalog.clearCache()

displayHTML("<p> To delete tables, uncomment the code above and run the cell</p>")