# Workshop 3: Lakeflow Pipelines

## The Story

Our retail company, "BikeSuperstore", is modernizing its data platform. We have raw CSV files landing in our data lake (Orders, Customers, Products), and the business intelligence team needs a reliable, up-to-date **Star Schema** to report on sales performance.

**The Challenge:**
Traditional ETL pipelines are brittle and hard to maintain. We need to use **Databricks Lakeflow (Spark Declarative Pipelines)** to build a robust system that handles:
1.  **Ingestion**: Automatically loading new files.
2.  **History**: Tracking changes in customer data (SCD Type 2).
3.  **Quality**: Enforcing data quality rules.
4.  **Modeling**: Creating a Gold layer with a Star Schema.

**Your Mission:**
You will implement the pipeline definition. You can choose to work in **SQL** or **Python** (or both!). The code provided has some "blanks" that you need to fill in to make the pipeline work.

In [None]:
%run ../00_setup

## Business Logic & Architecture

We are following the **Medallion Architecture**:

1.  **Bronze Layer (Raw)**:
    *   Ingest CSV files from `Customers`, `Product`, `ProductCategory`, `SalesOrderHeader`, `SalesOrderDetail`.
    *   Use **Auto Loader** (`cloud_files`) for efficient incremental ingestion.

2.  **Silver Layer (Cleaned & Enriched)**:
    *   **Customers**: Apply **SCD Type 2** (Slowly Changing Dimensions) to track history based on `ModifiedDate`.
    *   **Products & Categories**: Apply **SCD Type 1** (Overwrite) to keep the latest info.
    *   **Orders**: Clean data and apply **Data Quality Expectations** (e.g., `TotalDue > 0`).

3.  **Gold Layer (Star Schema)**:
    *   **Fact Table**: `fact_sales` (Transactions).
    *   **Dimensions**:
        *   `dim_customer`: Current customer details.
        *   `dim_product`: Product details enriched with Category names.
        *   `dim_date`: A generated calendar dimension for analysis.

![assets/images/ibDi2xO_FDNG-IhAmH5CO.png](../../assets/images/ibDi2xO_FDNG-IhAmH5CO.png)

## Choose Your Weapon: SQL or Python?

Lakeflow SDP supports both languages.
*   **Section A**: SQL Implementation (Standard for analysts/engineers).
*   **Section B**: Python Implementation (Great for complex logic and metaprogramming).


*   **source_path** : /Volumes/ecommerce_platform_<your_user_name>/default/datasets/workshop
---

# SECTION A: SQL Implementation

In this section, you will complete the SQL DDL statements to define the pipeline.

In [None]:
-- ============================================================
-- 1. BRONZE LAYER: Ingestion
-- ============================================================
-- TODO: Complete the Auto Loader syntax to read CSV files.
-- Hint: Use cloud_files function. We need to infer schema and read headers.

-- Bronze Customers
CREATE OR REFRESH STREAMING TABLE bronze_customers
COMMENT 'Raw customers data from CSV'
AS SELECT * FROM ___('${source_path}/Customers', 'csv', map("header", "true", "inferSchema", "___"));

-- Bronze Products
CREATE OR REFRESH STREAMING TABLE bronze_products
COMMENT 'Raw products data from CSV'
AS SELECT * FROM cloud_files('${source_path}/___', 'csv', map("header", "true", "inferSchema", "___"));

-- [TASK] Complete the code for Product Categories
-- Bronze Product Categories
CREATE OR REFRESH STREAMING TABLE bronze_product_categories
COMMENT 'Raw product categories data from CSV'
AS SELECT * FROM cloud_files('${source_path}/ProductCategory', '___', map(___));

-- [TASK] Complete the code for Orders Header
-- Bronze Orders Header
CREATE OR REFRESH STREAMING TABLE bronze_orders_header
COMMENT 'Raw orders data from CSV'
AS SELECT * FROM cloud_files('${source_path}/SalesOrderHeader', '___', map(___));

-- [TASK] Complete the code for Orders Detail
-- Bronze Orders Detail
CREATE OR REFRESH STREAMING TABLE bronze_orders_detail
COMMENT 'Raw orders data from CSV'
AS SELECT * FROM ___('${source_path}/___', 'csv', map("header", "true", "inferSchema", "true"));

In [None]:
-- ============================================================
-- 2. SILVER LAYER: SCD & Quality
-- ============================================================

-- [TASK] Implement SCD Type 2 for Customers
-- We need to track history using 'ModifiedDate'
CREATE OR REFRESH STREAMING TABLE silver_customers;

AUTO CDC INTO silver_customers
FROM bronze_customers
KEYS (CustomerID)
SEQUENCE BY ___
STORED AS SCD TYPE ___;

-- [TASK] Implement SCD Type 1 for Products
CREATE OR REFRESH STREAMING TABLE silver_products;

AUTO CDC INTO silver_products
FROM bronze_products
KEYS (___)
SEQUENCE BY ___
STORED AS SCD TYPE 1;

-- [TASK] Implement SCD Type 1 for Product Categories
CREATE OR REFRESH STREAMING TABLE silver_product_categories;

AUTO CDC INTO silver_product_categories
FROM bronze_product_categories
KEYS (ProductCategoryID)
SEQUENCE BY ___
STORED AS SCD TYPE ___;

-- [TASK] Add Data Quality Expectations for Orders
-- Ensure TotalDue is greater than 0 (DROP ROW on violation)
-- Ensure CustomerID is not null (FAIL UPDATE on violation)
CREATE OR REFRESH STREAMING TABLE silver_orders
(
  CONSTRAINT valid_amount EXPECT (___) ON VIOLATION DROP ROW,
  CONSTRAINT valid_customer EXPECT (CustomerID IS NOT NULL) ON VIOLATION ___
)
AS SELECT 
  SalesOrderID, CustomerID, TotalDue, OrderDate, Status, current_timestamp() as processed_at
FROM STREAM(bronze_orders_header);

-- [TASK] Create Silver Order Details
-- Just a simple pass-through with some cleaning if needed
CREATE OR REFRESH STREAMING TABLE silver_order_details
AS SELECT 
  SalesOrderID, ___, OrderQty, ProductID, UnitPrice, LineTotal, ModifiedDate
FROM STREAM(___);

In [None]:
-- ============================================================
-- 3. GOLD LAYER: Star Schema
-- ============================================================

-- [TASK] Create the Customer Dimension
-- Filter out old records (SCD Type 2) using __END_AT
CREATE OR REFRESH MATERIALIZED VIEW dim_customer
AS SELECT 
  CustomerID, FirstName, LastName, EmailAddress, Phone
FROM ___
WHERE ___ IS NULL;

-- [TASK] Create the Product Dimension
-- Join silver_products with silver_product_categories to get CategoryName
CREATE OR REFRESH MATERIALIZED VIEW dim_product
AS SELECT 
  p.ProductID,
  p.Name as ProductName,
  pc.Name as CategoryName
FROM silver_products p
LEFT JOIN ___ pc ON p.ProductCategoryID = ___;

-- [TASK] Create the Date Dimension
-- Extract Year, Month, Day from OrderDate in silver_orders
CREATE OR REFRESH MATERIALIZED VIEW dim_date
AS SELECT ___
  cast(OrderDate as date) as DateKey,
  year(OrderDate) as Year,
  month(OrderDate) as Month,
  ___(OrderDate) as Day
FROM silver_orders;

-- [TASK] Create the Fact Table
-- Join Order Headers and Details
CREATE OR REFRESH MATERIALIZED VIEW fact_sales
AS SELECT 
  od.SalesOrderID,
  oh.OrderDate,
  oh.CustomerID,
  od.ProductID,
  od.LineTotal
FROM silver_order_details od
JOIN silver_orders oh ON ___ = ___;

# SECTION B: Python Implementation

In this section, you will use the `pyspark.pipelines` (dp) API to define the same logic.

In [None]:
import pyspark.pipelines as dp
from pyspark.sql.functions import *

source_path = spark.conf.get("source_path")

# ============================================================
# 1. BRONZE LAYER
# ============================================================

# Bronze Customers
dp.create_streaming_table(name="bronze_customers")

@dp.append_flow(target="bronze_customers")
def bronze_customers_flow():
    return (
        spark.readStream.format("___")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/___")
    )

# [TASK] Bronze Products
dp.create_streaming_table(name="bronze_products")

@dp.append_flow(target="bronze_products")
def bronze_products_flow():
    return (
        spark.readStream.format("___")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/___")
    )

# [TASK] Bronze Product Categories
dp.create_streaming_table(name="bronze_product_categories")

@dp.append_flow(target="bronze_product_categories")
def bronze_product_categories_flow():
    return (
        spark.readStream.format("___")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load(f"{source_path}/___")
    )

# [TASK] Complete the Bronze Orders Header flow
dp.create_streaming_table(name="bronze_orders_header")

@dp.append_flow(target="bronze_orders_header")
def bronze_orders_header_flow():
    return (
        spark.readStream.format("___") # Hint: Auto Loader format
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load(f"{source_path}/___")
    )

# [TASK] Bronze Orders Detail
dp.create_streaming_table(name="bronze_orders_detail")

@dp.append_flow(target="bronze_orders_detail")
def bronze_orders_detail_flow():
    return (
        spark.readStream.format("___")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load(f"{source_path}/___")
    )

In [None]:
# ============================================================
# 2. SILVER LAYER
# ============================================================

# [TASK] Define SCD Type 2 for Customers
dp.create_streaming_table(name="silver_customers")

dp.create_auto_cdc_flow(
    target="silver_customers",
    source="bronze_customers",
    keys=["___"],
    sequence_by=col("___"), # What column tracks the change time?
    stored_as_scd_type="___" # Type 1 or 2?
)

# [TASK] Define SCD Type 1 for Products
dp.create_streaming_table(name="silver_products")

dp.create_auto_cdc_flow(
    target="silver_products",
    source="bronze_products",
    keys=["___"],
    sequence_by=col("___"),
    stored_as_scd_type="1"
)

# [TASK] Define SCD Type 1 for Product Categories
dp.create_streaming_table(name="silver_product_categories")

dp.create_auto_cdc_flow(
    target="silver_product_categories",
    source="bronze_product_categories",
    keys=["___"],
    sequence_by=col("___"),
    stored_as_scd_type="1"
)

# [TASK] Define Data Quality for Orders
@dp.table(
    name="silver_orders",
    expect_all_or_drop={"valid_amount": "___ > 0"},
    expect_all_or_fail={"valid_customer": "___"} # Check for not null
)
def silver_orders():
    return (
        spark.readStream.table("bronze_orders_header")
        .select("SalesOrderID", "CustomerID", "TotalDue", "OrderDate", "Status")
    )

# [TASK] Silver Order Details
@dp.table(name="silver_order_details")
def silver_order_details():
    return (
        spark.readStream.table("___")
        .select("SalesOrderID", "___", "OrderQty", "ProductID", "UnitPrice", "LineTotal")
    )

In [None]:
# ============================================================
# 3. GOLD LAYER
# ============================================================

# [TASK] Create the Customer Dimension
@dp.materialized_view(name="dim_customer")
def dim_customer():
    return (
        spark.read.table("___")
        .filter(col("___").isNull()) # Filter for current records
        .select("CustomerID", "FirstName", "LastName", "EmailAddress")
    )

# [TASK] Create the Product Dimension
@dp.materialized_view(name="dim_product")
def dim_product():
    p = spark.read.table("___")
    pc = spark.read.table("silver_product_categories")
    
    return (
        p.join(pc, "___", "left")
        .select(
            p["ProductID"],
            p["Name"].alias("ProductName"),
            pc["Name"].alias("CategoryName")
        )
    )

# [TASK] Create the Date Dimension
@dp.materialized_view(name="dim_date")
def dim_date():
    return (
        spark.read.table("___")
        .select(col("OrderDate").cast("date").alias("DateKey"))
        .___()
        .select(
            col("DateKey"),
            year("DateKey").alias("Year"),
            month("DateKey").alias("Month"),
            dayofmonth("DateKey").alias("Day")
        )
    )

# [TASK] Create the Fact Sales Materialized View
@dp.materialized_view(name="fact_sales")
def fact_sales():
    od = spark.read.table("___")
    oh = spark.read.table("silver_orders")
    
    # Perform the join
    return (
        od.join(oh, "___", "inner") # Join key?
        .select(
            od["SalesOrderID"],
            oh["OrderDate"],
            oh["CustomerID"],
            od["LineTotal"]
        )
    )

## Orchestration & Deployment

In a real-world scenario, you wouldn't run these cells interactively to process data. Instead:

1.  You commit these files (`pipeline.py` or `.sql` files) to a Git repository.
2.  You create a **Delta Live Tables (DLT)** pipeline in Databricks Workflows.
3.  You point the DLT pipeline to your source code.
4.  Databricks handles the orchestration, retries, and scaling.

## Stuck?

If you are stuck or want to see the full, working solution, check the files in the `lakeflow/lakeflow_workshop` folder:
*   `lakeflow/lakeflow_workshop/sql/` for the SQL solution.
*   `lakeflow/lakeflow_workshop/python/pipeline.py` for the Python solution.