# LAB 06: Advanced Transforms -- PySpark & SQL

**Duration:** ~40 min | **Day:** 2 | **Difficulty:** Intermediate-Advanced

> *"Build analytical reports using window functions, CTEs, explode, and CTAS."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
from pyspark.sql.functions import col, sum, count, desc, row_number, rank, dense_rank, lag, lead, explode, from_json
from pyspark.sql.window import Window

# Load base data
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{GOLD_SCHEMA}")

df_orders = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.orders")
df_customers = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers")
df_products = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.products")

# Register as temp views for SQL tasks
df_orders.createOrReplaceTempView("orders")
df_customers.createOrReplaceTempView("customers")
df_products.createOrReplaceTempView("products")

print(f"Data loaded: {df_orders.count()} orders, {df_customers.count()} customers, {df_products.count()} products")

---
## Task 1: Window Function -- Rank Products by Revenue (PySpark)

For each product, calculate total revenue. Then rank products using `row_number()`.

Hint: Use `Window.orderBy(desc("total_revenue"))`

In [None]:
# TODO: Calculate total revenue per product and rank them
df_product_revenue = (
    df_orders
    .groupBy("product_id")
    .agg(sum("total_price").alias("total_revenue"))
)

window_spec = Window.orderBy(________("total_revenue"))

df_ranked = (
    df_product_revenue
    .withColumn("rank", ________(________))
)

display(df_ranked.limit(10))

In [None]:
# -- Validation --
assert "rank" in df_ranked.columns, "Missing 'rank' column"
first = df_ranked.orderBy("rank").first()
assert first["rank"] == 1, "First row should have rank 1"
print(f"Task 1 OK: Top product has revenue {first['total_revenue']:.2f}")

---
## Task 2: Running Total (SQL)

Write a SQL query to compute a cumulative running total per customer ordered by order_date.

In [None]:
# TODO: Complete the SQL window function
df_running = spark.sql("""
    SELECT 
        customer_id,
        order_date,
        total_price,
        SUM(total_price) OVER (
            PARTITION BY ________
            ORDER BY ________
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) AS running_total
    FROM orders
    ORDER BY customer_id, order_date
""")

display(df_running.limit(20))

In [None]:
# -- Validation --
assert "running_total" in df_running.columns, "Missing 'running_total' column"
print(f"Task 2 OK: Running totals computed for {df_running.select('customer_id').distinct().count()} customers")

---
## Task 3: Multi-step CTE

Write a SQL query with two CTEs to find the top 5 days by total revenue:
1. `daily_sales` -- total revenue per day
2. `ranked_days` -- rank days by revenue

In [None]:
# TODO: Complete the CTE query
df_top_days = spark.sql("""
    WITH daily_sales AS (
        SELECT 
            order_date,
            ________(total_price) AS daily_revenue,
            ________(*)          AS order_count
        FROM orders
        GROUP BY order_date
    ),
    ranked_days AS (
        SELECT *,
            ROW_NUMBER() OVER (ORDER BY daily_revenue ________) AS day_rank
        FROM daily_sales
    )
    SELECT * FROM ranked_days
    WHERE day_rank <= 5
    ORDER BY day_rank
""")

display(df_top_days)

In [None]:
# -- Validation --
assert df_top_days.count() <= 5, "Should return at most 5 rows"
assert df_top_days.first()["day_rank"] == 1, "First row should be rank 1"
print(f"Task 3 OK: Top {df_top_days.count()} days by revenue")

---
## Task 4: Correlated Subquery

Find customers whose total spending is above the overall average spending per customer.

In [None]:
# TODO: Write SQL with subquery
df_high_spenders = spark.sql("""
    SELECT customer_id, SUM(total_price) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(total_price) > (
        SELECT ________(total_spent) FROM (
            SELECT customer_id, SUM(total_price) AS total_spent
            FROM orders
            GROUP BY customer_id
        )
    )
    ORDER BY total_spent DESC
""")

display(df_high_spenders)

In [None]:
# -- Validation --
assert df_high_spenders.count() > 0, "Should find at least some high spenders"
print(f"Task 4 OK: {df_high_spenders.count()} customers above average spending")

---
## Task 5: Explode Array Column

Create a sample DataFrame with an array column and use `explode()` to flatten it.

In [None]:
from pyspark.sql.functions import array, lit, explode

# Sample data with array column
df_with_array = spark.createDataFrame([
    (1, ["Electronics", "Books", "Clothing"]),
    (2, ["Food", "Electronics"]),
    (3, ["Books"])
], ["customer_id", "categories"])

# TODO: Explode the categories array into individual rows
df_exploded = df_with_array.select(
    "customer_id",
    ________(col("categories")).alias("category")
)

display(df_exploded)

In [None]:
# -- Validation --
assert df_exploded.count() == 6, f"Expected 6 rows after explode, got {df_exploded.count()}"
assert "category" in df_exploded.columns, "Missing 'category' column"
print(f"Task 5 OK: Exploded {df_with_array.count()} rows into {df_exploded.count()} rows")

---
## Task 6: CTAS -- Create Gold Tables

Use `CREATE TABLE AS SELECT` to persist the top products analysis as a Gold table.

In [None]:
# TODO: Create gold table using CTAS
gold_table = f"{CATALOG}.{GOLD_SCHEMA}.top_products"
spark.sql(f"DROP TABLE IF EXISTS {gold_table}")

spark.sql(f"""
    ________ {gold_table} AS
    SELECT 
        product_id,
        SUM(total_price) AS total_revenue,
        COUNT(*) AS order_count
    FROM {CATALOG}.{BRONZE_SCHEMA}.orders
    GROUP BY product_id
    ORDER BY total_revenue DESC
    LIMIT 10
""")

display(spark.table(gold_table))

In [None]:
# -- Validation --
gold_count = spark.table(gold_table).count()
assert gold_count > 0 and gold_count <= 10, f"Expected 1-10 rows, got {gold_count}"
detail = spark.sql(f"DESCRIBE DETAIL {gold_table}").first()
assert detail["format"] == "delta", "CTAS should create a Delta table"
print(f"Task 6 OK: Gold table '{gold_table}' created with {gold_count} rows")

---
## Lab Complete!

You have:
- Applied window functions (row_number, running SUM) in PySpark and SQL
- Written multi-step CTEs for complex analytics
- Used subqueries to filter by aggregate conditions
- Flattened arrays with explode()
- Created Gold tables using CTAS

> **Exam Tip:** Know the difference: `ROW_NUMBER()` always gives unique sequential numbers. `RANK()` gives the same number for ties (with gaps). `DENSE_RANK()` gives same number for ties (no gaps).

> **Next:** LAB 07 - Build a Medallion Pipeline in Lakeflow