# LAB 04: Delta Lake Optimization

**Duration:** ~30 min | **Day:** 2 | **Difficulty:** Intermediate
**After module:** M04: Delta Lake Optimization

> *"Optimize the orders table: compact files, apply Z-ORDER, clean up with VACUUM, try Liquid Clustering."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
# Create orders table with many small files (simulating production fragmentation)
import json

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}")

orders_path = f"{DATASET_PATH}/orders/orders_batch.json"
df_orders = spark.read.format("json").load(orders_path)

# Write in small batches to create many files
table_name = f"{CATALOG}.{BRONZE_SCHEMA}.orders_optimize_lab"
df_orders.repartition(20).write.mode("overwrite").saveAsTable(table_name)

# Add a few more appends to create small files
for i in range(5):
    df_orders.limit(10).write.mode("append").saveAsTable(table_name)

print(f"Table ready with fragmented files: {spark.table(table_name).count()} rows")

---
## Task 1: Inspect Table Metrics

Use `DESCRIBE DETAIL` to check the number of files and total size.

In [None]:
# TODO: Inspect table detail
df_detail = spark.sql(f"________ {table_name}")
display(df_detail.select("format", "numFiles", "sizeInBytes"))

In [None]:
# -- Validation --
detail = df_detail.first()
num_files_before = detail["numFiles"]
assert detail["format"] == "delta", "Table should be Delta format"
print(f"Task 1 OK: {num_files_before} files, {detail['sizeInBytes']:,} bytes")

---
## Task 2: OPTIMIZE

Run `OPTIMIZE` to compact the small files into larger ones.

In [None]:
# TODO: Run OPTIMIZE
spark.sql(f"________ {table_name}")

In [None]:
# Check files after OPTIMIZE
df_detail_after = spark.sql(f"DESCRIBE DETAIL {table_name}")
num_files_after = df_detail_after.first()["numFiles"]

print(f"Files BEFORE: {num_files_before}")
print(f"Files AFTER:  {num_files_after}")

In [None]:
# -- Validation --
assert num_files_after <= num_files_before, "OPTIMIZE should reduce file count"
print(f"Task 2 OK: Compacted from {num_files_before} to {num_files_after} files")

---
## Task 3: ZORDER BY

Run `OPTIMIZE ... ZORDER BY (customer_id)` to co-locate data for customer queries.

In [None]:
# TODO: OPTIMIZE with ZORDER
spark.sql(f"""
    OPTIMIZE {table_name}
    ________ (customer_id)
""")

In [None]:
# -- Validation --
history = spark.sql(f"DESCRIBE HISTORY {table_name}").collect()
ops = [r["operation"] for r in history]
assert "OPTIMIZE" in ops, "Expected OPTIMIZE in history"
print(f"Task 3 OK: ZORDER applied. History: {ops[:5]}")

---
## Task 4: VACUUM

Run `VACUUM` to remove obsolete files. Use `DRY RUN` first to preview, then execute.

> Note: We'll use 0 hours retention for the lab (requires disabling safety check).

In [None]:
# Disable retention check (LAB ONLY - never do this in production!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM DRY RUN first
display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS ________"))

In [None]:
# TODO: Run actual VACUUM
spark.sql(f"________ {table_name} RETAIN 0 HOURS")
print("VACUUM complete!")

In [None]:
# -- Validation --
# After VACUUM, old versions should no longer be accessible
current_count = spark.table(table_name).count()
assert current_count > 0, "Table should still have data"
print(f"Task 4 OK: VACUUM done. Current table: {current_count} rows")

# Re-enable safety check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

---
## Task 5: Liquid Clustering

Create a new table WITH Liquid Clustering enabled, then copy data into it.

In [None]:
# TODO: Create table with Liquid Clustering
lc_table = f"{CATALOG}.{BRONZE_SCHEMA}.orders_liquid_cluster"
spark.sql(f"DROP TABLE IF EXISTS {lc_table}")

spark.sql(f"""
    CREATE TABLE {lc_table}
    ________ (customer_id)
    AS SELECT * FROM {table_name}
""")

In [None]:
# -- Validation --
lc_detail = spark.sql(f"DESCRIBE DETAIL {lc_table}").first()
lc_count = spark.table(lc_table).count()
assert lc_count > 0, "Liquid Clustered table should have data"
print(f"Task 5 OK: Liquid Clustered table created with {lc_count} rows")
print(f"  Clustering columns: {lc_detail['clusteringColumns']}")

---
## Task 6: Detect and Handle Data Skew

Create a skewed dataset, detect the skew, and fix it using a **broadcast join**.

**Scenario:** A `sales` table has 90% of rows for `customer_id = 1` (hot key). Joining it with a small `customers` lookup table causes skew → one executor does most of the work.

**Steps:**
1. Create a skewed `sales` table
2. Detect the skew by counting rows per key
3. Use `broadcast()` hint to optimize the join

In [None]:
from pyspark.sql.functions import col, count, lit, rand, round as spark_round

# Step 1: Create a skewed sales table (90% hot key)
skew_table = f"{CATALOG}.{BRONZE_SCHEMA}.sales_skew_lab"

# 9000 rows for customer_id=1, 100 rows each for customers 2-11
df_hot = spark.range(9000).withColumn("customer_id", lit(1)).withColumn("amount", spark_round(rand() * 100, 2))
df_rest = spark.range(1000).withColumn("customer_id", (col("id") % 10 + 2).cast("int")).withColumn("amount", spark_round(rand() * 100, 2))
df_skewed = df_hot.unionByName(df_rest)

df_skewed.write.mode("overwrite").saveAsTable(skew_table)
print(f"Skewed table ready: {spark.table(skew_table).count()} rows")

In [None]:
# TODO: Detect skew — count rows per customer_id, ordered DESC
# Fill in the GROUP BY and ORDER BY

df_skew_check = spark.sql(f"""
    SELECT customer_id, ________(________) as row_count
    FROM {skew_table}
    GROUP BY ________
    ORDER BY row_count ________
""")

display(df_skew_check)

In [None]:
# -- Validation --
top_row = df_skew_check.first()
assert top_row["customer_id"] == 1, "Customer 1 should have the most rows (hot key)"
assert top_row["row_count"] > 5000, f"Customer 1 should have 9000 rows, got {top_row['row_count']}"
print(f"Skew detected! Customer {top_row['customer_id']}: {top_row['row_count']} rows (hot key)")
print(f"Other customers: ~100 rows each")

In [None]:
from pyspark.sql.functions import broadcast, sum as spark_sum

# Create a small lookup table (customers)
customers_lookup = spark.createDataFrame(
    [(i, f"Customer_{i}") for i in range(1, 12)],
    ["customer_id", "customer_name"]
)

# TODO: Use broadcast() to join skewed sales with small customers table
# This avoids shuffle on the large skewed table

df_joined = spark.table(skew_table).join(
    ________(customers_lookup),    # hint: broadcast(customers_lookup)
    on="customer_id",
    how="left"
)

# Aggregate: total amount per customer
df_result = (
    df_joined
    .groupBy("customer_id", "customer_name")
    .agg(spark_sum("amount").alias("total_amount"))
    .orderBy("total_amount", ascending=False)
)

display(df_result)

In [None]:
# -- Validation --
assert df_result.count() > 0, "Join result should not be empty"
assert "customer_name" in df_result.columns, "customer_name should be present (from broadcast join)"
top = df_result.first()
assert top["customer_id"] == 1, "Customer 1 should have highest total (9000 rows)"
print(f"Task 6 OK: Broadcast join completed. Top customer: {top['customer_name']} = ${top['total_amount']:.2f}")
print("\nKey takeaway: broadcast() sends the small table to all executors,")
print("  avoiding shuffle of the large skewed table.")

---
## Cleanup

In [None]:
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
spark.sql(f"DROP TABLE IF EXISTS {lc_table}")
spark.sql(f"DROP TABLE IF EXISTS {skew_table}")
print("Lab tables cleaned up")

---
## Lab Complete!

You have:
- Inspected table metrics with DESCRIBE DETAIL
- Compacted small files with OPTIMIZE
- Applied Z-ORDER for query optimization
- Cleaned obsolete files with VACUUM
- Created a Liquid Clustered table
- Detected data skew and resolved it with broadcast join

> **Exam Tip:** Liquid Clustering replaces both partitioning and Z-ORDER. Use `ALTER TABLE ... CLUSTER BY (new_cols)` to change clustering columns without rewriting data. For data skew, `broadcast()` works when one side is small (< 10MB by default). AQE handles skew automatically in most cases.

> **Next:** LAB 05 - Streaming & Auto Loader