# LAB 04: Delta Lake Optimization

**Duration:** ~30 min | **Day:** 2 | **Difficulty:** Intermediate
**After module:** M04: Delta Lake Optimization

> *"Optimize the orders table: compact files, apply Z-ORDER, clean up with VACUUM, try Liquid Clustering."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
# Create orders table with many small files (simulating production fragmentation)
import json

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}")

orders_path = f"{DATASET_PATH}/orders/orders_batch.json"
df_orders = spark.read.format("json").load(orders_path)

# Write in small batches to create many files
table_name = f"{CATALOG}.{BRONZE_SCHEMA}.orders_optimize_lab"
df_orders.repartition(20).write.mode("overwrite").saveAsTable(table_name)

# Add a few more appends to create small files
for i in range(5):
    df_orders.limit(10).write.mode("append").saveAsTable(table_name)

print(f"Table ready with fragmented files: {spark.table(table_name).count()} rows")

---
## Task 1: Inspect Table Metrics

Use `DESCRIBE DETAIL` to check the number of files and total size.

In [None]:
# TODO: Inspect table detail
df_detail = spark.sql(f"________ {table_name}")
display(df_detail.select("format", "numFiles", "sizeInBytes"))

In [None]:
# -- Validation --
detail = df_detail.first()
num_files_before = detail["numFiles"]
assert detail["format"] == "delta", "Table should be Delta format"
print(f"Task 1 OK: {num_files_before} files, {detail['sizeInBytes']:,} bytes")

---
## Task 2: OPTIMIZE

Run `OPTIMIZE` to compact the small files into larger ones.

In [None]:
# TODO: Run OPTIMIZE
spark.sql(f"________ {table_name}")

In [None]:
# Check files after OPTIMIZE
df_detail_after = spark.sql(f"DESCRIBE DETAIL {table_name}")
num_files_after = df_detail_after.first()["numFiles"]

print(f"Files BEFORE: {num_files_before}")
print(f"Files AFTER:  {num_files_after}")

In [None]:
# -- Validation --
assert num_files_after <= num_files_before, "OPTIMIZE should reduce file count"
print(f"Task 2 OK: Compacted from {num_files_before} to {num_files_after} files")

---
## Task 3: ZORDER BY

Run `OPTIMIZE ... ZORDER BY (customer_id)` to co-locate data for customer queries.

In [None]:
# TODO: OPTIMIZE with ZORDER
spark.sql(f"""
    OPTIMIZE {table_name}
    ________ (customer_id)
""")

In [None]:
# -- Validation --
history = spark.sql(f"DESCRIBE HISTORY {table_name}").collect()
ops = [r["operation"] for r in history]
assert "OPTIMIZE" in ops, "Expected OPTIMIZE in history"
print(f"Task 3 OK: ZORDER applied. History: {ops[:5]}")

---
## Task 4: VACUUM

Run `VACUUM` to remove obsolete files. Use `DRY RUN` first to preview, then execute.

> Note: We'll use 0 hours retention for the lab (requires disabling safety check).

In [None]:
# Disable retention check (LAB ONLY - never do this in production!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM DRY RUN first
display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS ________"))

In [None]:
# TODO: Run actual VACUUM
spark.sql(f"________ {table_name} RETAIN 0 HOURS")
print("VACUUM complete!")

In [None]:
# -- Validation --
# After VACUUM, old versions should no longer be accessible
current_count = spark.table(table_name).count()
assert current_count > 0, "Table should still have data"
print(f"Task 4 OK: VACUUM done. Current table: {current_count} rows")

# Re-enable safety check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")

---
## Task 5: Liquid Clustering

Create a new table WITH Liquid Clustering enabled, then copy data into it.

In [None]:
# TODO: Create table with Liquid Clustering
lc_table = f"{CATALOG}.{BRONZE_SCHEMA}.orders_liquid_cluster"
spark.sql(f"DROP TABLE IF EXISTS {lc_table}")

spark.sql(f"""
    CREATE TABLE {lc_table}
    ________ (customer_id)
    AS SELECT * FROM {table_name}
""")

In [None]:
# -- Validation --
lc_detail = spark.sql(f"DESCRIBE DETAIL {lc_table}").first()
lc_count = spark.table(lc_table).count()
assert lc_count > 0, "Liquid Clustered table should have data"
print(f"Task 5 OK: Liquid Clustered table created with {lc_count} rows")
print(f"  Clustering columns: {lc_detail['clusteringColumns']}")

---
## Cleanup

In [None]:
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
spark.sql(f"DROP TABLE IF EXISTS {lc_table}")
print("Lab tables cleaned up")

---
## Lab Complete!

You have:
- Inspected table metrics with DESCRIBE DETAIL
- Compacted small files with OPTIMIZE
- Applied Z-ORDER for query optimization
- Cleaned obsolete files with VACUUM
- Created a Liquid Clustered table

> **Exam Tip:** Liquid Clustering replaces both partitioning and Z-ORDER. Use `ALTER TABLE ... CLUSTER BY (new_cols)` to change clustering columns without rewriting data.

> **Next:** LAB 05 - Streaming & Auto Loader