# Workshop 2: Delta Lake Optimization

## The Story

We have an IoT ingestion system that receives data from thousands of sensors. The system writes data in very small batches (simulating continuous streaming or frequent small inserts).

**The Problem:**
This behavior has created the "Small Files Problem". The `iot_events_silver` table contains thousands of tiny files, which kills read performance because Spark has to open and close too many files to read a small amount of data.

**Your Mission:**
1.  **Diagnose**: Prove that we have a "Small Files Problem" by inspecting the table metadata and physical files.
2.  **Optimize**: Use `OPTIMIZE` to compact small files into larger ones.
3.  **Index**: Use `ZORDER` to co-locate data by `device_id` for faster filtering.
4.  **Clean**: Use `VACUUM` to remove the old, fragmented files.

**Time:** 30 minutes

In [0]:
%run ../00_setup

In [0]:
from pyspark.sql.functions import col, rand, current_timestamp, expr

# --- INDEPENDENT SETUP ---
# We will create a synthetic table to simulate the "Small Files Problem"
table_name = f"{CATALOG}.{SILVER_SCHEMA}.iot_events_silver"
print(f"Preparing simulation table: {table_name}...")

# 1. Generate synthetic data (50k rows)
# Simulating IoT events: EventID, DeviceID, Temperature, Timestamp
df_synthetic = spark.range(50000).select(
    col("id").alias("event_id"),
    (rand() * 100).cast("int").alias("device_id"),
    (rand() * 30 + 20).alias("temperature"),
    current_timestamp().alias("event_time"),
    expr("case when rand() > 0.9 then 'ERROR' else 'OK' end").alias("status")
)

# 2. Write with EXTREME fragmentation
# We use repartition(10000) to force Spark to write 10000 separate files.
# This simulates the effect of 10000 separate small INSERT operations.
print("Simulating 10000 small inserts (creating 1000 small files)...")
df_synthetic.repartition(10000).write.format("delta").mode("overwrite").saveAsTable(table_name)

print(f"âœ… Table {table_name} created with ~10000 small files.")

In [0]:
table_name = f"{CATALOG}.{SILVER_SCHEMA}.iot_events_silver"
print(f"Working with table: {table_name}")

## Step 1: Problem Diagnosis

### Task 1.1: Check table details

Use `DESCRIBE DETAIL` to see the number of files and total size.
Also, let's look at the **physical files** on the storage to see how messy it is.

**Hint:**
Use `DESCRIBE DETAIL table_name` to get metadata.
To list files, you need the `location` path from the details, then use `dbutils.fs.ls(location)`.

In [0]:
df = spark.sql(f"DESCRIBE DETAIL {table_name}")
display(df)

In [0]:
# TODO: Get table details


# TODO: Get the location path from details and list files

### Analysis Questions:

1.  Look at `numFiles`. Is it high? (Should be ~1000)
2.  Look at `sizeInBytes`. Calculate average file size (`size` / `numFiles`).
    *   Example: 10 MB / 10000 files = 0.001 MB per file.
    *   **Ideal size** for Delta/Parquet is usually 100MB - 1GB.
3.  Look at the file list. Do you see many `part-000...` files?

This confirms the **Small Files Problem**.

## Step 2: Time Travel - Change History

### Task 2.1: Display table history

Delta Lake records every change. We can travel through time!

**Hint:**
```sql
DESCRIBE HISTORY catalog.schema.table_name
```


In [0]:
# TODO: Display table change history

### Task 2.2: Read an older version of the table

You can read data from a specific version or point in time.

**Hint:**
```python
# By version
spark.read.format("delta").option("versionAsOf", 0).table("name")

# By time
spark.read.format("delta").option("timestampAsOf", "2024-01-01").table("name")
```


In [0]:
# TODO: Read version 0 of the table (first version)

## Step 3: Change Data Feed (CDF)

### Task 3.1: Enable CDF

Delta Lake can track row-level changes (INSERT, UPDATE, DELETE). This is called **Change Data Feed**.
It allows us to process only changed data instead of full reloads.

**Hint:**
```sql
ALTER TABLE table_name SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
```

In [0]:
# TODO: Enable Change Data Feed

### Task 3.2: Modify Data

Let's simulate some data corrections.
1.  Fix 'ERROR' status to 'FIXED'.
2.  Delete some records.

These operations will be recorded in the CDF.

**Hint:**
```sql
UPDATE table_name SET col = 'val' WHERE col = 'old_val'
DELETE FROM table_name WHERE col = 'val'
```

In [0]:
# TODO: Update some records


# TODO: Delete some records


### Task 3.3: Read Changes (CDF)

Now we can query the **Change Data Feed** to see exactly what happened.
This is critical for ETL pipelines to propagate only changes to downstream tables.

**Hint:**
```python
spark.read.format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 1) \
    .table("name")
```

In [0]:
# TODO: Read changes from the table

## Step 4: Optimization

### 4.1: Small Files Problem

Many small files = many I/O operations = slow queries.

**Solution:** `OPTIMIZE` combines small files into larger ones (default ~1GB).

### Task 4.1: Run OPTIMIZE

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name
```

In [0]:
# TODO: Optimize the table

### 4.2: Z-Ordering - Data Colocation

Our analysts frequently filter by `device_id`.
Currently, data for `device_id = 42` might be scattered across all 1000 files.

**How it works (Data Skipping):**
Delta Lake stores min/max statistics for each column in every file.
*   **Without Z-Order:** File 1 (IDs 1-100), File 2 (IDs 1-100)... Spark has to read all files.
*   **With Z-Order:** File 1 (IDs 1-10), File 2 (IDs 11-20)... Spark can **skip** files that don't contain the requested ID.

### Task 4.2: Run OPTIMIZE with Z-ORDER

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name ZORDER BY (column_name)
```

In [0]:
# TODO: Optimize with Z-ORDER on device_id

### Task 3.3: Check optimization metrics

After OPTIMIZE, check in history:
- How many files were added/removed?
- How did the file count change?

In [0]:
spark.sql(f"""
CREATE TABLE {table_name}_lc AS 
SELECT * FROM {table_name}""")

In [0]:
# Check history after optimization
display(spark.sql(f"DESCRIBE HISTORY {table_name} LIMIT 5"))

In [0]:
# Check if file count decreased
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

### 4.3: Liquid Clustering 

A newer alternative to Z-ORDER is **Liquid Clustering**.
It automatically maintains optimal data layout without needing to run `OPTIMIZE ZORDER BY` manually every time.

**How to use:**
```sql
-- When creating a table
CREATE TABLE ... CLUSTER BY (column)

-- Or modifying an existing table
ALTER TABLE ... CLUSTER BY (column)
```

In [0]:
# Bonus: Enable Liquid Clustering (optional)

## Step 5: Vacuum - Cleaning Old Files

### Problem: Old Files Take Up Space

After OPTIMIZE, old small files still exist on disk (for Time Travel). `VACUUM` removes them.

**WARNING:** After VACUUM, you can no longer travel to older versions!

### Task 5.1: Run VACUUM (DRY RUN mode)

First, check what will be deleted:

**Hint:**
```sql
VACUUM catalog.schema.table_name RETAIN 168 HOURS DRY RUN
```
(168 hours = 7 days - default safety threshold)

In [0]:
# TODO: Check what will be deleted (DRY RUN)

### Task 4.2: Run VACUUM (for workshop with shorter time)

In a development environment, we can force a shorter retention time.

**Never do this in production!**


In [0]:
# Disable retention time check (DEV ONLY!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM with short retention time

In [0]:
# OPTIMIZE original table with Z-ORDER on device_id
display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (device_id)"))

# OPTIMIZE the liquid clustered table
spark.sql(f"ALTER TABLE {table_name}_lc CLUSTER BY (device_id)")
display(spark.sql(f"OPTIMIZE {table_name}_lc FULL"))



In [0]:
# Benchmark query to compare performance
from time import perf_counter

In [0]:
# Benchmark Z-ORDER table
start = perf_counter()

display(spark.sql(f"""SELECT * FROM {table_name} a
                    WHERE a.device_id >= 90 
                    order by a.device_id"""))

zorder_time = perf_counter() - start

# Benchmark Liquid Clustering table
start = perf_counter()

display(spark.sql(f"""SELECT * FROM {table_name}_lc a  
                  
                  WHERE a.device_id >= 90  
                  order by a.device_id"""))

liquid_time = perf_counter() - start

print(f"Z-ORDER query time: {zorder_time:.2f} seconds")
print(f"Liquid Clustering query time: {liquid_time:.2f} seconds")

In [0]:
# Benchmark Z-ORDER table
start = perf_counter()

display(spark.sql(f"""SELECT * FROM {table_name} a
                    LEFT JOIN {table_name} b  ON a.event_id = b.event_id 
                    WHERE a.device_id >= 90 
                    order by a.device_id"""))

zorder_time = perf_counter() - start

# Benchmark Liquid Clustering table
start = perf_counter()

display(spark.sql(f"""SELECT * FROM {table_name}_lc a  
                  LEFT JOIN {table_name}_lc b  ON a.event_id = b.event_id
                  WHERE a.device_id >= 90  
                  order by a.device_id"""))

liquid_time = perf_counter() - start

print(f"Z-ORDER query time: {zorder_time:.2f} seconds")
print(f"Liquid Clustering query time: {liquid_time:.2f} seconds")

# Solution

The complete code is below. Try to solve it yourself first!


In [0]:
# ============================================================
# FULL SOLUTION - Workshop 2: Delta Lake Optimization
# ============================================================

table_name = f"{CATALOG}.{SILVER_SCHEMA}.iot_events_silver"

# --- Step 1: Diagnosis ---
print("TABLE DETAILS:")
df_detail = spark.sql(f"DESCRIBE DETAIL {table_name}")
display(df_detail)

print("\nPHYSICAL FILES:")
location = df_detail.collect()[0]['location']
print(location)

# --- Step 2: History ---
print("\nCHANGE HISTORY:")
display(spark.sql(f"DESCRIBE HISTORY {table_name}"))

# --- Step 3: Change Data Feed ---
print("\nENABLING CDF:")
spark.sql(f"ALTER TABLE {table_name} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)")

print("\nMODIFYING DATA:")
spark.sql(f"UPDATE {table_name} SET status = 'FIXED' WHERE status = 'ERROR'")
spark.sql(f"DELETE FROM {table_name} WHERE device_id = -1") # Dummy delete

print("\nREADING CHANGES:")
display(spark.read.format("delta").option("readChangeFeed", "true").option("startingVersion", 1).table(table_name))

# --- Step 4: Optimization ---
print("\nOPTIMIZE WITH Z-ORDER:")
display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (device_id)"))

# --- Step 5: Vacuum ---
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
print("\nVACUUM:")
display(spark.sql(f"DESCRIBE HISTORY {table_name}"))

display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS"))

# --- Verification ---
print("\nAFTER OPTIMIZATION:")
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

print("\nOptimization completed!")