# Workshop 2: Delta Lake Optimization

## The Story

We have an IoT ingestion system that receives data from thousands of sensors. The system writes data in very small batches (simulating continuous streaming or frequent small inserts).

**The Problem:**
This behavior has created the "Small Files Problem". The `iot_events_silver` table contains thousands of tiny files, which kills read performance because Spark has to open and close too many files to read a small amount of data.

**Your Mission:**
1.  **Diagnose**: Prove that we have a "Small Files Problem" by inspecting the table metadata and physical files.
2.  **Optimize**: Use `OPTIMIZE` to compact small files into larger ones.
3.  **Index**: Use `ZORDER` to co-locate data by `device_id` for faster filtering.
4.  **Clean**: Use `VACUUM` to remove the old, fragmented files.

**Time:** 30 minutes

In [None]:
%run ../00_setup
from pyspark.sql.functions import col, rand, current_timestamp, expr

# --- INDEPENDENT SETUP ---
# We will create a synthetic table to simulate the "Small Files Problem"
table_name = f"{catalog}.{SILVER_SCHEMA}.iot_events_silver"
print(f"Preparing simulation table: {table_name}...")

# 1. Generate synthetic data (50k rows)
# Simulating IoT events: EventID, DeviceID, Temperature, Timestamp
df_synthetic = spark.range(50000).select(
    col("id").alias("event_id"),
    (rand() * 100).cast("int").alias("device_id"),
    (rand() * 30 + 20).alias("temperature"),
    current_timestamp().alias("event_time"),
    expr("case when rand() > 0.9 then 'ERROR' else 'OK' end").alias("status")
)

# 2. Write with EXTREME fragmentation
# We use repartition(10000) to force Spark to write 10000 separate files.
# This simulates the effect of 10000 separate small INSERT operations.
print("Simulating 10000 small inserts (creating 1000 small files)...")
df_synthetic.repartition(10000).write.format("delta").mode("overwrite").saveAsTable(table_name)

print(f"âœ… Table {table_name} created with ~10000 small files.")

In [None]:
table_name = f"{catalog}.{SILVER_SCHEMA}.iot_events_silver"
print(f"Working with table: {table_name}")

## Step 1: Problem Diagnosis

### Task 1.1: Check table details

Use `DESCRIBE DETAIL` to see the number of files and total size.
Also, let's look at the **physical files** on the storage to see how messy it is.

**Hint:**
```python
# Metadata
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

# Physical files (requires getting the location first)
# location = ...
# display(dbutils.fs.ls(location))
```

In [None]:
# TODO: Get table details
# df_detail = spark.sql(f"DESCRIBE DETAIL {table_name}")
# display(df_detail)

# TODO: Get the location path from details and list files
# location = df_detail.collect()[0]['location']
# print(f"Table Location: {location}")
# display(dbutils.fs.ls(location))

### Analysis Questions:

1.  Look at `numFiles`. Is it high? (Should be ~1000)
2.  Look at `sizeInBytes`. Calculate average file size (`size` / `numFiles`).
    *   Example: 10 MB / 10000 files = 0.001 MB per file.
    *   **Ideal size** for Delta/Parquet is usually 100MB - 1GB.
3.  Look at the file list. Do you see many `part-000...` files?

This confirms the **Small Files Problem**.

## Step 2: Time Travel - Change History

### Task 2.1: Display table history

Delta Lake records every change. We can travel through time!

**Hint:**
```sql
DESCRIBE HISTORY catalog.schema.table_name
```


In [None]:
# TODO: Display table change history
# display(spark.sql(f"DESCRIBE HISTORY ..."))

### Task 2.2: Read an older version of the table

You can read data from a specific version or point in time.

**Hint:**
```python
# By version
spark.read.format("delta").option("versionAsOf", 0).table("name")

# By time
spark.read.format("delta").option("timestampAsOf", "2024-01-01").table("name")
```


In [None]:
# TODO: Read version 0 of the table (first version)
# df_v0 = spark.read.format("delta").option("versionAsOf", 0).table(table_name)
# print(f"Version 0 had {df_v0.count()} rows")

## Step 3: Optimization

### 3.1: Small Files Problem

Many small files = many I/O operations = slow queries.

**Solution:** `OPTIMIZE` combines small files into larger ones (default ~1GB).

### Task 3.1: Run OPTIMIZE

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name
```


In [None]:
# TODO: Optimize the table
# display(spark.sql(f"OPTIMIZE {table_name}"))

### 3.2: Z-Ordering - Data Colocation

Our analysts frequently filter by `device_id`.
Currently, data for `device_id = 42` might be scattered across all 1000 files.
Z-ORDER will reorganize data so all records for `device_id = 42` are in the same file (or few files).

### Task 3.2: Run OPTIMIZE with Z-ORDER

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name ZORDER BY (column_name)
```

In [None]:
# TODO: Optimize with Z-ORDER on device_id
# display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (device_id)"))

### Task 3.3: Check optimization metrics

After OPTIMIZE, check in history:
- How many files were added/removed?
- How did the file count change?

In [None]:
# Check history after optimization
display(spark.sql(f"DESCRIBE HISTORY {table_name} LIMIT 5"))

In [None]:
# Check if file count decreased
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

## Step 4: Vacuum - Cleaning Old Files

### Problem: Old Files Take Up Space

After OPTIMIZE, old small files still exist on disk (for Time Travel). `VACUUM` removes them.

**WARNING:** After VACUUM, you can no longer travel to older versions!

### Task 4.1: Run VACUUM (DRY RUN mode)

First, check what will be deleted:

**Hint:**
```sql
VACUUM catalog.schema.table_name RETAIN 168 HOURS DRY RUN
```
(168 hours = 7 days - default safety threshold)


In [None]:
# TODO: Check what will be deleted (DRY RUN)
# display(spark.sql(f"VACUUM {table_name} RETAIN 168 HOURS DRY RUN"))

### Task 4.2: Run VACUUM (for workshop with shorter time)

In a development environment, we can force a shorter retention time.

**Never do this in production!**


In [None]:
# Disable retention time check (DEV ONLY!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM with short retention time
# display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS"))

## Step 5: Optimization Verification

### Task 5.1: Compare performance

Run a query filtering by city and check if it's faster.


In [None]:
# Test query: Filter by a specific device
# This should be faster now (Data Skipping)
%%timeit -n 1 -r 1
spark.table(table_name).filter("device_id = 50").count()

##  Liquid Clustering 

A newer alternative to Z-ORDER - automatically maintains optimal data layout.

```sql
-- When creating a table
CREATE TABLE ... CLUSTER BY (City)

-- Or modifying an existing table
ALTER TABLE ... CLUSTER BY (City)
```


In [None]:
# Bonus: Enable Liquid Clustering (optional)
# spark.sql(f"ALTER TABLE {table_name} CLUSTER BY (device_id)")

# Solution

The complete code is below. Try to solve it yourself first!


In [None]:
# ============================================================
# FULL SOLUTION - Workshop 2: Delta Lake Optimization
# ============================================================

table_name = f"{catalog}.{SILVER_SCHEMA}.iot_events_silver"

# --- Step 1: Diagnosis ---
print("TABLE DETAILS:")
df_detail = spark.sql(f"DESCRIBE DETAIL {table_name}")
display(df_detail)

print("\nPHYSICAL FILES:")
location = df_detail.collect()[0]['location']
display(dbutils.fs.ls(location))

# --- Step 2: History ---
print("\nCHANGE HISTORY:")
display(spark.sql(f"DESCRIBE HISTORY {table_name}"))

# --- Step 3: Optimization ---
print("\nOPTIMIZE WITH Z-ORDER:")
display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (device_id)"))

# --- Step 4: Vacuum ---
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
print("\nVACUUM:")
display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS"))

# --- Verification ---
print("\nAFTER OPTIMIZATION:")
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

print("\nOptimization completed!")