---
## üîß Step 0: Setup

In [None]:
%run ../00_setup

In [None]:
# Table name from Workshop 1 (or we'll create a test one)
table_name = f"{catalog}.{schema}.customers_silver"
print(f"üìä Working with table: {table_name}")

---
## üîç Step 1: Problem Diagnosis

### Task 1.1: Check table details

Use the `DESCRIBE DETAIL` command to see:
- Table size
- Number of files
- Location

**Hint:**
```sql
DESCRIBE DETAIL catalog.schema.table_name
```

In [None]:
# TODO: Display table details
# display(spark.sql(f"DESCRIBE DETAIL ..."))

### ü§î Analysis Questions:

1. How many files does the table have? (`numFiles` column)
2. What is the size? (`sizeInBytes` column)
3. Is the table partitioned? (`partitionColumns` column)

**Red flag:** If you have many small files (e.g., 100+ files of a few KB each), that's a problem!

---
## ‚è∞ Step 2: Time Travel - Change History

### Task 2.1: Display table history

Delta Lake records every change. We can travel through time!

**Hint:**
```sql
DESCRIBE HISTORY catalog.schema.table_name
```

In [None]:
# TODO: Display table change history
# display(spark.sql(f"DESCRIBE HISTORY ..."))

### Task 2.2: Read an older version of the table

You can read data from a specific version or point in time.

**Hint:**
```python
# By version
spark.read.format("delta").option("versionAsOf", 0).table("name")

# By time
spark.read.format("delta").option("timestampAsOf", "2024-01-01").table("name")
```

In [None]:
# TODO: Read version 0 of the table (first version)
# df_v0 = spark.read.format("delta").option("versionAsOf", 0).table(table_name)
# print(f"Version 0 had {df_v0.count()} rows")

---
## ‚ö° Step 3: Optimization

### 3.1: Small Files Problem

Many small files = many I/O operations = slow queries.

**Solution:** `OPTIMIZE` combines small files into larger ones (default ~1GB).

### Task 3.1: Run OPTIMIZE

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name
```

In [None]:
# TODO: Optimize the table
# display(spark.sql(f"OPTIMIZE {table_name}"))

### 3.2: Z-Ordering - Data Colocation

The BI team often filters by **city** (`City`). Z-ORDER arranges data so that rows with the same city are physically close together.

**Effect:** Spark can skip entire files that don't contain the searched city!

### Task 3.2: Run OPTIMIZE with Z-ORDER

**Hint:**
```sql
OPTIMIZE catalog.schema.table_name ZORDER BY (column1, column2)
```

In [None]:
# TODO: Optimize with Z-ORDER on City column
# display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (City)"))

### Task 3.3: Check optimization metrics

After OPTIMIZE, check in history:
- How many files were added/removed?
- How did the file count change?

In [None]:
# Check history after optimization
display(spark.sql(f"DESCRIBE HISTORY {table_name} LIMIT 5"))

In [None]:
# Check if file count decreased
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

---
## üßπ Step 4: Vacuum - Cleaning Old Files

### Problem: Old Files Take Up Space

After OPTIMIZE, old small files still exist on disk (for Time Travel). `VACUUM` removes them.

‚ö†Ô∏è **WARNING:** After VACUUM, you can no longer travel to older versions!

### Task 4.1: Run VACUUM (DRY RUN mode)

First, check what will be deleted:

**Hint:**
```sql
VACUUM catalog.schema.table_name RETAIN 168 HOURS DRY RUN
```
(168 hours = 7 days - default safety threshold)

In [None]:
# TODO: Check what will be deleted (DRY RUN)
# display(spark.sql(f"VACUUM {table_name} RETAIN 168 HOURS DRY RUN"))

### Task 4.2: Run VACUUM (for workshop with shorter time)

In a development environment, we can force a shorter retention time.

‚ö†Ô∏è **Never do this in production!**

In [None]:
# Disable retention time check (DEV ONLY!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM with short retention time
# display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS"))

---
## ‚úÖ Step 5: Optimization Verification

### Task 5.1: Compare performance

Run a query filtering by city and check if it's faster.

In [None]:
# Test query after optimization
%%timeit -n 1 -r 1
spark.table(table_name).filter("City = 'Seattle'").count()

---
## üéØ Bonus: Liquid Clustering (Databricks 13.3+)

A newer alternative to Z-ORDER - automatically maintains optimal data layout.

```sql
-- When creating a table
CREATE TABLE ... CLUSTER BY (City)

-- Or modifying an existing table
ALTER TABLE ... CLUSTER BY (City)
```

In [None]:
# Bonus: Enable Liquid Clustering (optional)
# spark.sql(f"ALTER TABLE {table_name} CLUSTER BY (City)")

---
---

# üìã SOLUTION

‚ö†Ô∏è **Don't look here until you've tried it yourself!** ‚ö†Ô∏è

In [None]:
# ============================================================
# üìã FULL SOLUTION - Workshop 2: Delta Lake Optimization
# ============================================================

table_name = f"{catalog}.{schema}.customers_silver"

# --- Step 1: Diagnosis ---
print("üìä TABLE DETAILS:")
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

# --- Step 2: History ---
print("\n‚è∞ CHANGE HISTORY:")
display(spark.sql(f"DESCRIBE HISTORY {table_name}"))

# --- Step 3: Optimization ---
print("\n‚ö° OPTIMIZE WITH Z-ORDER:")
display(spark.sql(f"OPTIMIZE {table_name} ZORDER BY (City)"))

# --- Step 4: Vacuum ---
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
print("\nüßπ VACUUM:")
display(spark.sql(f"VACUUM {table_name} RETAIN 0 HOURS"))

# --- Verification ---
print("\n‚úÖ AFTER OPTIMIZATION:")
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

print("\n‚úÖ Optimization completed!")