# LAB 03: Delta DML & Time Travel

**Duration:** ~40 min | **Day:** 1 | **Difficulty:** Intermediate
**After module:** M03: Delta Lake Fundamentals

> *"Merge new customer data, handle accidental deletes, recover using Time Travel, and see how VACUUM affects it."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
# Ensure bronze.customers table exists (idempotent)
customers_path = f"{DATASET_PATH}/customers/customers.csv"
df_base = spark.read.format("csv").option("header", True).option("inferSchema", True).load(customers_path)

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}")
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.customers")
df_base.write.mode("overwrite").saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers")
print(f"Base table ready: {spark.table(f'{CATALOG}.{BRONZE_SCHEMA}.customers').count()} rows")

---
## Task 1: Examine the Update File

Load `customers_new.csv` and compare counts. How many customers overlap with base table?

In [None]:
# TODO: Read the update file
update_path = f"{DATASET_PATH}/customers/customers_new.csv"

df_updates = (
    spark.read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .load(________)
)

print(f"Existing customers: {spark.table(f'{CATALOG}.{BRONZE_SCHEMA}.customers').count()}")
print(f"Updates file: {df_updates.count()} rows")

display(df_updates.limit(5))

In [None]:
# -- Validation --
assert df_updates.count() > 0, "Updates file is empty"
print(f"Task 1 OK: {df_updates.count()} update records loaded")

---
## Task 2: MERGE INTO (Upsert)

Register `df_updates` as temp view `v_updates`, then use SQL MERGE to:
- **UPDATE** existing customers (match on `customer_id`)
- **INSERT** new customers

In [None]:
# Register updates as temp view
df_updates.createOrReplaceTempView("v_updates")

In [None]:
# TODO: Complete the MERGE statement
spark.sql(f"""
    MERGE INTO {CATALOG}.{BRONZE_SCHEMA}.customers AS target
    USING v_updates AS source
    ON target.customer_id = source.________
    WHEN MATCHED THEN
        UPDATE SET *
    WHEN NOT MATCHED THEN
        ________ 
""")

new_count = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers").count()
print(f"Customers after MERGE: {new_count}")

In [None]:
# -- Validation --
base_count = df_base.count()
assert new_count >= base_count, f"Expected at least {base_count} rows after MERGE, got {new_count}"
print(f"Task 2 OK: MERGE completed. {new_count} total customers (was {base_count})")

---
## Task 3: UPDATE Records

Update the `state` column for all customers where `city = 'Austin'`. Set state to `'TX'`.

In [None]:
# TODO: UPDATE statement
spark.sql(f"""
    UPDATE {CATALOG}.{BRONZE_SCHEMA}.customers
    SET state = ________
    WHERE city = ________
""")

# Verify
display(spark.sql(f"SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.customers WHERE city = 'Austin'"))

---
## Task 4: Accidental DELETE

Simulate an accident -- delete all customers where country is not null.

**WARNING:** This is intentional! We will recover the data using Time Travel.

In [None]:
# Record row count BEFORE the accident
count_before = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers").count()
print(f"Rows BEFORE delete: {count_before}")

# "Accident" - delete a large chunk
spark.sql(f"""
    DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.customers
    WHERE country IS NOT NULL
""")

count_after = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers").count()
print(f"Rows AFTER delete: {count_after} (lost {count_before - count_after} rows!)")

---
## Task 5: DESCRIBE HISTORY

Check the table history to see all operations performed.

In [None]:
# TODO: Show table history
display(spark.sql(f"________ {CATALOG}.{BRONZE_SCHEMA}.customers"))

In [None]:
# -- Validation --
history = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers").collect()
operations = [row["operation"] for row in history]
assert "DELETE" in operations, "Expected DELETE in history"
assert "MERGE" in operations, "Expected MERGE in history"
print(f"Task 5 OK: {len(history)} versions found. Operations: {operations}")

---
## Task 6: Time Travel - Query Previous Version

Read the table as it was BEFORE the accidental delete (the version with the most rows).

In [None]:
# TODO: Find the version number before DELETE from the history above
# Then read that version

version_before_delete = ________  # Replace with the correct version number

df_recovered = spark.sql(f"""
    SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.customers
    VERSION AS OF {version_before_delete}
""")

print(f"Recovered version has {df_recovered.count()} rows (current has {count_after})")

In [None]:
# -- Validation --
assert df_recovered.count() > count_after, "Recovered version should have more rows than current"
print(f"Task 6 OK: Time Travel successful! Recovered {df_recovered.count()} rows")

---
## Task 7: RESTORE the Table

Use `RESTORE TABLE` to bring the table back to the version before the accidental delete.

In [None]:
# TODO: Restore the table
spark.sql(f"""
    RESTORE TABLE {CATALOG}.{BRONZE_SCHEMA}.customers
    TO VERSION AS OF ________
""")

restored_count = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers").count()
print(f"Rows after RESTORE: {restored_count}")

In [None]:
# -- Validation --
assert restored_count == count_before, f"Expected {count_before} rows after restore, got {restored_count}"
print(f"Task 7 OK: Table restored! {restored_count} rows (matches pre-delete count)")

---
## Task 8: VACUUM and Its Impact on Time Travel

Run `VACUUM` with 0 hours retention, then try to query an old version. You'll see that Time Travel **no longer works** for vacuumed versions — the data files have been physically deleted.

> **Warning:** We use `RETAIN 0 HOURS` for demo purposes only. In production, the default retention is **7 days** — never lower it without understanding the consequences.

In [None]:
# Step 1: Check how many versions exist before VACUUM
history_before = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers").collect()
print(f"Versions available: {len(history_before)}")
print(f"Version numbers: {[r['version'] for r in history_before]}")

# Step 2: Disable safety check (LAB ONLY!)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# TODO: Run VACUUM with 0 hours retention
spark.sql(f"________ {CATALOG}.{BRONZE_SCHEMA}.customers RETAIN 0 HOURS")

# Re-enable safety
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")
print("VACUUM complete — old data files removed!")

In [None]:
# TODO: Try to query version 0 (the original table before any changes)
# This should FAIL with FileNotFoundException — VACUUM deleted the old data files

try:
    df_old = spark.sql(f"""
        SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.customers
        VERSION AS OF 0
    """)
    df_old.count()  # Force evaluation
    print("Unexpected: Time Travel still works (files not yet cleaned)")
except Exception as e:
    print(f"Expected error! Time Travel FAILED after VACUUM:")
    print(f"  {type(e).__name__}: {str(e)[:200]}")
    print("\n→ VACUUM removed the old data files. History metadata exists, but data is gone.")

In [None]:
# -- Validation --
# After VACUUM, the latest version should still be accessible
current = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers").count()
assert current > 0, "Current version should still work after VACUUM"
print(f"Task 8 OK: Current table has {current} rows (latest version OK)")
print("Key takeaway: VACUUM removes old files → Time Travel breaks for vacuumed versions")
print(f"  Default retention: 7 days | Production best practice: never set to 0 hours")

---
## Lab Complete!

You have:
- Used MERGE INTO for upsert (insert + update)
- Performed UPDATE and DELETE on Delta tables
- Inspected history with DESCRIBE HISTORY
- Queried previous versions with Time Travel
- Restored a table with RESTORE TABLE
- Ran VACUUM and observed its impact on Time Travel

> **Exam Tip:** Time Travel uses the Delta transaction log. Data files for old versions are only removed by `VACUUM`. Default retention is **7 days**. After VACUUM, `DESCRIBE HISTORY` still shows metadata, but querying old versions fails because the underlying Parquet files are gone.

> **Next:** LAB 04 - Delta Optimization

## Cleanup (Optional)

In [None]:
# Optional cleanup
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.customers")
print("LAB 03 complete.")