# Delta Lake Advanced Operations

## Overview
This notebook covers advanced Delta Lake features including MERGE, time travel, optimization, and schema evolution.

## Learning Objectives
- Perform MERGE (UPSERT) operations
- Use time travel for auditing and rollback
- Optimize Delta tables with OPTIMIZE and Z-ORDER
- Manage schema evolution
- Use VACUUM for cleanup
- Work with Change Data Feed (CDF)

---

## 1. MERGE Operations (UPSERT)

MERGE allows you to upsert (update + insert) data into Delta tables.

In [None]:
from delta.tables import *
from pyspark.sql.functions import *

# Create target table
target_data = [
    (1, "Alice", 25, "2024-01-01"),
    (2, "Bob", 30, "2024-01-01"),
    (3, "Charlie", 35, "2024-01-01")
]

target_df = spark.createDataFrame(
    target_data,
    ["id", "name", "age", "updated_date"]
)

# Write as Delta table
target_df.write.format("delta").mode("overwrite").save("/tmp/delta/users")

print("Initial target table:")
spark.read.format("delta").load("/tmp/delta/users").show()

In [None]:
# Create source data with updates and new records
source_data = [
    (2, "Bob", 31, "2024-01-15"),  # Update: age changed
    (3, "Charlie", 35, "2024-01-15"),  # No change
    (4, "Diana", 28, "2024-01-15")  # New record
]

source_df = spark.createDataFrame(
    source_data,
    ["id", "name", "age", "updated_date"]
)

print("Source data:")
source_df.show()

In [None]:
# Perform MERGE
delta_table = DeltaTable.forPath(spark, "/tmp/delta/users")

delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set = {
    "name": "source.name",
    "age": "source.age",
    "updated_date": "source.updated_date"
}).whenNotMatchedInsert(values = {
    "id": "source.id",
    "name": "source.name",
    "age": "source.age",
    "updated_date": "source.updated_date"
}).execute()

print("After MERGE:")
spark.read.format("delta").load("/tmp/delta/users").show()

### Conditional MERGE

In [None]:
# MERGE with conditions
delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    condition = "source.age > target.age",  # Only update if age increased
    set = {
        "age": "source.age",
        "updated_date": "source.updated_date"
    }
).whenNotMatchedInsert(
    values = {
        "id": "source.id",
        "name": "source.name",
        "age": "source.age",
        "updated_date": "source.updated_date"
    }
).execute()

print("Conditional MERGE complete")

### MERGE with Delete

In [None]:
# MERGE that can also delete records
delta_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedDelete(
    condition = "source.age < 30"  # Delete if age < 30
).whenMatchedUpdate(
    set = {
        "age": "source.age",
        "updated_date": "source.updated_date"
    }
).whenNotMatchedInsert(
    values = {
        "id": "source.id",
        "name": "source.name",
        "age": "source.age",
        "updated_date": "source.updated_date"
    }
).execute()

print("MERGE with delete complete")

## 2. Time Travel

Query previous versions of your Delta table.

In [None]:
# View history
delta_table = DeltaTable.forPath(spark, "/tmp/delta/users")
history_df = delta_table.history()

print("Table history:")
history_df.select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

In [None]:
# Query by version
version_0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta/users")

print("Version 0 (original data):")
version_0.show()

# Query by timestamp
# df_timestamp = spark.read.format("delta") \
#     .option("timestampAsOf", "2024-01-01") \
#     .load("/tmp/delta/users")

In [None]:
# SQL time travel
spark.sql("""
    CREATE OR REPLACE TEMP VIEW users_current 
    USING DELTA 
    LOCATION '/tmp/delta/users'
""")

# Query specific version in SQL
spark.sql("SELECT * FROM users_current VERSION AS OF 0").show()

### Restore to Previous Version

In [None]:
# Restore to version 0
# delta_table.restoreToVersion(0)

# Or restore to timestamp
# delta_table.restoreToTimestamp("2024-01-01")

print("Restore operations available (commented out)")

## 3. OPTIMIZE and Z-ORDER

Optimize file layout for better query performance.

In [None]:
# Create larger table for optimization demo
large_data = [(i, f"User{i}", 20 + (i % 30), "2024-01-01") for i in range(1, 10001)]
large_df = spark.createDataFrame(large_data, ["id", "name", "age", "date"])

# Write with many small files
large_df.repartition(100).write.format("delta").mode("overwrite").save("/tmp/delta/large_users")

# Check file stats before optimization
print("Before OPTIMIZE:")
spark.sql("DESCRIBE DETAIL delta.`/tmp/delta/large_users`").select("numFiles", "sizeInBytes").show()

In [None]:
# OPTIMIZE - compact small files
spark.sql("OPTIMIZE delta.`/tmp/delta/large_users`")

print("\nAfter OPTIMIZE:")
spark.sql("DESCRIBE DETAIL delta.`/tmp/delta/large_users`").select("numFiles", "sizeInBytes").show()

In [None]:
# OPTIMIZE with Z-ORDER
# Z-ORDER clusters data by specified columns for better filtering
spark.sql("OPTIMIZE delta.`/tmp/delta/large_users` ZORDER BY (age)")

print("OPTIMIZE with Z-ORDER complete")
print("Queries filtering on 'age' will now be faster")

## 4. Schema Evolution

In [None]:
# Create initial table
initial_data = [(1, "Alice"), (2, "Bob")]
initial_df = spark.createDataFrame(initial_data, ["id", "name"])
initial_df.write.format("delta").mode("overwrite").save("/tmp/delta/schema_test")

print("Initial schema:")
spark.read.format("delta").load("/tmp/delta/schema_test").printSchema()

In [None]:
# Try to add new column - will fail without mergeSchema
new_data = [(3, "Charlie", 25)]
new_df = spark.createDataFrame(new_data, ["id", "name", "age"])

# This would fail:
# new_df.write.format("delta").mode("append").save("/tmp/delta/schema_test")

# Enable schema evolution
new_df.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save("/tmp/delta/schema_test")

print("\nAfter schema evolution:")
evolved_df = spark.read.format("delta").load("/tmp/delta/schema_test")
evolved_df.printSchema()
evolved_df.show()

### Automatic Schema Evolution

In [None]:
# Enable automatic schema merging
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

# Now schema evolution happens automatically
another_col_data = [(4, "Diana", 28, "Engineer")]
another_df = spark.createDataFrame(another_col_data, ["id", "name", "age", "title"])

another_df.write.format("delta").mode("append").save("/tmp/delta/schema_test")

print("After auto schema evolution:")
spark.read.format("delta").load("/tmp/delta/schema_test").printSchema()

## 5. VACUUM - Clean Up Old Files

In [None]:
# VACUUM removes old files no longer referenced
# Default retention: 7 days

# Check what will be deleted (dry run)
spark.sql("VACUUM delta.`/tmp/delta/users` RETAIN 0 HOURS DRY RUN").show()

# To actually vacuum (need to disable safety check for < 7 days)
# spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
# spark.sql("VACUUM delta.`/tmp/delta/users` RETAIN 0 HOURS")

print("VACUUM dry run complete")
print("⚠️ Be careful with VACUUM - it deletes old files permanently")

## 6. Change Data Feed (CDF)

Track row-level changes in Delta tables.

In [None]:
# Create table with CDF enabled
cdf_data = [(1, "Alice", 100), (2, "Bob", 200)]
cdf_df = spark.createDataFrame(cdf_data, ["id", "name", "amount"])

cdf_df.write.format("delta") \
    .option("delta.enableChangeDataFeed", "true") \
    .mode("overwrite") \
    .save("/tmp/delta/cdf_table")

print("Table with CDF enabled created")

In [None]:
# Make some changes
updates = [(1, "Alice", 150), (3, "Charlie", 300)]
update_df = spark.createDataFrame(updates, ["id", "name", "amount"])

delta_cdf = DeltaTable.forPath(spark, "/tmp/delta/cdf_table")
delta_cdf.alias("target").merge(
    update_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(set = {
    "amount": "source.amount"
}).whenNotMatchedInsert(values = {
    "id": "source.id",
    "name": "source.name",
    "amount": "source.amount"
}).execute()

print("Changes made")

In [None]:
# Read change data feed
changes = spark.read.format("delta") \
    .option("readChangeData", "true") \
    .option("startingVersion", 0) \
    .load("/tmp/delta/cdf_table")

print("Change Data Feed:")
changes.select("id", "name", "amount", "_change_type", "_commit_version").show()

# _change_type values:
# - insert: new row
# - update_preimage: old value before update
# - update_postimage: new value after update
# - delete: deleted row

## 7. Constraints and Data Quality

In [None]:
# Add CHECK constraint
spark.sql("""
    ALTER TABLE delta.`/tmp/delta/users`
    ADD CONSTRAINT age_positive CHECK (age > 0)
""")

print("Constraint added: age must be positive")

# This would fail:
# bad_data = [(10, "Invalid", -5, "2024-01-01")]
# bad_df = spark.createDataFrame(bad_data, ["id", "name", "age", "updated_date"])
# bad_df.write.format("delta").mode("append").save("/tmp/delta/users")

## 8. Table Properties and Metadata

In [None]:
# View table details
spark.sql("DESCRIBE DETAIL delta.`/tmp/delta/users`").show(vertical=True)

# View table properties
spark.sql("SHOW TBLPROPERTIES delta.`/tmp/delta/users`").show()

# Set table properties
spark.sql("""
    ALTER TABLE delta.`/tmp/delta/users`
    SET TBLPROPERTIES (
        'delta.logRetentionDuration' = '30 days',
        'delta.deletedFileRetentionDuration' = '7 days'
    )
""")

print("Table properties updated")

## Practice Exercises

### Exercise 1: Implement SCD Type 2
Create a Slowly Changing Dimension Type 2 table using MERGE.

In [None]:
# Your solution here
# TODO: Implement SCD Type 2 with effective_date and end_date columns

### Exercise 2: Optimize for Query Performance
Given a large table, optimize it for queries filtering by date and product_id.

In [None]:
# Your solution here
# TODO: Use OPTIMIZE with Z-ORDER on appropriate columns

## Summary

In this notebook, you learned:

✅ MERGE operations for UPSERT patterns
✅ Time travel for auditing and rollback
✅ OPTIMIZE and Z-ORDER for performance
✅ Schema evolution techniques
✅ VACUUM for cleanup
✅ Change Data Feed for CDC
✅ Constraints and data quality
✅ Table properties and metadata

## Next Steps

1. Practice MERGE patterns for real scenarios
2. Implement CDC pipelines with CDF
3. Learn about partition pruning
4. Study Delta Lake internals

## Additional Resources

- [Delta Lake Guide](https://docs.delta.io/)
- [Databricks Delta Lake](https://docs.databricks.com/delta/index.html)
- [Spark By Examples - Delta Lake](https://sparkbyexamples.com/spark/spark-delta-lake-tutorial/)