# Exercise: Delta Lake Operations - MERGE, Schema Evolution, and Time Travel

## Learning Objectives

In this exercise, you will practice:
- Creating and managing Delta tables
- Performing MERGE operations (upserts)
- Handling schema evolution
- Using time travel to query historical data
- Working with data updates and inserts

## Scenario

You are working with a product inventory system. You need to:
1. Create an initial Delta table with product data
2. Handle incoming updates and new products using MERGE
3. Evolve the schema when new attributes are added
4. Use time travel to audit changes

## Part 1: Initial Setup

Run the cells below to set up the initial environment and create sample data.

In [0]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col
from delta.tables import DeltaTable

# Define the Delta table path
delta_path = "/Volumes/workspace/default/databricks_demo/exercise_products_delta"

# Clean up any existing table (for fresh start)
try:
    dbutils.fs.rm(delta_path, True)
    print(f"Cleaned up existing data at {delta_path}")
except:
    print(f"No existing data to clean up at {delta_path}")

print("Setup complete!")

Cleaned up existing data at /Volumes/workspace/default/databricks_demo/exercise_products_delta
Setup complete!


In [0]:
# Create initial product data
# Schema: id, product_name, price, quantity_in_stock
initial_data = [
    (101, "Laptop Pro", 1299.99, 50),
    (102, "Wireless Mouse", 29.99, 200),
    (103, "Mechanical Keyboard", 149.99, 75),
    (104, "USB-C Cable", 19.99, 300),
    (105, "Monitor 27inch", 399.99, 30)
]

initial_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity_in_stock", IntegerType(), True)
])

initial_df = spark.createDataFrame(initial_data, initial_schema)

print("Initial product data:")
initial_df.show()
print(f"\nTotal products: {initial_df.count()}")

Initial product data:
+----------+-------------------+-------+-----------------+
|product_id|       product_name|  price|quantity_in_stock|
+----------+-------------------+-------+-----------------+
|       101|         Laptop Pro|1299.99|               50|
|       102|     Wireless Mouse|  29.99|              200|
|       103|Mechanical Keyboard| 149.99|               75|
|       104|        USB-C Cable|  19.99|              300|
|       105|     Monitor 27inch| 399.99|               30|
+----------+-------------------+-------+-----------------+


Total products: 5


## Exercise 1: Create Initial Delta Table

**Task:** Create a Delta table from the initial DataFrame and save it to `delta_path`.

**Requirements:**
- Write the DataFrame as Delta format
- Use overwrite mode (since we cleaned up earlier)
- Verify the table was created by reading it back

**Write your code below:**

In [0]:
# TODO: Write your code here
# Hint: Use .write.format("delta").mode("overwrite").save()

initial_df.write.format("delta").mode("overwrite").save(delta_path)


In [0]:
# Verification cell - Run this after completing Exercise 1
try:
    verify_df = spark.read.format("delta").load(delta_path)
    print("✅ Delta table created successfully!")
    print(f"\nTable contains {verify_df.count()} products:")
    verify_df.show()
    print("\nSchema:")
    verify_df.printSchema()
except Exception as e:
    print(f"❌ Error: {e}")
    print("Please complete Exercise 1 first.")

✅ Delta table created successfully!

Table contains 5 products:
+----------+-------------------+-------+-----------------+
|product_id|       product_name|  price|quantity_in_stock|
+----------+-------------------+-------+-----------------+
|       101|         Laptop Pro|1299.99|               50|
|       102|     Wireless Mouse|  29.99|              200|
|       103|Mechanical Keyboard| 149.99|               75|
|       104|        USB-C Cable|  19.99|              300|
|       105|     Monitor 27inch| 399.99|               30|
+----------+-------------------+-------+-----------------+


Schema:
root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity_in_stock: integer (nullable = true)



## Exercise 2: Perform MERGE Operation (Upsert)

**Task:** Use MERGE to handle updates and inserts.

**Scenario:** You receive a batch of product updates:
- Product ID 101: Price updated to 1199.99, quantity updated to 45
- Product ID 102: Quantity updated to 180
- Product ID 106: New product "Gaming Headset" with price 199.99 and quantity 60
- Product ID 107: New product "Webcam HD" with price 89.99 and quantity 100

**Requirements:**
- Create a DataFrame with the updates/new products
- Use DeltaTable.merge() to:
  - UPDATE existing records when product_id matches
  - INSERT new records when product_id doesn't exist
- Match on `product_id`

**Write your code below:**

In [0]:
# Step 1: Create updates DataFrame
new_data = [
    (105, "CPU AND KEYBOARD", 599.99, 40),
    (106, "Monitor 56inch", 399.99, 60)
]

new_df = spark.createDataFrame(new_data, initial_schema)
from delta.tables import DeltaTable

# Load the Delta table
delta_table = DeltaTable.forPath(spark, delta_path)

# Use MERGE to update the existing table with new data
delta_table.alias("existing") \
    .merge(
        new_df.alias("new"),
        "existing.product_id = new.product_id"
    ) \
    .whenMatchedUpdate(set={
        "price": "new.price",
        "quantity_in_stock": "new.quantity_in_stock",
        "product_name": "new.product_name"
    
    }) \
    .whenNotMatchedInsert(values={
        "product_id": "new.product_id",
        "product_name": "new.product_name",
        "price": "new.price",
        "quantity_in_stock": "new.quantity_in_stock"
    }) \
    .execute()

DataFrame[num_affected_rows: bigint, num_updated_rows: bigint, num_deleted_rows: bigint, num_inserted_rows: bigint]

In [0]:
# Verification cell - Run this after completing Exercise 2
try:
    verify_df = spark.read.format("delta").load(delta_path)
    print("✅ MERGE operation completed!")
    print(f"\nTable now contains {verify_df.count()} products:")
    verify_df.show()
    
    # Check specific updates
    print("\nVerifying updates:")
    product_101 = verify_df.filter(col("product_id") == 101).collect()
    if product_101 and product_101[0]["price"] == 1199.99:
        print("✅ Product 101 price updated correctly")
    
    product_106 = verify_df.filter(col("product_id") == 106).collect()
    if product_106:
        print("✅ Product 106 (new product) inserted correctly")
    
    product_107 = verify_df.filter(col("product_id") == 107).collect()
    if product_107:
        print("✅ Product 107 (new product) inserted correctly")
        
except Exception as e:
    print(f"❌ Error: {e}")
    print("Please complete Exercise 2 first.")

✅ MERGE operation completed!

Table now contains 6 products:
+----------+-------------------+-------+-----------------+
|product_id|       product_name|  price|quantity_in_stock|
+----------+-------------------+-------+-----------------+
|       101|         Laptop Pro|1299.99|               50|
|       102|     Wireless Mouse|  29.99|              200|
|       103|Mechanical Keyboard| 149.99|               75|
|       104|        USB-C Cable|  19.99|              300|
|       105|   CPU AND KEYBOARD| 599.99|               40|
|       106|     Monitor 56inch| 399.99|               60|
+----------+-------------------+-------+-----------------+


Verifying updates:
✅ Product 106 (new product) inserted correctly


## Exercise 3: Time Travel - Query Historical Version

**Task:** Use time travel to view the state of the table before the MERGE operation.

**Requirements:**
- Get the history of the Delta table
- Query version 0 (the initial version before MERGE)
- Compare it with the current version

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Get table history
delta_table = DeltaTable.forPath(spark, delta_path)
history = delta_table.history()
display(history)
# Step 2: Query version 0 (initial version)

print("\nQuerying version 0 (initial version):")
version_0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
display(version_0)
# Step 3: Compare with current version
current_df = spark.read.format("delta").load(delta_path)
print("Version 0 count:", version_0.count())
print("Current version count:", current_df.count())
# Version 0 schema
schema_v0 = version_0.schema

# Current version schema
schema_current = current_df.schema

cols_v0 = {field.name for field in schema_v0}
cols_current = {field.name for field in schema_current}

added_columns = cols_current - cols_v0
print("Added columns:", added_columns)

removed_columns = cols_v0 - cols_current
print("Removed columns:", removed_columns)

added_rows = current_df.join(
    version_0,
    on="product_id",
    how="left_anti"
)

display(added_rows)

deleted_rows = version_0.join(
    current_df,
    on="product_id",
    how="left_anti"
)

display(deleted_rows)

from pyspark.sql.functions import col

updated_rows = current_df.alias("curr") \
    .join(version_0.alias("old"), on="product_id") \
    .filter(
        (col("curr.product_name") != col("old.product_name")) |
        (col("curr.price") != col("old.price")) |
        (col("curr.quantity_in_stock") != col("old.quantity_in_stock"))
    )

display(updated_rows)




product_id,product_name,price,quantity_in_stock,category,supplier_name,product_name.1,price.1,quantity_in_stock.1
101,Laptop Pro,1199.99,45,Computers,TechSupplier Inc,Laptop Pro,1299.99,50
102,Wireless Mouse,29.99,180,Accessories,TechSupplier Inc,Wireless Mouse,29.99,200
105,CPU AND KEYBOARD,599.99,40,,,Monitor 27inch,399.99,30


In [0]:
# Verification cell - Run this after completing Exercise 3
try:
    delta_table = DeltaTable.forPath(spark, delta_path)
    history = delta_table.history()
    
    if history.count() > 0:
        print("✅ History retrieved successfully!")
        print(f"\nTotal versions: {history.count()}")
        
        version_0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
        current = spark.read.format("delta").load(delta_path)
        
        print(f"\nVersion 0 had {version_0.count()} products")
        print(f"Current version has {current.count()} products")
        
        if current.count() > version_0.count():
            print("✅ Time travel working correctly - new products detected!")
    else:
        print("❌ No history found")
        
except Exception as e:
    print(f"❌ Error: {e}")
    print("Please complete Exercise 3 first.")

✅ History retrieved successfully!

Total versions: 3

Version 0 had 5 products
Current version has 6 products
✅ Time travel working correctly - new products detected!


## Part 2: Schema Evolution Setup

**Important:** Complete Exercises 1-3 before running the cells below!

The following cells simulate a real-world scenario where:
1. New data arrives with additional columns (category, supplier_name)
2. You need to evolve the schema to accommodate these new fields
3. You'll need to handle MERGE operations with the evolved schema

Run the setup cells below to prepare for the schema evolution exercises.

In [0]:
# Simulate incoming data with new schema (includes category and supplier_name)
# This represents data from a new source system that tracks additional attributes

new_schema_data = [
    (108, "SSD 1TB", 129.99, 150, "Storage", "TechSupplier Inc"),
    (109, "RAM 16GB", 89.99, 200, "Memory", "TechSupplier Inc"),
    (110, "Graphics Card RTX", 599.99, 25, "Graphics", "GamingTech Ltd")
]

evolved_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity_in_stock", IntegerType(), True),
    StructField("category", StringType(), True),  # New column
    StructField("supplier_name", StringType(), True)  # New column
])

new_schema_df = spark.createDataFrame(new_schema_data, evolved_schema)

print("New data with evolved schema (includes category and supplier_name):")
new_schema_df.show()
print("\nNew schema:")
new_schema_df.printSchema()
print("\n⚠️  Note: This data has 2 additional columns that don't exist in the current Delta table!")

New data with evolved schema (includes category and supplier_name):
+----------+-----------------+------+-----------------+--------+----------------+
|product_id|     product_name| price|quantity_in_stock|category|   supplier_name|
+----------+-----------------+------+-----------------+--------+----------------+
|       108|          SSD 1TB|129.99|              150| Storage|TechSupplier Inc|
|       109|         RAM 16GB| 89.99|              200|  Memory|TechSupplier Inc|
|       110|Graphics Card RTX|599.99|               25|Graphics|  GamingTech Ltd|
+----------+-----------------+------+-----------------+--------+----------------+


New schema:
root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity_in_stock: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- supplier_name: string (nullable = true)


⚠️  Note: This data has 2 additional columns that don't exist in the current D

## Exercise 4: Schema Evolution with mergeSchema

**Task:** Evolve the Delta table schema to accommodate the new columns (category and supplier_name).

**Requirements:**
- Append the new data to the Delta table
- Use the `mergeSchema` option to allow schema evolution
- Verify that existing records have NULL values for the new columns
- Verify that new records have values for all columns including the new ones

**Write your code below:**

In [0]:
# TODO: Write your code here
# Hint: Use .option("mergeSchema", "true") when writing

new_schema_df.write.format("delta").mode("append").option("mergeSchema", "true").save(delta_path)


In [0]:
# Verification cell - Run this after completing Exercise 4
try:
    verify_df = spark.read.format("delta").load(delta_path)
    
    print("✅ Schema evolution completed!")
    print("\nUpdated schema:")
    verify_df.printSchema()
    
    print("\nSample data (showing new columns):")
    verify_df.select("product_id", "product_name", "category", "supplier_name").show(10)
    
    # Check if schema has new columns
    columns = verify_df.columns
    if "category" in columns and "supplier_name" in columns:
        print("\n✅ New columns 'category' and 'supplier_name' added successfully!")
        
        # Check that old records have NULL for new columns
        old_record = verify_df.filter(col("product_id") == 101).collect()[0]
        if old_record["category"] is None:
            print("✅ Old records correctly have NULL for new columns")
        
        # Check that new records have values
        new_record = verify_df.filter(col("product_id") == 108).collect()[0]
        if new_record["category"] is not None:
            print("✅ New records correctly have values for new columns")
    else:
        print("❌ Schema evolution did not work as expected")
        
except Exception as e:
    print(f"❌ Error: {e}")
    print("Please complete Exercise 4 first.")

✅ Schema evolution completed!

Updated schema:
root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- price: double (nullable = true)
 |-- quantity_in_stock: integer (nullable = true)
 |-- category: string (nullable = true)
 |-- supplier_name: string (nullable = true)


Sample data (showing new columns):
+----------+-------------------+--------+----------------+
|product_id|       product_name|category|   supplier_name|
+----------+-------------------+--------+----------------+
|       108|            SSD 1TB| Storage|TechSupplier Inc|
|       109|           RAM 16GB|  Memory|TechSupplier Inc|
|       110|  Graphics Card RTX|Graphics|  GamingTech Ltd|
|       101|         Laptop Pro|    NULL|            NULL|
|       102|     Wireless Mouse|    NULL|            NULL|
|       103|Mechanical Keyboard|    NULL|            NULL|
|       104|        USB-C Cable|    NULL|            NULL|
|       105|   CPU AND KEYBOARD|    NULL|            NULL|
|  

## Part 3: MERGE with Evolved Schema

Now that the schema has evolved, you'll receive updates that include the new columns.
Run the setup cell below to prepare for the final exercise.

In [0]:
# Simulate updates with evolved schema
# Some products need category and supplier information added
# Some products need price/quantity updates

updates_with_schema = [
    # Update existing product with new columns
    (101, "Laptop Pro", 1199.99, 45, "Computers", "TechSupplier Inc"),  # Update with category
    (102, "Wireless Mouse", 29.99, 180, "Accessories", "TechSupplier Inc"),  # Update with category
    # New product with full schema
    (111, "USB Hub", 34.99, 120, "Accessories", "TechSupplier Inc")
]

updates_df = spark.createDataFrame(updates_with_schema, evolved_schema)

print("Updates with evolved schema:")
updates_df.show()
print("\n⚠️  Note: These updates include category and supplier_name for existing products!")

Updates with evolved schema:
+----------+--------------+-------+-----------------+-----------+----------------+
|product_id|  product_name|  price|quantity_in_stock|   category|   supplier_name|
+----------+--------------+-------+-----------------+-----------+----------------+
|       101|    Laptop Pro|1199.99|               45|  Computers|TechSupplier Inc|
|       102|Wireless Mouse|  29.99|              180|Accessories|TechSupplier Inc|
|       111|       USB Hub|  34.99|              120|Accessories|TechSupplier Inc|
+----------+--------------+-------+-----------------+-----------+----------------+


⚠️  Note: These updates include category and supplier_name for existing products!


## Exercise 5: MERGE with Evolved Schema

**Task:** Perform a MERGE operation that updates existing products with category/supplier information and inserts new products.

**Requirements:**
- Use MERGE to update existing products (101, 102) with their category and supplier_name
- Insert new product (111)
- Match on `product_id`
- Update all columns when matched
- Insert all columns when not matched

**Write your code below:**

In [0]:
# TODO: Write your code here
# Perform MERGE with evolved schema
updates_df.write.format("delta").mode("append").option("mergeSchema", "true").save(delta_path)

In [0]:
# Verification cell - Run this after completing Exercise 5
try:
    verify_df = spark.read.format("delta").load(delta_path)
    
    print("✅ MERGE with evolved schema completed!")
    print(f"\nTotal products: {verify_df.count()}")
    
    # Check product 101 has category now
    product_101 = verify_df.filter(col("product_id") == 101).collect()[0]
    if product_101["category"] == "Computers":
        print("✅ Product 101 updated with category correctly")
    
    # Check product 102 has category now
    product_102 = verify_df.filter(col("product_id") == 102).collect()[0]
    if product_102["category"] == "Accessories":
        print("✅ Product 102 updated with category correctly")
    
    # Check new product 111
    product_111 = verify_df.filter(col("product_id") == 111).collect()
    if product_111:
        print("✅ Product 111 inserted correctly")
    
    print("\nFinal table state:")
    verify_df.orderBy("product_id").show(15)
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("Please complete Exercise 5 first.")

✅ MERGE with evolved schema completed!

Total products: 12
✅ Product 101 updated with category correctly
✅ Product 102 updated with category correctly
✅ Product 111 inserted correctly

Final table state:
+----------+-------------------+-------+-----------------+-----------+----------------+
|product_id|       product_name|  price|quantity_in_stock|   category|   supplier_name|
+----------+-------------------+-------+-----------------+-----------+----------------+
|       101|         Laptop Pro|1199.99|               45|  Computers|TechSupplier Inc|
|       101|         Laptop Pro|1299.99|               50|       NULL|            NULL|
|       102|     Wireless Mouse|  29.99|              200|       NULL|            NULL|
|       102|     Wireless Mouse|  29.99|              180|Accessories|TechSupplier Inc|
|       103|Mechanical Keyboard| 149.99|               75|       NULL|            NULL|
|       104|        USB-C Cable|  19.99|              300|       NULL|            NULL|
|   

## Summary and Reflection

Congratulations on completing the Delta Lake operations exercise! 🎉

### What You Practiced:

✅ **Delta Table Creation** - Created and managed Delta tables

✅ **MERGE Operations** - Performed upserts (update existing, insert new)

✅ **Time Travel** - Queried historical versions of data

✅ **Schema Evolution** - Evolved table schema to accommodate new columns

✅ **Complex MERGE** - Handled MERGE operations with evolved schemas

### Key Takeaways:

1. **Delta Lake provides ACID transactions** - All operations are atomic and consistent

2. **MERGE is powerful** - It handles both updates and inserts in a single operation

3. **Schema evolution is flexible** - You can add new columns without breaking existing data

4. **Time travel enables auditing** - You can always query previous versions of your data

### Next Steps:

- Practice with larger datasets
- Explore Delta table optimization (OPTIMIZE, Z-ORDER)
- Learn about Delta table maintenance (VACUUM)
- Experiment with partitioning strategies