# Exercise: Advanced Delta Lake Operations with Analytics

## Learning Objectives

In this exercise, you will practice:
- Creating and managing multiple Delta tables
- Performing MERGE operations with complex scenarios
- Handling schema evolution across multiple tables
- Using time travel for data auditing
- Performing joins between Delta tables
- Using aggregations and window functions
- Creating pivot tables for analytics
- Building a complete data pipeline with Delta Lake

## Scenario: E-Commerce Sales Analytics System

You are building a sales analytics system for an e-commerce platform. The system needs to:
1. Track customer orders and product sales
2. Handle updates to order status and product information
3. Evolve schemas as new attributes are added
4. Generate analytics reports using joins, aggregations, and window functions
5. Create pivot tables for business intelligence

## Data Model

- **customers_delta**: Customer master data
- **products_delta**: Product catalog
- **orders_delta**: Order transactions

## Part 1: Initial Setup

Run the cells below to set up the initial environment and create sample data for all three tables.

In [0]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, DateType, TimestampType
from pyspark.sql.functions import col, sum, count, row_number, rank, dense_rank, window, year, month, when
from pyspark.sql.window import Window
from delta.tables import DeltaTable
from datetime import datetime, date

# Define Delta table paths
base_path = "/Volumes/workspace/default/databricks_demo/analytics_exercise"
customers_path = f"{base_path}/customers_delta"
products_path = f"{base_path}/products_delta"
orders_path = f"{base_path}/orders_delta"

# Clean up any existing tables (for fresh start)
for path in [customers_path, products_path, orders_path]:
    try:
        dbutils.fs.rm(path, True)
        print(f"Cleaned up existing data at {path}")
    except:
        print(f"No existing data to clean up at {path}")

print("\n‚úÖ Setup complete!")

In [0]:
# Create initial customers data
customers_data = [
    (1, "John Smith", "john.smith@email.com", "New York", "USA"),
    (2, "Emma Johnson", "emma.j@email.com", "London", "UK"),
    (3, "Michael Chen", "m.chen@email.com", "San Francisco", "USA"),
    (4, "Sarah Williams", "sarah.w@email.com", "Toronto", "Canada"),
    (5, "David Brown", "david.b@email.com", "Sydney", "Australia")
]

customers_schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("customer_name", StringType(), True),
    StructField("email", StringType(), True),
    StructField("city", StringType(), True),
    StructField("country", StringType(), True)
])

customers_df = spark.createDataFrame(customers_data, customers_schema)

print("Initial customers data:")
customers_df.show()
print(f"\nTotal customers: {customers_df.count()}")

In [0]:
# Create initial products data
products_data = [
    (101, "Laptop Pro", "Electronics", 1299.99, 50),
    (102, "Wireless Mouse", "Accessories", 29.99, 200),
    (103, "Mechanical Keyboard", "Accessories", 149.99, 75),
    (104, "USB-C Cable", "Accessories", 19.99, 300),
    (105, "Monitor 27inch", "Electronics", 399.99, 30),
    (106, "Gaming Headset", "Electronics", 199.99, 60)
]

products_schema = StructType([
    StructField("product_id", IntegerType(), True),
    StructField("product_name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("stock_quantity", IntegerType(), True)
])

products_df = spark.createDataFrame(products_data, products_schema)

print("Initial products data:")
products_df.show()
print(f"\nTotal products: {products_df.count()}")

In [0]:
# Create initial orders data
orders_data = [
    (1001, 1, 101, date(2024, 1, 15), 2, 1299.99, "Completed"),
    (1002, 2, 102, date(2024, 1, 16), 3, 29.99, "Completed"),
    (1003, 1, 103, date(2024, 1, 17), 1, 149.99, "Completed"),
    (1004, 3, 105, date(2024, 1, 18), 1, 399.99, "Processing"),
    (1005, 4, 104, date(2024, 1, 19), 5, 19.99, "Completed"),
    (1006, 2, 106, date(2024, 1, 20), 1, 199.99, "Processing"),
    (1007, 5, 101, date(2024, 1, 21), 1, 1299.99, "Pending")
]

orders_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("status", StringType(), True)
])

orders_df = spark.createDataFrame(orders_data, orders_schema)

print("Initial orders data:")
orders_df.show()
print(f"\nTotal orders: {orders_df.count()}")

## Exercise 1: Create Delta Tables for All Entities

**Task:** Create Delta tables for customers, products, and orders.

**Requirements:**
- Create three separate Delta tables
- Use overwrite mode for initial creation
- Verify all tables are created successfully

**Write your code below:**

In [0]:
# TODO: Write your code here
# Create Delta tables for customers, products, and orders
# Hint: Use .write.format("delta").mode("overwrite").save() for each DataFrame
customers_df.write.format("delta").mode("overwrite").saveAsTable("workspace.delta_demo.customers_delta")
products_df.write.format("delta").mode("overwrite").saveAsTable("workspace.delta_demo.products_delta")
orders_df.write.format("delta").mode("overwrite").saveAsTable("workspace.delta_demo.orders_delta")




In [0]:
# Verification cell - Run this after completing Exercise 1
try:
    customers_delta = spark.read.format("delta").load(customers_path)
    products_delta = spark.read.format("delta").load(products_path)
    orders_delta = spark.read.format("delta").load(orders_path)
    
    print("‚úÖ All Delta tables created successfully!")
    print(f"\nCustomers: {customers_delta.count()} records")
    print(f"Products: {products_delta.count()} records")
    print(f"Orders: {orders_delta.count()} records")
    
    print("\nCustomers table sample:")
    customers_delta.show(3)
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 1 first.")

## Exercise 2: Perform MERGE Operations on Orders

**Task:** Handle order updates and new orders using MERGE.

**Scenario:** You receive a batch of order updates:
- Order 1004: Status changed from "Processing" to "Completed"
- Order 1006: Status changed from "Processing" to "Completed"
- Order 1007: Status changed from "Pending" to "Processing"
- Order 1008: New order - Customer 3, Product 102, Date 2024-01-22, Quantity 2, Price 29.99, Status "Pending"
- Order 1009: New order - Customer 1, Product 105, Date 2024-01-22, Quantity 1, Price 399.99, Status "Pending"

**Requirements:**
- Create a DataFrame with the updates/new orders
- Use MERGE to update existing orders and insert new ones
- Match on `order_id`

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Create updates DataFrame for orders


# Step 2: Perform MERGE operation on orders_delta
# Hint: Use DeltaTable.forPath() and .merge() with whenMatchedUpdateAll() and whenNotMatchedInsertAll()


In [0]:
# Verification cell - Run this after completing Exercise 2
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    
    print("‚úÖ MERGE operation completed!")
    print(f"\nTotal orders: {orders_delta.count()}")
    
    # Check specific updates
    print("\nVerifying updates:")
    order_1004 = orders_delta.filter(col("order_id") == 1004).collect()
    if order_1004 and order_1004[0]["status"] == "Completed":
        print("‚úÖ Order 1004 status updated correctly")
    
    order_1008 = orders_delta.filter(col("order_id") == 1008).collect()
    if order_1008:
        print("‚úÖ Order 1008 (new order) inserted correctly")
    
    print("\nUpdated orders table:")
    orders_delta.orderBy("order_id").show()
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 2 first.")

## Exercise 3: Join Operations - Create Enriched Order View

**Task:** Create an enriched view of orders by joining with customers and products tables.

**Requirements:**
- Join orders with customers on `customer_id`
- Join the result with products on `product_id`
- Select: order_id, customer_name, email, product_name, category, order_date, quantity, unit_price, status
- Calculate total_amount as quantity * unit_price
- Show only completed orders

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Read all three Delta tables


# Step 2: Join orders with customers


# Step 3: Join with products and select required columns
# Hint: Use .join() with join condition, then .select() and .withColumn() for calculated fields


In [0]:
# Verification cell - Run this after completing Exercise 3
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    customers_delta = spark.read.format("delta").load(customers_path)
    products_delta = spark.read.format("delta").load(products_path)
    
    # Expected join result
    enriched = orders_delta.join(customers_delta, "customer_id") \
                           .join(products_delta, "product_id") \
                           .withColumn("total_amount", col("quantity") * col("unit_price")) \
                           .filter(col("status") == "Completed") \
                           .select("order_id", "customer_name", "email", "product_name", 
                                  "category", "order_date", "quantity", "unit_price", 
                                  "total_amount", "status")
    
    print("‚úÖ Join operation completed!")
    print(f"\nCompleted orders: {enriched.count()}")
    print("\nEnriched order view:")
    enriched.orderBy("order_id").show()
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 3 first.")

## Exercise 4: Aggregations - Sales Summary by Category

**Task:** Calculate sales aggregations by product category.

**Requirements:**
- Join orders with products
- Filter for completed orders only
- Calculate total_amount for each order (quantity * unit_price)
- Group by category
- Calculate: total_sales, order_count, avg_order_value, total_quantity_sold
- Order by total_sales descending

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Join orders with products and calculate total_amount


# Step 2: Group by category and calculate aggregations
# Hint: Use .groupBy() with .agg() for sum(), count(), avg(), etc.


In [0]:
# Verification cell - Run this after completing Exercise 4
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    products_delta = spark.read.format("delta").load(products_path)
    
    # Expected aggregation result
    sales_summary = orders_delta.join(products_delta, "product_id") \
                                .filter(col("status") == "Completed") \
                                .withColumn("total_amount", col("quantity") * col("unit_price")) \
                                .groupBy("category") \
                                .agg(
                                    sum("total_amount").alias("total_sales"),
                                    count("order_id").alias("order_count"),
                                    (sum("total_amount") / count("order_id")).alias("avg_order_value"),
                                    sum("quantity").alias("total_quantity_sold")
                                ) \
                                .orderBy(col("total_sales").desc())
    
    print("‚úÖ Aggregation completed!")
    print("\nSales summary by category:")
    sales_summary.show()
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 4 first.")

## Exercise 5: Window Functions - Customer Ranking and Running Totals

**Task:** Use window functions to calculate customer rankings and running totals.

**Requirements:**
- Join orders with customers
- Filter for completed orders
- Calculate total_amount per order
- For each customer, calculate:
  - Total sales (sum of all orders)
  - Rank of customers by total sales (use rank())
  - Running total of sales ordered by order_date
- Show customer_id, customer_name, order_id, order_date, total_amount, customer_total_sales, sales_rank, running_total

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Join orders with customers and calculate total_amount


# Step 2: Use window functions for ranking and running totals
# Hint: 
# - Use Window.partitionBy() for customer-level aggregations
# - Use Window.orderBy() for running totals
# - Use rank() or dense_rank() for rankings
# - Use sum().over() for running totals


In [0]:
# Verification cell - Run this after completing Exercise 5
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    customers_delta = spark.read.format("delta").load(customers_path)
    
    # Expected window function result
    window_spec_customer = Window.partitionBy("customer_id")
    window_spec_rank = Window.orderBy(col("customer_total_sales").desc())
    window_spec_running = Window.partitionBy("customer_id").orderBy("order_date")
    
    result = orders_delta.join(customers_delta, "customer_id") \
                         .filter(col("status") == "Completed") \
                         .withColumn("total_amount", col("quantity") * col("unit_price")) \
                         .withColumn("customer_total_sales", sum("total_amount").over(window_spec_customer)) \
                         .withColumn("sales_rank", rank().over(window_spec_rank)) \
                         .withColumn("running_total", sum("total_amount").over(window_spec_running)) \
                         .select("customer_id", "customer_name", "order_id", "order_date", 
                                "total_amount", "customer_total_sales", "sales_rank", "running_total") \
                         .orderBy("customer_id", "order_date")
    
    print("‚úÖ Window functions completed!")
    print("\nCustomer rankings and running totals:")
    result.show(20)
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 5 first.")

## Part 2: Schema Evolution Setup

**Important:** Complete Exercises 1-5 before running the cells below!

The following cells simulate a real-world scenario where:
1. New attributes are added to the orders table (discount_percentage, shipping_cost)
2. You need to evolve the schema to accommodate these new fields
3. You'll need to update existing orders and handle new orders with the evolved schema

Run the setup cells below to prepare for the schema evolution exercises.

In [0]:
# Simulate incoming orders with new schema (includes discount_percentage and shipping_cost)
# This represents data from an updated order system that tracks discounts and shipping

new_orders_data = [
    (1010, 2, 103, date(2024, 1, 23), 2, 149.99, "Pending", 10.0, 5.99),  # 10% discount
    (1011, 3, 104, date(2024, 1, 23), 10, 19.99, "Pending", 15.0, 8.99),  # 15% discount
    (1012, 4, 106, date(2024, 1, 24), 1, 199.99, "Pending", 0.0, 12.99)  # No discount
]

evolved_orders_schema = StructType([
    StructField("order_id", IntegerType(), True),
    StructField("customer_id", IntegerType(), True),
    StructField("product_id", IntegerType(), True),
    StructField("order_date", DateType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("unit_price", DoubleType(), True),
    StructField("status", StringType(), True),
    StructField("discount_percentage", DoubleType(), True),  # New column
    StructField("shipping_cost", DoubleType(), True)  # New column
])

new_orders_df = spark.createDataFrame(new_orders_data, evolved_orders_schema)

print("New orders with evolved schema (includes discount_percentage and shipping_cost):")
new_orders_df.show()
print("\nNew schema:")
new_orders_df.printSchema()
print("\n‚ö†Ô∏è  Note: This data has 2 additional columns that don't exist in the current orders Delta table!")

## Exercise 6: Schema Evolution with mergeSchema

**Task:** Evolve the orders Delta table schema to accommodate the new columns (discount_percentage and shipping_cost).

**Requirements:**
- Append the new orders to the orders Delta table
- Use the `mergeSchema` option to allow schema evolution
- Verify that existing orders have NULL values for the new columns
- Verify that new orders have values for all columns including the new ones

**Write your code below:**

In [0]:
# TODO: Write your code here
# Evolve schema by appending new orders with mergeSchema option
# Hint: Use .option("mergeSchema", "true") when writing


In [0]:
# Verification cell - Run this after completing Exercise 6
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    
    print("‚úÖ Schema evolution completed!")
    print("\nUpdated schema:")
    orders_delta.printSchema()
    
    print("\nSample data (showing new columns):")
    orders_delta.select("order_id", "customer_id", "product_id", "discount_percentage", "shipping_cost").show(10)
    
    # Check if schema has new columns
    columns = orders_delta.columns
    if "discount_percentage" in columns and "shipping_cost" in columns:
        print("\n‚úÖ New columns 'discount_percentage' and 'shipping_cost' added successfully!")
        
        # Check that old records have NULL for new columns
        old_order = orders_delta.filter(col("order_id") == 1001).collect()[0]
        if old_order["discount_percentage"] is None:
            print("‚úÖ Old orders correctly have NULL for new columns")
        
        # Check that new records have values
        new_order = orders_delta.filter(col("order_id") == 1010).collect()[0]
        if new_order["discount_percentage"] is not None:
            print("‚úÖ New orders correctly have values for new columns")
    else:
        print("‚ùå Schema evolution did not work as expected")
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 6 first.")

## Part 3: MERGE with Evolved Schema and Source Data Updates

Now that the schema has evolved, you'll receive updates that include the new columns.
Run the setup cells below to prepare for the final exercises.

In [0]:
# Simulate updates to existing orders with evolved schema
# Some orders need discount and shipping information added
# Some orders need status updates

updates_with_schema = [
    # Update existing order with new columns
    (1001, 1, 101, date(2024, 1, 15), 2, 1299.99, "Completed", 5.0, 15.99),  # Add discount and shipping
    (1002, 2, 102, date(2024, 1, 16), 3, 29.99, "Completed", 0.0, 5.99),  # Add discount and shipping
    # New order with full schema
    (1013, 5, 105, date(2024, 1, 25), 1, 399.99, "Pending", 20.0, 10.99)  # New order with discount
]

updates_df = spark.createDataFrame(updates_with_schema, evolved_orders_schema)

print("Updates with evolved schema:")
updates_df.show()
print("\n‚ö†Ô∏è  Note: These updates include discount_percentage and shipping_cost for existing orders!")

## Exercise 7: MERGE with Evolved Schema

**Task:** Perform a MERGE operation that updates existing orders with discount/shipping information and inserts new orders.

**Requirements:**
- Use MERGE to update existing orders (1001, 1002) with their discount_percentage and shipping_cost
- Insert new order (1013)
- Match on `order_id`
- Update all columns when matched
- Insert all columns when not matched

**Write your code below:**

In [0]:
# TODO: Write your code here
# Perform MERGE with evolved schema on orders_delta


In [0]:
# Verification cell - Run this after completing Exercise 7
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    
    print("‚úÖ MERGE with evolved schema completed!")
    print(f"\nTotal orders: {orders_delta.count()}")
    
    # Check order 1001 has discount now
    order_1001 = orders_delta.filter(col("order_id") == 1001).collect()[0]
    if order_1001["discount_percentage"] == 5.0:
        print("‚úÖ Order 1001 updated with discount correctly")
    
    # Check new order 1013
    order_1013 = orders_delta.filter(col("order_id") == 1013).collect()
    if order_1013:
        print("‚úÖ Order 1013 inserted correctly")
    
    print("\nSample of updated orders:")
    orders_delta.select("order_id", "customer_id", "product_id", "status", 
                        "discount_percentage", "shipping_cost").orderBy("order_id").show(10)
    
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 7 first.")

## Exercise 8: Pivot Operation - Sales by Month and Category

**Task:** Create a pivot table showing sales by month and product category.

**Requirements:**
- Join orders with products
- Filter for completed orders only
- Calculate total_amount (considering discount: quantity * unit_price * (1 - discount_percentage/100) + shipping_cost)
- Extract year and month from order_date
- Create a pivot table with months as columns and categories as rows
- Show total sales for each category-month combination

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Join orders with products and calculate total_amount with discount


# Step 2: Extract year and month, then create pivot table
# Hint: 
# - Use year() and month() functions to extract date parts
# - Use .groupBy() with .pivot() to create pivot table
# - Use .agg() with sum() for aggregations


In [0]:
# Verification cell - Run this after completing Exercise 8
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    products_delta = spark.read.format("delta").load(products_path)
    
    # Calculate total amount with discount and shipping
    orders_with_amount = orders_delta.join(products_delta, "product_id") \
                                     .filter(col("status") == "Completed") \
                                     .withColumn("discount_amount", 
                                                when(col("discount_percentage").isNotNull(),
                                                     col("quantity") * col("unit_price") * (1 - col("discount_percentage") / 100))
                                                .otherwise(col("quantity") * col("unit_price"))) \
                                     .withColumn("shipping", 
                                                when(col("shipping_cost").isNotNull(), col("shipping_cost"))
                                                .otherwise(0.0)) \
                                     .withColumn("total_amount", col("discount_amount") + col("shipping")) \
                                     .withColumn("order_year", year("order_date")) \
                                     .withColumn("order_month", month("order_date"))
    
    # Create pivot table
    pivot_table = orders_with_amount.groupBy("category") \
                                   .pivot("order_month") \
                                   .agg(sum("total_amount").alias("total_sales")) \
                                   .orderBy("category")
    
    print("‚úÖ Pivot operation completed!")
    print("\nSales by month and category (pivot table):")
    pivot_table.show()
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 8 first.")

## Exercise 9: Time Travel - Compare Sales Before and After Schema Evolution

**Task:** Use time travel to compare order data before and after schema evolution.

**Requirements:**
- Get the history of the orders Delta table
- Query a version before schema evolution (version 0 or 1)
- Query the current version
- Compare the number of orders and total sales between versions
- Show how schema evolution affected the data structure

**Write your code below:**

In [0]:
# TODO: Write your code here
# Step 1: Get table history


# Step 2: Query an earlier version (before schema evolution)


# Step 3: Query current version and compare


In [0]:
# Verification cell - Run this after completing Exercise 9
try:
    orders_delta = spark.read.format("delta").load(orders_path)
    delta_table = DeltaTable.forPath(spark, orders_path)
    history = delta_table.history()
    
    if history.count() > 0:
        print("‚úÖ History retrieved successfully!")
        print(f"\nTotal versions: {history.count()}")
        
        # Find version before schema evolution (typically version 0 or 1)
        version_before = 0
        version_0 = spark.read.format("delta").option("versionAsOf", version_before).load(orders_path)
        current = spark.read.format("delta").load(orders_path)
        
        print(f"\nVersion {version_before} (before schema evolution):")
        print(f"  - Orders: {version_0.count()}")
        print(f"  - Columns: {len(version_0.columns)}")
        version_0.printSchema()
        
        print(f"\nCurrent version (after schema evolution):")
        print(f"  - Orders: {current.count()}")
        print(f"  - Columns: {len(current.columns)}")
        current.printSchema()
        
        print(f"\n‚úÖ Schema evolution added {len(current.columns) - len(version_0.columns)} new columns!")
    else:
        print("‚ùå No history found")
        
except Exception as e:
    print(f"‚ùå Error: {e}")
    print("Please complete Exercise 9 first.")

## Summary and Reflection

Congratulations on completing the Advanced Delta Lake Operations exercise! üéâ

### What You Practiced:

‚úÖ **Multi-Table Delta Operations** - Created and managed multiple Delta tables

‚úÖ **Complex MERGE Operations** - Handled updates and inserts across different scenarios

‚úÖ **Join Operations** - Joined multiple Delta tables to create enriched views

‚úÖ **Aggregations** - Calculated sales summaries and statistics

‚úÖ **Window Functions** - Used rankings and running totals for analytics

‚úÖ **Schema Evolution** - Evolved table schemas to accommodate new attributes

‚úÖ **Pivot Operations** - Created pivot tables for business intelligence

‚úÖ **Time Travel** - Compared data across different versions

### Key Takeaways:

1. **Delta Lake supports complex analytics** - You can combine Delta operations with SQL analytics

2. **Schema evolution is flexible** - Add new columns without breaking existing queries

3. **MERGE handles complex scenarios** - Update existing records and insert new ones efficiently

4. **Time travel enables data auditing** - Track changes and compare versions

5. **Delta tables work seamlessly with Spark SQL** - All Spark SQL operations work on Delta tables

### Next Steps:

- Practice with larger, more complex datasets
- Explore Delta table optimization (OPTIMIZE, Z-ORDER, partitioning)
- Learn about Delta table maintenance (VACUUM, history retention)
- Experiment with streaming Delta tables
- Build end-to-end data pipelines with Delta Lake