# Module 04 - Advanced Features: Delta Lake, Unity Catalog, and Jobs

## Overview

This module covers advanced Databricks features including Delta Lake for ACID transactions, Unity Catalog for data governance, and Jobs for automation.

## Learning Objectives

By the end of this module, you will understand:
- Delta Lake: ACID transactions, time travel, and schema evolution
- Unity Catalog: Data governance and cataloging
- Jobs: Scheduling and automating notebook execution
- Workflows: Orchestrating multiple tasks
- Best practices for production workloads


## Introduction to Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to data lakes. It's built on top of Parquet and provides:

### Key Features

1. **ACID Transactions**: Ensures data consistency
2. **Time Travel**: Query historical versions of data
3. **Schema Enforcement**: Prevents bad data from being written
4. **Schema Evolution**: Allows schema changes over time
5. **Upserts**: Update and insert operations (MERGE)
6. **Optimized Performance**: Better query performance than Parquet


In [None]:
# Create sample data for Delta Lake demonstrations
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.functions import col, current_timestamp

# Initial dataset
data = [
    (1, "Product A", 100.0, 10),
    (2, "Product B", 150.0, 15),
    (3, "Product C", 200.0, 20),
]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True)
])

df = spark.createDataFrame(data, schema)
print("Initial DataFrame:")
df.display()


Initial DataFrame:


id,product,price,quantity
1,Product A,100.0,10
2,Product B,150.0,15
3,Product C,200.0,20


## Creating Delta Tables

Delta tables are created by writing DataFrames in Delta format.


In [None]:
# Create a Delta table
delta_path = "/Volumes/workspace/default/databricks_demo/products_delta"

# Write as Delta
df.write.format("delta").mode("overwrite").save(delta_path)
print(f"Delta table created at: {delta_path}")

# Read Delta table
delta_df = spark.read.format("delta").load(delta_path)
print("\nReading Delta table:")
delta_df.display()


Delta table created at: /Volumes/workspace/default/databricks_demo/products_delta

Reading Delta table:


id,product,price,quantity,category
1,Product A,100.0,10,
2,Product B,150.0,15,
3,Product C,200.0,20,


## Delta Table Operations

### 1. Append Mode


In [None]:
# Append new data to Delta table
new_data = [
    (4, "Product D", 250.0, 25),
    (5, "Product E", 300.0, 30),
]

new_df = spark.createDataFrame(new_data, schema)
new_df.write.format("delta").mode("append").save(delta_path)

# Read updated table
updated_df = spark.read.format("delta").load(delta_path)
print("After appending:")
updated_df.display()


After appending:


id,product,price,quantity,category
1,Product A,100.0,10,
2,Product B,150.0,15,
3,Product C,200.0,20,
4,Product D,250.0,25,
5,Product E,300.0,30,


### 2. Update Operations (MERGE)


In [None]:
from delta.tables import DeltaTable
from pyspark.sql.functions import lit

# Create updates DataFrame and add 'category' column with nulls to match Delta table schema
updates = [
    (1, "Product A Updated", 110.0, 12),  # Update existing
    (6, "Product F", 350.0, 35),          # New record
]

updates_df = spark.createDataFrame(updates, schema).withColumn("category", lit(None).cast("string"))

# Perform MERGE operation
delta_table = DeltaTable.forPath(spark, delta_path)

delta_table.alias("target").merge(
    updates_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

print("MERGE operation completed")
print("\nUpdated Delta table:")
spark.read.format("delta").load(delta_path).display()

MERGE operation completed

Updated Delta table:


id,product,price,quantity,category
1,Product A Updated,110.0,12,
6,Product F,350.0,35,
2,Product B,150.0,15,
3,Product C,200.0,20,
4,Product D,250.0,25,
5,Product E,300.0,30,


### 3. Time Travel

Delta Lake maintains a history of all changes, allowing you to query previous versions.


In [1]:
# Get history of Delta table
delta_table = DeltaTable.forPath(spark, delta_path)
history = delta_table.history()

print("Delta table history:")
display(history)

# Query a specific version
print("\nQuerying version 0 (initial version):")
version_0 = spark.read.format("delta").option("versionAsOf", 0).load(delta_path)
display(version_0)

# Query by timestamp (if you know the timestamp)
# timestamp_df = spark.read.format("delta").option("timestampAsOf", "2024-01-01 00:00:00").load(delta_path)


NameError: name 'DeltaTable' is not defined

### 4. Schema Evolution


In [None]:
# Add a new column to the schema
from pyspark.sql.types import StructType, StructField, StringType

# New data with additional column
new_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("product", StringType(), True),
    StructField("price", DoubleType(), True),
    StructField("quantity", IntegerType(), True),
    StructField("category", StringType(), True)  # New column
])

new_data_with_category = [
    (7, "Product G", 400.0, 40, "Electronics"),
    (8, "Product H", 450.0, 45, "Electronics"),
]

new_df_with_category = spark.createDataFrame(new_data_with_category, new_schema)

# Write with mergeSchema option to evolve schema
new_df_with_category.write.format("delta").mode("append").option("mergeSchema", "true").save(delta_path)

print("Schema evolved - new column 'category' added")
spark.read.format("delta").load(delta_path).show()


Schema evolved - new column 'category' added
+---+-----------------+-----+--------+-----------+
| id|          product|price|quantity|   category|
+---+-----------------+-----+--------+-----------+
|  7|        Product G|400.0|      40|Electronics|
|  8|        Product H|450.0|      45|Electronics|
|  2|        Product B|150.0|      15|       NULL|
|  3|        Product C|200.0|      20|       NULL|
|  4|        Product D|250.0|      25|       NULL|
|  5|        Product E|300.0|      30|       NULL|
|  1|Product A Updated|110.0|      12|       NULL|
|  6|        Product F|350.0|      35|       NULL|
+---+-----------------+-----+--------+-----------+



## Delta Lake Best Practices

1. **Use Delta for all production data** - Better performance and reliability
2. **Partition large tables** - Improves query performance
3. **Compact small files** - Use OPTIMIZE to merge small files
4. **Vacuum old versions** - Clean up old data files (be careful with time travel)
5. **Use Z-ordering** - For better query performance on specific columns
6. **Enable schema enforcement** - Prevent bad data
7. **Use mergeSchema carefully** - Understand the implications


In [None]:
# Optimize Delta table (compact small files)
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.optimize().executeCompaction()

print("Delta table optimized")

# Z-order by specific column for better query performance
# delta_table.optimize().executeZOrderBy("product")

# Vacuum old files (removes files older than retention period)
# delta_table.vacuum(retentionHours=168)  # 7 days
# print("Vacuum completed")


Delta table optimized


## Unity Catalog (Overview)

Unity Catalog is Databricks' unified governance solution for data and AI. It provides:

### Key Features

1. **Centralized Metadata**: Single source of truth for all data assets
2. **Fine-grained Access Control**: Column and row-level security
3. **Data Lineage**: Track data flow and dependencies
4. **Audit Logging**: Monitor data access and changes
5. **Cross-cloud Support**: Works across AWS, Azure, and GCP

### Note for Free Tier

Unity Catalog may have limited features in the free tier. Check your Databricks edition for availability.


In [None]:
# Unity Catalog concepts (if available in your workspace)
# Unity Catalog uses a three-level namespace: catalog.schema.table

# Example: Query from Unity Catalog
# df = spark.table("main.default.products")

# Create table in Unity Catalog (if you have permissions)
# df.write.saveAsTable("main.default.products")

print("Unity Catalog is available in Databricks workspaces with appropriate licenses.")
print("It provides centralized governance for all your data assets.")
print("\nKey concepts:")
print("- Catalog: Top-level container (e.g., 'main')")
print("- Schema/Database: Second-level container")
print("- Table: Actual data table")


Unity Catalog is available in Databricks workspaces with appropriate licenses.
It provides centralized governance for all your data assets.

Key concepts:
- Catalog: Top-level container (e.g., 'main')
- Schema/Database: Second-level container
- Table: Actual data table


## Working with Managed Tables

Managed tables are tables where Databricks manages both the data and metadata.


In [None]:
# Create a managed table
df.write.mode("overwrite").saveAsTable("products_managed")

print("Managed table 'products_managed' created")
print("You can query it with: SELECT * FROM products_managed")

# Query the managed table
spark.table("products_managed").show()


Managed table 'products_managed' created
You can query it with: SELECT * FROM products_managed
+---+---------+-----+--------+
| id|  product|price|quantity|
+---+---------+-----+--------+
|  1|Product A|100.0|      10|
|  2|Product B|150.0|      15|
|  3|Product C|200.0|      20|
+---+---------+-----+--------+



In [None]:
%sql
-- Query managed table using SQL
SELECT * FROM products_managed
ORDER BY price DESC


id,product,price,quantity
3,Product C,200.0,20
2,Product B,150.0,15
1,Product A,100.0,10


## Databricks Jobs

Jobs allow you to run notebooks or scripts on a schedule or trigger. They're essential for production workloads.

### Job Types

1. **Notebook Jobs**: Run Databricks notebooks
2. **Python Scripts**: Run Python files
3. **JAR Jobs**: Run JAR files
4. **Spark Submit**: Run Spark applications

### Job Features

- **Scheduling**: Cron-based scheduling
- **Retries**: Automatic retry on failure
- **Notifications**: Email/Slack alerts
- **Job Clusters**: Automatic cluster management
- **Dependencies**: Chain multiple jobs


In [None]:
# Jobs are typically created via the Databricks UI or REST API
# Here's how to create a job programmatically using Databricks API

print("Creating Jobs:")
print("\n1. Via UI:")
print("   - Go to Workflows > Jobs")
print("   - Click 'Create Job'")
print("   - Add tasks (notebooks, scripts, etc.)")
print("   - Configure schedule and cluster")
print("   - Set up notifications")

print("\n2. Via Databricks CLI:")
print("   databricks jobs create --json-file job_config.json")

print("\n3. Via REST API:")
print("   POST /api/2.1/jobs/create")

print("\nExample job configuration:")
job_config_example = {
    "name": "Daily ETL Job",
    "tasks": [
        {
            "task_key": "extract_data",
            "notebook_task": {
                "notebook_path": "/Users/your_email@domain.com/extract_data"
            },
            "existing_cluster_id": "your-cluster-id"
        },
        {
            "task_key": "transform_data",
            "notebook_task": {
                "notebook_path": "/Users/your_email@domain.com/transform_data"
            },
            "depends_on": [{"task_key": "extract_data"}],
            "existing_cluster_id": "your-cluster-id"
        }
    ],
    "schedule": {
        "quartz_cron_expression": "0 0 2 * * ?",  # Daily at 2 AM
        "timezone_id": "America/New_York"
    },
    "email_notifications": {
        "on_success": ["your-email@domain.com"],
        "on_failure": ["your-email@domain.com"]
    }
}

print("\nJob config structure:")
for key, value in job_config_example.items():
    print(f"  {key}: {value}")


Creating Jobs:

1. Via UI:
   - Go to Workflows > Jobs
   - Click 'Create Job'
   - Add tasks (notebooks, scripts, etc.)
   - Configure schedule and cluster
   - Set up notifications

2. Via Databricks CLI:
   databricks jobs create --json-file job_config.json

3. Via REST API:
   POST /api/2.1/jobs/create

Example job configuration:

Job config structure:
  name: Daily ETL Job
  tasks: [{'task_key': 'extract_data', 'notebook_task': {'notebook_path': '/Users/your_email@domain.com/extract_data'}, 'existing_cluster_id': 'your-cluster-id'}, {'task_key': 'transform_data', 'notebook_task': {'notebook_path': '/Users/your_email@domain.com/transform_data'}, 'depends_on': [{'task_key': 'extract_data'}], 'existing_cluster_id': 'your-cluster-id'}]
  schedule: {'quartz_cron_expression': '0 0 2 * * ?', 'timezone_id': 'America/New_York'}
  email_notifications: {'on_success': ['your-email@domain.com'], 'on_failure': ['your-email@domain.com']}


## Parameterizing Notebooks for Jobs

Notebooks can accept parameters when run as jobs, making them reusable.


In [None]:
# Using widgets for parameters
dbutils.widgets.text("input_path", "/tmp/input", "Input Path")
dbutils.widgets.text("output_path", "/tmp/output", "Output Path")
dbutils.widgets.dropdown("mode", "overwrite", ["overwrite", "append"], "Write Mode")

# Get parameter values
input_path = dbutils.widgets.get("input_path")
output_path = dbutils.widgets.get("output_path")
mode = dbutils.widgets.get("mode")

print(f"Input Path: {input_path}")
print(f"Output Path: {output_path}")
print(f"Mode: {mode}")

# Use parameters in your code
# df = spark.read.parquet(input_path)
# df.write.mode(mode).parquet(output_path)


Input Path: /tmp/input
Output Path: /tmp/output
Mode: overwrite


## Running Notebooks from Other Notebooks

You can orchestrate workflows by running notebooks from other notebooks.


In [None]:
# Run another notebook
# result = dbutils.notebook.run(
#     "/path/to/other/notebook",
#     timeout_seconds=300,
#     arguments={
#         "param1": "value1",
#         "param2": "value2"
#     }
# )

# print(f"Notebook execution result: {result}")

print("To run a notebook from another notebook:")
print("result = dbutils.notebook.run('/path/to/notebook', timeout_seconds=300)")
print("\nThis is useful for:")
print("- Orchestrating multi-step workflows")
print("- Reusing common logic")
print("- Building modular data pipelines")


## Summary

In this module, you learned:

✅ **Delta Lake** - ACID transactions, time travel, schema evolution, and MERGE operations

✅ **Unity Catalog** - Data governance and centralized metadata management

✅ **Managed Tables** - Creating and managing tables in Databricks

✅ **Jobs** - Scheduling and automating notebook execution

✅ **Parameterization** - Making notebooks reusable with widgets

✅ **Notebook Orchestration** - Running notebooks from other notebooks

### Next Steps

In the final module, we'll explore:
- Production best practices
- Performance optimization
- Monitoring and debugging
- Real-world scenarios and case studies


## Exercise

Try these exercises to practice:

1. Create a Delta table and perform INSERT, UPDATE, and DELETE operations
2. Use time travel to query a previous version of your Delta table
3. Evolve the schema of a Delta table by adding a new column
4. Create a managed table and query it using SQL
7. Use MERGE to upsert data into a Delta table
