# M03: Delta Lake Fundamentals

| Exam Domain | Weight |
|---|---|
| ELT with Spark SQL and Python | 29% |
| Data Governance | 14% |

Delta Lake to open-source'owa warstwa storage zapewniająca transakcje ACID, schema enforcement i time travel na plikach Parquet. Jest domyślnym formatem w Databricks. W tym module poznasz operacje CRUD, MERGE INTO, zarządzanie schematem i wersjonowanie danych — kluczowe tematy egzaminacyjne.

---

## Setup

In [0]:
%run ../../setup/00_setup

### Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Display user context
display(
    spark.createDataFrame([
        (CATALOG, BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA)
    ], ['catalog', 'bronze_schema', 'silver_schema', 'gold_schema'])
)

# Set catalog and schema as default
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## Delta Lake Core Features

Core capabilities of Delta Lake: ACID transactions, schema enforcement, schema evolution, identity columns, and data quality constraints. These features make Delta the default table format in Databricks.

---

**Theoretical Introduction:**

Delta Lake is a transactional layer over Parquet that provides ACID properties (Atomicity, Consistency, Isolation, Durability). Every operation on a Delta table is recorded in the Delta Log - a JSON file containing metadata about changes.

<img src="../../../assets/images/8850570fe2b147eb86cb690d51bc798c.png" width="800">


**Key Concepts:**
- **ACID Transactions**: All operations are atomic and consistent
- **Delta Log**: `_delta_log/` folder with JSON files describing each transaction
- **Schema Enforcement**: Automatic schema validation - prevents bad data from entering
- **Schema Evolution**: Controlled addition of new columns without breaking existing pipelines
- **Constraints**: Data quality rules enforced at the table level

**Practical Application:**
- Transactional updates in Data Lake
- Ensuring data quality through schema validation and constraints
- Unified data access for batch and streaming workloads

### Example: Creating the First Delta Table

**Objective:** Demonstration of creating a Delta table and basic properties

**Approach:**
1. Load data from Unity Catalog Volume
2. Create a managed table in Delta format
3. Explore Delta Log and metadata

In [0]:
# Load customer data from Unity Catalog Volume
customers_df = (spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(f"{DATASET_PATH}/customers/customers.csv")
)

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.customers_delta")

In [0]:
# Create managed Delta table
customers_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

In [0]:
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").limit(5))

In [0]:
# Write customers_df as an external Delta table to a specified path
external_path = f"{DATASET_PATH}/external/customers_delta"
customers_df.write.format("delta").mode("overwrite").save(external_path)

### Example: Schema Enforcement in Action

**Objective:** Demonstration of automatic schema validation during data insertion



Schema Enforcement is a critical feature that prevents "garbage in, garbage out" scenarios. Delta Lake compares incoming data schema with the target table schema and **rejects incompatible writes**.

In [0]:
# Check current table schema
spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").printSchema()

In [0]:
# Attempt to insert data with mismatched schema (extra/missing columns)
invalid_data = spark.createDataFrame([
    ("CUST999999", "Test", "Customer", "invalid_email", "+48 123 456 789",'2025-11-28')  #registration is string instead of date
], ["customer_id", "first_name", "last_name", "email", "phone","registration_date"])

print(f"Schema enforcement in action")

In [0]:
try:
    invalid_data.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
except Exception:
    print("Schema enforcement prevented the write due to a data type mismatch in the registration_date column.")

In [0]:
from pyspark.sql.functions import col
from pyspark.sql.types import DateType

invalid_data = invalid_data.withColumn("registration_date", col("registration_date").cast(DateType()))

In [0]:
invalid_data.printSchema()
display(invalid_data)

In [0]:
invalid_data.write \
    .format("delta") \
    .mode("append") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

### Example: Identity and Generated Columns

**Objective:** Demonstrate advanced column features - auto-generated surrogate keys and computed columns

Delta Lake supports:
- **IDENTITY columns**: Auto-incrementing surrogate keys (unique, increasing, but not contiguous)
- **GENERATED columns**: Computed columns derived from other columns

In [0]:
# Create table with Identity Column and Generated Column
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.orders_modern (
    order_sk BIGINT GENERATED ALWAYS AS IDENTITY,  -- Surrogate Key
    order_id STRING,
    total_amount DOUBLE,
    order_timestamp TIMESTAMP,
    order_date DATE GENERATED ALWAYS AS (CAST(order_timestamp AS DATE)) -- Auto-calculated
) USING DELTA
""")

> **Note:** In a distributed environment like Databricks (Spark/Delta Lake), `GENERATED ALWAYS AS IDENTITY` has specific behaviors. It guarantees **uniqueness** and an **increasing trend**, but does NOT guarantee contiguous numbering.

Now we will insert data. Note that in the `INSERT` query we omit `order_sk` and `order_date` columns:
- `order_sk`: will be generated automatically (unique number)
- `order_date`: will be calculated based on `order_timestamp`

In [0]:
# Insert data without specifying generated columns
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.orders_modern (order_id, total_amount, order_timestamp)
VALUES 
    ('ORD-001', 150.50, current_timestamp()),
    ('ORD-002', 200.00, current_timestamp())
""")
display(spark.table("orders_modern"))

### Example: Schema Evolution

**Objective:** Demonstration of automatic schema evolution when adding new columns


Schema Evolution allows for controlled addition of new columns to existing Delta tables without interrupting application operations. Delta Lake supports additive schema changes automatically when enabled with `mergeSchema` option.


<img src="../../../assets/images/3c274f5e758047b2b8033f29fd179c85.png" width="800">

In [0]:
# Data with additional column (customer_tier)
extended_customers = spark.createDataFrame([
    ("CUST010001", "New", "Customer", "new@example.com", "+48 111 222 333", "Warsaw", "MZ", "Poland", "2023-12-01", "Basic", "Premium"),
    ("CUST010002", "Another", "Customer", "another@example.com", "+48 444 555 666", "Krakow", "MP", "Poland", "2023-12-02", "Premium", "Standard")
], ["customer_id", "first_name", "last_name", "email", "phone", "city", "state", "country", "registration_date", "customer_segment", "customer_tier"])
display(extended_customers)

In [0]:
df = spark.table("customers_delta")
display(df.orderBy("customer_id"))

In [0]:
from pyspark.sql.functions import col
from pyspark.sql.types import DateType

# Cast registration_date to proper type
extended_customers = extended_customers.withColumn("registration_date", col("registration_date").cast(DateType()))

In [0]:
# Enable automatic schema evolution with mergeSchema option
extended_customers.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

In [0]:
# Check new schema - notice the new customer_tier column
spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").printSchema()

In [0]:
# Verify data - new column has NULL for old records
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .select("customer_id", "first_name", "last_name", "customer_tier")
    .filter(col("customer_id") == "CUST010001")
)

### Example: Data Quality with Constraints

**Objective:** Enforce data quality rules at the table level using CHECK constraints

Delta Lake allows defining **Constraints** that guarantee data quality at the table level. This works similarly to traditional SQL databases.

**Constraint Types:**
- `NOT NULL`: Enforces the presence of a value
- `CHECK`: Enforces any logical condition (e.g., `age > 0`, `customer_id LIKE 'CUST%'`)

In [0]:
cleaned_df = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").na.drop(subset=["customer_id"])
display(cleaned_df.orderBy("customer_id"))

In [0]:
%sql

delete from customers_delta
where customer_id is null

In [0]:
# Add CHECK constraint: customer_id must start with CUST
try:
    spark.sql(f"""
        ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
        ADD CONSTRAINT valid_customer_id CHECK (customer_id LIKE 'CUST%')
    """)
    print("Constraint 'valid_customer_id' added successfully.")
except Exception as e:
    print(f"Info: {e}")

In [0]:
# Attempt to insert invalid data (customer_id does not start with CUST)
try:
    spark.sql(f"""
        INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment)
        VALUES ('INVALID123', 'Bad', 'Customer', 'bad@example.com', '+48 000 000 000', 'Test', 'TS', 'Poland', '2023-01-01', 'Basic')
    """)
except Exception as e:
    print(f"Expected Data Quality error:\n{str(e)[:300]}...")

In [0]:
spark.sql(f"""
  INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment)
  VALUES ('CUST123', 'Bad', 'Customer', 'bad@example.com', '+48 000 000 000', 'Test', 'TS', 'Poland', '2023-01-01', 'Basic')""")

## CRUD Operations & MERGE

INSERT, UPDATE, DELETE and MERGE INTO operations on Delta tables. MERGE is the most important for the exam — it appears on almost every test.

---

**Theoretical Introduction:**

Delta Lake supports the full range of CRUD operations (Create, Read, Update, Delete), making it ideal for transactional workloads in Data Lake. All operations are:
- **Atomic**: Either fully complete or fully rolled back
- **ACID-compliant**: Ensuring data consistency
- **Recorded in Delta Log**: Full audit trail of all changes

Additionally, Delta Lake provides the powerful **MERGE INTO** operation (also known as "upsert") that combines INSERT and UPDATE in a single atomic transaction - essential for CDC (Change Data Capture) scenarios.

### Example: INSERT Operation

**Objective:** Adding new records to an existing table

In [0]:
# INSERT new customers
spark.sql(f"""
    INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment, customer_tier)
    VALUES 
        ('CUST020001', 'Insert', 'Customer1', 'insert1@example.com', '+48 111 111 111', 'Warsaw', 'MZ', 'Poland', '2023-12-10', 'Premium', 'Gold'),
        ('CUST020002', 'Insert', 'Customer2', 'insert2@example.com', '+48 222 222 222', 'Gdansk', 'PM', 'Poland', '2023-12-11', 'Basic', 'Silver')
""")

In [0]:
# Verify insertion
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_id").like("CUST02%"))
    .orderBy("customer_id")
)

### Example: UPDATE Operation

**Objective:** Updating existing records in the table

In [0]:
# UPDATE customer tier for specific customers
spark.sql(f"""
    UPDATE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    SET customer_tier = 'Platinum'
    WHERE customer_id IN ('CUST010001', 'CUST020001')
""")

In [0]:
# Verify update
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_tier") == "Platinum")
)

### Example: DELETE Operation

**Objective:** Deleting records from a Delta table

In [0]:
# DELETE specific customer
spark.sql(f"""
    DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    WHERE customer_id = 'CUST020002'
""")

In [0]:
# Verify deletion
deleted_check = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta") \
    .filter(F.col("customer_id") == "CUST020002") \
    .count()

display(
    spark.createDataFrame([
        ("Records with customer_id CUST020002", deleted_check)
    ], ["description", "count"])
)

### Example: MERGE INTO (Upsert)

**Objective:** Demonstration of upsert operation - update existing and insert new records in a single atomic transaction

MERGE INTO is especially useful when processing changes from transactional systems (CDC patterns). It allows you to:
- **Update** existing records when a match is found
- **Insert** new records when no match exists
- **Delete** records based on conditions (optional)


<img src="../../../assets/images/ff01677d3d4a45d6a6a7530d8911b785.png" width="800">

In [0]:
# Prepare data for merge (mix of updates and new records)
merge_data = spark.createDataFrame([
    ("CUST010001", "Updated", "Name", "updated@example.com", "+48 999 999 999", "Poznan", "WP", "Poland", "2023-12-01", "VIP", "Diamond"),  # Update existing
    ("CUST030001", "Brand", "New", "brand.new@example.com", "+48 777 777 777", "Wroclaw", "DS", "Poland", "2023-12-15", "Basic", "Bronze"),   # Insert new
    ("CUST030002", "Another", "New", "another.new@example.com", "+48 888 888 888", "Lodz", "LD", "Poland", "2023-12-16", "Premium", "Silver") # Insert new
], ["customer_id", "first_name", "last_name", "email", "phone", "city", "state", "country", "registration_date", "customer_segment", "customer_tier"])

# Create temporary view for merge operation
merge_data.createOrReplaceTempView("customer_updates")

In [0]:
# MERGE INTO operation (Upsert)
spark.sql(f"""
    MERGE INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta AS target
    USING customer_updates AS source
    ON target.customer_id = source.customer_id
    
    WHEN MATCHED THEN
        UPDATE SET
            first_name = source.first_name,
            last_name = source.last_name,
            email = source.email,
            phone = source.phone,
            city = source.city,
            state = source.state,
            country = source.country,
            customer_segment = source.customer_segment,
            customer_tier = source.customer_tier
    
    WHEN NOT MATCHED THEN
        INSERT (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment, customer_tier)
        VALUES (source.customer_id, source.first_name, source.last_name, source.email, source.phone, source.city, source.state, source.country, source.registration_date, source.customer_segment, source.customer_tier)
""")

In [0]:
# Verify MERGE results
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_id").isin(["CUST010001", "CUST030001", "CUST030002"]))
    .orderBy("customer_id")
)

## Metadata and Analytics

DESCRIBE DETAIL, DESCRIBE HISTORY and Delta Log internals. Understanding metadata commands is essential for auditing, debugging, and compliance on the exam.

---

**Theoretical Introduction:**

Delta Lake offers rich metadata about tables and operations that enables:
- **Auditing**: Who changed what and when
- **Debugging**: Understanding operation performance and metrics
- **Compliance**: Meeting regulatory requirements for data lineage

**Key Commands:**
- `DESCRIBE DETAIL`: File structure, partitioning, table properties
- `DESCRIBE HISTORY`: Complete audit trail of all operations
- `SHOW TBLPROPERTIES`: Table configuration and settings

### Example: DESCRIBE DETAIL

**Objective:** Analysis of Delta table metadata and physical storage details

In [0]:
# Detailed table information
display(
    spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
)

### Example: Operation History Analysis

**Objective:** Deeper analysis of history and operation metrics

In [0]:
# History with additional metrics
history_df = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta")

display(
    history_df.select(
        "version", 
        "timestamp", 
        "operation", 
        "operationMetrics.numTargetRowsInserted",
        "operationMetrics.numTargetRowsUpdated",
        "operationMetrics.numTargetRowsDeleted"
    )
)

### Example: Delta Log Internals (Deep Dive)

**Objective:** Understanding how Delta Lake ensures ACID by looking "under the hood" at JSON files in `_delta_log`

The Delta Log is a transaction log stored in the `_delta_log/` folder. Each transaction creates a new JSON file containing:
- `add`: Adding a new Parquet file with data
- `remove`: Logical deletion of a file (e.g., during DELETE or OPTIMIZE)
- `commitInfo`: Metadata about the transaction (who, when, what operation)

<img src="../../../assets/images/f0361af0b2ef43c0a1f61ee1202e2df3.png" width="800">

In [0]:
# Get table path
table_details = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
display(table_details.select("location", "numFiles", "sizeInBytes"))

In [0]:
# View complete Delta table history
display(spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta"))

## Time Travel and Disaster Recovery

Access previous versions of data with VERSION AS OF and TIMESTAMP AS OF. RESTORE for disaster recovery and VACUUM for storage cleanup — all heavily tested on the exam.

---

**Theoretical Introduction:**

Time Travel is a fundamental Delta Lake feature enabling access to previous versions of data. It is based on the **immutable transaction log** (`_delta_log`) - every change creates a new version of files, while old versions remain available until cleaned up by VACUUM.

**Key Capabilities:**
- **VERSION AS OF**: Query data at a specific version number
- **TIMESTAMP AS OF**: Query data at a specific point in time
- **RESTORE**: Rollback table to a previous state
- **Audit**: Compare data between versions

**Important Consideration:** VACUUM removes old files and directly impacts Time Travel capabilities. Understanding this relationship is crucial for data retention strategies.


<img src="../../../assets/images/7e9c029eeeb14e44b5ce68c5af90f350.png" width="800">

### Demo: Time Travel Setup

In [0]:
spark.sql(f"""DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo """)

In [0]:
# Create a new table specifically for Time Travel demonstration
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo (
    id INT,
    name STRING,
    status STRING,
    updated_at TIMESTAMP
) USING DELTA
""")

# Version 0: Insert initial data
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VALUES
    (1, 'Alice', 'active', current_timestamp()),
    (2, 'Bob', 'active', current_timestamp()),
    (3, 'Charlie', 'active', current_timestamp())
""")

print("Version 0: Initial data inserted")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo"))

In [0]:
# Version 1: Update some records
spark.sql(f"""
UPDATE {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo
SET status = 'premium', updated_at = current_timestamp()
WHERE name = 'Alice'
""")

print("Version 1: Alice upgraded to premium")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo"))

In [0]:
# Version 2: Insert new record
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VALUES
    (4, 'Diana', 'new', current_timestamp())
""")

print("Version 2: Diana added")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo"))

In [0]:
# Version 3: Delete a record
spark.sql(f"""
DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo
WHERE name = 'Charlie'
""")

print("Version 3: Charlie deleted")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo"))

### Example: Table History Exploration

**Objective:** Use DESCRIBE HISTORY to analyze all operations on the table

In [0]:
# Show complete history of all operations
display(
    spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo")
)

### Example: Time Travel Queries

**Objective:** Access previous versions of data using VERSION AS OF and TIMESTAMP AS OF

In [0]:
# Access data from version 0 (initial state)
print("Version 0 - Initial data (before any changes):")
display(
    spark.sql(f"SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VERSION AS OF 0")
)

In [0]:
# Access data from version 1 (after Alice upgrade)
print("Version 1 - After Alice upgrade:")
display(
    spark.sql(f"SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VERSION AS OF 1")
)

In [0]:
# Compare record counts between versions
version_counts = []
for v in range(6):
    count = spark.sql(f"SELECT COUNT(*) as cnt FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VERSION AS OF {v}").first()[0]
    version_counts.append((f"Version {v}", count))

display(spark.createDataFrame(version_counts, ["version", "record_count"]))

### Example: Disaster Recovery - Accidental Deletion

**Objective:** Simulate accidental data deletion and recover using RESTORE

In [0]:
# DISASTER! Accidental deletion of ALL data
spark.sql(f"DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo")

print("Oh no! All data deleted!")
print("Record count after deletion:", spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo").count())

In [0]:
# Check history to find the last good version
history = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo")
display(history.select("version", "timestamp", "operation"))

In [0]:
# RESTORE to version before the accidental deletion
# The last good version is the one before DELETE (version 3 in our case)
last_good_version = spark.sql(f"""
    SELECT version FROM (
        SELECT version, operation, 
               ROW_NUMBER() OVER (ORDER BY version DESC) as rn
        FROM (DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo)
        WHERE operation != 'DELETE'
    ) WHERE rn = 1
""").first()[0]

In [0]:
print(f"Restoring to version: {last_good_version}")
spark.sql(f"RESTORE TABLE {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo TO VERSION AS OF {last_good_version}")

In [0]:
# Verify restoration
print("Data restored successfully!")
print("Record count after RESTORE:", spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo").count())
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.time_travel_demo"))

In [0]:
%sql

DESCRIBE HISTORY time_travel_demo

### Example: VACUUM and Its Impact on Time Travel

**Objective:** Understand how VACUUM affects Time Travel capabilities

**Critical Concept:** VACUUM removes old data files that are no longer referenced by the current version of the table. Once vacuumed, **Time Travel to those versions becomes impossible**.

**Default Retention:** 7 days (168 hours)
- This means you can Time Travel to any version within the last 7 days
- After VACUUM, only versions within the retention period are accessible

In [0]:
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VALUES
    (4, 'Diana', 'new', current_timestamp()),
    (5, 'Eve', 'active', current_timestamp()),
    (6, 'Frank', 'active', current_timestamp())
""")

In [0]:
# Check current table size and files BEFORE VACUUM
before_vacuum = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo")
display(before_vacuum.select("numFiles", "sizeInBytes"))

In [0]:
# Let's see what versions are available BEFORE vacuum
print("Available versions before VACUUM:")
display(spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo").select("version", "timestamp", "operation"))

In [0]:
# VACUUM with 0 hours retention (DEMO ONLY - requires disabling safety check)
# In production, NEVER use 0 hours - use default 7 days or more!
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")

vacuum_result = spark.sql(f"""
    VACUUM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo RETAIN 0 HOURS
""")

display(vacuum_result)

In [0]:
# Check table size AFTER VACUUM
after_vacuum = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo")
display(after_vacuum.select("numFiles", "sizeInBytes"))

In [0]:
# Now try to access an old version - this will fail!

spark.sql(f"SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.time_travel_demo VERSION AS OF 6").display()

## Summary

In this notebook, we covered the **fundamentals of Delta Lake**:

| Feature | Purpose | Key Command |
|---------|---------|-------------|
| ACID Transactions | Data reliability | Automatic via Delta format |
| Schema Enforcement | Prevent bad data | Automatic on write |
| Schema Evolution | Controlled changes | `mergeSchema`, `overwriteSchema` |
| DML Operations | Insert/Update/Delete/Merge | `MERGE INTO`, `UPDATE`, `DELETE` |
| Time Travel | Historical access | `VERSION AS OF`, `TIMESTAMP AS OF` |
| RESTORE | Disaster recovery | `RESTORE TABLE ... TO VERSION AS OF` |
| VACUUM | Storage cleanup | `VACUUM table RETAIN x HOURS` |

**Next:** M04 covers Delta optimization techniques (OPTIMIZE, Z-ORDER, Liquid Clustering, CDF).

## Resource Cleanup

Clean up resources created during the notebook:

In [0]:
# Optional test resource cleanup
# NOTE: Run only if you want to delete all created data

cleanup_tables = [
    "customers_delta",
    "orders_modern",
    "time_travel_demo"
]

# Uncomment below to execute cleanup:
# for table in cleanup_tables:
#     spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.{table}")
#     print(f"Dropped: {table}")