# Delta Lake Operations

## The Story: From "Data Swamp" to Reliable Data

Your e-commerce company's previous data platform was a **Data Lake disaster**:

- Marketing updated customer segments, breaking downstream reports
- A failed job left half-written files - data corrupted for 2 days
- Finance asked for "data from last month" - no way to get it
- Data Scientists couldn't trust the data - "is this the latest version?"

**Delta Lake solves these problems.** This is the most important technology in modern Lakehouse.

---

## Why Delta Lake Matters (The Business Case)

### Problems Delta Lake Solves

| Problem | Before Delta | With Delta |
|---------|--------------|------------|
| Concurrent writes | Data corruption | ACID transactions |
| Schema changes | Silent data quality issues | Schema enforcement |
| Data recovery | "We don't have backups" | Time Travel (30 days default) |
| Incremental updates | Full reloads only | MERGE, UPDATE, DELETE |
| Query performance | Full scans always | Z-ORDER, data skipping |

### When NOT to Use Delta

| Scenario | Better Alternative | Why |
|----------|-------------------|-----|
| Streaming with <1s latency | Kafka, Redis | Delta optimized for seconds, not milliseconds |
| Simple file storage | Parquet, CSV | No need for transactions overhead |
| Cross-platform sharing | Iceberg | Better Snowflake/BigQuery interop |
| Append-only logs | Plain Parquet | Simpler, cheaper |

### Delta vs Iceberg vs Hudi

| Feature | Delta | Iceberg | Hudi |
|---------|-------|---------|------|
| **Primary backer** | Databricks | Netflix/Apple/Snowflake | Uber |
| **Best for** | Databricks ecosystem | Multi-engine (Spark, Trino, Flink) | Streaming upserts |
| **Governance** | Unity Catalog | Open catalog | Limited |
| **Adoption** | Highest in Databricks shops | Growing fast | Niche |

*Recommendation: If you're on Databricks, use Delta. If multi-cloud/multi-engine, consider Iceberg.*

---

## What You'll Learn

| Topic | Why It Matters |
|-------|---------------|
| Delta Log internals | Debug issues, understand versioning |
| MERGE operations | Implement SCD Type 1/2 |
| Time Travel | Disaster recovery, audit compliance |
| Optimization | 10x query performance improvement |
| Change Data Feed | Incremental downstream processing |

---

## Context and Requirements

- **Training Day**: Day 2 - Lakehouse & Delta Lake
- **Notebook Type**: Demo
- **Technical Requirements**:
  - Databricks Runtime 16.4 LTS or 17.3 LTS (Spark 4.0)
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
  - Cluster: Standard with 2-4 workers (or Serverless Compute)

## Theoretical Introduction

**Section Objective:** Introduction to Delta Lake as a transactional storage layer over Data Lake

**Key Concepts:**
- **Delta Lake**: Open-source storage layer providing ACID transactions for Apache Spark
- **Delta Log**: Transactional log storing metadata about all table changes
- **Schema Enforcement**: Automatic schema validation on write
- **Time Travel**: Ability to access previous versions of data

**Why is this important?**
Delta Lake solves fundamental Data Lake problems: lack of transactions, schema drift, update difficulties, and quality assurance. It provides Data Warehouse reliability with Data Lake flexibility.

## Per-user Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ../00_setup

## Configuration

Import libraries and set environment variables:

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Display user context
display(
    spark.createDataFrame([
        (CATALOG, BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA)
    ], ['catalog', 'bronze_schema', 'silver_schema', 'gold_schema'])
)

# Set catalog and schema as default
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## Section 1: Delta Lake Core Features

**Theoretical Introduction:**

Delta Lake is a transactional layer over Parquet that provides ACID properties (Atomicity, Consistency, Isolation, Durability). Every operation on a Delta table is recorded in the Delta Log - a JSON file containing metadata about changes.

**Key Concepts:**
- **ACID Transactions**: All operations are atomic and consistent
- **Delta Log**: `_delta_log/` folder with JSON files describing each transaction
- **Schema Enforcement**: Automatic schema validation
- **Unified Batch and Streaming**: One table supports both batch and streaming

**Practical Application:**
- Transactional updates in Data Lake
- Ensuring data quality through schema validation
- Unified data access for batch and streaming workloads

### Example 1.1: Creating the First Delta Table

**Objective:** Demonstration of creating a Delta table and basic properties

**Approach:**
1. Load data from Unity Catalog Volume
2. Create a managed table in Delta format
3. Explore Delta Log and metadata

In [0]:
# Load customer data from Unity Catalog Volume
customers_df = (spark.read
    .option("header", "true")
    .option("inferSchema", "true")
    .csv(f"{DATASET_BASE_PATH}/customers/customers.csv")
)

**Create managed Delta table:**

In [0]:
# Create managed Delta table
customers_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

**Display result:**

In [0]:
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").limit(5))

**Explanation:**

A managed Delta table was created in Unity Catalog. The Delta format automatically:
- Created `_delta_log/` folder with transaction metadata
- Registered table schema in Unity Catalog
- Applied Parquet compression with additional Delta features

### Example 1.2: Schema Enforcement in Action

**Objective:** Demonstration of automatic schema validation during data insertion

In [0]:
# Check current table schema
spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").printSchema()

In [0]:
# Attempt to insert data with invalid schema (missing columns)
invalid_data = spark.createDataFrame([
    ("CUST999999", "Test", "Customer", "invalid_email", "+48 123 456 789")  # Missing city, state, country, registration_date, customer_segment
], ["customer_id", "first_name", "last_name", "email", "phone"])

try:
    invalid_data.write \
        .format("delta") \
        .mode("append") \
        .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
except Exception as e:
    display(
        spark.createDataFrame([
            ("Schema enforcement in action", str(e)[:200] + "...")
        ], ["message", "error"])
    )

**Explanation:**

Schema enforcement automatically rejected data with invalid type. Delta Lake compares the new data schema with the table schema and blocks incompatible insertions, ensuring consistency.

In [0]:
# Create table with Identity Column and Generated Column
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.orders_modern (
    order_sk BIGINT GENERATED ALWAYS AS IDENTITY,  -- Surrogate Key
    order_id STRING,
    total_amount DOUBLE,
    order_timestamp TIMESTAMP,
    order_date DATE GENERATED ALWAYS AS (CAST(order_timestamp AS DATE)) -- Auto-calculated
) USING DELTA
""")

In [0]:
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.orders_modern"))


Let's check the result. Note the automatically populated columns.

In [0]:
# Insert data without specifying generated columns
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.orders_modern (order_id, total_amount, order_timestamp)
VALUES 
    ('ORD-001', 150.50, current_timestamp()),
    ('ORD-002', 200.00, current_timestamp())
""")

Now we will insert data. Note that in the `INSERT` query we omit `order_sk` and `order_date` columns.
- `order_sk`: will be generated automatically (unique number).
- `order_date`: will be calculated based on `order_timestamp`.

## Section 2: Schema Evolution

**Theoretical Introduction:**

Schema Evolution allows for controlled addition of new columns to existing Delta tables without interrupting application operations. Delta Lake supports additive schema changes automatically.

### Example 2.1: Automatic Column Addition

**Objective:** Demonstration of automatic schema evolution when adding new columns

In [0]:
# Data with additional column (customer_tier)
extended_customers = spark.createDataFrame([
    ("CUST010001", "New", "Customer", "new@example.com", "+48 111 222 333", "Warsaw", "MZ", "Poland", "2023-12-01", "Basic", "Premium"),
    ("CUST010002", "Another", "Customer", "another@example.com", "+48 444 555 666", "Krakow", "MP", "Poland", "2023-12-02", "Premium", "Standard")
], ["customer_id", "first_name", "last_name", "email", "phone", "city", "state", "country", "registration_date", "customer_segment", "customer_tier"])

**Enable automatic schema evolution:**

In [0]:
%skip
from pyspark.sql.functions import col
from pyspark.sql.types import DateType

extended_customers = extended_customers.withColumn("registration_date", col("registration_date").cast(DateType()))

In [0]:
# Enable automatic schema evolution
extended_customers.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

In [0]:
# Check new schema
spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").printSchema()

In [0]:
# Verify data - new column has NULL for old records
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .select("customer_id", "first_name", "last_name", "customer_tier")
)

In [0]:
# Add CHECK constraint: customer_id must start with CUST
try:
    spark.sql(f"""
        ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
        ADD CONSTRAINT valid_customer_id CHECK (customer_id LIKE 'CUST%')
    """)
    print("Constraint 'valid_customer_id' added successfully.")
except Exception as e:
    print(f"Info: {e}")

In [0]:
# Attempt to insert invalid data (customer_id does not start with CUST)
try:
    spark.sql(f"""
        INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment)
        VALUES ('INVALID123', 'Bad', 'Customer', 'bad@example.com', '+48 000 000 000', 'Test', 'TS', 'Poland', '2023-01-01', 'Basic')
    """)
except Exception as e:
    print(f"Expected Data Quality error:\n{str(e)[:300]}...")

Constraint has been added. Now let's try to insert data that violates it (customer_id does not start with 'CUST').
We expect Delta Lake to block this operation and return a `CheckConstraintViolation` error.

## Section 2.5: Data Quality & Constraints

**Theoretical Introduction:**

Delta Lake allows defining **Constraints** that guarantee data quality at the table level. This works similarly to traditional SQL databases.

**Constraint Types:**
- `NOT NULL`: Enforces the presence of a value.
- `CHECK`: Enforces any logical condition (e.g., `age > 0`).

## Section 3: Time Travel and Disaster Recovery

**Theoretical Introduction:**

Time Travel is a key Delta Lake feature enabling access to previous versions of data. It is based on the Copy-on-Write mechanism - every change creates a new version of files, while old versions remain available.

**Disaster Recovery:**
Thanks to Time Travel, we can not only read old data but also **restore** the table to a previous state using the `RESTORE` command. This is crucial in case of accidental data deletion or incorrect updates.

### Example 3.1: Table History Exploration

**Objective:** Use DESCRIBE HISTORY to analyze all operations on the table

In [0]:
# Show history of all operations on the table
display(
    spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
)

### Example 3.2: Time Travel queries

**Objective:** Access previous versions of data using VERSION AS OF and TIMESTAMP AS OF

In [0]:
# Access data from version 0 (before schema evolution)
version_0_data = spark.sql(f"""
    SELECT *
    FROM {CATALOG}.{BRONZE_SCHEMA}.customers_delta VERSION AS OF 1
    ORDER BY customer_id
""")

In [0]:
display(version_0_data)

In [0]:
# Compare record count between versions
current_count = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").count()
version_0_count = spark.sql(f"SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.customers_delta VERSION AS OF 0").count()

display(
    spark.createDataFrame([
        ("Current version", current_count),
        ("Version 0", version_0_count)
    ], ["version", "record_count"])
)

In [0]:
# 1. Error simulation: Accidental deletion of all data
spark.sql(f"DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.customers_delta")

In [0]:
print("Record count after RESTORE:", spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").count())

The table has been restored. Let's verify the record count.

In [0]:
# 2. Fix: RESTORE to version before deletion
# Get the last good version (before DELETE)
last_good_version = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta").select("version").limit(2).collect()[1][0]

print(f"Restoring to version: {last_good_version}")

spark.sql(f"RESTORE TABLE {CATALOG}.{BRONZE_SCHEMA}.customers_delta TO VERSION AS OF {last_good_version}")

Now we will use **Time Travel** to find the last correct version (before deletion) and restore the table using the `RESTORE` command.

In [0]:
print("Record count after failure:", spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta").count())

Oops! We deleted all data. Let's check if the table is actually empty.

### Example 3.3: Disaster Recovery - RESTORE TABLE

**Objective:** Restore the table to the state before the erroneous operation (failure simulation).

## Section 4: CRUD Operations

**Theoretical Introduction:**

Delta Lake supports the full range of CRUD operations (Create, Read, Update, Delete), making it ideal for transactional workloads in Data Lake. All operations are atomic and ACID-compliant.

### Example 4.1: INSERT operation

**Objective:** Adding new records to an existing table

In [0]:
# INSERT new customers
spark.sql(f"""
    INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment, customer_tier)
    VALUES 
        ('CUST020001', 'Insert', 'Customer1', 'insert1@example.com', '+48 111 111 111', 'Warsaw', 'MZ', 'Poland', '2023-12-10', 'Premium', 'Gold'),
        ('CUST020002', 'Insert', 'Customer2', 'insert2@example.com', '+48 222 222 222', 'Gdansk', 'PM', 'Poland', '2023-12-11', 'Basic', 'Silver')
""")

**Verify insertion:**

In [0]:
# Verify insertion
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_id").like("CUST02%"))
    .orderBy("customer_id")
)

### Example 4.2: UPDATE operation

**Objective:** Updating existing records in the table

In [0]:
# UPDATE customer tier for specific customers
spark.sql(f"""
    UPDATE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    SET customer_tier = 'Platinum'
    WHERE customer_id IN ('CUST010001', 'CUST020001')
""")

**Verify update:**

In [0]:
# Verify update
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_tier") == "Platinum")
)

### Example 4.3: DELETE operation

**Objective:** Deleting records from a Delta table

In [0]:
# DELETE specific customer
spark.sql(f"""
    DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    WHERE customer_id = 'CUST020002'
""")

**Verify deletion:**

In [0]:
# Verify deletion
deleted_check = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta") \
    .filter(F.col("customer_id") == "CUST020002") \
    .count()

display(
    spark.createDataFrame([
        ("Records with customer_id CUST020002", deleted_check)
    ], ["description", "count"])
)

## Section 5: MERGE INTO Operations

**Theoretical Introduction:**

MERGE INTO is a powerful operation enabling upsert (update + insert) in a single transaction. Especially useful when processing changes from transactional systems (CDC - Change Data Capture).

### Example 5.1: Basic MERGE INTO

**Objective:** Demonstration of upsert operation - update existing and insert new records

In [0]:
# Prepare data for merge (mix of updates and new records)
merge_data = spark.createDataFrame([
    ("CUST010001", "Updated", "Name", "updated@example.com", "+48 999 999 999", "Poznan", "WP", "Poland", "2023-12-01", "VIP", "Diamond"),  # Update
    ("CUST030001", "Brand", "New", "brand.new@example.com", "+48 777 777 777", "Wroclaw", "DS", "Poland", "2023-12-15", "Basic", "Bronze"),   # Insert
    ("CUST030002", "Another", "New", "another.new@example.com", "+48 888 888 888", "Lodz", "LD", "Poland", "2023-12-16", "Premium", "Silver") # Insert
], ["customer_id", "first_name", "last_name", "email", "phone", "city", "state", "country", "registration_date", "customer_segment", "customer_tier"])

**Create temporary view for merge operation:**

In [0]:
# Create temporary view for merge operation
merge_data.createOrReplaceTempView("customer_updates")

**Execute MERGE operation (Upsert):**

In [0]:
# MERGE INTO operation
spark.sql(f"""
    MERGE INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta AS target
    USING customer_updates AS source
    ON target.customer_id = source.customer_id
    
    WHEN MATCHED THEN
        UPDATE SET
            first_name = source.first_name,
            last_name = source.last_name,
            email = source.email,
            phone = source.phone,
            city = source.city,
            state = source.state,
            country = source.country,
            customer_segment = source.customer_segment,
            customer_tier = source.customer_tier
    
    WHEN NOT MATCHED THEN
        INSERT (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment, customer_tier)
        VALUES (source.customer_id, source.first_name, source.last_name, source.email, source.phone, source.city, source.state, source.country, source.registration_date, source.customer_segment, source.customer_tier)
""")

**Verify MERGE results:**

In [0]:
# Verify MERGE results
display(
    spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")
    .filter(F.col("customer_id").isin(["CUST010001", "CUST030001", "CUST030002"]))
    .orderBy("customer_id")
)

## Section 6: Metadata and Analytics

**Theoretical Introduction:**

Delta Lake offers rich metadata about tables and operations. DESCRIBE DETAIL provides information about file structure, partitioning, and table properties.

### Example 6.1: DESCRIBE DETAIL

**Objective:** Analysis of Delta table metadata

In [0]:
# Detailed table information
display(
    spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
)

### Example 6.2: Operation History Analysis

**Objective:** Deeper analysis of history and operation metrics

In [0]:
# History with additional metrics
history_df = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta")

display(
    history_df.select(
        "version", 
        "timestamp", 
        "operation", 
        "operationMetrics.numTargetRowsInserted",
        "operationMetrics.numTargetRowsUpdated",
        "operationMetrics.numTargetRowsDeleted"
    )
)

### Example 6.3: Delta Log Internals (Deep Dive)

**Objective:** Understanding how Delta Lake ensures ACID by looking "under the hood" at JSON files in `_delta_log`.

**Get table path and _delta_log:**

In [0]:
# Get table path
table_path = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
delta_log_path = f"{table_path}"

In [0]:
display(table_path)

**Display files in _delta_log:**

Below is a preview of the content of the last transaction JSON file.
It contains metadata about operations, such as:
- `add`: adding a new Parquet file with data.
- `remove`: logical deletion of a file (e.g., during DELETE or OPTIMIZE operation).
- `commitInfo`: metadata about the transaction itself (who, when, what operation).

In [0]:
# View Delta table history using SQL
display(spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta"))

**Analysis of the last transaction file (JSON):**

## Section 7: Optimization (Intro)

**Theoretical Introduction:**

Delta Lake offers a range of optimization mechanisms. In this notebook, we will focus on the basic **OPTIMIZE** operation (file compaction) and **VACUUM** (cleaning).

> **Deep Dive:** Advanced techniques such as **ZORDER BY**, **Partitioning**, and **Liquid Clustering** are discussed in detail in notebook **05_optimization_best_practices.ipynb**.

### Example 7.1: OPTIMIZE (File Compaction)

**Objective:** Compacting small files (small files problem) into larger ones, which improves read performance.
We can optionally add the `ZORDER BY` clause (discussed in notebook 05) to additionally sort the data.

**Execute file compaction:**

In [0]:
# OPTIMIZE table
optimize_result = spark.sql(f"""
    OPTIMIZE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
""")

display(optimize_result)

### Example 7.3: Liquid Clustering (Mention)

**Modern Alternative:**
Databricks introduced **Liquid Clustering** - a new technique that replaces traditional partitioning and ZORDER.
Liquid Clustering automatically manages data layout, adapting to query patterns.

> **Deep Dive:** Detailed discussion and examples of Liquid Clustering can be found in notebook **05_optimization_best_practices.ipynb**.

### Example 7.2: VACUUM operation

**Objective:** Removing old files (older than retention period) that are no longer needed for Time Travel.

**Disable retention check (demo only):**

In [0]:
# VACUUM - remove files older than 0 hours (demo only!)
# In production: default 7 days, minimum 0 hours with flag
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")

**Run VACUUM (remove old files):**

In [0]:
vacuum_result = spark.sql(f"""
    VACUUM {CATALOG}.{BRONZE_SCHEMA}.customers_delta RETAIN 0 HOURS
""")

display(vacuum_result)

## Section 8: Change Data Feed

**Theoretical Introduction:**

Change Data Feed (CDF) is a Delta Lake feature enabling tracking of all changes in a table. Every INSERT, UPDATE, DELETE operation is recorded with additional metadata.

### Example 8.1: Enabling Change Data Feed

**Objective:** Activating CDF for an existing table

**Enable Change Data Feed:**

In [0]:
# Enable Change Data Feed
spark.sql(f"""
    ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.customers_delta 
    SET TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

### Example 8.2: Generating changes for CDF

**Objective:** Executing operations that will be tracked by CDF

**Insert new record (INSERT):**

In [0]:
# Make more changes after enabling CDF
spark.sql(f"""
    INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    (customer_id, first_name, last_name, email, phone, city, state, country, registration_date, customer_segment, customer_tier)
    VALUES ('CUST040001', 'CDF', 'TestCustomer', 'cdf@example.com', '+48 555 555 555', 'Szczecin', 'ZP', 'Poland', '2023-12-20', 'Basic', 'Bronze')
""")

**Update record (UPDATE):**

In [0]:
spark.sql(f"""
    UPDATE {CATALOG}.{BRONZE_SCHEMA}.customers_delta
    SET customer_tier = 'Gold'
    WHERE customer_id = 'CUST040001'
""")

### Example 8.3: Reading Change Data Feed

**Objective:** Analysis of all changes recorded by CDF

In [0]:
# Check CDF properties
table_properties = spark.sql(f"SHOW TBLPROPERTIES {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
cdf_enabled = table_properties.filter(F.col("key") == "delta.enableChangeDataFeed").count() > 0

display(
    spark.createDataFrame([
        ("Change Data Feed enabled", cdf_enabled)
    ], ["property", "status"])
)

**Read changes (CDF):**

In [0]:
# Batch read change data feed from specific version
# First let's check current version
current_version = spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.customers_delta").select("version").first()[0]
cdf_start_version = max(0, current_version - 3)  # Last 3 versions

display(current_version)

In [0]:
changes_batch = spark.read \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", cdf_start_version) \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.customers_delta")

    

In [0]:
display(changes_batch)

In [0]:
display(
    changes_batch.select(
        "customer_id", "first_name", "last_name", "customer_tier", 
        "_change_type", "_commit_version", "_commit_timestamp"
    )
)

## Validation and Verification

### Checklist - What you should achieve:
- [ ] Delta table created with automatic schema enforcement
- [ ] Schema evolution - customer_tier column added
- [ ] Time Travel queries work for previous versions
- [ ] CRUD operations (INSERT, UPDATE, DELETE) executed correctly
- [ ] MERGE INTO implemented with upsert logic
- [ ] Optimization OPTIMIZE and ZORDER applied
- [ ] Change Data Feed enabled and recording changes

### Verification commands:

## Best Practices

### Performance:
- Use ZORDER BY for columns frequently appearing in WHERE clauses
- Regularly run OPTIMIZE for small files compaction
- Partitioning only for very large tables (TB+) with skewed data

### Code Quality:
- Always use explicit schema instead of inferSchema in production
- Implement schema evolution strategy for backward compatibility
- Use MERGE INTO instead of separate DELETE + INSERT operations

### Data Quality:
- Enable Change Data Feed for audit trails and compliance
- Regular backup via Time Travel snapshots
- Implement data validation rules in Delta constraints

### Governance:
- Set appropriate retention periods for compliance requirements
- Use Unity Catalog permissions for row/column level security
- Document schema changes and business logic in table comments

## Troubleshooting

### Problem 1: Schema enforcement error
**Symptoms:**
- AnalysisException during INSERT/MERGE with incompatible schema
- "Cannot write incompatible datatype" message

**Solution:**
```python
# Use mergeSchema option for schema evolution
df.write.format("delta").option("mergeSchema", "true").mode("append")
```

### Problem 2: Time Travel - version not found
**Symptoms:**
File not found for specific version after VACUUM

**Solution:**
Check retention period and available versions via DESCRIBE HISTORY

### Problem 3: VACUUM removes too many files
**Symptoms:** Time Travel queries fail after VACUUM

**Solution:**
Set appropriate retention period (default 7 days minimum)

### Debugging tips:
- Use `DESCRIBE HISTORY` to understand table operations
- Check `DESCRIBE DETAIL` for file metadata
- Verify table properties via `SHOW TBLPROPERTIES`
- Monitor `_delta_log/` folder for troubleshooting

## Summary

### What has been achieved:
- Demonstration of Delta Lake ACID properties and schema enforcement
- Hands-on Schema Evolution with automatic column addition
- Time Travel queries for historical data access
- Complete CRUD operations (CREATE, READ, UPDATE, DELETE)
- Advanced MERGE INTO for upsert scenarios
- Performance optimization with OPTIMIZE, ZORDER, VACUUM
- Change Data Feed for comprehensive audit trails

### Key Takeaways:
1. **Delta Lake = Data Lake + ACID**: Combines Data Lake flexibility with transactional reliability
2. **Schema Evolution safely**: Additive changes are automatic, breaking changes require planning
3. **Time Travel + Copy-on-Write**: Every version is preserved, enabling rollback and audit

### Quick Reference - Key Commands:

| Operation | PySpark | SQL |
|----------|---------|-----|
| Create Delta Table | `df.write.format("delta").saveAsTable()` | `CREATE TABLE USING DELTA` |
| Time Travel | `spark.read.format("delta").option("versionAsOf", 1)` | `SELECT * FROM table VERSION AS OF 1` |
| MERGE | `DeltaTable.forName().merge().execute()` | `MERGE INTO target USING source` |
| Optimize | N/A | `OPTIMIZE table ZORDER BY col` |
| History | N/A | `DESCRIBE HISTORY table` |

### Additional Resources:
- [Delta Lake Documentation](https://docs.delta.io/)
- [Delta Lake Best Practices](https://docs.databricks.com/delta/best-practices.html)

## Resource Cleanup

Clean up resources created during the notebook:

In [0]:
# Optional test resource cleanup
# NOTE: Run only if you want to delete all created data

# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.customers_delta")
# spark.sql("DROP VIEW IF EXISTS customer_updates")
# spark.catalog.clearCache()

# display(spark.createDataFrame([("Resources have been cleaned", "âœ“")], ["status", "result"]))