# M04: Delta Optimization & Performance

| Exam Domain | Weight |
|---|---|
| ELT with Spark SQL and Python | 29% |
| Incremental Data Processing | 20% |

Topics: OPTIMIZE, Z-ORDER, Liquid Clustering, Partitioning, Change Data Feed, Deletion Vectors, Predictive Optimization.

---

## Setup

In [0]:
%run ../../setup/00_setup

### Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta

# Display user context
display(
    spark.createDataFrame([
        (CATALOG, BRONZE_SCHEMA, SILVER_SCHEMA, GOLD_SCHEMA)
    ], ['catalog', 'bronze_schema', 'silver_schema', 'gold_schema'])
)

# Set catalog and schema as default
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

## 4.1. Optimization

**Theoretical Introduction:**

As data grows, query performance can degrade due to several factors:
- **Small Files Problem**: Too many small files increase metadata overhead
- **Data Layout**: Data not organized for common query patterns
- **Predicate Pushdown Inefficiency**: Scanning more data than necessary

Delta Lake provides several optimization techniques:

| Technique | Description | When to Use |
|-----------|-------------|-------------|
| **OPTIMIZE** | Compacts small files into larger ones | After many small writes |
| **Partitioning** | Physical data separation by column values | Low-cardinality filter columns |
| **Z-ORDER** | Co-locates related data for better pruning | Frequently filtered columns |
| **Liquid Clustering** | Modern alternative to partitioning + Z-ORDER | New tables (recommended) |



<img src="../../../assets/images/6653c397bbc24993975c4fc11a356ab6.png" width="800">

### 4.1.1. Example: The Small Files Problem

**Objective:** Demonstrate how many small files impact performance and how OPTIMIZE solves it

The "small files problem" occurs when:
- Streaming jobs write many small files
- Frequent small batch inserts
- High-concurrency writes

This leads to:
- Increased metadata overhead
- Slower query performance
- Higher storage costs (metadata per file)

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.small_files_demo")

In [0]:
# Create a table with many small files (simulating streaming ingestion)
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.small_files_demo (
    id INT,
    data STRING,
    created_at TIMESTAMP
) USING DELTA
TBLPROPERTIES (
    delta.autoOptimize.optimizeWrite = false,
    delta.autoOptimize.autoCompact = false
)
""")

# Insert data in many small batches (simulating streaming)
from pyspark.sql.functions import lit, current_timestamp
import random
import string

print("Inserting 500 small batches to simulate streaming ingestion...")

In [0]:
%sql 

DESCRIBE DETAIL small_files_demo

In [0]:
from pyspark.sql.functions import lit, expr, current_timestamp
import random

# 1. Configuration
total_files = 5000
rows_per_file = 2  # Average 2 records per file
total_rows = total_files * rows_per_file

# 2. Generate data in memory (no Python loop!)
df = (
    spark.range(0, total_rows)
    .withColumn("id", lit(random.randint(1, 100)))
    .withColumn("data", expr("uuid()"))
    .withColumn("created_at", current_timestamp())
)

# 3. Write with forced number of files
df.repartition(total_files).write \
    .format("delta") \
    .mode("append") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.small_files_demo")

print(f"Done! Created {total_files} small files in a single transaction.")
display(spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.small_files_demo"))

In [0]:
# Check the number of files BEFORE optimization
before_optimize = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.small_files_demo")
display(before_optimize.select("numFiles", "sizeInBytes"))
display(spark.sql(f"DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.small_files_demo"))

In [0]:
# Run OPTIMIZE to compact small files
optimize_result = spark.sql(f"""
    OPTIMIZE {CATALOG}.{BRONZE_SCHEMA}.small_files_demo
""")

display(optimize_result)

In [0]:
# VACUUM with 0 hours retention (DEMO ONLY - requires disabling safety check)
# In production, NEVER use 0 hours - use default 7 days or more!
spark.sql("SET spark.databricks.delta.retentionDurationCheck.enabled = false")

vacuum_result = spark.sql(f"""
    VACUUM {CATALOG}.{BRONZE_SCHEMA}.small_files_demo RETAIN 0 HOURS
""")

In [0]:
# Check the number of files AFTER optimization
after_optimize = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.small_files_demo")
display(after_optimize.select("numFiles", "sizeInBytes"))

### 4.1.2. Example: Partitioning

**Objective:** Demonstrate how partitioning improves query performance through partition pruning

Partitioning physically separates data into directories based on column values. This enables:
- **Partition Pruning**: Skip entire partitions that don't match query filters
- **Parallel Processing**: Process partitions independently

**Best Practices:**
- Use low-cardinality columns (date, country, status)
- Avoid over-partitioning (too many small partitions)
- Aim for 1GB+ per partition

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.orders_partitioned")

In [0]:
# Create a partitioned table
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.orders_partitioned (
    order_id STRING,
    customer_id STRING,
    product_id STRING,
    order_date DATE,
    amount DOUBLE,
    status STRING
) 
USING DELTA
PARTITIONED BY (order_date,status)
""")

In [0]:
# Insert sample data across multiple dates
from datetime import date, timedelta

orders_data = []
base_date = date(2024, 1, 1)

for day_offset in range(30):  # 30 days of data
    order_date = base_date + timedelta(days=day_offset)
    for i in range(100):  # 100 orders per day
        orders_data.append((
            f"ORD-{day_offset:02d}-{i:04d}",
            f"CUST{i % 50:04d}",
            f"PROD{i % 20:03d}",
            order_date,
            50 + (i * 2.5),
            "completed" if i % 3 != 0 else "pending"
        ))

orders_df = spark.createDataFrame(orders_data, 
    ["order_id", "customer_id", "product_id", "order_date", "amount", "status"])

orders_df.write.format("delta").mode("append").partitionBy("order_date","status") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.orders_partitioned")

print(f"Inserted {len(orders_data)} orders across 30 days")

In [0]:
# Check partitioning structure
display(spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.orders_partitioned"))

In [0]:
# Query with partition filter - only scans relevant partitions
# Check the Spark UI to see partition pruning in action
result = spark.sql(f"""
    SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.orders_partitioned
    WHERE order_date = '2024-01-15'
""")

print("Query for single date (should scan only 1 partition):")
display(result)

### 4.1.3. Example: Z-ORDER (Data Skipping)

**Objective:** Demonstrate Z-ORDER for multi-dimensional clustering

Z-ORDER is a multi-dimensional clustering technique that co-locates related data within files. This enables **data skipping** - reading only relevant files based on min/max statistics.

**When to Use:**
- Columns frequently in WHERE clauses
- High-cardinality columns (customer_id, product_id)
- Up to 4 columns (effectiveness decreases with more)

**How it Works:**
- Reorganizes data within files using Z-order curve
- Maintains min/max statistics per file
- Query engine skips files that don't match predicates


<img src="../../../assets/images/8ddc3a9209e145cf8edec763d45344d7.png" width="800">

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo")

In [0]:
# Create a table for Z-ORDER demonstration with auto-optimization disabled
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo (
    sale_id STRING,
    customer_id STRING,
    product_id STRING,
    store_id STRING,
    sale_date DATE,
    amount DOUBLE,
    quantity INT
) USING DELTA
TBLPROPERTIES (
    delta.autoOptimize.optimizeWrite = false,
    delta.autoOptimize.autoCompact = false
)
""")

In [0]:
# Insert sample data
from datetime import date
import random

sales_data = []
for i in range(100000):  # 100K records
    sales_data.append((
        f"SALE-{i:08d}",
        f"CUST{random.randint(1, 1000):04d}",
        f"PROD{random.randint(1, 500):03d}",
        f"STORE{random.randint(1, 50):02d}",
        date(2024, random.randint(1, 12), random.randint(1, 28)),
        random.uniform(10, 500),
        random.randint(1, 10)
    ))

sales_df = spark.createDataFrame(
    sales_data, 
    ["sale_id", "customer_id", "product_id", "store_id", "sale_date", "amount", "quantity"]
)

sales_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(
    f"{CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo"
)
display(sales_df)

In [0]:
# Check file statistics BEFORE Z-ORDER
display(spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo"))

In [0]:
#Queries filtering before Z-Order
result = spark.sql(f"""
    SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo
    WHERE customer_id = 'CUST0393' AND product_id = 'PROD259' --CUST0592	PROD011
""")

display(result)

In [0]:
# Apply Z-ORDER on frequently filtered columns
# In this case: customer_id and product_id are common filter columns
zorder_result = spark.sql(f"""
    OPTIMIZE {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo
    ZORDER BY (customer_id, product_id)
""")

display(zorder_result)

In [0]:
# Example query that benefits from Z-ORDER
result = spark.sql(f"""
    SELECT * FROM {CATALOG}.{BRONZE_SCHEMA}.sales_zorder_demo
     WHERE customer_id = 'CUST0393' AND product_id = 'PROD259' --CUST0592	PROD011
""")

print("Query with Z-ORDER optimized columns (check Spark UI for data skipping):")
display(result)

### 4.1.4. Example: Liquid Clustering

**Objective:** Demonstrate Liquid Clustering as a modern alternative to partitioning and Z-ORDER

Liquid Clustering is Databricks' latest optimization technique that combines the benefits of partitioning and Z-ORDER while being easier to manage:

**Key Benefits:**
- **Automatic**: Databricks manages data layout automatically
- **Adaptive**: Adjusts to changing query patterns over time
- **Flexible**: Can change clustering columns without rewriting data
- **Incremental**: Works incrementally with each OPTIMIZE
- **Simpler**: No need to choose between partitioning and Z-ORDER

**When to Use:**
- New tables (recommended default)
- Tables with evolving query patterns
- When you're unsure about optimal partitioning strategy

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering")

In [0]:
# Create a table with Liquid Clustering, auto-optimization disabled, and Predictive Optimization enabled
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering (
    sale_id STRING,
    customer_id STRING,
    product_id STRING,
    region STRING,
    sale_date DATE,
    amount DOUBLE,
    quantity INT
) 
USING DELTA
TBLPROPERTIES (
    delta.autoOptimize.optimizeWrite = false,
    delta.autoOptimize.autoCompact = false
)
""")

In [0]:
spark.sql(f""" ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering SET TBLPROPERTIES ('spark.databricks.sql.predictiveOptimization.enabled'='true'); """)

In [0]:
from pyspark.sql.functions import col, rand, lit, concat, lpad, element_at, array, date_add, to_date, round

# 1. Configuration
target_files = 5000       # We want 5000 files
rows_per_file = 10        # 10 records per file
total_rows = target_files * rows_per_file # Total 50,000 records

# Array of regions for random selection
regions_list = array([lit(x) for x in ['North', 'South', 'East', 'West', 'Central']])

# 2. Data generation (no for loop!)
df = spark.range(0, total_rows).withColumnRenamed("id", "idx") \
    .withColumn("sale_id", concat(lit("SALE-"), lpad(col("idx"), 8, "0"))) \
    .withColumn("customer_id", concat(lit("CUST"), lpad((rand() * 500 + 1).cast("int"), 4, "0"))) \
    .withColumn("product_id", concat(lit("PROD"), lpad((rand() * 200 + 1).cast("int"), 3, "0"))) \
    .withColumn("region", element_at(regions_list, (rand() * 5 + 1).cast("int"))) \
    .withColumn("sale_date", date_add(to_date(lit("2024-01-01")), (rand() * 364).cast("int"))) \
    .withColumn("amount", round(rand() * 490 + 10, 2)) \
    .withColumn("quantity", (rand() * 10 + 1).cast("int")) \
    .drop("idx") # Remove helper column

# 3. Write - repartition is key
# Create table with Liquid Clustering enabled (if not exists)
# or append to existing one.
df.repartition(target_files).write \
    .format("delta") \
    .mode("append") \
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering")

print(f"Done! Inserted {total_rows} records in {target_files} small files.")

In [0]:
df = spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering")
display(df)

In [0]:
# Enable liquid clustering on an existing table by specifying clustering columns
spark.sql(f"""
ALTER TABLE {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering
CLUSTER BY AUTO
""")

In [0]:
# OPTIMIZE automatically applies Liquid Clustering
# No need to specify ZORDER - it's built into the table definition!
optimize_result = spark.sql(f"""
    OPTIMIZE {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering
""")

display(optimize_result)

In [0]:
# Check clustering information
display(spark.sql(f"DESCRIBE DETAIL {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering"))

In [0]:
# Queries filtering by clustering columns are automatically optimized
result = spark.sql(f"""
    SELECT region, COUNT(*) as sales_count, SUM(amount) as total_amount
    FROM {CATALOG}.{BRONZE_SCHEMA}.sales_liquid_clustering
    WHERE customer_id LIKE 'CUST00%' AND region = 'North'
    GROUP BY region
""")

display(result)

**Comparison: Partitioning vs Z-ORDER vs Liquid Clustering**

| Feature | Partitioning | Z-ORDER | Liquid Clustering |
|---------|-------------|---------|-------------------|
| When to choose | Low-cardinality columns | High-cardinality filter columns | General purpose (recommended) |
| Data layout | Directory per partition | Co-located in files | Automatic clustering |
| Schema change | Requires rewrite | Easy to change | Easy to change |
| Maintenance | Manual | Manual OPTIMIZE | Automatic with OPTIMIZE |
| Best for | Date/Region filters | Multi-column filters | Evolving workloads |

## 4.1.5. Data Skew & Data Distribution Problems

### What is Data Skew?

**Data skew** occurs when data is unevenly distributed across partitions. Instead of each partition having roughly the same number of rows, one or more partitions end up with significantly more data ("hot partitions"). This is one of the most common and impactful performance problems in distributed data processing.

<img src="../../../assets/images/eef58dd32657427eb8f4f2fdacff9b39.png" width="800">

---

### Common Causes of Data Skew

| Cause | Example | Impact |
|-------|---------|--------|
| **Skewed join keys** | Joining on `country` where 80% of customers are from one country | One executor handles most of the join |
| **Null values in join/group keys** | `customer_id IS NULL` for anonymous orders | All NULLs land on one partition |
| **Hot keys in groupBy** | `groupBy("product_category")` where "Electronics" has 90% of sales | One partition aggregates most data |
| **Uneven source files** | One file is 10 GB, others are 100 MB | One task reads disproportionate data |
| **Time-based partitioning** | `PARTITION BY (order_date)` when most orders are from last month | Recent partition is orders of magnitude larger |

---

### Detecting Data Skew

**Symptoms:**
- One stage takes much longer than others in Spark UI
- One task in a stage uses much more memory/time than siblings
- `SpillToMemory` or `SpillToDisk` metrics appear for one task
- OOM (Out of Memory) errors on specific executors

**Diagnostic queries:**

```python
# Check partition sizes (data distribution)
df.groupBy(spark.sql.functions.spark_partition_id()).count().orderBy("count", ascending=False).show()

# Check key distribution for join/group columns
df.groupBy("join_key_column").count().orderBy("count", ascending=False).show(20)

# In Databricks SQL:
# SELECT join_key, COUNT(*) as cnt FROM table GROUP BY join_key ORDER BY cnt DESC LIMIT 20
```

---

### Solutions for Data Skew

#### 1. Adaptive Query Execution (AQE) — Automatic

Databricks enables AQE by default. It detects skew at runtime and automatically:
- Splits skewed partitions into smaller sub-partitions
- Adjusts shuffle partition count based on data volume
- Converts sort-merge joins to broadcast joins when beneficial

```sql
-- AQE is ON by default in Databricks. To verify:
SET spark.sql.adaptive.enabled;  -- true

-- Key AQE skew optimization settings:
SET spark.sql.adaptive.skewJoin.enabled;  -- true (auto-splits skewed partitions)
SET spark.sql.adaptive.skewJoin.skewedPartitionFactor;  -- 5 (partition 5x median = skewed)
SET spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes;  -- 256MB
```

> **Exam Tip:** AQE is a Spark 3.x feature enabled by default in Databricks. It handles many skew scenarios automatically. On the exam, know that AQE can dynamically adjust partition counts and handle skewed joins.

#### 2. Salting — Manual Technique

When AQE is insufficient (extreme skew), **salting** splits hot keys artificially:

```python
from pyspark.sql.functions import concat, lit, floor, rand

# Add random salt (0-9) to the skewed key
num_salts = 10
df_large_salted = df_large.withColumn("salted_key", concat("join_key", lit("_"), floor(rand() * num_salts).cast("int")))

# Explode the small table to match all salt values
from pyspark.sql.functions import explode, array
df_small_salted = df_small.withColumn("salted_key",
    explode(array([concat("join_key", lit(f"_{i}")) for i in range(num_salts)])))

# Join on salted key — data distributed evenly
result = df_large_salted.join(df_small_salted, "salted_key")
```

#### 3. Broadcast Join — Small Table Optimization

If one side of the join is small enough (< 10 MB default, configurable), broadcast it to all executors to avoid shuffle entirely:

```python
from pyspark.sql.functions import broadcast

# Force broadcast join — eliminates shuffle
result = df_large.join(broadcast(df_small), "join_key")
```

```sql
-- SQL equivalent:
SELECT /*+ BROADCAST(small_table) */ *
FROM large_table JOIN small_table ON large_table.key = small_table.key;
```

> **Exam Tip:** The default broadcast threshold is 10 MB (`spark.sql.autoBroadcastJoinThreshold`). AQE can also automatically convert sort-merge joins to broadcast joins at runtime.

#### 4. Handling NULL Keys

```python
# Separate NULL keys from the join (they won't match anyway in inner join)
df_nulls = df.filter("join_key IS NULL")
df_non_nulls = df.filter("join_key IS NOT NULL")

# Process non-null keys normally, handle nulls separately
result = df_non_nulls.join(df_other, "join_key")
```

---

### Skew vs. Optimization Techniques — Summary

| Problem | Solution | When to Use |
|---------|----------|-------------|
| Skewed join keys | AQE (automatic) | First attempt — usually sufficient |
| Extreme skew in joins | Salting | AQE insufficient, 100:1+ key imbalance |
| Small table join | Broadcast join | One table < 10 MB (adjustable) |
| NULL keys | Filter + handle separately | Many NULLs in join/group column |
| Skewed groupBy | `repartition()` + salting | One group dominates others |
| Uneven file sizes | Auto Loader / OPTIMIZE | Ingestion or post-load compaction |
| Time partition skew | Liquid Clustering | Replace static partitioning |

> **Exam Tip:** Data skew detection and resolution is a practical skill tested indirectly. Questions about AQE, broadcast joins, and partition management often relate to skew scenarios. Remember: AQE handles most cases automatically in Databricks.

## 4.2. Change Data Feed vs Change Data Capture

**Theoretical Introduction:**

Two terms are often confused in the data engineering world: **Change Data Feed (CDF)** and **Change Data Capture (CDC)**. Understanding the difference is crucial:

### Change Data Capture (CDC)
**What it is:** A *pattern/technique* for capturing changes from source systems (databases, APIs, etc.)

**Characteristics:**
- Source-side technology
- Captures INSERT, UPDATE, DELETE from operational databases
- Tools: Debezium, AWS DMS, Fivetran, Qlik Replicate
- Produces a stream of change events

**Example:** Capturing changes from PostgreSQL and streaming them to Kafka

### Change Data Feed (CDF)
**What it is:** A *Delta Lake feature* that records row-level changes within Delta tables

**Characteristics:**
- Delta Lake native feature
- Tracks changes that happen WITHIN Delta tables
- Provides `_change_type`, `_commit_version`, `_commit_timestamp` columns
- Enables efficient incremental processing

**Example:** Reading only the rows that changed since the last pipeline run

### 4.2.1. Example: Enabling Change Data Feed

**Objective:** Enable CDF on a Delta table to track all changes

In [0]:
# Create a table with CDF enabled from the start
spark.sql(f"""
CREATE OR REPLACE TABLE {CATALOG}.{BRONZE_SCHEMA}.cdf_demo (
    user_id STRING,
    name STRING,
    email STRING,
    status STRING,
    updated_at TIMESTAMP
) 
USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed = true)
""")

print("Table created with Change Data Feed enabled")

In [0]:
# Verify CDF is enabled
properties = spark.sql(f"SHOW TBLPROPERTIES {CATALOG}.{BRONZE_SCHEMA}.cdf_demo")
display(properties.filter(F.col("key").like("%change%")))

### 4.2.2. Example: Generating and Tracking Changes

**Objective:** Perform various DML operations and observe how CDF tracks them

In [0]:
# INSERT some initial data
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.cdf_demo VALUES
    ('U001', 'Alice', 'alice@example.com', 'active', current_timestamp()),
    ('U002', 'Bob', 'bob@example.com', 'active', current_timestamp()),
    ('U003', 'Charlie', 'charlie@example.com', 'active', current_timestamp())
""")
print("Version 1: Initial INSERT completed")

In [0]:
# UPDATE a record
spark.sql(f"""
UPDATE {CATALOG}.{BRONZE_SCHEMA}.cdf_demo
SET status = 'premium', updated_at = current_timestamp()
WHERE user_id = 'U001'
""")
print("Version 2: UPDATE completed - Alice upgraded to premium")

In [0]:
# DELETE a record
spark.sql(f"""
DELETE FROM {CATALOG}.{BRONZE_SCHEMA}.cdf_demo
WHERE user_id = 'U002'
""")
print("Version 3: DELETE completed - Bob removed")

In [0]:
# INSERT new record
spark.sql(f"""
INSERT INTO {CATALOG}.{BRONZE_SCHEMA}.cdf_demo VALUES
    ('U004', 'Diana', 'diana@example.com', 'trial', current_timestamp())
""")
print("Version 4: INSERT completed - Diana added")

### 4.2.3. Example: Reading Change Data Feed

**Objective:** Read and analyze change data with CDF metadata columns

In [0]:
changes = spark.read \
    .format("delta") \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.cdf_demo")

display(changes)

In [0]:
# Read all changes from the beginning
changes = spark.read \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 0) \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.cdf_demo")

# Show changes with CDF metadata columns
display(
    changes.select(
        "user_id", "name", "status",
        "_change_type",        # insert, update_preimage, update_postimage, delete
        "_commit_version",     # Delta version number
        "_commit_timestamp"    # When the change occurred
    ).orderBy("_commit_version", "user_id")
)

**Understanding `_change_type` values:**

| Change Type | Description |
|-------------|-------------|
| `insert` | New row inserted |
| `update_preimage` | Row value BEFORE update |
| `update_postimage` | Row value AFTER update |
| `delete` | Row that was deleted |

This enables powerful incremental processing patterns - you can process only what changed!

In [0]:
# Example: Get only new inserts since version 2
new_inserts = spark.read \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 2) \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.cdf_demo") \
    .filter(F.col("_change_type") == "insert")

print("New inserts since version 2:")
display(new_inserts.select("user_id", "name", "status", "_commit_version"))

In [0]:
# Example: Get all deletions for audit purposes
deletions = spark.read \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", 0) \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.cdf_demo") \
    .filter(F.col("_change_type") == "delete")

print("All deleted records (for audit):")
display(deletions.select("user_id", "name", "_commit_version", "_commit_timestamp"))

### 4.2.4. Example: CDF for Incremental ETL

**Objective:** Demonstrate how CDF enables efficient incremental processing in ETL pipelines

Instead of reprocessing entire tables, use CDF to process only changed rows:

In [0]:
# Simulate an incremental ETL pipeline
# First run: Process all data (startingVersion = 0)
# Subsequent runs: Process only changes since last processed version

# Store the last processed version (in practice, save this to a checkpoint table)
last_processed_version = 0

# Read incremental changes
incremental_changes = spark.read \
    .format("delta") \
    .option("readChangeFeed", "true") \
    .option("startingVersion", last_processed_version) \
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.cdf_demo")

# Apply transformations only to changed records
transformed = incremental_changes \
    .filter(F.col("_change_type").isin(["insert", "update_postimage"])) \
    .withColumn("processed_at", F.current_timestamp()) \
    .withColumn("email_domain", F.split(F.col("email"), "@")[1])

print("Incremental processing - only changed records:")
display(transformed.select("user_id", "name", "email_domain", "status", "_change_type", "processed_at"))

## 4.3. Deletion Vectors

**Theoretical Introduction:**

Deletion Vectors are a storage optimization feature in Delta Lake that improves DELETE, UPDATE, and MERGE performance.

**How it works:**
- Instead of rewriting entire data files on DELETE/UPDATE, Delta Lake marks rows as deleted in a separate **deletion vector file**
- The actual data files remain unchanged until next OPTIMIZE or REORG
- Reads automatically filter out deleted rows using the deletion vector


<img src="../../../assets/images/3358ea8efef949f09710fd4423728a31.png" width="800">


| Aspect | Without Deletion Vectors | With Deletion Vectors |
|--------|------------------------|----------------------|
| DELETE speed | Slow (rewrite files) | Fast (mark in vector) |
| Storage during DELETE | Temporary 2x | Minimal overhead |
| Read performance | Normal | Slight overhead (filter) |
| Cleanup | Automatic | Needs REORG TABLE ... APPLY |

```sql
-- Enable deletion vectors on a table
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

-- Purge deletion vectors (rewrite files)
REORG TABLE my_table APPLY (PURGE);
```

**Exam Note:** Deletion Vectors are enabled by default on new tables in Databricks. They make DML operations faster but may slightly increase read overhead until REORG is run.

## 4.4. Predictive Optimization

**Theoretical Introduction:**

Predictive Optimization is an automatic maintenance feature in Databricks that eliminates the need for manual OPTIMIZE and VACUUM commands.

**How it works:**
- Databricks automatically monitors table health metrics
- Runs OPTIMIZE (file compaction) when small files accumulate
- Runs VACUUM when orphaned files need cleanup
- Managed at catalog or schema level

```sql
-- Enable at schema level
ALTER SCHEMA my_catalog.my_schema
ENABLE PREDICTIVE OPTIMIZATION;

-- Check optimization history
SELECT * FROM system.storage.predictive_optimization_operations_history
WHERE catalog_name = 'my_catalog'
ORDER BY timestamp DESC;
```

| Feature | Manual | Predictive Optimization |
|---------|--------|------------------------|
| OPTIMIZE | Run manually / schedule | Automatic |
| VACUUM | Run manually / schedule | Automatic |
| Tuning | DBA responsibility | ML-based decisions |
| Cost | Compute cost on schedule | Pay per optimization |

**Exam Note:** Predictive Optimization requires Unity Catalog managed tables. It is the recommended approach for production workloads.

---

## Summary

| Topic | Key Takeaway |
|---|---|
| **OPTIMIZE** | Compacts small files. Run after streaming/frequent inserts |
| **Partitioning** | Low-cardinality columns, partition > 1GB |
| **Z-ORDER** | Co-locates data for data skipping on filter columns |
| **Liquid Clustering** | Modern replacement — incremental, no manual OPTIMIZE |
| **Data Skew** | AQE handles most cases. Salting, broadcast for extreme cases |
| **CDF vs CDC** | CDC = source capture. CDF = Delta Lake row-level tracking |
| **Deletion Vectors** | Fast DELETE/UPDATE via marking instead of rewriting |
| **Predictive Optimization** | Automatic OPTIMIZE + VACUUM for UC managed tables |

### Quick Reference

| Operation | Command |
|---|---|
| Optimize | `OPTIMIZE table` |
| Z-ORDER | `OPTIMIZE table ZORDER BY (col)` |
| Vacuum | `VACUUM table RETAIN X HOURS` |
| Enable CDF | `ALTER TABLE SET TBLPROPERTIES (delta.enableChangeDataFeed = true)` |
| Read CDF | `.option("readChangeFeed", "true").option("startingVersion", 0)` |

---

> **← M03: Delta Fundamentals | Day 2 | M05: Incremental Processing →**

## Cleanup

In [0]:
# Optional test resource cleanup
# NOTE: Run only if you want to delete all created data

# Tables to clean up:
cleanup_tables = [
    "customers_delta",
    "orders_modern", 
    "time_travel_demo",
    "small_files_demo",
    "orders_partitioned",
    "sales_zorder_demo",
    "sales_liquid_clustering",
    "cdf_demo"
]

In [0]:
# Uncomment below to execute cleanup:
# for table in cleanup_tables:
#     spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.{table}")
#     print(f"Dropped: {table}")

# spark.sql("DROP VIEW IF EXISTS customer_updates")
# spark.catalog.clearCache()

# print("All resources cleaned up!")