# M05: Incremental Processing

| Exam Domain | Weight |
|---|---|
| Incremental Data Processing | 20% |
| ELT with Spark SQL and Python | 29% |

Topics: COPY INTO, Auto Loader (cloudFiles), Structured Streaming, Trigger Modes, Schema Evolution, Rescued Data, Stream-Static Joins.

---

## Setup

Initialize the environment, import libraries, and prepare simulated data sources for all ingestion demos in this module.

---

In [0]:
%run ../../setup/00_setup

### Configuration

Import libraries and define paths for source data, checkpoints, and schema locations.

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import time

In [0]:
# Set default catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# === SOURCE DATA (REAL DATASET) ===
SOURCE_CUSTOMERS = f"{DATASET_PATH}/customers/customers.csv"
SOURCE_ORDERS = f"{DATASET_PATH}/orders/orders_batch.json"

# === DEMO PATHS (SIMULATED ARRIVAL) ===
DEMO_BASE_PATH = f"{DATASET_PATH}/ingestion_demo"
BATCH_SOURCE_PATH = f"{DEMO_BASE_PATH}/batch_source"
STREAM_SOURCE_PATH = f"{DEMO_BASE_PATH}/stream_source"

# === TECHNICAL PATHS ===
CHECKPOINT_BASE_PATH = f"{DEMO_BASE_PATH}/checkpoints"
SCHEMA_BASE_PATH = f"{DEMO_BASE_PATH}/schemas"
BAD_RECORDS_PATH = f"{DEMO_BASE_PATH}/bad_records"

# Cleanup from previous runs
dbutils.fs.rm(DEMO_BASE_PATH, True)
print(f"Demo environment prepared at: {DEMO_BASE_PATH}")

In [0]:
display(
    spark.createDataFrame([
        ("CATALOG", CATALOG),
        ("BRONZE_SCHEMA", BRONZE_SCHEMA),
        ("SILVER_SCHEMA", SILVER_SCHEMA),
        ("USER", raw_user),
        ("CUSTOMERS_CSV", SOURCE_CUSTOMERS),
        ("STREAMING_SOURCE_PATH", STREAM_SOURCE_PATH)
    ], ["Variable", "Value"])
)

In [0]:
# === DATA PREPARATION (SIMULATION) ===

# 1. Prepare Batch Data (Customers)
# We split customers into 2 days for COPY INTO demo
df_customers = spark.read.option("header", "true").csv(SOURCE_CUSTOMERS)
df_batch_day1, df_batch_day2,df_batch_day3,df_batch_day4 = df_customers.randomSplit([0.25]*4, seed=42)

# Save Day 1 immediately
df_batch_day1.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day1")
print(f"Batch Data: Day 1 ready at {BATCH_SOURCE_PATH}/day1")

# 2. Prepare Streaming Data (Orders)
# We take existing stream files, merge them, and split into 10 micro-batches for simulation
SOURCE_STREAM_FILES = f"{DATASET_PATH}/orders/stream/*.json"
df_all_orders = spark.read.json(SOURCE_STREAM_FILES)

# Split into 20 parts (5% each)
stream_batches = df_all_orders.randomSplit([0.05] * 20, seed=42)

# Save Batch 1 immediately to start the stream
stream_batches[0].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_01")
print(f"Stream Data: Batch 1 ready at {STREAM_SOURCE_PATH}/batch_01")

### Data Loading Methods Overview

<img src="../../../assets/images/d271b8dfc29049aaab22d97536aca66d.avif" width="800">

| Feature | CTAS | COPY INTO | Auto Loader |
|---------|------|-----------|-------------|
| **Incremental** | No | Yes | Yes |
| **Idempotent** | No | Yes | Yes |
| **Schema Evolution** | No | Limited | Advanced |
| **File Tracking** | No | Metadata | Checkpoint |
| **Scalability** | Low | Medium | High |
| **Streaming** | No | No | Yes |
| **Use Case** | One-time | Scheduled batch | Real-time/Streaming |

> **Exam Tip:** Auto Loader (`cloudFiles`) is the **recommended** ingestion method for new projects. COPY INTO is suitable for scheduled batch loads with <100K files.

---

### Demo: CTAS (Create Table As Select)

In [0]:
# Example: Load CSV from volume to customer_cts table using CTAS

table_name = "customer_cts"

display(spark.sql(f"""
CREATE OR REPLACE TABLE {table_name} AS
SELECT *
FROM csv.`{SOURCE_CUSTOMERS}`
"""))

In [0]:
%sql
select * from customer_cts

## COPY INTO — Batch Loading

COPY INTO provides idempotent, file-level batch ingestion from cloud storage into Delta tables, automatically tracking which files have already been loaded.

> **Exam Tip:** `COPY INTO` is **idempotent** — it tracks processed files and won't load duplicates on re-run.

---

### Demo: COPY INTO from CSV

In [0]:
TABLE_CUSTOMERS = f"{BRONZE_SCHEMA}.customers_batch"

**Creating target table:**

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TABLE_CUSTOMERS}")

In [0]:
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_CUSTOMERS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date DATE,
  customer_segment STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
COMMENT 'Customers data - Bronze layer'
""")

**Execute COPY INTO:**

In [0]:
# Load Day 1 data
result = spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    city,
    state,
    country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/day1'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
COPY_OPTIONS ('mergeSchema' = 'true')
""")

display(result)
display(spark.table(TABLE_CUSTOMERS))

### Demo: Idempotency

In [0]:
count_before = spark.table(TABLE_CUSTOMERS).count()

# Re-run COPY INTO (same source path)
spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id, first_name, last_name, email, phone,
    city, state, country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/*'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
""")

In [0]:
count_after = spark.table(TABLE_CUSTOMERS).count()
display(count_after)

In [0]:
display(
    spark.createDataFrame([
        ("Before", count_before),
        ("After", count_after),
        ("Difference", count_after - count_before)
    ], ["State", "Count"])
)

### Demo: Adding More Days

In [0]:
# Load all days data
result = spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    city,
    state,
    country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/*'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
COPY_OPTIONS ('mergeSchema' = 'true')
""")

display(result)

In [0]:
df_batch_day2.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day6")

In [0]:
df_batch_day3.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day3")

In [0]:
df_batch_day4.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day4")
display(spark.table(TABLE_CUSTOMERS).count())

display(spark.table(TABLE_CUSTOMERS))

## Auto Loader — Streaming Ingestion

Auto Loader (`cloudFiles`) provides scalable, checkpoint-based streaming ingestion from cloud storage with built-in schema evolution and exactly-once guarantees.

> **Exam Tip:** Auto Loader uses `cloudFiles` format with **checkpoint-based** exactly-once processing. It's the **recommended** method for new ingestion pipelines.

---

### Trigger Modes & Output Modes

| Trigger Mode | Behavior | Use Case |
|------|----------|----------|
| `availableNow=True` | Process all available → stop | Scheduled jobs |
| `processingTime="10 seconds"` | Every N seconds | Real-time |
| `once=True` | Legacy (deprecated) | — |

<img src="../../../assets/images/113d4c6273584dc6aa6882e2afe85d0b.png" width="800">

| Output Mode | Description | Use Case |
|------|-------------|----------|
| **Append** | Only new rows | Raw data ingestion (stateless) |
| **Update** | Only updated rows | Aggregations (stateful) |
| **Complete** | Entire result rewritten | Small aggregations |

---

In [0]:
try:
    dbutils.fs.rm(CHECKPOINT_BASE_PATH, True)
    dbutils.fs.rm(SCHEMA_BASE_PATH, True)
    dbutils.fs.rm(BAD_RECORDS_PATH, True)
except:
    pass

In [0]:
TARGET_TABLE_AL = f"{BRONZE_SCHEMA}.orders_autoloader"
CHECKPOINT_AL = f"{CHECKPOINT_BASE_PATH}/autoloader"
SCHEMA_AL = f"{SCHEMA_BASE_PATH}/autoloader"

**Auto Loader readStream configuration:**

### Auto Loader Configuration Options

| Category | Key Options |
|---|---|
| **Common** | `cloudFiles.format`, `cloudFiles.schemaLocation`, `cloudFiles.includeExistingFiles` |
| **Schema** | `cloudFiles.inferColumnTypes`, `cloudFiles.schemaEvolutionMode` (`addNewColumns`, `rescue`, `failOnNewColumns`) |
| **File Discovery** | `cloudFiles.useIncrementalListing`, `cloudFiles.maxFilesPerTrigger` |
| **Notification** | `cloudFiles.useNotifications` (cloud-specific: SQS, Event Grid) |

[Full options reference](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options/)

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_AL}")

In [0]:
df_autoloader = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", SCHEMA_AL)
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load(STREAM_SOURCE_PATH) # Reading from our simulated stream source
)

**Adding metadata columns:**

In [0]:
from pyspark.sql.functions import col

df_enriched = (df_autoloader
    .withColumn("_processing_time", F.current_timestamp())
    .withColumn("_source_file", col("_metadata.file_path"))
)

**Start streaming with `availableNow` trigger:**

> `availableNow` - processes all available data and stops (batch-like streaming)

In [0]:
query = (df_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_AL)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_AL)
)
display(spark.table(f"{TARGET_TABLE_AL}"))

**Auto Loader results:**

In [0]:
display(
    spark.createDataFrame([
        ("Records Loaded", str(spark.table(TARGET_TABLE_AL).count())),
        ("Source Files", str(spark.table(TARGET_TABLE_AL).select("_source_file").distinct().count()))
    ], ["Metric", "Value"])
)

### Demo: Incremental Processing — Add New Data

In [0]:
# Save Batch 2 immediately to start the stream
stream_batches[1].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_02")

### Demo: Continuous Processing (processingTime)

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_AL}_continuous")

In [0]:
# Start stream in background
query_continuous = (df_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", f"{CHECKPOINT_AL}_continuous")
    .trigger(processingTime="10 seconds")
    .toTable(f"{TARGET_TABLE_AL}_continuous")
)

print("Stream started... Waiting for initialization.")
display(spark.table(f"{TARGET_TABLE_AL}_continuous").count())

In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(2, 10): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")
display(spark.table(f"{TARGET_TABLE_AL}_continuous").count())
display(spark.table(f"{TARGET_TABLE_AL}_continuous"))

In [0]:
# Stop stream
query_continuous.stop()

## Stream-Static Joins & Aggregations

Stream-Static Join = enriching a stream (e.g., Orders) with a static table (e.g., Customers). Static table is re-read at each micro-batch start.

---

In [0]:
# Load Static Table
df_static_customers = spark.table(TABLE_CUSTOMERS)
display(df_static_customers)

### Demo: Stream-Static Join (Append Mode)

In [0]:
spark.sql("DROP TABLE IF EXISTS jointed_orders")

In [0]:
df_enriched.createOrReplaceTempView("enriched_orders_stream")
df_static_customers.createOrReplaceTempView("static_customers")

df_joined = spark.sql("""
SELECT
  o.order_id,
  o.total_amount,
  c.first_name,
  c.last_name,
  c.email,
  o._processing_time
FROM enriched_orders_stream o
LEFT JOIN static_customers c
  ON o.customer_id = c.customer_id
""")

In [0]:
# Write enriched stream
query_join = (df_joined.writeStream
    .format("memory")
    .queryName("enriched_orders")
    .outputMode("append") # Joins (Left Stream-Static) are append-only
    .option("checkpointLocation", f"{CHECKPOINT_BASE_PATH}/join_demo_3")
    .trigger(processingTime="5 seconds")
    .option("includeExistingFiles", "true")
    .start()
)

In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(11, 13): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")
display(spark.table("enriched_orders"))

In [0]:
query_join.stop()
display(spark.sql("SELECT count(1) FROM enriched_orders "))

### Demo: Streaming Aggregation (Watermarking)

In [0]:
# 1. Define Streaming Aggregation (Orders per 30 seconds)
# We use the previously defined df_enriched (from Auto Loader)

windowed_counts = (df_enriched
    .withWatermark("_processing_time", "1 minutes") # Allow 1 mins late data
    .groupBy(
        F.window("_processing_time", "30 seconds"),
        "customer_id"
    )
    .count()
)

# 2. Write Stream with UPDATE mode
# Update mode is efficient for aggregations - it emits only changed windows
query_agg = (windowed_counts.writeStream
    .format("console") 
    .queryName("orders_counts")
    .outputMode("update")
    .option("checkpointLocation", f"{CHECKPOINT_BASE_PATH}/agg_demo")
    .trigger(processingTime="5 seconds")
    .start()
)

print("Aggregation stream started...")

In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(14, 16): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")
display(windowed_counts)

In [0]:
# Stop for demo purposes
query_agg.stop()

## Error Handling

Databricks provides multiple strategies for handling malformed, corrupted, or schema-mismatched records during ingestion.

| Mode | Behavior |
|------|----------|
| `PERMISSIVE` | Parses what it can, errors → `_corrupt_record` |
| `DROPMALFORMED` | Removes malformed records |
| `FAILFAST` | Stops on first error |

> **Best Practice:** Use `badRecordsPath` to save malformed records for later analysis.

---

### Schema Evolution & Rescued Data

| `schemaEvolutionMode` | Behavior |
|---|---|
| `addNewColumns` | Automatically adds new columns |
| `rescue` | New/mismatched → `_rescued_data` JSON column |
| `failOnNewColumns` | Fail if schema changes |
| `none` | Ignores new columns |

> **Exam Tip:** `_rescued_data` column captures new columns, type mismatches, and malformed records when using `rescue` mode.

---

In [0]:
TARGET_TABLE_RESCUE = f"{BRONZE_SCHEMA}.orders_rescued"
CHECKPOINT_RESCUE = f"{CHECKPOINT_BASE_PATH}/rescue"
SCHEMA_RESCUE = f"{SCHEMA_BASE_PATH}/rescue"

spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_RESCUE}")

**Define explicit schema (partial):**

In [0]:
# Deliberately define only some columns - rest will go to _rescued_data
partial_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("total_amount", DoubleType(), True)
])

**Auto Loader with rescue mode:**

In [0]:
# Create BAD data (Extra column + Type mismatch)
bad_data = [{"order_id": 99999, "total_amount": "INVALID_NUMBER", "new_col": "surprise"}]
spark.createDataFrame(bad_data).write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/bad_data")

In [0]:
df_rescue = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", SCHEMA_RESCUE)
    .option("cloudFiles.schemaEvolutionMode", "rescue")  # Rescue mode!
    .schema(partial_schema)  # Partial schema
    .load(STREAM_SOURCE_PATH)
)

**Start stream:**

In [0]:
query_rescue = (df_rescue.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_RESCUE)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_RESCUE)
)
display(spark.table(TARGET_TABLE_RESCUE))

**Schema with `_rescued_data` column:**

In [0]:
spark.table(TARGET_TABLE_RESCUE).printSchema()

**Data with rescued columns:**

In [0]:
display(
    spark.table(TARGET_TABLE_RESCUE)
    .limit(5)
)

### Demo: badRecordsPath

In [0]:
TABLE_ERRORS = f"{BRONZE_SCHEMA}.customers_with_validation"

**Creating table with `_corrupt_record`:**

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TABLE_ERRORS}")

spark.sql(f"""
CREATE TABLE {TABLE_ERRORS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date DATE,
  customer_segment STRING,
  _corrupt_record STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
""")

**Loading with error handling:**

In [0]:
# Create a CSV with a bad row
bad_csv_data = [
    (999, "Eve", "2023-01-03"),
    (888, "Frank", "NOT_A_DATE") # This will fail date parsing
]
spark.createDataFrame(bad_csv_data, ["customer_id", "first_name", "registration_date"]).write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/bad_csv")

In [0]:
df_with_errors = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("columnNameOfCorruptRecord", "_corrupt_record")
    .option("badRecordsPath", BAD_RECORDS_PATH)
    .schema("""
        customer_id STRING,
        first_name STRING,
        registration_date DATE,
        _corrupt_record STRING
    """)
    .load(f"{BATCH_SOURCE_PATH}/bad_csv")
    .withColumn("_ingestion_timestamp", F.current_timestamp())
)
display(df_with_errors)

**Bad records statistics:**

In [0]:
files = [f.path for f in dbutils.fs.ls(BAD_RECORDS_PATH)]
files = [f + "*" for f in files]
print(files)

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

bad_record_schema = StructType([
    StructField("path", StringType(), True),
    StructField("record", StringType(), True),
    StructField("reason", StringType(), True)
])

df_bad_records = (
    spark.read
    .format("json")
    .schema(bad_record_schema)
    .load(files)
)
display(df_bad_records)

## Lakeflow Connect (Informational)

Zero-code SaaS ingestion via UI: Salesforce, Workday, HubSpot, SAP, ServiceNow, etc.

| Method | Use Case |
|--------|----------|
| **COPY INTO** | Files in cloud storage (batch) |
| **Auto Loader** | Files in cloud storage (streaming) |
| **Lakeflow Connect** | Data from SaaS systems |
| **Lakeflow Declarative Pipelines** | Transformations Bronze → Silver → Gold |

---

## Summary

| Topic | Key Concept | Exam Keywords |
|---|---|---|
| **COPY INTO** | Idempotent batch loading, file tracking | `COPY INTO`, `FILEFORMAT`, `mergeSchema` |
| **Auto Loader** | `cloudFiles` streaming ingestion | `cloudFiles.format`, `schemaLocation`, `schemaEvolutionMode` |
| **Trigger Modes** | `availableNow=True`, `processingTime` | Batch-like vs continuous |
| **Schema Evolution** | `rescue`, `addNewColumns`, `failOnNewColumns` | `_rescued_data` column |
| **Stream-Static Join** | Enrich stream with dimension table | Append mode, static reload |
| **Watermarking** | State cleanup in aggregations | `withWatermark()`, late data handling |
| **Lakeflow Connect** | Zero-code SaaS ingestion | Salesforce, Workday, SAP |

---

> **← M04: Delta Optimization | Day 2 | M06: Advanced Transforms →**

In [0]:
# List of created tables
created_tables = [
    "customers_batch",
    "orders_autoloader",
    "orders_rescued",
    "customers_with_validation",
    "orders_trigger_test"
]

## Cleanup

Remove demo tables, checkpoints, and temporary data created during this module.

---

In [0]:
results = []
for table in created_tables:
    full_table = f"{CATALOG}.{BRONZE_SCHEMA}.{table}"
    try:
        if spark.catalog.tableExists(full_table):
            count = spark.table(full_table).count()
            results.append((table, "EXISTS", str(count)))
        else:
            results.append((table, "NOT FOUND", "-"))
    except Exception as e:
        results.append((table, "ERROR", str(e)[:30]))

display(spark.createDataFrame(results, ["Table", "Status", "Records"]))

In [0]:
# Cleanup flag
CLEANUP_ENABLED = False

**Execute cleanup (if enabled):**

In [0]:
if CLEANUP_ENABLED:
    results = []
    for table in created_tables:
        full_table = f"{CATALOG}.{BRONZE_SCHEMA}.{table}"
        try:
            spark.sql(f"DROP TABLE IF EXISTS {full_table}")
            results.append((table, "DROPPED"))
        except Exception as e:
            results.append((table, f"ERROR: {str(e)[:30]}"))
    
    # Cleanup checkpoints
    try:
        dbutils.fs.rm(CHECKPOINT_BASE_PATH, True)
        results.append(("checkpoints", "REMOVED"))
    except:
        results.append(("checkpoints", "NOT FOUND"))
    
    display(spark.createDataFrame(results, ["Resource", "Status"]))
else:
    display(spark.createDataFrame([
        ("CLEANUP_ENABLED", "False"),
        ("Action", "Change to True to delete resources")
    ], ["Setting", "Value"]))