# M05: Incremental Processing

## 5.1. The Story: Real-Time Analytics

Your e-commerce platform needs to process orders in real-time. The batch jobs are too slow (T-1 latency). Marketing needs to see campaign performance *now*, not tomorrow. You need to build a streaming pipeline that ingests data as it arrives, processes it incrementally, and handles schema changes automatically.

---

## 5.2. Per-user Isolation

In [0]:
%run ../../setup/00_setup

## 5.3. Configuration

In [0]:
from pyspark.sql import functions as F
from pyspark.sql.types import *
from datetime import datetime
import time

**Paths and variables configuration:**

In [0]:
# Set default catalog and schema
spark.sql(f"USE CATALOG {CATALOG}")
spark.sql(f"USE SCHEMA {BRONZE_SCHEMA}")

# === SOURCE DATA (REAL DATASET) ===
SOURCE_CUSTOMERS = f"{DATASET_PATH}/customers/customers.csv"
SOURCE_ORDERS = f"{DATASET_PATH}/orders/orders_batch.json"

# === DEMO PATHS (SIMULATED ARRIVAL) ===
DEMO_BASE_PATH = f"{DATASET_PATH}/ingestion_demo"
BATCH_SOURCE_PATH = f"{DEMO_BASE_PATH}/batch_source"
STREAM_SOURCE_PATH = f"{DEMO_BASE_PATH}/stream_source"

# === TECHNICAL PATHS ===
CHECKPOINT_BASE_PATH = f"{DEMO_BASE_PATH}/checkpoints"
SCHEMA_BASE_PATH = f"{DEMO_BASE_PATH}/schemas"
BAD_RECORDS_PATH = f"{DEMO_BASE_PATH}/bad_records"

# Cleanup from previous runs
dbutils.fs.rm(DEMO_BASE_PATH, True)
print(f"Demo environment prepared at: {DEMO_BASE_PATH}")


**Cleanup from previous runs (for demo):**

In [0]:
try:
    dbutils.fs.rm(CHECKPOINT_BASE_PATH, True)
    dbutils.fs.rm(SCHEMA_BASE_PATH, True)
    dbutils.fs.rm(BAD_RECORDS_PATH, True)
    dbutils.fs.rm(DEMO_BASE_PATH, True)
except:
    pass

**Configuration verification:**

In [0]:
display(
    spark.createDataFrame([
        ("CATALOG", CATALOG),
        ("BRONZE_SCHEMA", BRONZE_SCHEMA),
        ("SILVER_SCHEMA", SILVER_SCHEMA),
        ("USER", raw_user),
        ("CUSTOMERS_CSV", SOURCE_CUSTOMERS),
        ("STREAMING_SOURCE_PATH", STREAM_SOURCE_PATH)
    ], ["Variable", "Value"])
)

In [0]:
# === DATA PREPARATION (SIMULATION) ===

# 1. Prepare Batch Data (Customers)
# We split customers into 2 days for COPY INTO demo
df_customers = spark.read.option("header", "true").csv(SOURCE_CUSTOMERS)
df_batch_day1, df_batch_day2,df_batch_day3,df_batch_day4 = df_customers.randomSplit([0.25]*4, seed=42)

# Save Day 1 immediately
df_batch_day1.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day1")
print(f"Batch Data: Day 1 ready at {BATCH_SOURCE_PATH}/day1")

# 2. Prepare Streaming Data (Orders)
# We take existing stream files, merge them, and split into 10 micro-batches for simulation
SOURCE_STREAM_FILES = f"{DATASET_PATH}/orders/stream/*.json"
df_all_orders = spark.read.json(SOURCE_STREAM_FILES)

# Split into 20 parts (5% each)
stream_batches = df_all_orders.randomSplit([0.05] * 20, seed=42)

# Save Batch 1 immediately to start the stream
stream_batches[0].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_01")
print(f"Stream Data: Batch 1 ready at {STREAM_SOURCE_PATH}/batch_01")


### 5.3.1. Databricks Data Loading Methods Overview

Databricks provides three primary methods for loading data into Delta tables, each optimized for different use cases:

---

![alt text](../../assets/images/image_load.avif)

### **1. CTAS (Create Table As Select)**

**Purpose**: Quick table creation from existing data or queries

**Characteristics**:
- One-time operation (not incremental)
- Creates table and loads data in a single statement
- No built-in deduplication or file tracking
- Best for data transformations and one-off loads

**Use Cases**:
- Creating derived tables from existing data
- Data transformation and aggregation
- Ad-hoc analysis tables
- Quick prototyping

**Limitations**:
- No automatic tracking of processed files
- Running twice creates duplicates or requires DROP TABLE first
- Not suitable for incremental loads

---

### **2. COPY INTO**

**Purpose**: Idempotent batch ingestion with automatic file tracking

**Characteristics**:
- **Idempotency**: Tracks processed files automatically
- **Incremental**: Only loads new files on subsequent runs
- **File Formats**: CSV, JSON, Parquet, Avro, ORC
- **Schema Evolution**: Optional merge schema support
- **Error Handling**: Built-in options for malformed records

**Use Cases**:
- Scheduled batch ingestion (daily, hourly)
- Loading from cloud storage (S3, ADLS, GCS)
- Incremental file-based loads
- Data lake ingestion where files arrive periodically

**Advantages**:
- Safe to re-run (no duplicates)
- Simple syntax with powerful options
- Good for small to medium file volumes
- Built-in file tracking via metadata

---

### **3. Auto Loader (cloudFiles)**

**Purpose**: Scalable streaming ingestion with automatic file discovery

**Characteristics**:
- **Streaming Architecture**: Processes files incrementally
- **File Discovery**: Automatic detection of new files
- **Schema Management**: Inference, evolution, and rescue modes
- **Checkpoint Management**: Fault-tolerant processing
- **Scalability**: Handles millions of files efficiently
- **Cloud-Native**: Uses file notification services (optional)

**Use Cases**:
- Real-time or near-real-time ingestion
- Large-scale file ingestion (millions of files)
- Unknown or evolving schemas
- Low-latency requirements
- Continuous data arrival

**Advantages**:
- Best scalability (millions of files)
- Automatic schema evolution
- Rescued data for unexpected changes
- Efficient file discovery (notification-based)
- Exactly-once processing semantics

---

### **Comparison Matrix**

| Feature | CTAS | COPY INTO | Auto Loader |
|---------|------|-----------|-------------|
| **Incremental** | [ERROR] No | [OK] Yes | [OK] Yes |
| **Idempotent** | [ERROR] No | [OK] Yes | [OK] Yes |
| **Schema Evolution** | [ERROR] No | [WARNING] Limited | [OK] Advanced |
| **File Tracking** | [ERROR] No | [OK] Metadata | [OK] Checkpoint |
| **Scalability** | Low | Medium | High |
| **Streaming** | [ERROR] No | [ERROR] No | [OK] Yes |
| **Error Handling** | Basic | Good | Advanced |
| **Use Case** | One-time | Scheduled batch | Real-time/Streaming |

---

### **Decision Guide**

**Choose CTAS when:**
- Creating tables from existing data
- One-time loads or transformations
- Simple, ad-hoc operations

**Choose COPY INTO when:**
- Scheduled batch ingestion (e.g., daily files)
- Need idempotency without streaming overhead
- Small to medium file volumes (<100K files)
- Simple incremental loads

**Choose Auto Loader when:**
- Real-time or near-real-time processing required
- Large number of files (>100K)
- Schema may change over time
- Need advanced error handling and rescued data
- Building production streaming pipelines

---

### 5.3.2 CTAS (Create Table As Select)

In [0]:
# Example: Load CSV from volume to customer_cts table using CTAS

table_name = "customer_cts"

display(spark.sql(f"""
CREATE OR REPLACE TABLE {table_name} AS
SELECT *
FROM csv.`{SOURCE_CUSTOMERS}`
"""))

In [0]:
%sql

select * from customer_cts

In [0]:
display(spark.sql(f"""
CREATE OR REPLACE TABLE {table_name}_copy AS
SELECT *
FROM {table_name} 
"""))

In [0]:
%sql

select * from customer_cts_copy

## 5.4. COPY INTO - Batch Loading

**COPY INTO** is the recommended batch ingestion method:
- **Idempotency**: Automatic tracking of processed files
- **File tracking**: Only new files are loaded on re-run
- **Format support**: CSV, JSON, Parquet, Avro, ORC


![alt text](../../assets/images/DKXKibszdkNP36Gme2Oks.png)

---

### M05_T5.1. Example: COPY INTO from CSV

In [0]:
TABLE_CUSTOMERS = f"{BRONZE_SCHEMA}.customers_batch"

**Creating target table:**

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TABLE_CUSTOMERS}")

In [0]:
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {TABLE_CUSTOMERS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date DATE,
  customer_segment STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
COMMENT 'Customers data - Bronze layer'
""")


**Execute COPY INTO:**

In [0]:
# Load Day 1 data
result = spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    city,
    state,
    country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/day1'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
COPY_OPTIONS ('mergeSchema' = 'true')
""")

display(result)


In [0]:
display(spark.table(TABLE_CUSTOMERS))

### M05_T5.2. Idempotency

Running COPY INTO again will **not** add duplicates:

In [0]:
count_before = spark.table(TABLE_CUSTOMERS).count()

# Re-run COPY INTO (same source path)
spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id, first_name, last_name, email, phone,
    city, state, country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/*'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
""")



In [0]:
count_after = spark.table(TABLE_CUSTOMERS).count()
display(count_after)

**Comparison:**

In [0]:
display(
    spark.createDataFrame([
        ("Before", count_before),
        ("After", count_after),
        ("Difference", count_after - count_before)
    ], ["State", "Count"])
)

### M05_T5.3. Checking Results After Adding More Days

Let's check what happens when we add more days.

In [0]:
# Load all days data
result = spark.sql(f"""
COPY INTO {TABLE_CUSTOMERS}
FROM (
  SELECT 
    customer_id,
    first_name,
    last_name,
    email,
    phone,
    city,
    state,
    country,
    TO_DATE(registration_date, 'yyyy-MM-dd') as registration_date,
    customer_segment,
    current_timestamp() as _ingestion_timestamp
  FROM '{BATCH_SOURCE_PATH}/*'
)
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'false')
COPY_OPTIONS ('mergeSchema' = 'true')
""")

display(result)


In [0]:
df_batch_day2.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day6")

In [0]:
df_batch_day3.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day3")

In [0]:
df_batch_day4.write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/day4")

In [0]:
display(spark.table(TABLE_CUSTOMERS).count())

display(spark.table(TABLE_CUSTOMERS))

## 5.5. Auto Loader - Streaming Ingestion

**Auto Loader (cloudFiles)** is a Databricks-managed streaming source:
- Automatic file discovery (file notifications)
- Incremental processing (only new files)
- Schema inference & evolution
- Checkpoint management
- Scales to millions of files




### 5.5.1. Trigger Modes & Output Modes

**Trigger** determines how often streaming query executes micro-batches:

| Mode | Behavior | Use Case |
|------|----------|----------|
| `availableNow=True` | Process everything → stop | Scheduled jobs  |
| `once=True` | Legacy (deprecated) | - |
| `processingTime="10 second"` | Every 10 seconds | Real-time |
| `continuous="1 second"` | Ultra-low latency | Experimental |



![!\[alt text\](../../assets/images/!\[12afdawfq2351fdawfaw.png\](attachment:12afdawfq2351fdawfaw.png).png)](../../assets/images/12afdawfq2351fdawfaw.png)



**Output Mode** determines how data is written to the sink:

| Mode | Description | Use Case |
|------|-------------|----------|
| **Append** | Only new rows are written | Raw data ingestion (stateless) |
| **Update** | Only updated rows are written | Aggregations (stateful) |
| **Complete** | The entire result table is rewritten | Small aggregations (e.g. counts) |

---

In [0]:
try:
    dbutils.fs.rm(CHECKPOINT_BASE_PATH, True)
    dbutils.fs.rm(SCHEMA_BASE_PATH, True)
    dbutils.fs.rm(BAD_RECORDS_PATH, True)
except:
    pass

In [0]:
TARGET_TABLE_AL = f"{BRONZE_SCHEMA}.orders_autoloader"
CHECKPOINT_AL = f"{CHECKPOINT_BASE_PATH}/autoloader"
SCHEMA_AL = f"{SCHEMA_BASE_PATH}/autoloader"

**Auto Loader readStream configuration:**

---
### Auto Loader (`cloudFiles`) Configuration Options

Auto Loader options are prefixed with `cloudFiles`. Key categories:

#### Common Options
- `cloudFiles.format`: File format (`csv`, `json`, `parquet`, `avro`, `orc`, etc.)
- `cloudFiles.schemaLocation`: Path to store inferred schema and checkpoints
- `cloudFiles.includeExistingFiles`: `true` to process existing files on first run

#### Directory Listing Options
- `cloudFiles.useIncrementalListing`: `true` (default) for scalable file discovery
- `cloudFiles.maxFilesPerTrigger`: Max files to process per micro-batch

#### File Notification Options
- `cloudFiles.useNotifications`: Enable file notification service (cloud-specific)
- `cloudFiles.subscriptionId`, `cloudFiles.queueName`, etc.: Notification service configs

#### File Format Options
- `cloudFiles.inferColumnTypes`: For CSV/JSON, infer column types (`true`/`false`)
- `cloudFiles.schemaEvolutionMode`: How schema changes are handled (`addNewColumns`, etc.)

#### CSV Options
- `cloudFiles.csv.header`: `true` if CSV files have headers
- `cloudFiles.csv.delimiter`: Field delimiter (default: `,`)

#### JSON Options
- `cloudFiles.json.multiline`: `true` for multi-line JSON

#### Parquet/Avro/ORC Options
- Standard Spark read options apply (e.g., `mergeSchema`)

#### Cloud-Specific Options
- AWS: `cloudFiles.region`, `cloudFiles.endpoint`, etc.
- Azure: `cloudFiles.resourceGroup`, `cloudFiles.subscriptionId`, etc.

#### Other Useful Options
- `cloudFiles.allowOverwrites`: Allow file overwrites (`true`/`false`)
- `cloudFiles.backfillInterval`: Interval for backfilling missed files

**Full list:** [Auto Loader options documentation](https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/options/)

In [0]:


spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_AL}")


In [0]:
df_autoloader = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", SCHEMA_AL)
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load(STREAM_SOURCE_PATH) # Reading from our simulated stream source
)

**Adding metadata columns:**

In [0]:
from pyspark.sql.functions import col

df_enriched = (df_autoloader
    .withColumn("_processing_time", F.current_timestamp())
    .withColumn("_source_file", col("_metadata.file_path"))
)

**Start streaming with `availableNow` trigger:**

> `availableNow` - processes all available data and stops (batch-like streaming)

In [0]:
query = (df_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_AL)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_AL)
)

In [0]:
display(spark.table(f"{TARGET_TABLE_AL}"))

**Auto Loader results:**

In [0]:
display(
    spark.createDataFrame([
        ("Records Loaded", str(spark.table(TARGET_TABLE_AL).count())),
        ("Source Files", str(spark.table(TARGET_TABLE_AL).select("_source_file").distinct().count()))
    ], ["Metric", "Value"])
)

---
### Add new data and re-run stream with `availableNow` trigger

In [0]:
# Save Batch 2 immediately to start the stream
stream_batches[1].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_02")

### 5.5.4. Example: Continuous Processing (processingTime)

Now we will start a continuous stream and add data while it's running.

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_AL}_continuous")

In [0]:
# Start stream in background
query_continuous = (df_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", f"{CHECKPOINT_AL}_continuous")
    .trigger(processingTime="10 seconds")
    .toTable(f"{TARGET_TABLE_AL}_continuous")
)

print("Stream started... Waiting for initialization.")


In [0]:
display(spark.table(f"{TARGET_TABLE_AL}_continuous").count())

In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(2, 10): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")


In [0]:
display(spark.table(f"{TARGET_TABLE_AL}_continuous").count())
display(spark.table(f"{TARGET_TABLE_AL}_continuous"))


In [0]:
# Stop stream
query_continuous.stop()


## 5.6. Stream-Static Joins & Aggregations

A common pattern is enriching a stream (e.g., Orders) with a static table (e.g., Customers).
*   **Stateless**: Stream (Left) JOIN Static (Right)
*   **Stateful**: Static (Left) JOIN Stream (Right) - *less common*

**Note:** The static table is read at the start of each micro-batch.

---

In [0]:
# Load Static Table
df_static_customers = spark.table(TABLE_CUSTOMERS)
display(df_static_customers)

### 5.6.1. Example: Streaming Aggregation (Update Mode)

**Output Mode: Update** is used when you want to write only the rows that have changed since the last trigger. This is typical for aggregations.

**Watermarking** is crucial here to handle late data and clean up the state. It tells the engine: *"How long do we wait for late data before finalizing the window?"*

In this example, we calculate the number of orders per 30-second window.

In [0]:
spark.sql("DROP TABLE IF EXISTS jointed_orders")

In [0]:
df_enriched.createOrReplaceTempView("enriched_orders_stream")
df_static_customers.createOrReplaceTempView("static_customers")

df_joined = spark.sql("""
SELECT
  o.order_id,
  o.total_amount,
  c.first_name,
  c.last_name,
  c.email,
  o._processing_time
FROM enriched_orders_stream o
LEFT JOIN static_customers c
  ON o.customer_id = c.customer_id
""")

In [0]:
# Write enriched stream
query_join = (df_joined.writeStream
    .format("memory")
    .queryName("enriched_orders")
    .outputMode("append") # Joins (Left Stream-Static) are append-only
    .option("checkpointLocation", f"{CHECKPOINT_BASE_PATH}/join_demo_3")
    .trigger(processingTime="5 seconds")
    .option("includeExistingFiles", "true")
    .start()
)


In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(11, 13): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")


In [0]:
display(spark.table("enriched_orders"))

In [0]:
query_join.stop()

In [0]:
display(spark.sql("SELECT count(1) FROM enriched_orders "))

### 5.6.2. Define Streaming Aggregation (Orders per 30 seconds)


In [0]:
# 1. Define Streaming Aggregation (Orders per 30 seconds)
# We use the previously defined df_enriched (from Auto Loader)

windowed_counts = (df_enriched
    .withWatermark("_processing_time", "1 minutes") # Allow 1 mins late data
    .groupBy(
        F.window("_processing_time", "30 seconds"),
        "customer_id"
    )
    .count()
)

# 2. Write Stream with UPDATE mode
# Update mode is efficient for aggregations - it emits only changed windows
query_agg = (windowed_counts.writeStream
    .format("console") 
    .queryName("orders_counts")
    .outputMode("update")
    .option("checkpointLocation", f"{CHECKPOINT_BASE_PATH}/agg_demo")
    .trigger(processingTime="5 seconds")
    .start()
)

print("Aggregation stream started...")



In [0]:
# Simulate arrival of remaining batches (2 to 10)
print("Starting data simulation...")

for i in range(14, 16): # Batches 2 to 10 (indices of remaining parts)
    batch_num = i + 1
    print(f"Arriving: Batch {batch_num}...", end=" ")
    
    # Write next batch
    stream_batches[i].write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/batch_{batch_num:02d}")
    
    print("Done. Waiting for stream...")
    time.sleep(4) # Wait for trigger to pick it up

print("All batches arrived.")


In [0]:
display(windowed_counts)

In [0]:
# Stop for demo purposes
query_agg.stop()

## 5.7. Error Handling

**Error handling strategies in COPY INTO:**

| Mode | Behavior |
|------|----------|
| `PERMISSIVE` | Parses what it can, errors → `_corrupt_record` |
| `DROPMALFORMED` | Removes malformed records |
| `FAILFAST` | Stops on first error |

**`badRecordsPath`** - saves bad records to a folder for later analysis.

---

### 5.7.1 Schema Evolution & Rescued Data

**Rescued Data Column** is an Auto Loader mechanism for handling unexpected data:

| Scenario | Behavior |
|----------|----------|
| New columns | Saved in `_rescued_data` |
| Type mismatches | Saved in `_rescued_data` |
| Malformed records | Saved in `_rescued_data` |

**Configuration:**
```python
.option("cloudFiles.schemaEvolutionMode", "rescue")
```

**schemaEvolutionMode options:**
- `addNewColumns`: Automatically adds new columns
- `rescue`: New columns → `_rescued_data` JSON
- `failOnNewColumns`: Fail if schema changes
- `none`: Ignores new columns (risky!)

---

In [0]:
TARGET_TABLE_RESCUE = f"{BRONZE_SCHEMA}.orders_rescued"
CHECKPOINT_RESCUE = f"{CHECKPOINT_BASE_PATH}/rescue"
SCHEMA_RESCUE = f"{SCHEMA_BASE_PATH}/rescue"

spark.sql(f"DROP TABLE IF EXISTS {TARGET_TABLE_RESCUE}")

**Define explicit schema (partial):**

In [0]:
# Deliberately define only some columns - rest will go to _rescued_data
partial_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("total_amount", DoubleType(), True)
])

**Auto Loader with rescue mode:**

In [0]:
# Create BAD data (Extra column + Type mismatch)
bad_data = [{"order_id": 99999, "total_amount": "INVALID_NUMBER", "new_col": "surprise"}]
spark.createDataFrame(bad_data).write.mode("overwrite").json(f"{STREAM_SOURCE_PATH}/bad_data")

In [0]:

df_rescue = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", SCHEMA_RESCUE)
    .option("cloudFiles.schemaEvolutionMode", "rescue")  # Rescue mode!
    .schema(partial_schema)  # Partial schema
    .load(STREAM_SOURCE_PATH)
)


**Start stream:**

In [0]:
query_rescue = (df_rescue.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", CHECKPOINT_RESCUE)
    .trigger(availableNow=True)
    .toTable(TARGET_TABLE_RESCUE)
)

In [0]:
display(spark.table(TARGET_TABLE_RESCUE))

**Schema with `_rescued_data` column:**

In [0]:
spark.table(TARGET_TABLE_RESCUE).printSchema()

**Data with rescued columns:**

In [0]:
display(
    spark.table(TARGET_TABLE_RESCUE)
    .limit(5)
)

### 5.7.2. Example: Error handling with badRecordsPath

In [0]:
TABLE_ERRORS = f"{BRONZE_SCHEMA}.customers_with_validation"

**Creating table with `_corrupt_record`:**

In [0]:
spark.sql(f"DROP TABLE IF EXISTS {TABLE_ERRORS}")

spark.sql(f"""
CREATE TABLE {TABLE_ERRORS} (
  customer_id STRING,
  first_name STRING,
  last_name STRING,
  email STRING,
  phone STRING,
  city STRING,
  state STRING,
  country STRING,
  registration_date DATE,
  customer_segment STRING,
  _corrupt_record STRING,
  _ingestion_timestamp TIMESTAMP
) USING DELTA
""")

**Loading with error handling:**

In [0]:
# Create a CSV with a bad row
bad_csv_data = [
    (999, "Eve", "2023-01-03"),
    (888, "Frank", "NOT_A_DATE") # This will fail date parsing
]
spark.createDataFrame(bad_csv_data, ["customer_id", "first_name", "registration_date"]).write.mode("overwrite").option("header", "true").csv(f"{BATCH_SOURCE_PATH}/bad_csv")

In [0]:
df_with_errors = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("columnNameOfCorruptRecord", "_corrupt_record")
    .option("badRecordsPath", BAD_RECORDS_PATH)
    .schema("""
        customer_id STRING,
        first_name STRING,
        registration_date DATE,
        _corrupt_record STRING
    """)
    .load(f"{BATCH_SOURCE_PATH}/bad_csv")
    .withColumn("_ingestion_timestamp", F.current_timestamp())
)

In [0]:
display(df_with_errors)

**Bad records statistics:**

In [0]:
files = [f.path for f in dbutils.fs.ls(BAD_RECORDS_PATH)]
files = [f + "*" for f in files]
print(files)

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

bad_record_schema = StructType([
    StructField("path", StringType(), True),
    StructField("record", StringType(), True),
    StructField("reason", StringType(), True)
])


df_bad_records = (
    spark.read
    .format("json")
    .schema(bad_record_schema)
    .load(files)
)
display(df_bad_records)

## 5.8. Lakeflow Connect (Informational)

**Lakeflow Connect** is managed SaaS integration without writing code:

### Supported sources:
- Salesforce
- Workday
- Google Analytics
- HubSpot
- Stripe
- SAP
- Netsuite
- ServiceNow

### Key features:

| Feature | Description |
|---------|-------------|
| **Zero-code** | Configuration via UI |
| **Managed** | Automatic scaling |
| **CDC Support** | Change Data Capture |
| **Schema Evolution** | Automatic updates |
| **Unity Catalog** | Full integration |

### How to start:
1. Workspace → **Data** → **Lakeflow** → **Connect**
2. Choose connector (e.g. Salesforce)
3. Provide credentials
4. Select objects to synchronize
5. Set schedule

### Ingestion methods comparison:

| Method | Use Case |
|--------|----------|
| **COPY INTO** | Files in cloud storage (batch) |
| **Auto Loader** | Files in cloud storage (streaming) |
| **Lakeflow Connect** | Data from SaaS systems |
| **Lakeflow Pipelines** | Transformations Bronze → Silver → Gold |

---

## 5.9. Cleanup (Optional)

---

In [0]:
# List of created tables
created_tables = [
    "customers_batch",
    "orders_autoloader",
    "orders_rescued",
    "customers_with_validation",
    "orders_trigger_test"
]

## 5.10. Final Summary

### What was achieved:
- Built a robust streaming pipeline using Auto Loader
- Implemented incremental processing with Structured Streaming
- Handled schema evolution with `cloudFiles.schemaEvolutionMode`
- Enriched streaming data with static dimension tables (Stream-Static Join)
- Managed stateful aggregations with Watermarking
- Connected the pipeline to Lakeflow for orchestration

### Key Takeaways:
1. **Auto Loader (cloudFiles)** is the standard for ingestion - handles state, schema, and files automatically.
2. **Structured Streaming** unifies batch and streaming APIs.
3. **Watermarking** is critical for state cleanup in aggregations.
4. **Stream-Static Joins** allow enriching streams with reference data (but require careful caching/reloading).
5. **Schema Evolution** prevents pipeline failures when data changes.

---

**Created resources verification:**

In [0]:
results = []
for table in created_tables:
    full_table = f"{CATALOG}.{BRONZE_SCHEMA}.{table}"
    try:
        if spark.catalog.tableExists(full_table):
            count = spark.table(full_table).count()
            results.append((table, "EXISTS", str(count)))
        else:
            results.append((table, "NOT FOUND", "-"))
    except Exception as e:
        results.append((table, "ERROR", str(e)[:30]))

display(spark.createDataFrame(results, ["Table", "Status", "Records"]))

In [0]:
# Flaga cleanup
CLEANUP_ENABLED = False

**Execute cleanup (if enabled):**
> [TIP] *Recommended: Keep data for subsequent notebooks!*

In [0]:
if CLEANUP_ENABLED:
    results = []
    for table in created_tables:
        full_table = f"{CATALOG}.{BRONZE_SCHEMA}.{table}"
        try:
            spark.sql(f"DROP TABLE IF EXISTS {full_table}")
            results.append((table, "DROPPED"))
        except Exception as e:
            results.append((table, f"ERROR: {str(e)[:30]}"))
    
    # Cleanup checkpoints
    try:
        dbutils.fs.rm(CHECKPOINT_BASE_PATH, True)
        results.append(("checkpoints", "REMOVED"))
    except:
        results.append(("checkpoints", "NOT FOUND"))
    
    display(spark.createDataFrame(results, ["Resource", "Status"]))
else:
    display(spark.createDataFrame([
        ("CLEANUP_ENABLED", "False"),
        ("Action", "Change to True to delete resources")
    ], ["Setting", "Value"]))