# LAB 05: Streaming & Auto Loader

**Duration:** ~35 min | **Day:** 2 | **Difficulty:** Intermediate-Advanced
**After module:** M05: Incremental Data Processing

> *"Set up Auto Loader for streaming JSON ingestion into the Bronze layer with exactly-once guarantees."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
# Prepare landing zone and checkpoint paths
landing_path = f"{DATASET_PATH}/orders/stream"
checkpoint_path = f"/tmp/{CATALOG}/lab05/checkpoint"
schema_path = f"/tmp/{CATALOG}/lab05/schema"
target_table = f"{CATALOG}.{BRONZE_SCHEMA}.orders_stream"

# Clean up from previous runs
spark.sql(f"DROP TABLE IF EXISTS {target_table}")
dbutils.fs.rm(checkpoint_path, True)
dbutils.fs.rm(schema_path, True)

print(f"Landing path: {landing_path}")
print(f"Target table: {target_table}")
print(f"Files available: {[f.name for f in dbutils.fs.ls(landing_path)]}")

---
## Task 1: COPY INTO (Batch Ingestion)

Use `COPY INTO` to load the first file from the landing zone.

In [None]:
# First create the target table
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {target_table}
    (order_id STRING, customer_id STRING, product_id STRING, 
     quantity INT, total_price DOUBLE, order_date STRING, 
     payment_method STRING, store_id STRING)
""")

# TODO: Use COPY INTO to load JSON files
spark.sql(f"""
    COPY INTO {target_table}
    FROM '{landing_path}'
    FILEFORMAT = ________
    FORMAT_OPTIONS ('mergeSchema' = 'true')
""")

count_after_copy = spark.table(target_table).count()
print(f"Rows after COPY INTO: {count_after_copy}")

In [None]:
# -- Validation --
assert count_after_copy > 0, "COPY INTO should have loaded data"
print(f"Task 1 OK: {count_after_copy} rows loaded via COPY INTO")

---
## Task 2: Verify COPY INTO Idempotency

Run COPY INTO again on the same files. How many new rows are loaded?

In [None]:
# TODO: Re-run the same COPY INTO
spark.sql(f"""
    COPY INTO {target_table}
    FROM '{landing_path}'
    FILEFORMAT = JSON
    FORMAT_OPTIONS ('mergeSchema' = 'true')
""")

count_after_rerun = spark.table(target_table).count()
print(f"Rows after re-run: {count_after_rerun} (was {count_after_copy})")
print(f"New rows loaded: {count_after_rerun - count_after_copy}")

In [None]:
# -- Validation --
assert count_after_rerun == count_after_copy, "COPY INTO should be idempotent - no new rows!"
print("Task 2 OK: COPY INTO is idempotent. 0 new rows on re-run.")

---
## Task 3: Auto Loader - Configure Stream

Set up Auto Loader (`cloudFiles`) to read JSON files from the landing zone.

Key options:
- `cloudFiles.format` = json
- `cloudFiles.schemaLocation` = path for inferred schema
- `cloudFiles.inferColumnTypes` = true

In [None]:
# Reset target for Auto Loader test
al_target = f"{CATALOG}.{BRONZE_SCHEMA}.orders_autoloader"
spark.sql(f"DROP TABLE IF EXISTS {al_target}")
dbutils.fs.rm(checkpoint_path, True)
dbutils.fs.rm(schema_path, True)

In [None]:
# TODO: Configure Auto Loader readStream
df_stream = (
    spark.readStream
    .format(________)
    .option("cloudFiles.format", ________)
    .option("cloudFiles.schemaLocation", schema_path)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(landing_path)
)

In [None]:
# -- Validation --
assert df_stream.isStreaming, "Should be a streaming DataFrame"
print(f"Task 3 OK: Streaming DataFrame configured with schema: {df_stream.schema.fieldNames()}")

---
## Task 4: Write Stream with trigger(availableNow=True)

Write the stream to a Delta table using `trigger(availableNow=True)`.

This processes all available files and stops automatically.

In [None]:
# TODO: Write stream to Delta table
query = (
    df_stream
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .trigger(________=________)
    .toTable(al_target)
)

query.awaitTermination()
print(f"Stream completed. Rows loaded: {spark.table(al_target).count()}")

In [None]:
# -- Validation --
al_count = spark.table(al_target).count()
assert al_count > 0, "Auto Loader should have loaded data"
print(f"Task 4 OK: {al_count} rows loaded via Auto Loader")

---
## Task 5: Incremental Processing

Re-run the stream. Since no new files arrived, 0 new rows should be processed.

This proves the checkpoint tracks processed files.

In [None]:
# Re-run the stream (same checkpoint = incremental)
df_stream2 = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(landing_path)
)

query2 = (
    df_stream2
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable(al_target)
)

query2.awaitTermination()
al_count2 = spark.table(al_target).count()
print(f"Rows after re-run: {al_count2} (was {al_count})")
print(f"New rows: {al_count2 - al_count}")

In [None]:
# -- Validation --
assert al_count2 == al_count, f"Should be 0 new rows, but got {al_count2 - al_count}"
print("Task 5 OK: Incremental processing verified. 0 new rows on re-run (checkpoint works!)")

---
## Task 6: Metadata Columns

Add metadata columns to streaming data: `_processing_time` (processing timestamp) and `_source_file` (source file path from `_metadata`).

These columns are essential in production pipelines for debugging and auditing.

**TODO:** Fill in `current_timestamp()` and `_metadata.file_path`.

In [None]:
from pyspark.sql.functions import current_timestamp, col

# Target table for metadata-enriched stream
metadata_target = f"{CATALOG}.{BRONZE_SCHEMA}.orders_with_metadata"
metadata_checkpoint = f"/tmp/{CATALOG}/lab05/checkpoint_metadata"
metadata_schema = f"/tmp/{CATALOG}/lab05/schema_metadata"

# Reset
spark.sql(f"DROP TABLE IF EXISTS {metadata_target}")
dbutils.fs.rm(metadata_checkpoint, True)
dbutils.fs.rm(metadata_schema, True)

# TODO: Configure Auto Loader with metadata columns
df_with_metadata = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", metadata_schema)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(landing_path)
    .withColumn("_processing_time", ________)       # hint: current_timestamp()
    .withColumn("_source_file", ________)            # hint: col("_metadata.file_path")
)

# Write enriched stream
(
    df_with_metadata
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", metadata_checkpoint)
    .trigger(availableNow=True)
    .toTable(metadata_target)
    .awaitTermination()
)

print(f"Written to: {metadata_target}")
display(spark.table(metadata_target).limit(5))

In [None]:
# -- Validation --
meta_cols = spark.table(metadata_target).columns
assert "_processing_time" in meta_cols, "Missing '_processing_time' — did you use current_timestamp()?"
assert "_source_file" in meta_cols, "Missing '_source_file' — did you use _metadata.file_path?"
print(f"Task 6 OK: Metadata columns added — {meta_cols}")

---
## Task 7: Schema Evolution — Rescued Data

Configure Auto Loader with a partial schema (only `order_id` + `customer_id`). Set `schemaEvolutionMode` to `"rescue"` so that extra columns land in `_rescued_data`.

**TODO:** Fill in the `schemaEvolutionMode` value.

In [None]:
from pyspark.sql.types import StructType, StructField, StringType

rescue_target = f"{CATALOG}.{BRONZE_SCHEMA}.orders_rescued"
rescue_checkpoint = f"/tmp/{CATALOG}/lab05/checkpoint_rescue"
rescue_schema = f"/tmp/{CATALOG}/lab05/schema_rescue"

# Reset
spark.sql(f"DROP TABLE IF EXISTS {rescue_target}")
dbutils.fs.rm(rescue_checkpoint, True)
dbutils.fs.rm(rescue_schema, True)

# Partial schema — only 2 columns out of many
partial_schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
])

# TODO: Configure Auto Loader with rescue mode
df_rescued = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", rescue_schema)
    .option("cloudFiles.schemaEvolutionMode", "________")  # hint: "rescue"
    .schema(partial_schema)
    .load(landing_path)
)

# Write with rescued data
(
    df_rescued
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", rescue_checkpoint)
    .trigger(availableNow=True)
    .toTable(rescue_target)
    .awaitTermination()
)

print(f"Written to: {rescue_target}")
display(spark.table(rescue_target).limit(5))

In [None]:
# -- Validation --
rescue_cols = spark.table(rescue_target).columns
assert "_rescued_data" in rescue_cols, "Missing '_rescued_data' column — did you set schemaEvolutionMode to 'rescue'?"
rescue_count = spark.table(rescue_target).filter("_rescued_data IS NOT NULL").count()
assert rescue_count > 0, "Expected rescued data for columns not in partial schema"
print(f"Task 7 OK: {rescue_count} rows with rescued data (extra columns captured in _rescued_data)")

---
## Task 8: Stream-Static Join

Join streaming data (orders) with a **static table** (customers) to enrich the stream with customer information.

A **Stream-Static Join** lets you combine a streaming DataFrame with a batch DataFrame — the static side is re-read on every micro-batch.

**TODO:** Fill in `readStream`, join columns, and `writeStream`.

In [None]:
# -- Stream-Static Join: Enrich orders with customer info --

join_target = f"{CATALOG}.{BRONZE_SCHEMA}.orders_enriched"
join_checkpoint = f"/tmp/{CATALOG}/lab05/checkpoint_enriched"

# Reset
spark.sql(f"DROP TABLE IF EXISTS {join_target}")
dbutils.fs.rm(join_checkpoint, True)

# Static side: read customers table
customers_df = spark.table(f"{CATALOG}.{BRONZE_SCHEMA}.customers")

# Stream side: read orders as a stream
# TODO: Fill in the format and table name
orders_stream = (
    spark
    .readStream
    .format("________")        # hint: "delta"
    .table(________)           # hint: target_table (the table from Task 1)
)

# TODO: Join stream with static DataFrame on customer_id
enriched_stream = orders_stream.join(
    ________,                  # hint: customers_df
    on="________",             # hint: "customer_id"
    how="left"
)

# TODO: Write the joined stream to join_target
(
    enriched_stream
    .writeStream
    .format("delta")
    .outputMode("________")          # hint: "append"
    .option("checkpointLocation", join_checkpoint)
    .trigger(availableNow=________)  # hint: True
    .toTable(________)               # hint: join_target
    .awaitTermination()
)

print(f"Stream-static join written to: {join_target}")
display(spark.table(join_target).limit(5))

In [None]:
# -- Validation --
enriched_count = spark.table(join_target).count()
enriched_cols = spark.table(join_target).columns
assert enriched_count > 0, "No rows in enriched table"
assert "customer_id" in enriched_cols, "Missing customer_id"
assert any(c for c in enriched_cols if c not in spark.table(target_table).columns), \
    "Join didn't add any new columns — check join expression"
print(f"Task 8 OK: {enriched_count} enriched rows. Columns: {enriched_cols}")

---
## Task 9: Change Data Feed (CDF) for Incremental ETL

Use **Change Data Feed** to read only the changes made to a Delta table, instead of re-reading the entire table each time.

**Scenario:** 
1. Create a CDF-enabled table with initial data
2. Make some changes (INSERT + UPDATE)
3. Read only the changes using `readChangeFeed`
4. Build an incremental ETL pipeline from the change feed

> **Exam Tip:** CDF captures `_change_type` (`insert`, `update_preimage`, `update_postimage`, `delete`), `_commit_version`, and `_commit_timestamp`.

In [None]:
# Step 1: Create a CDF-enabled table with initial data
cdf_source = f"{CATALOG}.{BRONZE_SCHEMA}.cdf_orders_lab"
cdf_target = f"{CATALOG}.{BRONZE_SCHEMA}.cdf_orders_incremental"

spark.sql(f"DROP TABLE IF EXISTS {cdf_source}")
spark.sql(f"DROP TABLE IF EXISTS {cdf_target}")

# TODO: Create table with Change Data Feed enabled
spark.sql(f"""
    CREATE TABLE {cdf_source} (
        order_id INT, customer_id INT, amount DOUBLE, status STRING
    )
    TBLPROPERTIES (________ = ________)
""")
# hint: 'delta.enableChangeDataFeed' = 'true'

# Insert initial data
spark.sql(f"""
    INSERT INTO {cdf_source} VALUES
    (1, 101, 99.99, 'pending'),
    (2, 102, 149.50, 'pending'),
    (3, 103, 75.00, 'pending'),
    (4, 101, 200.00, 'pending'),
    (5, 104, 50.00, 'pending')
""")

initial_version = spark.sql(f"DESCRIBE HISTORY {cdf_source}").first()["version"]
print(f"CDF table created at version {initial_version} with {spark.table(cdf_source).count()} rows")

In [None]:
# Step 2: Make some changes (simulate business operations)
# Update: mark orders 1,2 as shipped
spark.sql(f"UPDATE {cdf_source} SET status = 'shipped' WHERE order_id IN (1, 2)")

# Insert: new order
spark.sql(f"INSERT INTO {cdf_source} VALUES (6, 105, 320.00, 'pending')")

print(f"Changes applied: 2 updates + 1 insert")
print(f"Current table has {spark.table(cdf_source).count()} rows")

In [None]:
# TODO: Read the Change Data Feed starting from the version AFTER initial load
# Use table_changes() SQL function

df_changes = spark.sql(f"""
    SELECT * FROM ________('{cdf_source}', {initial_version + 1})
    ORDER BY _commit_version, order_id
""")
# hint: table_changes(...)

display(df_changes)

In [None]:
# TODO: Build incremental ETL — read only inserts and update_postimage (new values)
# Filter out update_preimage and delete to get "current state of changes"

df_incremental = df_changes.filter(
    col("_change_type").isin(________, ________)
)
# hint: "insert", "update_postimage"

# Write incremental changes to target
df_incremental.drop("_change_type", "_commit_version", "_commit_timestamp").write \
    .mode("append").saveAsTable(cdf_target)

print(f"Incremental ETL: {df_incremental.count()} rows written to {cdf_target}")
display(spark.table(cdf_target))

In [None]:
# -- Validation --
cdf_count = spark.table(cdf_target).count()
assert cdf_count == 3, f"Expected 3 rows (2 updated + 1 inserted), got {cdf_count}"
statuses = [r["status"] for r in spark.table(cdf_target).collect()]
assert "shipped" in statuses, "Should contain 'shipped' status from updates"
assert "pending" in statuses, "Should contain 'pending' status from new insert"
print(f"Task 9 OK: CDF incremental ETL — {cdf_count} change rows captured")
print("  2 update_postimage (shipped) + 1 insert (new order)")

---
## Cleanup

In [None]:
# Stop any active streams
for s in spark.streams.active:
    s.stop()

# Drop lab tables
for t in [target_table, al_target, 
          f"{CATALOG}.{BRONZE_SCHEMA}.orders_with_metadata",
          f"{CATALOG}.{BRONZE_SCHEMA}.orders_rescued",
          f"{CATALOG}.{BRONZE_SCHEMA}.orders_enriched",
          f"{CATALOG}.{BRONZE_SCHEMA}.cdf_orders_lab",
          f"{CATALOG}.{BRONZE_SCHEMA}.cdf_orders_incremental"]:
    spark.sql(f"DROP TABLE IF EXISTS {t}")

# Clean temp paths
for p in [checkpoint_path, schema_path,
          f"/tmp/{CATALOG}/lab05/checkpoint_metadata",
          f"/tmp/{CATALOG}/lab05/schema_metadata",
          f"/tmp/{CATALOG}/lab05/checkpoint_rescue",
          f"/tmp/{CATALOG}/lab05/schema_rescue",
          f"/tmp/{CATALOG}/lab05/checkpoint_enriched"]:
    dbutils.fs.rm(p, True)

print("Lab cleanup complete")

---
## Lab Complete!

You have:
- Used COPY INTO for idempotent batch loading
- Configured Auto Loader (cloudFiles) for streaming ingestion
- Used trigger(availableNow=True) for incremental processing
- Verified checkpoint-based exactly-once guarantees
- Added metadata columns (`_processing_time`, `_source_file`) to streaming writes
- Used rescued data column for schema evolution handling
- Performed a stream-static join to enrich streaming data
- Used Change Data Feed (CDF) for incremental ETL

| Feature | COPY INTO | Auto Loader | CDF |
|---------|-----------|-------------|-----|
| Format | SQL command | readStream/writeStream | readChangeFeed / table_changes() |
| Scalability | Thousands of files | Millions of files | Any Delta table |
| Schema evolution | Manual | Automatic (rescue column) | Follows source schema |
| File tracking | SQL state | Checkpoint directory | Version-based |
| Use case | Simple batch | File-based streaming | Change-based incremental |

> **Exam Tip:** Auto Loader uses `cloudFiles` format. COPY INTO is simpler but Auto Loader scales better (file notification mode for millions of files). CDF captures row-level changes and is ideal for propagating updates through Medallion layers.

> **Next:** LAB 06 - Advanced Transforms