# LAB 05: Streaming & Auto Loader

**Duration:** ~35 min | **Day:** 2 | **Difficulty:** Intermediate-Advanced

> *"Set up Auto Loader for streaming JSON ingestion into the Bronze layer with exactly-once guarantees."*

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
# Prepare landing zone and checkpoint paths
landing_path = f"{DATASET_PATH}/orders/stream"
checkpoint_path = f"/tmp/{CATALOG}/lab05/checkpoint"
schema_path = f"/tmp/{CATALOG}/lab05/schema"
target_table = f"{CATALOG}.{BRONZE_SCHEMA}.orders_stream"

# Clean up from previous runs
spark.sql(f"DROP TABLE IF EXISTS {target_table}")
dbutils.fs.rm(checkpoint_path, True)
dbutils.fs.rm(schema_path, True)

print(f"Landing path: {landing_path}")
print(f"Target table: {target_table}")
print(f"Files available: {[f.name for f in dbutils.fs.ls(landing_path)]}")

---
## Task 1: COPY INTO (Batch Ingestion)

Use `COPY INTO` to load the first file from the landing zone.

In [None]:
# First create the target table
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {target_table}
    (order_id STRING, customer_id STRING, product_id STRING, 
     quantity INT, total_price DOUBLE, order_date STRING, 
     payment_method STRING, store_id STRING)
""")

# TODO: Use COPY INTO to load JSON files
spark.sql(f"""
    COPY INTO {target_table}
    FROM '{landing_path}'
    FILEFORMAT = ________
    FORMAT_OPTIONS ('mergeSchema' = 'true')
""")

count_after_copy = spark.table(target_table).count()
print(f"Rows after COPY INTO: {count_after_copy}")

In [None]:
# -- Validation --
assert count_after_copy > 0, "COPY INTO should have loaded data"
print(f"Task 1 OK: {count_after_copy} rows loaded via COPY INTO")

---
## Task 2: Verify COPY INTO Idempotency

Run COPY INTO again on the same files. How many new rows are loaded?

In [None]:
# TODO: Re-run the same COPY INTO
spark.sql(f"""
    COPY INTO {target_table}
    FROM '{landing_path}'
    FILEFORMAT = JSON
    FORMAT_OPTIONS ('mergeSchema' = 'true')
""")

count_after_rerun = spark.table(target_table).count()
print(f"Rows after re-run: {count_after_rerun} (was {count_after_copy})")
print(f"New rows loaded: {count_after_rerun - count_after_copy}")

In [None]:
# -- Validation --
assert count_after_rerun == count_after_copy, "COPY INTO should be idempotent - no new rows!"
print("Task 2 OK: COPY INTO is idempotent. 0 new rows on re-run.")

---
## Task 3: Auto Loader - Configure Stream

Set up Auto Loader (`cloudFiles`) to read JSON files from the landing zone.

Key options:
- `cloudFiles.format` = json
- `cloudFiles.schemaLocation` = path for inferred schema
- `cloudFiles.inferColumnTypes` = true

In [None]:
# Reset target for Auto Loader test
al_target = f"{CATALOG}.{BRONZE_SCHEMA}.orders_autoloader"
spark.sql(f"DROP TABLE IF EXISTS {al_target}")
dbutils.fs.rm(checkpoint_path, True)
dbutils.fs.rm(schema_path, True)

In [None]:
# TODO: Configure Auto Loader readStream
df_stream = (
    spark.readStream
    .format(________)
    .option("cloudFiles.format", ________)
    .option("cloudFiles.schemaLocation", schema_path)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(landing_path)
)

In [None]:
# -- Validation --
assert df_stream.isStreaming, "Should be a streaming DataFrame"
print(f"Task 3 OK: Streaming DataFrame configured with schema: {df_stream.schema.fieldNames()}")

---
## Task 4: Write Stream with trigger(availableNow=True)

Write the stream to a Delta table using `trigger(availableNow=True)`.

This processes all available files and stops automatically.

In [None]:
# TODO: Write stream to Delta table
query = (
    df_stream
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .trigger(________=________)
    .toTable(al_target)
)

query.awaitTermination()
print(f"Stream completed. Rows loaded: {spark.table(al_target).count()}")

In [None]:
# -- Validation --
al_count = spark.table(al_target).count()
assert al_count > 0, "Auto Loader should have loaded data"
print(f"Task 4 OK: {al_count} rows loaded via Auto Loader")

---
## Task 5: Incremental Processing

Re-run the stream. Since no new files arrived, 0 new rows should be processed.

This proves the checkpoint tracks processed files.

In [None]:
# Re-run the stream (same checkpoint = incremental)
df_stream2 = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .option("cloudFiles.inferColumnTypes", "true")
    .load(landing_path)
)

query2 = (
    df_stream2
    .writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path)
    .trigger(availableNow=True)
    .toTable(al_target)
)

query2.awaitTermination()
al_count2 = spark.table(al_target).count()
print(f"Rows after re-run: {al_count2} (was {al_count})")
print(f"New rows: {al_count2 - al_count}")

In [None]:
# -- Validation --
assert al_count2 == al_count, f"Should be 0 new rows, but got {al_count2 - al_count}"
print("Task 5 OK: Incremental processing verified. 0 new rows on re-run (checkpoint works!)")

---
## Cleanup

In [None]:
# Stop any active streams
for s in spark.streams.active:
    s.stop()

spark.sql(f"DROP TABLE IF EXISTS {target_table}")
spark.sql(f"DROP TABLE IF EXISTS {al_target}")
dbutils.fs.rm(checkpoint_path, True)
dbutils.fs.rm(schema_path, True)
print("Lab cleanup complete")

---
## Lab Complete!

You have:
- Used COPY INTO for idempotent batch loading
- Configured Auto Loader (cloudFiles) for streaming ingestion
- Used trigger(availableNow=True) for incremental processing
- Verified checkpoint-based exactly-once guarantees

> **Exam Tip:** Auto Loader uses `cloudFiles` format. COPY INTO is simpler but Auto Loader scales better (file notification mode for millions of files). Both are idempotent.

| Feature | COPY INTO | Auto Loader |
|---------|-----------|-------------|
| Format | SQL command | readStream/writeStream |
| Scalability | Thousands of files | Millions of files |
| Schema evolution | Manual | Automatic (rescue column) |
| File tracking | SQL state | Checkpoint directory |

> **Next:** LAB 06 - Advanced Transforms 