# Bronze Layer – Raw Order Ingestion (Auto Loader)

## Purpose
This notebook builds the **Bronze layer** of the Crisis Recovery Analytics platform. The Bronze layer represents the *raw, immutable system of record* and is responsible for ingesting application-level order events with minimal transformation.

The design follows real-world Databricks Lakehouse best practices using **Auto Loader**, enabling scalable, fault-tolerant, and idempotent ingestion.

---

## Business Context
During a simulated crisis recovery period for a food delivery platform, large volumes of order events arrive continuously from operational systems. These events may:
- Arrive late or out of order
- Contain schema drift (newx fields)
- Include partially corrupt records

The Bronze layer ensures **no data loss** while preserving raw fidelity for downstream analytics.

---

## Inputs
- JSON files written to the Landing Zone
  - Path: `/Volumes/workspace/food_delivery/landing_zone/orders`
  - Represents raw application logs (simulated)

## Outputs
- Delta table: `food_delivery.bronze_orders`
- Schema evolution metadata
- Streaming checkpoint state

---

## Key Design Decisions

### Why Auto Loader?
- Tracks processed files via checkpoints
- Prevents duplicate ingestion on notebook re-runs
- Supports schema evolution without pipeline failure

### Why JSON + Landing Zone?
- Mimics real-world S3 / ADLS ingestion patterns
- Decouples producers (apps) from consumers (analytics)

## 1: Landing Zone Validation

### Business Problem

In production environments, notebooks may be:
- Re-run manually
- Re-triggered by schedulers
- Restarted after failures

If raw files are regenerated on every run, it can lead to:
- Duplicate ingestion
- Inconsistent downstream metrics

---

### Approach

Before exporting any raw JSON files, we:
- Check whether the landing zone already contains data
- Skip regeneration if files exist

This ensures **notebook-level idempotency**.

---

### Design Decision

We prefer **idempotent checks** over manual cleanup because:
- Production pipelines must be restart-safe
- Engineers should not rely on human intervention
- Replays should not corrupt data

## 2: Simulating Application Raw Events

### Business Problem

Our upstream system (QuickBite app) does not exist.
However, downstream ingestion pipelines expect:
- Semi-structured JSON logs
- Partitioned by time
- Written incrementally

---

### Approach

We simulate the application by:
- Reading from `fact_orders`
- Writing JSON files into the landing zone
- Partitioning by `year` and `month`

This mirrors how:
- Mobile apps
- Backend services
- Event pipelines

write data into cloud object storage.

---

### Why JSON?

- Schema-flexible
- Common for event logs
- Easy to evolve without breaking ingestion

In [0]:
%sql
CREATE VOLUME IF NOT EXISTS workspace.food_delivery.landing_zone;


In [0]:
from pyspark.sql import functions as F

# Landing zone simulating application object storage
landing_path = "/Volumes/workspace/food_delivery/landing_zone/orders"

# Utility function to check whether landing zone already has files
def landing_zone_is_empty(path):
    try:
        return len(dbutils.fs.ls(path)) == 0
    except:
        # Path does not exist yet
        return True

# Idempotency check: export raw files only once
if landing_zone_is_empty(landing_path):
    print("Landing zone empty → exporting raw JSON files")

    # Source data representing application ground truth
    df_orders = spark.table("food_delivery.fact_orders")

    # Write raw events as JSON, partitioned by time
    # This simulates how real apps write logs to cloud storage
    (
        df_orders
        .withColumn("year", F.year("created_at_simulated"))
        .withColumn("month", F.month("created_at_simulated"))
        .write
        .format("json")
        .mode("append")              # Append is safe because export is guarded
        .partitionBy("year", "month")
        .save(landing_path)
    )

    print("Raw JSON files written once.")
else:
    print("Landing zone already has data → skipping export.")


## 3: Bronze Ingestion Using Auto Loader

### Business Problem

Traditional batch ingestion:
- Reprocesses files on every run
- Cannot safely handle schema changes
- Is brittle during failures

---

### Approach

We use **Databricks Auto Loader** with:
- `cloudFiles` source
- Streaming semantics
- Checkpoint-based state tracking

Even though data is finite, we run Auto Loader in  
**`availableNow` mode** to simulate production ingestion.

---

### Why Auto Loader Instead of Batch Reads?

| Reason | Benefit |
|------|--------|
| File tracking | Prevents duplicate ingestion |
| Checkpoints | Fault-tolerant restarts |
| Schema evolution | Handles new columns safely |
| Rescued data | Preserves corrupt records |

This is the **industry standard ingestion pattern**.


## 4: Schema Evolution & Rescued Data

### Business Problem

Upstream systems change over time:
- New columns are added
- Data types drift
- Partial corruption occurs

Failing the pipeline on such events would cause:
- Downtime
- Data loss
- Broken dashboards

---

### Approach

We explicitly enable:
- Schema evolution (`addNewColumns`)
- Schema location tracking
- `_rescued_data` capture

This ensures:
- No data is dropped
- Unknown fields are preserved
- Debugging remains possible

## 5: Checkpoints and Exactly-Once Semantics

### Business Problem

Streaming ingestion without checkpoints can:
- Reprocess files
- Duplicate records
- Corrupt aggregates

---

### Approach

We configure a dedicated checkpoint directory:
- Tracks processed files
- Stores streaming state
- Enables safe recovery

This guarantees **exactly-once ingestion** even if:
- The notebook is re-run
- The cluster restarts
- A failure occurs mid-ingestion


In [0]:
# Source path where raw application JSON events arrive
source_path = "/Volumes/workspace/food_delivery/landing_zone/orders"

# Checkpoint path to track processed files (exactly-once ingestion)
checkpoint_path = "/Volumes/workspace/food_delivery/food_delivery_data/checkpoints/bronze_orders"

# Schema location to persist inferred schema and support schema evolution
schema_path = "/Volumes/workspace/food_delivery/food_delivery_data/schemas/bronze_orders"

# Target Bronze table
table_name = "food_delivery.bronze_orders"

# Idempotency check: do not re-ingest if Bronze table already exists
if not spark.catalog.tableExists(table_name):
    print("Bronze table not found → running Auto Loader")

    # Auto Loader stream definition (cloudFiles = production-grade ingestion)
    query = (
        spark.readStream
        .format("cloudFiles")                         # Enable Auto Loader
        .option("cloudFiles.format", "json")          # Raw app events are JSON
        .option("cloudFiles.inferColumnTypes", "true")
        .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
        .option("cloudFiles.schemaLocation", schema_path)
        .load(source_path)

        # Write as Delta with partitioning for query efficiency
        .writeStream
        .format("delta")
        .partitionBy("year", "month")
        .option("checkpointLocation", checkpoint_path)
        .option("mergeSchema", "true")

        # availableNow = process all current files once (batch-style streaming)
        .trigger(availableNow=True)
        .toTable(table_name)
    )

    # Wait until ingestion completes
    query.awaitTermination()
    print("Bronze table created successfully.")
else:
    print("Bronze table already exists → skipping ingestion.")


In [0]:
display(spark.table("food_delivery.bronze_orders").limit(1000))

In [0]:
%sql 
SELECT COUNT(*) FROM food_delivery.fact_orders


In [0]:
%sql
SELECT COUNT(*) FROM food_delivery.bronze_orders

## Downstream Dependencies

The Bronze table feeds:
- `silver_orders_clean` (data validation & typing)
- `silver_orders_enriched` (customer & review joins)
- ML feature pipelines
- Operational dashboards

Any error here propagates everywhere —  
which is why **Bronze must be boring, stable, and safe**.

---

## Summary

This notebook establishes a **production-grade ingestion foundation** by:
- Simulating real application logs
- Using Auto Loader for reliability
- Enforcing idempotency and fault tolerance
- Preserving raw data fidelity

It forms the **backbone** of the Crisis Recovery analytics platform.