# Silver Layer - Quality firewall

## Purpose

This notebook builds the **Silver layer** of the Crisis Recovery Lakehouse.

The Silver layer is responsible for converting **raw Bronze data** into
**clean, typed, analytics-ready datasets** that can safely be used for:
- Dashboards
- Machine Learning
- AI enrichment
- Business decisioning

Unlike the Bronze layer, Silver **enforces structure and quality**.

---

## Business Context

During a crisis, raw operational data contains:
- Incorrect timestamps
- Invalid delivery durations
- Negative or zero item counts
- Schema inconsistencies from upstream systems

If such data flows directly into analytics or ML pipelines:
- KPIs become unreliable
- Models learn incorrect patterns
- Alerts generate false positives

The Silver layer acts as a **quality firewall** between ingestion and insight.

---

## Inputs and Outputs

### Input

| Source | Description |
|------|-------------|
| `food_delivery.bronze_orders` | Raw, schema-evolving order events |

---

### Outputs

| Table | Purpose |
|------|--------|
| `silver_orders_clean` | Typed & validated order records |
| `silver_orders_enriched` | Customer + review enriched orders |
| `silver_sla_metrics` | Operational & SLA-focused metrics |

---

## Design Principles of the Silver Layer

- Enforce **correct data types**
- Apply **business validity rules**
- Separate datasets by **analytical purpose**
- Avoid overloading a single “god table”
- Prepare data for **Gold and ML layers**

---

## 1: Silver Orders – Canonical Cleaning (`silver_orders_clean`)

### Business Problem

Bronze data preserves reality but:
- Uses string timestamps
- Allows invalid business values
- Cannot support event-time analytics

Downstream systems require **typed and valid data**.

---

### Approach

We create a **canonical Silver table** that:
- Casts all columns to correct types
- Removes logically invalid records
- Preserves all usable order facts

This table becomes the **single source of truth** for all downstream Silver and Gold datasets.

---

### Key Validations Applied

- `actual_delivery_time > created_at`
- `total_items > 0`
- Numeric casting for all metrics
- Timestamp conversion for event-time processing

---

### Why a Canonical Clean Table?

Instead of repeating casting logic everywhere:
- We clean once
- Reuse everywhere
- Reduce risk of inconsistent logic

This mirrors real production data platforms.

In [0]:
from pyspark.sql.functions import col, current_timestamp, expr, to_timestamp
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("order_id").orderBy(col("created_at_simulated").desc())

# 2. Apply Quality Constraints 
# Constraint 1: Delivery must happen AFTER creation
# Constraint 2: Order must have items

if not spark.catalog.tableExists("food_delivery.silver_orders_clean"):
    silver_orders_clean = (
    spark.table("food_delivery.bronze_orders")
        # ---------- timestamps ----------
        .withColumn("created_at_simulated", to_timestamp("created_at_simulated"))
        .withColumn("actual_delivery_time_simulated", to_timestamp("actual_delivery_time_simulated"))

        # ---------- numeric sanity ----------
        .withColumn("total_items", col("total_items").cast("int"))
        .withColumn("subtotal", col("subtotal").cast("int"))
        .withColumn("num_distinct_items", col("num_distinct_items").cast("int"))
        .withColumn("min_item_price", col("min_item_price").cast("int"))
        .withColumn("max_item_price", col("max_item_price").cast("int"))

        .withColumn("estimated_order_place_duration", col("estimated_order_place_duration").cast("int"))
        .withColumn("estimated_store_to_consumer_driving_duration",
                    col("estimated_store_to_consumer_driving_duration").cast("int"))

        .withColumn("total_busy_dashers", col("total_busy_dashers").cast("int"))
        .withColumn("total_onshift_dashers", col("total_onshift_dashers").cast("int"))
        .withColumn("total_outstanding_orders", col("total_outstanding_orders").cast("int"))

        # ---------- IDs ----------
        .withColumn("order_id", col("order_id").cast("long"))
        .withColumn("customer_id", col("customer_id").cast("long"))
        .withColumn("store_id", col("store_id").cast("int"))
        .withColumn("market_id", col("market_id").cast("int"))

        # ------------Null protection on critical columns -----------
        .filter(
            col("created_at_simulated").isNotNull() &
            col("actual_delivery_time_simulated").isNotNull() &
            col("order_id").isNotNull() &
            col("customer_id").isNotNull()
        )

        # ---------- Deduplication -----------
        .withColumn("rn", row_number().over(window_spec))
        .filter(col("rn") == 1)
        .drop("rn")

        # ---------- data quality filters ----------
        .filter(
            (col("actual_delivery_time_simulated") > col("created_at_simulated")) &
            (col("total_items") > 0) &  (col("subtotal") >= 0) &  (col("min_item_price") >= 0) & (col("max_item_price") >= col("min_item_price"))
        )

     
    )
        # ---------- silver_orders_clean table is created ----------
    silver_orders_clean.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.silver_orders_clean")

    print("Silver clean table created.")

else:
    print("Silver clean table already exists → skipping creation")
# Print the schema tree to the console
spark.table("food_delivery.silver_orders_clean").printSchema()
   

## 2: Silver Customer Enrichment (`silver_orders_enriched`)

### Business Problem

Customer experience analysis requires:
- Order behavior
- Customer attributes
- Review sentiment signals

These datasets exist separately and must be joined carefully.

---

### Approach

We enrich the clean Silver orders by:
- Joining with `dim_customers`
- Joining with `fact_reviews`
- Producing a **Customer 360 view**

This table answers:
> “What happened, to whom, and how did they feel?”

---

### Streaming Design Note

Although joins are **Stream → Static**, we:
- Apply watermarks
- Preserve event-time semantics

This makes the pipeline **future-ready** for windowed analytics and alerts.


In [0]:
from pyspark.sql.functions import window

if not spark.catalog.tableExists("food_delivery.silver_orders_enriched"):
    silver_enriched = (
        spark.table("food_delivery.silver_orders_clean").alias("orders")
        # --------------------------------------------------
        # Enrich with Customer Dimension (Lookup Join)
        # --------------------------------------------------
        .join(
            spark.table("food_delivery.dim_customers").alias("customers"),
            on="customer_id",
            how="left"
        )
        # --------------------------------------------------
        # Enrich with Reviews Fact (Lookup Join)
        # --------------------------------------------------
        .join(
            spark.table("food_delivery.fact_reviews").alias("reviews"),
            on="order_id",
            how="left"
        )
        # --------------------------------------------------
        # Final Enriched Schema
        # --------------------------------------------------
        .select(
            "orders.order_id",
            "orders.created_at_simulated",
            "orders.actual_delivery_time_simulated",
            "orders.store_id",
            "orders.subtotal",
            "customers.customer_id",
            "customers.customer_name",
            "customers.segment",
            "reviews.review_score",
            "reviews.sentiment_category",
            "reviews.review_text"
        )
    )

    # ---------- silver_enriched table is created ----------
    silver_enriched.write \
        .format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.silver_orders_enriched")

    print("Silver Customer Enriched table built (batch).")

else:
    print("Silver clean table already exists → skipping creation")

# Print the schema tree to the console
spark.table("food_delivery.silver_orders_clean").printSchema()

spark.table("food_delivery.silver_orders_enriched").show(50)


## 3: SLA & Operational Metrics (`silver_sla_metrics`)

### Business Problem

Operational teams care about:
- Delivery delays
- Dasher load
- Order backlog pressure

These signals are **different** from customer analytics and should not be mixed.

---

### Approach

We derive a dedicated SLA table that:
- Calculates delivery delays
- Retains operational load indicators
- Focuses on store & market performance

This separation enables:
- Faster queries
- Clear ownership
- Focused Gold aggregations

In [0]:
from pyspark.sql.functions import col

# Check if the SLA metrics table already exists to avoid overwriting
if not spark.catalog.tableExists("food_delivery.silver_sla_metrics"):
    # Read cleaned order data from Silver layer
    silver_sla_metrics = (
        spark.table("food_delivery.silver_orders_clean")
        # Calculate delivery delay in seconds: actual delivery time minus order creation and estimated driving duration
        .withColumn(
            "delivery_delay_seconds",
            col("actual_delivery_time_simulated").cast("long")
            - col("created_at_simulated").cast("long")
            - col("estimated_store_to_consumer_driving_duration")
        )
        # Select operational and SLA-relevant columns for focused analysis
        .select(
            "order_id",
            "store_id",
            "market_id",
            "created_at_simulated",
            "actual_delivery_time_simulated",
            "estimated_order_place_duration",
            "estimated_store_to_consumer_driving_duration",
            "delivery_delay_seconds",
            "total_busy_dashers",
            "total_onshift_dashers",
            "total_outstanding_orders"
        )
    )

    # Save the resulting DataFrame as a Delta table for reliable downstream use
    silver_sla_metrics.write.format("delta") \
        .mode("overwrite") \
        .saveAsTable("food_delivery.silver_sla_metrics")

    print("Silver SLA metrics table created.")
else:
    print("Silver SLA metrics table already exists → skipping creation")


In [0]:
spark.table("food_delivery.silver_sla_metrics").show(50)

## Downstream Dependencies

Silver outputs feed:
- Gold KPI dashboards
- Churn prediction models
- AI sentiment analysis
- Geospatial intelligence pipelines

Any error here directly impacts business decisions.

---

## Summary

This notebook transforms **raw ingestion data** into
**trusted analytical assets** by enforcing structure, quality, and purpose-driven design.

It is the **most critical quality gate** in the Crisis Recovery Lakehouse.