### Which cities are struggling the most?

This analysis aggregates delivery performance metrics at the city level to identify regions with operational challenges.

Metrics such as average accept-to-pickup time, pickup delays, and the proportion of high-risk deliveries help operations teams prioritize cities that require process improvements or additional monitoring.


In [0]:
from pyspark.sql import functions as F

df_silver = spark.table("capstone_project.logistics.silver_last_mile_deliveries")

gold_city_metrics = (
    df_silver
    .groupBy("city")
    .agg(
        F.count("*").alias("total_deliveries"),
        F.round(F.avg("accept_to_pickup_minutes"), 2).alias("avg_accept_to_pickup_min"),
        F.round(F.avg("pickup_delay_minutes"), 2).alias("avg_pickup_delay_min"),
        F.sum("high_risk_delivery").alias("high_risk_deliveries"),
        F.round(F.sum("high_risk_delivery") / F.count("*"), 2).alias("high_risk_ratio")
    ).orderBy(F.desc("high_risk_ratio"))
)

gold_city_metrics.display()


### Are certain couriers consistently risky?

This analysis focuses on couriers with sufficient delivery volume to ensure
statistical significance. Couriers with a high proportion of risky deliveries are
flagged, helping identify training needs, operational issues, or potential system abuse.

Risk bands are introduced to simplify interpretation for business stakeholders.


In [0]:
gold_consistently_risky_couriers = (
    df_silver
    .groupBy("courier_id")
    .agg(
        F.count("*").alias("total_orders"),
        F.round(F.avg("accept_to_pickup_minutes"), 2).alias("avg_accept_to_pickup_min"),
        F.sum("missing_pickup_gps").alias("missing_gps_count"),
        F.sum("high_risk_delivery").alias("high_risk_orders"),
        F.round(F.sum("high_risk_delivery") / F.count("*"), 2).alias("risk_ratio")
    )
    .filter(
        (F.col("total_orders") >= 100) &
        (F.col("risk_ratio") >= 0.7)
    )
    .orderBy(F.col("risk_ratio").desc())
)

gold_with_risk_band = (
    gold_consistently_risky_couriers
    .withColumn(
        "risk_band",
        F.when(F.col("risk_ratio") >= 0.9, "Extreme")
         .when(F.col("risk_ratio") >= 0.8, "High")
         .otherwise("Moderate")
    )
)

risk_band_distribution = (
    gold_with_risk_band
    .groupBy("risk_band")
    .count()
    )

risk_band_distribution.show()

### How healthy is our last-mile operation?

This aggregation provides a high-level overview of last-mile delivery performance
across all cities.

Percentile-based metrics (P95) are included to capture tail-risk behavior, which
often drives operational escalations and customer dissatisfaction.


In [0]:
gold_overall_metrics = (
    df_silver
    .agg(
        F.count("*").alias("total_deliveries"),
        F.round(F.avg("accept_to_pickup_minutes"), 2).alias("avg_accept_to_pickup_min"),
        F.round(
            F.expr("percentile_approx(accept_to_pickup_minutes, 0.95)"),
            2
        ).alias("p95_accept_to_pickup"),
        F.sum("high_risk_delivery").alias("high_risk_deliveries"),
        F.round(F.sum("high_risk_delivery") / F.count("*"), 2).alias("high_risk_ratio")
    )
)

gold_overall_metrics.display()


### Preparing the Gold Dataset for Machine Learning
This dataset contains the final, cleaned, and engineered features required for model training. Only relevant columns are selected to reduce noise and improve model interpretability.

This table serves as the direct input for ML pipelines and ensures a clean
database-to-AI workflow.

In [0]:
ml_gold_dataset = (
    df_silver
    .select(
        "order_id",
        "city",
        "accept_to_pickup_minutes",
        "pickup_delay_minutes",
        "missing_pickup_gps",
        "high_risk_delivery"
    )
    .dropna()
)

ml_gold_dataset.display()

In [0]:
ml_gold_dataset.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("capstone_project.logistics.gold_ml_last_mile_features")


To improve query performance, I run OPTIMIZE on Gold tables to compact small files into larger ones, reducing scan overhead for analytics and dashboards. I also apply Z-ORDER on frequently filtered columns to enable data skipping, minimizing I/O and speeding up business queries.

In [0]:
%sql
OPTIMIZE capstone_project.logistics.gold_ml_last_mile_features
ZORDER BY (city);
