### Loading Trained Model for Inference

The trained model is loaded from MLflow using a specific run ID, ensuring consistency between training and inference environments. For simplicity, a fixed MLflow run ID is used. In production, this would be replaced with a model registry or automated model promotion step.


In [0]:
import mlflow
import mlflow.sklearn

run_id = "bfb982dd5cfb4ebb98f345133c8f6865"

model_uri = f"runs:/{run_id}/model"
model = mlflow.sklearn.load_model(model_uri)


### Generating Risk Scores

The model outputs a probability score indicating the likelihood of a delivery being
high risk. This probabilistic output enables flexible decision-making instead of
binary rule-based alerts.


In [0]:
df_all = spark.table(
    "capstone_project.logistics.gold_ml_last_mile_features"
).toPandas()

X_all = df_all[
    [
        "accept_to_pickup_minutes",
        "pickup_delay_minutes",
        "missing_pickup_gps"
    ]
]

# Predict probability of high risk
df_all["risk_probability"] = model.predict_proba(X_all)[:, 1]

# Simple binary prediction
df_all["predicted_high_risk"] = (df_all["risk_probability"] >= 0.5).astype(int)

df_all.head()


### Turning Predictions into Decisions

The model outputs a probability of a delivery being high-risk.
To make this actionable, probabilities are mapped into business-friendly risk buckets:

- HIGH: ≥ 0.8 → Immediate intervention
- MEDIUM: 0.5–0.8 → Monitor closely
- LOW: < 0.5 → No action required

This converts AI outputs into operational decisions.


In [0]:
import numpy as np

df_all["risk_bucket"] = np.where(
    df_all["risk_probability"] >= 0.8, "HIGH",
    np.where(df_all["risk_probability"] >= 0.5, "MEDIUM", "LOW")
)

### Persisting Predictions to Delta Lake

Predictions are written back to Delta Lake as a Gold analytics table, enabling
downstream dashboards and Genie-based insights.


In [0]:
predictions_spark_df = spark.createDataFrame(df_all)

predictions_spark_df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable(
        "capstone_project.logistics.gold_last_mile_risk_predictions"
    )


As this prediction table is heavily used by dashboards and ad-hoc analytics,
I optimize it separately to ensure low-latency queries and efficient data skipping.

In [0]:
%sql
OPTIMIZE capstone_project.logistics.gold_last_mile_risk_predictions
ZORDER BY (city, risk_bucket);


RISK DISTRIBUTION

In [0]:
df_all["risk_bucket"].value_counts()

AVERAGE DELAY BY RISK

In [0]:
df_all.groupby("risk_bucket")["pickup_delay_minutes"].mean().round(2)

Risk distribution by city

In [0]:
%sql
SELECT
  city,
  risk_bucket,
  COUNT(*) AS deliveries
FROM capstone_project.logistics.gold_last_mile_risk_predictions
GROUP BY city, risk_bucket
ORDER BY city, risk_bucket;

Top risky cities

In [0]:
%sql
SELECT
  city,
  ROUND(AVG(risk_probability), 3) AS avg_risk_score,
  COUNT(*) AS total_deliveries
FROM capstone_project.logistics.gold_last_mile_risk_predictions
GROUP BY city
ORDER BY avg_risk_score DESC;
