# Supply Chain Management – Delivery & Quantity Risk Modeling

This notebook demonstrates the end-to-end analysis pipeline for identifying
delivery and quantity risks in procurement data.

It is designed as a **reproducible entry point** for the open-source project:
- Data cleaning & anomaly labeling
- Exploratory analysis and visualization
- Feature engineering
- Machine learning models for delivery and quantity risk
- Model interpretability using SHAP

> Note: The original datasets are confidential and are not included in this repository.
> This notebook assumes compatible datasets placed under `data/raw/`.

In [None]:
# Core
import os
import sys
from pathlib import Path

# Data & Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Silence warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Allow importing from src/
repo_root = Path("..").resolve()
sys.path.append(str(repo_root))

In [None]:
from src.config import build_paths, ensure_dirs
from src.io import load_company_data
from src.cleaning import clean_data
from src.labeling import add_comment_column, add_labels
from src.features import (
    basic_feature_engineering,
    extended_feature_engineering_delivery,
    feature_engineering_quantity
)
from src.visualization import (
    compare_missing_zero_counts,
    plot_special_orders,
    delivery_delay_comparison,
    supplier_performance_analysis
)
from src.models import (
    filter_train_data_for_delivery,
    filter_train_data_for_quantity,
    evaluate_all_models,
    save_model
)
from src.explain import shap_summary_bar
from src.predict import predict_single_order

In [None]:
# Build standardized project paths
paths = build_paths(repo_root)
ensure_dirs(paths)

# Expected data locations (user must provide their own data)
A_FILE = paths.data_dir / "raw" / "company_a.csv"
B_FILE = paths.data_dir / "raw" / "company_b.csv"

print("Company A file:", A_FILE)
print("Company B file:", B_FILE)

In [None]:
# Load datasets
df_a, df_b = load_company_data(A_FILE, B_FILE, sep=";")

print("Company A shape:", df_a.shape)
print("Company B shape:", df_b.shape)

df_a.head()

In [None]:
# Cleaning
df_a = clean_data(df_a)
df_b = clean_data(df_b)

# Anomaly comments
df_a = add_comment_column(df_a)
df_b = add_comment_column(df_b)

# Labels
df_a = add_labels(df_a, delivery_threshold=1)
df_b = add_labels(df_b, delivery_threshold=1)

# Basic features
df_a = basic_feature_engineering(df_a)
df_b = basic_feature_engineering(df_b)

df_a[["delivery_status", "quantity_status"]].value_counts()

In [None]:
# Missing & zero analysis
compare_missing_zero_counts(df_a, df_b)

# Special orders
plot_special_orders(df_a, "A")
plot_special_orders(df_b, "B")

# Delivery delay distributions
delivery_delay_comparison(df_a, "A")
delivery_delay_comparison(df_b, "B")

In [None]:
supplier_stats_a = supplier_performance_analysis(df_a, "A")
supplier_stats_b = supplier_performance_analysis(df_b, "B")

In [None]:
# Extended features for delivery modeling
df_a_fe = extended_feature_engineering_delivery(df_a)
df_b_fe = extended_feature_engineering_delivery(df_b)

# Training sets
train_a_delivery = filter_train_data_for_delivery(df_a_fe)
train_b_delivery = filter_train_data_for_delivery(df_b_fe)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression

delivery_models = {
    "Logistic Regression": LogisticRegression(max_iter=1000, class_weight="balanced"),
    "Random Forest": RandomForestClassifier(n_estimators=100, class_weight="balanced"),
    "LightGBM": LGBMClassifier(class_weight="balanced"),
    "XGBoost": XGBClassifier(eval_metric="mlogloss"),
    "CatBoost": CatBoostClassifier(verbose=0)
}

features_delivery = [
    "Ordered Quantity", "planned_month", "planned_weekday",
    "is_weekend", "is_holiday", "supplier_freq",
    "supplier_delay_mean", "supplier_late_rate"
]

X_a = train_a_delivery[features_delivery].fillna(0)
y_a = train_a_delivery["delivery_status"]

best_a_delivery = evaluate_all_models(
    delivery_models, X_a, y_a,
    company="Company A",
    taskname="Delivery Status",
    stratified=False
)

In [None]:
# Quantity feature engineering
df_a_q = feature_engineering_quantity(df_a)
train_a_quantity = filter_train_data_for_quantity(df_a_q)

features_quantity = [
    "Ordered Quantity", "planned_month",
    "supplier_freq", "supplier_quantity_deviation_mean"
]

X_q = train_a_quantity[features_quantity].fillna(0)
y_q = train_a_quantity["quantity_status"]

best_a_quantity = evaluate_all_models(
    delivery_models, X_q, y_q,
    company="Company A",
    taskname="Quantity Accuracy",
    stratified=True
)

In [None]:
shap_summary_bar(
    best_a_delivery.model,
    X_a,
    feature_names=features_delivery,
    title="SHAP Feature Importance – Company A Delivery Status"
)

In [None]:
example_order = X_a.iloc[0].to_dict()

predict_single_order(
    best_a_delivery.model,
    example_order,
    features_delivery
)

## Summary & Next Steps

This notebook demonstrates a complete, modular, and reproducible workflow for
supply chain risk analysis.

**Key takeaways:**
- Supplier historical performance is a strong predictor of delivery risk
- Extreme dates and invalid orders must be handled explicitly
- Tree-based models outperform linear baselines
- SHAP improves transparency and trust in model outputs

**Potential extensions:**
- Regression models for delivery delay magnitude
- Inventory simulation based on predicted risks
- Cost-sensitive learning and service-level optimization