# 05 - Model Evaluation & Comparison

**Objective**: Comprehensive evaluation of both trained models.

- ROC Curves
- Confusion Matrices
- Metric Comparison (accuracy, precision, recall, F1, AUC-ROC)
- Feature Importance Analysis
- Algorithm Complexity Analysis (Big-O)
- Final Summary & Recommendations

**Input**: Predictions & metrics from `04_model_training.ipynb`  
**Output**: Evaluation plots saved to `outputs/`

In [4]:
%pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.7.2-cp310-cp310-win_amd64.whl (8.9 MB)
   ---------------------------------------- 0.0/8.9 MB ? eta -:--:--
   -- ------------------------------------- 0.5/8.9 MB 8.5 MB/s eta 0:00:01
   ------------------------ --------------- 5.5/8.9 MB 22.4 MB/s eta 0:00:01
   ---------------------------------------- 8.9/8.9 MB 23.0 MB/s  0:00:00
Downloading joblib-1.5.3-py3-none-any.whl (309 kB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, sci

In [None]:
# ============================================================
# CELL 1: Imports & Spark Session
# ============================================================
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    roc_curve, auc, confusion_matrix,
    classification_report, precision_recall_curve
)
import json, os

spark = SparkSession.builder \
    .appName("ModelEvaluation") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

DATA_DIR = r'F:\SOFTWARICA\big-data-transport-analytics\data\processed'
MODEL_DIR = os.path.join(DATA_DIR, 'models')
OUTPUT_DIR = r'F:\SOFTWARICA\big-data-transport-analytics\outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

plt.rcParams.update({'font.size': 11, 'figure.dpi': 120})
sns.set_style('whitegrid')
print("Spark session & plotting ready!")

Spark session & plotting ready!


In [None]:
# ============================================================
# CELL 2: Load Predictions & Metrics
# ============================================================

# Load metrics from training
with open(os.path.join(MODEL_DIR, 'model_metrics.json'), 'r') as f:
    metrics = json.load(f)

with open(os.path.join(MODEL_DIR, 'feature_metadata.json'), 'r') as f:
    metadata = json.load(f)

TARGET = metadata['target']

# Load predictions (saved as CSV from notebook 04)
lr_pdf = pd.read_csv(os.path.join(DATA_DIR, 'lr_predictions.csv'))
rf_pdf = pd.read_csv(os.path.join(DATA_DIR, 'rf_predictions.csv'))

print(f"Logistic Regression predictions: {len(lr_pdf):,} rows")
print(f"Random Forest predictions:        {len(rf_pdf):,} rows")

print(f"\nMetrics loaded:")
for model, m in metrics.items():
    print(f"  {model}: accuracy={m['accuracy']:.4f}, AUC={m['auc_roc']:.4f}")

Logistic Regression predictions: 6,803 rows
Random Forest predictions:        6,803 rows

Metrics loaded:
  logistic_regression: accuracy=0.8540, AUC=0.8906
  random_forest: accuracy=0.8584, AUC=0.8795


In [7]:
# ============================================================
# CELL 3: ROC Curves - Both Models
# ============================================================

fig, ax = plt.subplots(figsize=(8, 7))

# Logistic Regression ROC
lr_fpr, lr_tpr, _ = roc_curve(lr_pdf[TARGET], lr_pdf['prob_1'])
lr_auc = auc(lr_fpr, lr_tpr)
ax.plot(lr_fpr, lr_tpr, 'b-', linewidth=2,
        label=f'Logistic Regression (AUC = {lr_auc:.4f})')

# Random Forest ROC
rf_fpr, rf_tpr, _ = roc_curve(rf_pdf[TARGET], rf_pdf['prob_1'])
rf_auc = auc(rf_fpr, rf_tpr)
ax.plot(rf_fpr, rf_tpr, 'r-', linewidth=2,
        label=f'Random Forest (AUC = {rf_auc:.4f})')

# Diagonal (random)
ax.plot([0, 1], [0, 1], 'k--', alpha=0.4, label='Random (AUC = 0.5)')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves — Model Comparison', fontweight='bold', fontsize=14)
ax.legend(loc='lower right', fontsize=11)
ax.set_xlim([-0.02, 1.02])
ax.set_ylim([-0.02, 1.02])

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'roc_curves.png'), dpi=150, bbox_inches='tight')
plt.show()
print("Saved: outputs/roc_curves.png")

Saved: outputs/roc_curves.png


  plt.show()


In [8]:
# ============================================================
# CELL 4: Precision-Recall Curves
# ============================================================

fig, ax = plt.subplots(figsize=(8, 7))

lr_prec, lr_rec, _ = precision_recall_curve(lr_pdf[TARGET], lr_pdf['prob_1'])
rf_prec, rf_rec, _ = precision_recall_curve(rf_pdf[TARGET], rf_pdf['prob_1'])

ax.plot(lr_rec, lr_prec, 'b-', linewidth=2, label='Logistic Regression')
ax.plot(rf_rec, rf_prec, 'r-', linewidth=2, label='Random Forest')

# Baseline (proportion of positive class)
pos_rate = lr_pdf[TARGET].mean()
ax.axhline(y=pos_rate, color='k', linestyle='--', alpha=0.4,
           label=f'Baseline ({pos_rate:.2f})')

ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curves — Model Comparison',
             fontweight='bold', fontsize=14)
ax.legend(loc='upper right', fontsize=11)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'precision_recall_curves.png'),
            dpi=150, bbox_inches='tight')
plt.show()
print("Saved: outputs/precision_recall_curves.png")

Saved: outputs/precision_recall_curves.png


  plt.show()


In [9]:
# ============================================================
# CELL 5: Confusion Matrices - Side by Side
# ============================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for ax, pdf, title, color in [
    (axes[0], lr_pdf, 'Logistic Regression', 'Blues'),
    (axes[1], rf_pdf, 'Random Forest', 'Reds'),
]:
    cm = confusion_matrix(pdf[TARGET], pdf['prediction'])
    sns.heatmap(cm, annot=True, fmt='d', cmap=color, ax=ax,
                xticklabels=['Low Risk (0)', 'High Risk (1)'],
                yticklabels=['Low Risk (0)', 'High Risk (1)'],
                annot_kws={'size': 14})
    ax.set_xlabel('Predicted', fontsize=12)
    ax.set_ylabel('Actual', fontsize=12)
    ax.set_title(f'{title}\nConfusion Matrix', fontweight='bold', fontsize=13)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'confusion_matrices.png'),
            dpi=150, bbox_inches='tight')
plt.show()
print("Saved: outputs/confusion_matrices.png")

Saved: outputs/confusion_matrices.png


  plt.show()


In [10]:
# ============================================================
# CELL 6: Classification Reports (sklearn)
# ============================================================

print("=" * 65)
print("LOGISTIC REGRESSION - Detailed Classification Report")
print("=" * 65)
print(classification_report(
    lr_pdf[TARGET], lr_pdf['prediction'],
    target_names=['Low Risk (0)', 'High Risk (1)']
))

print("=" * 65)
print("RANDOM FOREST - Detailed Classification Report")
print("=" * 65)
print(classification_report(
    rf_pdf[TARGET], rf_pdf['prediction'],
    target_names=['Low Risk (0)', 'High Risk (1)']
))

LOGISTIC REGRESSION - Detailed Classification Report
               precision    recall  f1-score   support

 Low Risk (0)       0.84      0.99      0.91      5224
High Risk (1)       0.94      0.40      0.56      1579

     accuracy                           0.85      6803
    macro avg       0.89      0.69      0.73      6803
 weighted avg       0.87      0.85      0.83      6803

RANDOM FOREST - Detailed Classification Report
               precision    recall  f1-score   support

 Low Risk (0)       0.85      0.99      0.91      5224
High Risk (1)       0.92      0.43      0.58      1579

     accuracy                           0.86      6803
    macro avg       0.88      0.71      0.75      6803
 weighted avg       0.87      0.86      0.84      6803



In [11]:
# ============================================================
# CELL 7: Metric Comparison Bar Chart
# ============================================================

compare_metrics = ['accuracy', 'f1', 'precision', 'recall', 'auc_roc']
lr_vals = [metrics['logistic_regression'][m] for m in compare_metrics]
rf_vals = [metrics['random_forest'][m] for m in compare_metrics]

x = np.arange(len(compare_metrics))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width/2, lr_vals, width, label='Logistic Regression',
               color='#3498db', alpha=0.85)
bars2 = ax.bar(x + width/2, rf_vals, width, label='Random Forest',
               color='#e74c3c', alpha=0.85)

# Add value labels
for bars in [bars1, bars2]:
    for bar in bars:
        h = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, h + 0.005,
                f'{h:.4f}', ha='center', va='bottom', fontsize=9)

ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Comparison', fontweight='bold', fontsize=14)
ax.set_xticks(x)
ax.set_xticklabels([m.replace('_', ' ').title() for m in compare_metrics])
ax.legend(fontsize=11)
ax.set_ylim(0, 1.12)
ax.axhline(y=0.5, color='gray', linestyle=':', alpha=0.5, label='_')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'model_comparison.png'),
            dpi=150, bbox_inches='tight')
plt.show()
print("Saved: outputs/model_comparison.png")

Saved: outputs/model_comparison.png


  plt.show()


In [12]:
# ============================================================
# CELL 8: Training Time Comparison
# ============================================================

fig, ax = plt.subplots(figsize=(7, 5))

models = ['Logistic Regression', 'Random Forest']
times = [
    metrics['logistic_regression']['train_time'],
    metrics['random_forest']['train_time']
]
colors = ['#3498db', '#e74c3c']

bars = ax.bar(models, times, color=colors, alpha=0.85, width=0.5)
for bar, t in zip(bars, times):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
            f'{t:.2f}s', ha='center', va='bottom', fontsize=12, fontweight='bold')

ax.set_ylabel('Training Time (seconds)', fontsize=12)
ax.set_title('Training Time Comparison', fontweight='bold', fontsize=14)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'training_time.png'),
            dpi=150, bbox_inches='tight')
plt.show()
print("Saved: outputs/training_time.png")

Saved: outputs/training_time.png


  plt.show()


In [13]:
# ============================================================
# CELL 9: Algorithm Complexity Analysis (Big-O)
# ============================================================

with open(os.path.join(DATA_DIR, 'feature_metadata.json'), 'r') as f:
    meta = json.load(f)

n_train = meta['train_count']
n_test = meta['test_count']
p = meta['feature_vector_size']

print("=" * 70)
print("ALGORITHM COMPLEXITY ANALYSIS")
print("=" * 70)

print(f"\nDataset dimensions:")
print(f"  Training samples (n): {n_train:,}")
print(f"  Test samples:         {n_test:,}")
print(f"  Feature dimensions (p): {p}")

print("\n" + "-" * 70)
print("MODEL 1: LOGISTIC REGRESSION")
print("-" * 70)
import math
lr_iters = 100
print(f"  Training complexity:   O(n * p * iterations)")
print(f"                       = O({n_train:,} * {p} * {lr_iters})")
print(f"                       ≈ O({n_train * p * lr_iters:,.0f}) operations")
print(f"  Prediction complexity: O(p) per sample")
print(f"                       = O({p}) per sample")
print(f"                       = O({p} * {n_test:,}) = O({p * n_test:,}) for full test set")
print(f"  Space complexity:      O(p) for model coefficients")
print(f"  Actual training time:  {metrics['logistic_regression']['train_time']:.2f}s")

print("\n" + "-" * 70)
print("MODEL 2: RANDOM FOREST")
print("-" * 70)
T = 100
d = 10
log_n = math.log2(n_train)
print(f"  Training complexity:   O(T * n * p * log(n))")
print(f"                       = O({T} * {n_train:,} * {p} * {log_n:.1f})")
print(f"                       ≈ O({T * n_train * p * log_n:,.0f}) operations")
print(f"  Prediction complexity: O(T * depth) per sample")
print(f"                       = O({T} * {d}) = O({T * d}) per sample")
print(f"                       = O({T * d} * {n_test:,}) = O({T * d * n_test:,}) for full test set")
print(f"  Space complexity:      O(T * 2^depth) = O({T} * {2**d}) = O({T * 2**d:,}) leaf nodes max")
print(f"  Actual training time:  {metrics['random_forest']['train_time']:.2f}s")

print("\n" + "-" * 70)
print("SCALABILITY COMPARISON")
print("-" * 70)
ratio = (T * n_train * p * log_n) / (n_train * p * lr_iters)
print(f"  RF/LR training complexity ratio: {ratio:.1f}x")
time_ratio = metrics['random_forest']['train_time'] / max(metrics['logistic_regression']['train_time'], 0.01)
print(f"  Actual training time ratio:      {time_ratio:.1f}x")
print(f"\n  Logistic Regression scales linearly with data size.")
print(f"  Random Forest scales quasi-linearly (n * log(n)) per tree,")
print(f"  but the constant factor is {T} trees.")
print(f"\n  For real-time prediction:")
print(f"    LR:  {p} multiplications + sigmoid  → sub-millisecond")
print(f"    RF:  {T} tree traversals (depth ≤ {d})  → low milliseconds")

ALGORITHM COMPLEXITY ANALYSIS

Dataset dimensions:
  Training samples (n): 28,119
  Test samples:         6,803
  Feature dimensions (p): 35

----------------------------------------------------------------------
MODEL 1: LOGISTIC REGRESSION
----------------------------------------------------------------------
  Training complexity:   O(n * p * iterations)
                       = O(28,119 * 35 * 100)
                       ≈ O(98,416,500) operations
  Prediction complexity: O(p) per sample
                       = O(35) per sample
                       = O(35 * 6,803) = O(238,105) for full test set
  Space complexity:      O(p) for model coefficients
  Actual training time:  35.41s

----------------------------------------------------------------------
MODEL 2: RANDOM FOREST
----------------------------------------------------------------------
  Training complexity:   O(T * n * p * log(n))
                       = O(100 * 28,119 * 35 * 14.8)
                       ≈ O(1,454,522,812

In [14]:
# ============================================================
# CELL 10: Save Final Comparison CSV & Export Predictions
# ============================================================

# Model comparison table
comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'F1 Score', 'Precision (weighted)',
               'Recall (weighted)', 'AUC-ROC', 'Training Time (s)'],
    'Logistic Regression': [
        metrics['logistic_regression']['accuracy'],
        metrics['logistic_regression']['f1'],
        metrics['logistic_regression']['precision'],
        metrics['logistic_regression']['recall'],
        metrics['logistic_regression']['auc_roc'],
        metrics['logistic_regression']['train_time'],
    ],
    'Random Forest': [
        metrics['random_forest']['accuracy'],
        metrics['random_forest']['f1'],
        metrics['random_forest']['precision'],
        metrics['random_forest']['recall'],
        metrics['random_forest']['auc_roc'],
        metrics['random_forest']['train_time'],
    ],
})
comparison_df.to_csv(os.path.join(OUTPUT_DIR, 'model_comparison.csv'), index=False)
print(comparison_df.to_string(index=False))

# Save RF predictions for further analysis
rf_pdf.to_csv(os.path.join(OUTPUT_DIR, 'rf_predictions.csv'), index=False)

print(f"\nSaved: outputs/model_comparison.csv")
print(f"Saved: outputs/rf_predictions.csv")

              Metric  Logistic Regression  Random Forest
            Accuracy             0.854035       0.858445
            F1 Score             0.830140       0.838027
Precision (weighted)             0.867036       0.866612
   Recall (weighted)             0.854035       0.858445
             AUC-ROC             0.890608       0.879546
   Training Time (s)            35.410227      52.244971

Saved: outputs/model_comparison.csv
Saved: outputs/rf_predictions.csv


In [None]:
# ============================================================
# CELL 11: Final Project Summary
# ============================================================

best_model = 'Random Forest' if metrics['random_forest']['auc_roc'] >= metrics['logistic_regression']['auc_roc'] else 'Logistic Regression'
best_metrics = metrics['random_forest'] if best_model == 'Random Forest' else metrics['logistic_regression']

print("="* 70)
print("FINAL PROJECT SUMMARY")
print("Disruption Impact Risk Prediction for Ensign Bus")
print("="* 70)

print(f"\n1. DATA PIPELINE")
print(f"   - Source: BODS TransXChange (170 XML files) + SIRI-SX (428 disruptions)")
print(f"   - Parsed: {n_train + n_test:,} timetable records with temporal disruption features")
print(f"   - Merge Strategy: Temporal (daily disruption metrics joined by date)")
print(f"   - Feature Engineering: {p} features after encoding & scaling")

print(f"\n2. MODELLING")
print(f"   - Task: Binary classification (high_disruption_risk: 0/1)")
print(f"   - Threshold: active_disruptions > {meta['threshold']}")
print(f"   - Train/Test Split: 80/20 (seed=42)")
print(f"   - Model 1: Logistic Regression (baseline)")
print(f"   - Model 2: Random Forest Classifier (100 trees, depth 10)")

print(f"\n3. RESULTS")
print(f"   - Best Model: {best_model}")
print(f"   - Accuracy:   {best_metrics['accuracy']:.4f}")
print(f"   - F1 Score:   {best_metrics['f1']:.4f}")
print(f"   - AUC-ROC:    {best_metrics['auc_roc']:.4f}")
print(f"   - Precision:  {best_metrics['precision']:.4f}")
print(f"   - Recall:     {best_metrics['recall']:.4f}")

print(f"\n4. OUTPUT FILES")
print(f"   data/processed/")
print(f"     - ensign_timetable_with_disruptions.csv (merged dataset)")
print(f"     - cleaned_dataset.csv (cleaned)")
print(f"     - featured_dataset.csv (with target & features)")
print(f"     - models/")
print(f"         - lr_model_params.pkl & rf_model_params.pkl (model parameters)")
print(f"         - feature_metadata.json (pipeline metadata)")
print(f"         - model_metrics.json (all metrics)")
print(f"   outputs/")
print(f"     - roc_curves.png")
print(f"     - precision_recall_curves.png")
print(f"     - confusion_matrices.png")
print(f"     - model_comparison.png")
print(f"     - feature_importance.png")
print(f"     - training_time.png")
print(f"     - model_comparison.csv")
print(f"     - rf_predictions.csv")

print(f"\n5. NOTEBOOKS")
print(f"   01_data_ingestion.ipynb    → XML parsing, merge, CSV export")
print(f"   02_data_cleaning.ipynb     → Nulls, dupes, types, column selection")
print(f"   03_feature_engineering.ipynb → Target, EDA, PySpark pipeline")
print(f"   04_model_training.ipynb    → LR + RF training, feature importance")
print(f"   05_evaluation.ipynb        → ROC, confusion matrices, comparison")

print(f"\n" + "=" * 70)
print(f"Project complete.")
print(f"=" * 70)

FINAL PROJECT SUMMARY
Disruption Impact Risk Prediction for Ensign Bus

1. DATA PIPELINE
   - Source: BODS TransXChange (170 XML files) + SIRI-SX (428 disruptions)
   - Parsed: 34,922 timetable records with temporal disruption features
   - Merge Strategy: Temporal (daily disruption metrics joined by date)
   - Feature Engineering: 35 features after encoding & scaling

2. MODELLING
   - Task: Binary classification (high_disruption_risk: 0/1)
   - Threshold: active_disruptions > 107.0
   - Train/Test Split: 80/20 (seed=42)
   - Model 1: Logistic Regression (baseline)
   - Model 2: Random Forest Classifier (100 trees, depth 10)

3. RESULTS
   - Best Model: Logistic Regression
   - Accuracy:   0.8540
   - F1 Score:   0.8301
   - AUC-ROC:    0.8906
   - Precision:  0.8670
   - Recall:     0.8540

4. OUTPUT FILES
   data/processed/
     - ensign_timetable_with_disruptions.csv (merged dataset)
     - cleaned_dataset.csv (cleaned)
     - featured_dataset.csv (with target & features)
     - lr

In [16]:
# Stop Spark
spark.stop()
print("Spark session stopped.")

Spark session stopped.
