# 04 - Model Training

**Objective**: Train two classification models to predict disruption impact risk.

- **Model 1 (Baseline)**: Logistic Regression — simple, interpretable, linear
- **Model 2 (Main)**: Random Forest Classifier — non-linear, ensemble, robust

**Input**: `data/processed/train_prepared.parquet`, `data/processed/test_prepared.parquet`  
**Output**: Trained models saved to `data/processed/`, predictions for evaluation

In [None]:
# ============================================================
# CELL 1: Imports & Spark Session
# ============================================================
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
import pandas as pd
import numpy as np
import json
import time
import os

spark = SparkSession.builder \
    .appName("ModelTraining") \
    .master("local[*]") \
    .config("spark.sql.shuffle.partitions", "8") \
    .config("spark.driver.memory", "4g") \
    .getOrCreate()

DATA_DIR = r'F:\SOFTWARICA\big-data-transport-analytics\data\processed'
MODEL_DIR = os.path.join(DATA_DIR, 'models')
OUTPUT_DIR = r'F:\SOFTWARICA\big-data-transport-analytics\outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

print("Spark session ready!")

Spark session ready!


In [None]:
# ============================================================
# CELL 2: Load Data, Rebuild Pipeline & Prepare Features
# ============================================================

# Load metadata from feature engineering
with open(os.path.join(MODEL_DIR, 'feature_metadata.json'), 'r') as f:
    metadata = json.load(f)

TARGET = metadata['target']
THRESHOLD = metadata['threshold']
NUM_FEATURES = metadata['num_features']
CAT_FEATURES = metadata['cat_features']
ALL_FEATURES = CAT_FEATURES + NUM_FEATURES

print(f"Target: {TARGET}")
print(f"Threshold: {THRESHOLD}")
print(f"Features: {len(NUM_FEATURES)} numeric + {len(CAT_FEATURES)} categorical")

# Load train/test CSVs into Spark
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline

train_pdf = pd.read_csv(os.path.join(DATA_DIR, 'train_split.csv'))
test_pdf = pd.read_csv(os.path.join(DATA_DIR, 'test_split.csv'))

train_df = spark.createDataFrame(train_pdf[ALL_FEATURES + [TARGET]])
test_df = spark.createDataFrame(test_pdf[ALL_FEATURES + [TARGET]])

print(f"\nTrain set: {train_df.count():,} rows")
print(f"Test set:  {test_df.count():,} rows")

# Rebuild preprocessing pipeline (same as notebook 03)
indexers = [
    StringIndexer(inputCol=col, outputCol=f'{col}_idx', handleInvalid='keep')
    for col in CAT_FEATURES
]
encoders = [
    OneHotEncoder(inputCol=f'{col}_idx', outputCol=f'{col}_vec')
    for col in CAT_FEATURES
]
encoded_cols = [f'{col}_vec' for col in CAT_FEATURES]
assembler = VectorAssembler(
    inputCols=NUM_FEATURES + encoded_cols,
    outputCol='features_raw'
)
scaler = StandardScaler(inputCol='features_raw', outputCol='features', withStd=True, withMean=False)

pipeline = Pipeline(stages=indexers + encoders + [assembler, scaler])

# Fit on training data, transform both
pipeline_model = pipeline.fit(train_df)
train_prepared = pipeline_model.transform(train_df)
test_prepared = pipeline_model.transform(test_df)

FEATURE_SIZE = train_prepared.select('features').first()[0].size
print(f"Feature vector size: {FEATURE_SIZE}")

print(f"\nTrain class balance:")
train_prepared.groupBy(TARGET).count().orderBy(TARGET).show()

print(f"Test class balance:")
test_prepared.groupBy(TARGET).count().orderBy(TARGET).show()

Target: high_disruption_risk
Threshold: 107.0
Features: 11 numeric + 4 categorical

Train set: 28,119 rows
Test set:  6,803 rows
Feature vector size: 35

Train class balance:
+--------------------+-----+
|high_disruption_risk|count|
+--------------------+-----+
|                   0|21550|
|                   1| 6569|
+--------------------+-----+

Test class balance:
+--------------------+-----+
|high_disruption_risk|count|
+--------------------+-----+
|                   0| 5224|
|                   1| 1579|
+--------------------+-----+



---
## Model 1: Logistic Regression (Baseline)

**Why?** Simple, fast, interpretable. A linear model that serves as the baseline to beat. If Random Forest can't outperform this, then the data may not have non-linear patterns worth capturing.

**Complexity**: Training O(n * p * iterations), Prediction O(p) per sample.

In [3]:
# ============================================================
# CELL 3: Train Logistic Regression
# ============================================================

print("=" * 60)
print("MODEL 1: LOGISTIC REGRESSION (Baseline)")
print("=" * 60)

lr = LogisticRegression(
    featuresCol='features',
    labelCol=TARGET,
    maxIter=100,
    regParam=0.01,         # L2 regularization
    elasticNetParam=0.0,   # Pure L2 (Ridge)
    threshold=0.5,
)

print("Hyperparameters:")
print(f"  maxIter:         {lr.getMaxIter()}")
print(f"  regParam (L2):   {lr.getRegParam()}")
print(f"  elasticNet:      {lr.getElasticNetParam()} (0.0 = pure L2)")
print(f"  threshold:       {lr.getThreshold()}")

# Train
start_time = time.time()
lr_model = lr.fit(train_prepared)
lr_train_time = time.time() - start_time

print(f"\nTraining completed in {lr_train_time:.2f} seconds")

# Training summary
lr_summary = lr_model.summary
print(f"\n--- Training Metrics ---")
print(f"  Training Accuracy: {lr_summary.accuracy:.4f}")
print(f"  Training AUC-ROC:  {lr_summary.areaUnderROC:.4f}")

MODEL 1: LOGISTIC REGRESSION (Baseline)
Hyperparameters:
  maxIter:         100
  regParam (L2):   0.01
  elasticNet:      0.0 (0.0 = pure L2)
  threshold:       0.5

Training completed in 35.41 seconds

--- Training Metrics ---
  Training Accuracy: 0.8507
  Training AUC-ROC:  0.8866


In [4]:
# ============================================================
# CELL 4: Logistic Regression - Test Predictions
# ============================================================

# Predict on test set
lr_predictions = lr_model.transform(test_prepared)

# Evaluate with multiple metrics
binary_eval = BinaryClassificationEvaluator(labelCol=TARGET, metricName='areaUnderROC')
acc_eval = MulticlassClassificationEvaluator(labelCol=TARGET, metricName='accuracy')
f1_eval = MulticlassClassificationEvaluator(labelCol=TARGET, metricName='f1')
prec_eval = MulticlassClassificationEvaluator(labelCol=TARGET, metricName='weightedPrecision')
rec_eval = MulticlassClassificationEvaluator(labelCol=TARGET, metricName='weightedRecall')

lr_metrics = {
    'accuracy': acc_eval.evaluate(lr_predictions),
    'f1': f1_eval.evaluate(lr_predictions),
    'precision': prec_eval.evaluate(lr_predictions),
    'recall': rec_eval.evaluate(lr_predictions),
    'auc_roc': binary_eval.evaluate(lr_predictions),
    'train_time': lr_train_time,
}

print("=" * 60)
print("LOGISTIC REGRESSION - Test Set Results")
print("=" * 60)
for metric, value in lr_metrics.items():
    print(f"  {metric:20s}: {value:.4f}")

print(f"\nConfusion Matrix (Test):")
lr_predictions.groupBy(TARGET, 'prediction').count().orderBy(TARGET, 'prediction').show()

LOGISTIC REGRESSION - Test Set Results
  accuracy            : 0.8540
  f1                  : 0.8301
  precision           : 0.8670
  recall              : 0.8540
  auc_roc             : 0.8906
  train_time          : 35.4102

Confusion Matrix (Test):
+--------------------+----------+-----+
|high_disruption_risk|prediction|count|
+--------------------+----------+-----+
|                   0|       0.0| 5185|
|                   0|       1.0|   39|
|                   1|       0.0|  954|
|                   1|       1.0|  625|
+--------------------+----------+-----+



---
## Model 2: Random Forest Classifier (Main Model)

**Why?** Handles non-linear relationships, mixed feature types, robust to outliers. Ensemble of 100 decision trees — majority vote for classification.

**Complexity**: Training O(T * n * p * log(n)), Prediction O(T * depth) per sample.

In [5]:
# ============================================================
# CELL 5: Train Random Forest
# ============================================================

print("=" * 60)
print("MODEL 2: RANDOM FOREST CLASSIFIER (Main Model)")
print("=" * 60)

rf = RandomForestClassifier(
    featuresCol='features',
    labelCol=TARGET,
    numTrees=100,
    maxDepth=10,
    minInstancesPerNode=5,
    featureSubsetStrategy='sqrt',
    seed=42,
)

print("Hyperparameters:")
print(f"  numTrees:              {rf.getNumTrees()}")
print(f"  maxDepth:              {rf.getMaxDepth()}")
print(f"  minInstancesPerNode:   {rf.getMinInstancesPerNode()}")
print(f"  featureSubsetStrategy: {rf.getFeatureSubsetStrategy()}")
print(f"  seed:                  {rf.getSeed()}")

# Train
start_time = time.time()
rf_model = rf.fit(train_prepared)
rf_train_time = time.time() - start_time

print(f"\nTraining completed in {rf_train_time:.2f} seconds")
print(f"Number of trees: {rf_model.getNumTrees}")

MODEL 2: RANDOM FOREST CLASSIFIER (Main Model)
Hyperparameters:
  numTrees:              100
  maxDepth:              10
  minInstancesPerNode:   5
  featureSubsetStrategy: sqrt
  seed:                  42

Training completed in 52.24 seconds
Number of trees: 100


In [6]:
# ============================================================
# CELL 6: Random Forest - Test Predictions
# ============================================================

# Predict on test set
rf_predictions = rf_model.transform(test_prepared)

rf_metrics = {
    'accuracy': acc_eval.evaluate(rf_predictions),
    'f1': f1_eval.evaluate(rf_predictions),
    'precision': prec_eval.evaluate(rf_predictions),
    'recall': rec_eval.evaluate(rf_predictions),
    'auc_roc': binary_eval.evaluate(rf_predictions),
    'train_time': rf_train_time,
}

print("=" * 60)
print("RANDOM FOREST - Test Set Results")
print("=" * 60)
for metric, value in rf_metrics.items():
    print(f"  {metric:20s}: {value:.4f}")

print(f"\nConfusion Matrix (Test):")
rf_predictions.groupBy(TARGET, 'prediction').count().orderBy(TARGET, 'prediction').show()

RANDOM FOREST - Test Set Results
  accuracy            : 0.8584
  f1                  : 0.8380
  precision           : 0.8666
  recall              : 0.8584
  auc_roc             : 0.8795
  train_time          : 52.2450

Confusion Matrix (Test):
+--------------------+----------+-----+
|high_disruption_risk|prediction|count|
+--------------------+----------+-----+
|                   0|       0.0| 5163|
|                   0|       1.0|   61|
|                   1|       0.0|  902|
|                   1|       1.0|  677|
+--------------------+----------+-----+



In [7]:
# ============================================================
# CELL 7: Feature Importance (Random Forest)
# ============================================================
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')

# Build feature names in same order as VectorAssembler
# First: numeric features
feature_names = metadata['num_features'].copy()

# Then: one-hot encoded categorical features (N-1 for each)
cat_labels = metadata['cat_labels']
for col in metadata['cat_features']:
    labels = cat_labels[col]
    for label in labels[:-1]:  # N-1 (last is reference category)
        feature_names.append(f'{col}_{label}')

importances = rf_model.featureImportances.toArray()

n = min(len(feature_names), len(importances))
feat_imp = sorted(zip(feature_names[:n], importances[:n]),
                  key=lambda x: x[1], reverse=True)

print("RANDOM FOREST - Feature Importance Ranking")
print("=" * 55)
for i, (name, imp) in enumerate(feat_imp, 1):
    bar = '█' * int(imp * 100)
    print(f"  {i:2d}. {name:25s} {imp:.4f} {bar}")

# Plot top features
fig, ax = plt.subplots(figsize=(10, 7))
top_n = min(15, len(feat_imp))
names = [x[0] for x in feat_imp[:top_n]]
vals = [x[1] for x in feat_imp[:top_n]]
colors = ['#e74c3c' if v > 0.05 else '#3498db' for v in vals]

ax.barh(range(len(names)), vals, color=colors)
ax.set_yticks(range(len(names)))
ax.set_yticklabels(names)
ax.invert_yaxis()
ax.set_xlabel('Feature Importance')
ax.set_title('Random Forest - Top Feature Importances', fontweight='bold', fontsize=13)
for i, v in enumerate(vals):
    ax.text(v + 0.002, i, f'{v:.4f}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'feature_importance.png'), dpi=150, bbox_inches='tight')
plt.show()
print("\nSaved: outputs/feature_importance.png")

RANDOM FOREST - Feature Importance Ranking
   1. day_of_week_num           0.4445 ████████████████████████████████████████████
   2. is_weekend                0.3644 ████████████████████████████████████
   3. line_name_88              0.0469 ████
   4. line_name_x32             0.0398 ███
   5. latitude                  0.0118 █
   6. stop_sequence             0.0103 █
   7. line_name_22              0.0101 █
   8. departure_hour            0.0099 
   9. longitude                 0.0069 
  10. hour_cos                  0.0065 
  11. line_name_33              0.0056 
  12. hour_sin                  0.0053 
  13. route_complexity          0.0051 
  14. line_name_99OT            0.0036 
  15. run_time_min              0.0034 
  16. line_name_73              0.0032 
  17. time_of_day_midday        0.0032 
  18. direction_inbound         0.0030 
  19. line_name_x80             0.0015 
  20. line_name_83              0.0015 
  21. lat_zone_mid_north        0.0012 
  22. line_name_x2         

  plt.show()


In [None]:
# ============================================================
# CELL 8: Quick Comparison & Save Model Outputs
# ============================================================

print("=" * 70)
print("MODEL COMPARISON SUMMARY")
print("=" * 70)
print(f"{'Metric':<22} {'Logistic Reg':>14} {'Random Forest':>14} {'Winner':>16}")
print("-" * 70)
for metric in ['accuracy', 'f1', 'precision', 'recall', 'auc_roc', 'train_time']:
    lr_val = lr_metrics[metric]
    rf_val = rf_metrics[metric]
    if metric == 'train_time':
        winner = 'LR (faster)' if lr_val < rf_val else 'RF (faster)'
    else:
        winner = 'Random Forest' if rf_val > lr_val else ('Logistic Reg' if lr_val > rf_val else 'Tie')
    print(f"  {metric:<20} {lr_val:>14.4f} {rf_val:>14.4f} {winner:>16}")

# --- Save predictions as CSV via pandas (avoids Hadoop/winutils) ---
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType

get_prob = udf(lambda v: float(v[1]), DoubleType())

# Extract label, prediction, probability(class=1) to pandas and save
lr_out = lr_predictions.withColumn('prob_1', get_prob('probability')) \
    .select(TARGET, 'prediction', 'prob_1').toPandas()
lr_out.to_csv(os.path.join(DATA_DIR, 'lr_predictions.csv'), index=False)
print(f"\nSaved: data/processed/lr_predictions.csv ({len(lr_out):,} rows)")

rf_out = rf_predictions.withColumn('prob_1', get_prob('probability')) \
    .select(TARGET, 'prediction', 'prob_1').toPandas()
rf_out.to_csv(os.path.join(DATA_DIR, 'rf_predictions.csv'), index=False)
print(f"Saved: data/processed/rf_predictions.csv ({len(rf_out):,} rows)")

# --- Save trained models (for GUI reuse) ---
import pickle

# Try saving full PySpark models (may fail on Windows without Hadoop)
print(f"\nAttempting to save PySpark models...")
try:
    lr_model.write().overwrite().save(os.path.join(MODEL_DIR, 'lr_model_spark'))
    rf_model.write().overwrite().save(os.path.join(MODEL_DIR, 'rf_model_spark'))
    print(f"  ✓ Saved PySpark native models to data/processed/models/")
except Exception as e:
    print(f"  ✗ PySpark model save failed (expected on Windows): {type(e).__name__}")
    print(f"    Falling back to parameter export...")

# Extract and save model parameters (always works, GUI-friendly)
print(f"\nExporting model parameters for GUI...")

# Logistic Regression parameters
lr_params = {
    'type': 'LogisticRegression',
    'coefficients': lr_model.coefficients.toArray().tolist(),
    'intercept': float(lr_model.intercept),
    'num_features': lr_model.numFeatures,
    'num_classes': lr_model.numClasses,
    'feature_names': metadata['num_features'] + \
                     [f'{col}_{lbl}' for col in metadata['cat_features'] 
                      for lbl in metadata['cat_labels'][col][:-1]],
}
with open(os.path.join(MODEL_DIR, 'lr_model_params.pkl'), 'wb') as f:
    pickle.dump(lr_params, f)
print(f"  ✓ Saved: data/processed/models/lr_model_params.pkl")

# Random Forest parameters + feature importances
rf_params = {
    'type': 'RandomForestClassifier',
    'num_trees': rf_model.getNumTrees,
    'feature_importances': rf_model.featureImportances.toArray().tolist(),
    'num_features': rf_model.numFeatures,
    'feature_names': lr_params['feature_names'],  # same as LR
    'tree_weights': rf_model.treeWeights,
    # Note: Full tree structures not exported (use PySpark model for inference)
}
with open(os.path.join(MODEL_DIR, 'rf_model_params.pkl'), 'wb') as f:
    pickle.dump(rf_params, f)
print(f"  ✓ Saved: data/processed/models/rf_model_params.pkl")

print(f"\n  For GUI inference:")
print(f"    - Load params with: pickle.load(open('rf_model_params.pkl', 'rb'))")
print(f"    - Or reload full PySpark model if available")
print(f"    - Pipeline metadata in: feature_metadata.json")

# Save metrics JSON
all_metrics = {'logistic_regression': lr_metrics, 'random_forest': rf_metrics}
with open(os.path.join(MODEL_DIR, 'model_metrics.json'), 'w') as f:
    json.dump(all_metrics, f, indent=2)
print(f"\nSaved: data/processed/models/model_metrics.json")

print(f"\n--- Training Complete ---")
print(f"  Ready for 05_evaluation.ipynb")

MODEL COMPARISON SUMMARY
Metric                   Logistic Reg  Random Forest           Winner
----------------------------------------------------------------------
  accuracy                     0.8540         0.8584    Random Forest
  f1                           0.8301         0.8380    Random Forest
  precision                    0.8670         0.8666     Logistic Reg
  recall                       0.8540         0.8584    Random Forest
  auc_roc                      0.8906         0.8795     Logistic Reg
  train_time                  35.4102        52.2450      LR (faster)

Saved: data/processed/lr_predictions.csv (6,803 rows)
Saved: data/processed/rf_predictions.csv (6,803 rows)

Attempting to save PySpark models...
  ✗ PySpark model save failed (expected on Windows): Py4JJavaError
    Falling back to parameter export...

Exporting model parameters for GUI...
  ✓ Saved: data/processed/lr_model_params.pkl
  ✓ Saved: data/processed/rf_model_params.pkl

  For GUI inference:
    - L