# ü§ñ Model Training for Bangkok Traffic Congestion Index Prediction

**Phase 3: Modeling, Analysis, and Evaluation**

## Overview
This notebook trains three regression models for daily Traffic Congestion Index (TCI) prediction:
- **Random Forest**: Ensemble tree model for robust predictions
- **XGBoost**: Gradient boosting for high accuracy
- **Linear Regression**: Simple baseline for comparison

**Target Metrics:** RMSE < 15, MAE < 10, R¬≤ > 0.70

---

**Author:** Data Science Team  
**Date:** November 2025  
**Project:** Bangkok Traffic Congestion Index Prediction (CPE312 Capstone)

In [11]:
# Setup and Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import pickle
import warnings
warnings.filterwarnings('ignore')

# Import ML libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb

# Import custom scripts
import sys
sys.path.insert(0, '../05_Scripts/')
from modeling import (
    temporal_train_test_split,
    train_xgboost_model,
    train_random_forest_model
)
from model_utils import save_model, set_random_seeds

# Set random seeds for reproducibility
set_random_seeds(42)

print("‚úÖ All imports successful!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"XGBoost: {xgb.__version__}")

INFO:model_utils:Random seeds set to 42


‚úÖ All imports successful!
NumPy: 2.3.4
Pandas: 2.3.3
XGBoost: 3.1.2


## 1. Load Engineered Features

In [12]:
# Define paths
DATA_PATH = Path('../02_Data/Processed/')
MODEL_PATH = Path('../02_Model_Development/Trained_Models/')
MODEL_PATH.mkdir(parents=True, exist_ok=True)

# Load engineered features (from Notebook 04)
df = pd.read_csv(DATA_PATH / 'features_engineered.csv')
print(f"Dataset shape: {df.shape}")

# Define target
target_col = 'congestion_index'

# Select only numeric columns for features (exclude date and target)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col != target_col]

# Handle any remaining NaN values
df[feature_cols] = df[feature_cols].fillna(0)

print(f"\nTarget: {target_col}")
print(f"Features: {len(feature_cols)} numeric columns")
print(f"Sample features: {feature_cols[:5]}...")

Dataset shape: (351, 37)

Target: congestion_index
Features: 33 numeric columns
Sample features: ['traffic_volume', 'average_speed', 'year', 'month', 'day']...


## 2. Prepare Train/Validation/Test Split

In [13]:
# Temporal train/val/test split (60/20/20)
train_df, val_df, test_df = temporal_train_test_split(
    df, 
    date_col='date' if 'date' in df.columns else df.columns[0],
    train_ratio=0.6,
    val_ratio=0.2,
    test_ratio=0.2
)

# Extract features and targets
X_train = train_df[feature_cols].values
y_train = train_df[target_col].values

X_val = val_df[feature_cols].values
y_val = val_df[target_col].values

X_test = test_df[feature_cols].values
y_test = test_df[target_col].values

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")

INFO:modeling:Split sizes - Train: 210, Val: 70, Test: 71


Training set: (210, 33)
Validation set: (70, 33)
Test set: (71, 33)


## 3. Train XGBoost Model

In [14]:
# XGBoost configuration
xgb_config = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42
}

# Train XGBoost
print("Training XGBoost...")
xgb_model, xgb_info = train_xgboost_model(
    X_train, y_train,
    X_val, y_val,
    config=xgb_config,
    save_path=str(MODEL_PATH / 'xgboost_model.pkl')
)

print("‚úÖ XGBoost training complete!")

INFO:modeling:Starting XGBoost training...


Training XGBoost...
[0]	validation_0-rmse:14.03240
[1]	validation_0-rmse:13.02203
[2]	validation_0-rmse:12.06843
[3]	validation_0-rmse:11.39006
[4]	validation_0-rmse:10.62415
[5]	validation_0-rmse:9.97655
[6]	validation_0-rmse:9.27904
[7]	validation_0-rmse:8.63545
[8]	validation_0-rmse:8.11920
[9]	validation_0-rmse:7.70414
[1]	validation_0-rmse:13.02203
[2]	validation_0-rmse:12.06843
[3]	validation_0-rmse:11.39006
[4]	validation_0-rmse:10.62415
[5]	validation_0-rmse:9.97655
[6]	validation_0-rmse:9.27904
[7]	validation_0-rmse:8.63545
[8]	validation_0-rmse:8.11920
[9]	validation_0-rmse:7.70414
[10]	validation_0-rmse:7.31193
[11]	validation_0-rmse:7.09016
[12]	validation_0-rmse:6.62506
[13]	validation_0-rmse:6.31771
[14]	validation_0-rmse:6.24532
[15]	validation_0-rmse:6.00407
[16]	validation_0-rmse:5.75827
[17]	validation_0-rmse:5.64827
[10]	validation_0-rmse:7.31193
[11]	validation_0-rmse:7.09016
[12]	validation_0-rmse:6.62506
[13]	validation_0-rmse:6.31771
[14]	validation_0-rmse:6.2453

INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/xgboost_model.pkl


‚úÖ XGBoost training complete!


## 4. Train Random Forest Model

In [15]:
# Random Forest configuration
rf_config = {
    'n_estimators': 100,
    'max_depth': 15,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'random_state': 42,
    'n_jobs': -1
}

# Train Random Forest
print("Training Random Forest...")
rf_model, rf_info = train_random_forest_model(
    X_train, y_train,
    config=rf_config,
    save_path=str(MODEL_PATH / 'random_forest_model.pkl')
)

print("‚úÖ Random Forest training complete!")

INFO:modeling:Starting Random Forest training...
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/random_forest_model.pkl
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/random_forest_model.pkl


Training Random Forest...
‚úÖ Random Forest training complete!


## 5. Train Linear Regression Model (Baseline)

In [16]:
# Train Linear Regression as baseline
print("Training Linear Regression...")
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Save model
lr_path = MODEL_PATH / 'linear_regression_model.pkl'
save_model(lr_model, str(lr_path))

# Quick evaluation
lr_pred = lr_model.predict(X_val)
lr_r2 = r2_score(y_val, lr_pred)
print(f"Linear Regression Validation R¬≤: {lr_r2:.4f}")
print(f"‚úÖ Linear Regression saved to: {lr_path.name}")

INFO:model_utils:Model saved to ../02_Model_Development/Trained_Models/linear_regression_model.pkl


Training Linear Regression...
Linear Regression Validation R¬≤: 0.9706
‚úÖ Linear Regression saved to: linear_regression_model.pkl


## 6. Quick Model Comparison on Validation Set

In [17]:
# Quick comparison on validation set
def evaluate_model(model, X, y_true, name):
    y_pred = model.predict(X)
    return {
        'Model': name,
        'RMSE': np.sqrt(mean_squared_error(y_true, y_pred)),
        'MAE': mean_absolute_error(y_true, y_pred),
        'R¬≤': r2_score(y_true, y_pred)
    }

# Evaluate all models
results = []
results.append(evaluate_model(xgb_model, X_val, y_val, 'XGBoost'))
results.append(evaluate_model(rf_model, X_val, y_val, 'Random Forest'))
results.append(evaluate_model(lr_model, X_val, y_val, 'Linear Regression'))

# Display results
results_df = pd.DataFrame(results)
results_df = results_df.sort_values('R¬≤', ascending=False)
print("\nüìä VALIDATION SET RESULTS")
print("=" * 50)
print(results_df.to_string(index=False))
print("=" * 50)

# Check against targets
best_model = results_df.iloc[0]
print(f"\nüèÜ Best Model: {best_model['Model']}")
print(f"   RMSE: {best_model['RMSE']:.2f} (Target: < 15)")
print(f"   MAE: {best_model['MAE']:.2f} (Target: < 10)")
print(f"   R¬≤: {best_model['R¬≤']:.4f} (Target: > 0.70)")


üìä VALIDATION SET RESULTS
            Model     RMSE      MAE       R¬≤
Linear Regression 0.915320 0.853895 0.970627
          XGBoost 3.271337 2.795112 0.624811
    Random Forest 5.090495 4.077810 0.091512

üèÜ Best Model: Linear Regression
   RMSE: 0.92 (Target: < 15)
   MAE: 0.85 (Target: < 10)
   R¬≤: 0.9706 (Target: > 0.70)


## 7. Training Summary

In [18]:
# Training summary
print("=" * 60)
print("MODEL TRAINING SUMMARY")
print("=" * 60)
print(f"\nModels trained and saved to: {MODEL_PATH}")
print("\nTrained Models:")
print("  ‚úÖ XGBoost - xgboost_model.pkl")
print("  ‚úÖ Random Forest - random_forest_model.pkl")
print("  ‚úÖ Linear Regression - linear_regression_model.pkl")

print("\n" + "=" * 60)
print("MODEL TRAINING COMPLETE!")
print("=" * 60)
print("\nNext Step: Run 06_Model_Evaluation.ipynb for detailed evaluation")

MODEL TRAINING SUMMARY

Models trained and saved to: ../02_Model_Development/Trained_Models

Trained Models:
  ‚úÖ XGBoost - xgboost_model.pkl
  ‚úÖ Random Forest - random_forest_model.pkl
  ‚úÖ Linear Regression - linear_regression_model.pkl

MODEL TRAINING COMPLETE!

Next Step: Run 06_Model_Evaluation.ipynb for detailed evaluation
