# ü§ñ Model Training for Traffic Flow Optimization

**Phase 3: Modeling, Analysis, and Evaluation**

## Overview
This notebook trains multiple models for traffic congestion prediction:
- **LSTM**: Long Short-Term Memory for sequential patterns
- **XGBoost**: Gradient boosting for tabular features
- **ARIMA**: Time-series baseline
- **Random Forest**: Ensemble tree model

---

**Author:** Data Science Team  
**Date:** November 2025  
**Project:** Bangkok Traffic Flow Optimization (CPE312 Capstone)

In [12]:
# Setup and Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
import pickle
import warnings
warnings.filterwarnings('ignore')

# Import ML libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import xgboost as xgb

# Import custom scripts
import sys
sys.path.insert(0, '../05_Scripts/')
from modeling import (
    create_sequences,
    temporal_train_test_split,
    train_xgboost_model,
    train_random_forest_model,
    train_arima_model
)
from model_utils import save_model, set_random_seeds

# Set random seeds for reproducibility
set_random_seeds(42)

print("‚úÖ All imports successful!")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"XGBoost: {xgb.__version__}")

INFO:model_utils:Random seeds set to 42


‚úÖ All imports successful!
NumPy: 2.3.4
Pandas: 2.3.3
XGBoost: 3.1.2


## 1. Load Engineered Features

In [5]:
# Define paths
DATA_PATH = Path('../02_Data/Processed/')
MODEL_PATH = Path('../02_Model_Development/Trained_Models/')
MODEL_PATH.mkdir(parents=True, exist_ok=True)

# Load engineered features (from Notebook 04)
df = pd.read_csv(DATA_PATH / 'features_engineered.csv')
print(f"Dataset shape: {df.shape}")

# Define target
target_col = 'congestion_index'

# Select only numeric columns for features (exclude date and target)
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols = [col for col in numeric_cols if col != target_col]

# Handle any remaining NaN values
df[feature_cols] = df[feature_cols].fillna(0)

print(f"\nTarget: {target_col}")
print(f"Features: {len(feature_cols)} numeric columns")
print(f"Sample features: {feature_cols[:5]}...")

Dataset shape: (1652, 37)

Target: congestion_index
Features: 33 numeric columns
Sample features: ['traffic_volume', 'average_speed', 'year', 'month', 'day']...


## 2. Prepare Train/Validation/Test Split

In [6]:
# Temporal train/val/test split (60/20/20)
train_df, val_df, test_df = temporal_train_test_split(
    df, 
    date_col='date' if 'date' in df.columns else df.columns[0],
    train_ratio=0.6,
    val_ratio=0.2,
    test_ratio=0.2
)

# Extract features and targets
X_train = train_df[feature_cols].values
y_train = train_df[target_col].values

X_val = val_df[feature_cols].values
y_val = val_df[target_col].values

X_test = test_df[feature_cols].values
y_test = test_df[target_col].values

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")

INFO:modeling:Split sizes - Train: 991, Val: 330, Test: 331


Training set: (991, 33)
Validation set: (330, 33)
Test set: (331, 33)


## 3. Train XGBoost Model

In [7]:
# XGBoost configuration
xgb_config = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'random_state': 42
}

# Train XGBoost
print("Training XGBoost...")
xgb_model, xgb_info = train_xgboost_model(
    X_train, y_train,
    X_val, y_val,
    config=xgb_config,
    save_path=str(MODEL_PATH / 'xgboost_model.pkl')
)

print("‚úÖ XGBoost training complete!")

INFO:modeling:Starting XGBoost training...


Training XGBoost...
[0]	validation_0-rmse:7.90538
[1]	validation_0-rmse:7.46043
[2]	validation_0-rmse:7.12928
[3]	validation_0-rmse:6.76722
[4]	validation_0-rmse:6.48022
[5]	validation_0-rmse:6.23008
[0]	validation_0-rmse:7.90538
[1]	validation_0-rmse:7.46043
[2]	validation_0-rmse:7.12928
[3]	validation_0-rmse:6.76722
[4]	validation_0-rmse:6.48022
[5]	validation_0-rmse:6.23008
[6]	validation_0-rmse:6.03583
[7]	validation_0-rmse:5.87765
[8]	validation_0-rmse:5.74751
[9]	validation_0-rmse:5.65878
[10]	validation_0-rmse:5.56027
[11]	validation_0-rmse:5.45645
[6]	validation_0-rmse:6.03583
[7]	validation_0-rmse:5.87765
[8]	validation_0-rmse:5.74751
[9]	validation_0-rmse:5.65878
[10]	validation_0-rmse:5.56027
[11]	validation_0-rmse:5.45645
[12]	validation_0-rmse:5.38381
[13]	validation_0-rmse:5.33062
[14]	validation_0-rmse:5.30465
[15]	validation_0-rmse:5.24887
[16]	validation_0-rmse:5.22018
[17]	validation_0-rmse:5.19039
[12]	validation_0-rmse:5.38381
[13]	validation_0-rmse:5.33062
[14]	val

INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/xgboost_model.pkl


‚úÖ XGBoost training complete!


## 4. Train Random Forest Model

In [8]:
# Random Forest configuration
rf_config = {
    'n_estimators': 100,
    'max_depth': 15,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'random_state': 42,
    'n_jobs': -1
}

# Train Random Forest
print("Training Random Forest...")
rf_model, rf_info = train_random_forest_model(
    X_train, y_train,
    config=rf_config,
    save_path=str(MODEL_PATH / 'random_forest_model.pkl')
)

print("‚úÖ Random Forest training complete!")

INFO:modeling:Starting Random Forest training...
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/random_forest_model.pkl
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/random_forest_model.pkl


Training Random Forest...
‚úÖ Random Forest training complete!


## 5. Train ARIMA Model (Time-Series Baseline)

In [9]:
# Train ARIMA as baseline
print("Training ARIMA...")
arima_model, arima_info = train_arima_model(
    y_train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 7),  # Weekly seasonality
    save_path=str(MODEL_PATH / 'arima_model.pkl')
)

print(f"ARIMA AIC: {arima_info['aic']:.2f}")
print("‚úÖ ARIMA training complete!")

Training ARIMA...


INFO:modeling:Training ARIMA model with order=(1, 1, 1), seasonal=(1, 1, 1, 7)
INFO:modeling:ARIMA AIC: 6056.26
INFO:modeling:ARIMA AIC: 6056.26
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/arima_model.pkl
INFO:modeling:Model saved to ../02_Model_Development/Trained_Models/arima_model.pkl


ARIMA AIC: 6056.26
‚úÖ ARIMA training complete!


## 6. Train LSTM Model (Optional - Requires TensorFlow)

In [10]:
# LSTM requires sequential data
# Create sequences for LSTM
SEQUENCE_LENGTH = 7

try:
    # Prepare sequential data
    X_seq_train, y_seq_train = create_sequences(
        train_df[feature_cols + [target_col]].values,
        sequence_length=SEQUENCE_LENGTH,
        target_col_idx=-1
    )
    
    X_seq_val, y_seq_val = create_sequences(
        val_df[feature_cols + [target_col]].values,
        sequence_length=SEQUENCE_LENGTH,
        target_col_idx=-1
    )
    
    # LSTM configuration
    lstm_config = {
        'units_layer1': 64,
        'units_layer2': 32,
        'dropout': 0.2,
        'learning_rate': 0.001,
        'epochs': 50,
        'batch_size': 32
    }
    
    print("Training LSTM...")
    lstm_model, lstm_history = train_lstm_model(
        X_seq_train, y_seq_train,
        X_seq_val, y_seq_val,
        config=lstm_config,
        save_path=str(MODEL_PATH / 'lstm_model')
    )
    
    print("‚úÖ LSTM training complete!")
    
except Exception as e:
    print(f"‚ö†Ô∏è LSTM training skipped: {e}")
    print("Install TensorFlow to enable LSTM training: pip install tensorflow")

Training LSTM...
‚ö†Ô∏è LSTM training skipped: name 'train_lstm_model' is not defined
Install TensorFlow to enable LSTM training: pip install tensorflow


## 7. Training Summary

In [11]:
# Training summary
print("=" * 60)
print("MODEL TRAINING SUMMARY")
print("=" * 60)
print(f"\nModels trained and saved to: {MODEL_PATH}")
print("\nTrained Models:")
print("  ‚úÖ XGBoost")
print("  ‚úÖ Random Forest")
print("  ‚úÖ ARIMA")
print("  ‚ö†Ô∏è LSTM (if TensorFlow available)")

print("\n" + "=" * 60)
print("MODEL TRAINING COMPLETE!")
print("=" * 60)
print("\nNext Step: Run 06_Model_Evaluation.ipynb")

MODEL TRAINING SUMMARY

Models trained and saved to: ../02_Model_Development/Trained_Models

Trained Models:
  ‚úÖ XGBoost
  ‚úÖ Random Forest
  ‚úÖ ARIMA
  ‚ö†Ô∏è LSTM (if TensorFlow available)

MODEL TRAINING COMPLETE!

Next Step: Run 06_Model_Evaluation.ipynb
