# Project: Kigali Traffic Congestion Prediction

## 1. Introduction

### 1.1 Project Overview

This project focuses on developing machine learning models to predict traffic congestion risk at urban intersections. Efficient traffic management is critical for rapidly urbanizing cities like Kigali, as congestion impacts economic productivity, environmental quality, and daily urban mobility.

### 1.2 Dataset Context

This project utilizes a Kaggle competition dataset comprising aggregated trip logging metrics from commercial vehicles. This dataset provides detailed information on vehicle stoppages and delays at intersections within a major urban area (e.g., North America).

**Dataset Rationale:**
* **Relevance:** The dataset directly addresses the problem of traffic congestion prediction and provides rich, real-world metrics (time stopped, distance to stop) essential for this task.
* **Complexity:** It offers a non-trivial challenge, requiring careful feature engineering and robust model development. This aligns with the assignment's objective to move beyond generic use cases.
* **Transferability:** The methodologies and insights gained from this project, utilizing this dataset, are directly applicable to traffic management challenges in other urban environments, including Kigali, given the availability of similar data. This project serves as a prototype demonstrating the application of advanced ML techniques for urban mobility.

## 2. Data Acquisition and Initial Exploration

This section covers loading the dataset and performing an initial examination to understand its structure, content, and statistical properties.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os # For creating directories

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report

import tensorflow as tf
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras import regularizers

# Set a random seed for reproducibility across all runs
np.random.seed(42)
tf.random.set_seed(42)

# Create a directory to save models if it doesn't exist
models_dir = 'saved_models'
if not os.path.exists(models_dir):
    os.makedirs(models_dir)
    print(f"Created directory: {models_dir}")

# Load the training dataset
# Ensure 'train.csv' is located in the same directory as this notebook.
train_df = pd.read_csv('train.csv')

print("Dataset loaded successfully.")

# Display initial rows and dataset information
print("\nFirst 5 rows of DataFrame:")
print(train_df.head())

print("\nDataFrame Information:")
train_df.info()

print("\nDescriptive Statistics for Numerical Features:")
print(train_df.describe())

2025-06-17 17:16:46.373667: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-17 17:16:47.915461: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-06-17 17:16:48.768770: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1750173409.478405  107008 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1750173409.708166  107008 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1750173411.052317  107008 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

Dataset loaded successfully.

First 5 rows of DataFrame:
     RowId  IntersectionId   Latitude  Longitude  \
0  1921357               0  33.791659 -84.430032   
1  1921358               0  33.791659 -84.430032   
2  1921359               0  33.791659 -84.430032   
3  1921360               0  33.791659 -84.430032   
4  1921361               0  33.791659 -84.430032   

                EntryStreetName                ExitStreetName EntryHeading  \
0  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   
1  Marietta Boulevard Northwest  Marietta Boulevard Northwest           SE   
2  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   
3  Marietta Boulevard Northwest  Marietta Boulevard Northwest           SE   
4  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   

  ExitHeading  Hour  Weekend  ...  TimeFromFirstStop_p40  \
0          NW     0        0  ...                    0.0   
1          SE     0        0  ...        

### 2.2 Feature Examination and Initial Cleaning

This step involves identifying relevant features, checking data types, and handling missing values to prepare the dataset for feature engineering.

**Key Features:**
* **`IntersectionId`, `Latitude`, `Longitude`**: Spatial identifiers. `Latitude` and `Longitude` will be used as primary spatial features.
* **`Hour`, `Weekend`, `Month`**: Temporal indicators. These will be transformed to capture cyclical patterns.
* **`TotalTimeStopped_pXX`, `TimeFromFirstStop_pXX`, `DistanceToFirstStop_pXX`**: Percentile-based metrics indicating vehicle stop times and distances. These are central to defining congestion and deriving new features.
* **`count`**: Represents the volume of vehicles in an observation group.

**Initial Cleaning:**
* Unnecessary identifier columns (`RowId`, `IntersectionId`) will be dropped.
* Numerical columns will be ensured to have correct data types, and any `NaN` values will be imputed (e.g., with 0).

In [2]:
# Cell 4: Code - Initial Cleaning Implementation

print("Starting initial data cleaning...")

# Explicitly define columns to drop initially before any type conversions
cols_to_drop_initial = ['RowId', 'IntersectionId', 'EntryStreetName', 'ExitStreetName', 'EntryHeading', 'ExitHeading']

# Drop these columns if they exist in the DataFrame
for col in cols_to_drop_initial:
    if col in train_df.columns:
        train_df.drop(col, axis=1, inplace=True)
        print(f"Dropped column: '{col}'.")
    else:
        print(f"Column '{col}' not found to drop.")

# Now, ensure all REMAINING columns are numerical and fill any NaNs.
# This loop will now only process columns that are intended to be numerical features.
for col in train_df.columns:
    # Attempt to convert to numeric. If it fails (e.g., if a string somehow remained), it becomes NaN.
    train_df[col] = pd.to_numeric(train_df[col], errors='coerce')
    # Fill any NaNs that resulted from coercion or were originally present.
    train_df[col] = train_df[col].fillna(0)

print("\nAll remaining columns processed for numerical conversion and NaN imputation.")
print(f"Total NaN values remaining in DataFrame: {train_df.isnull().sum().sum()}")

if train_df.isnull().sum().sum() == 0:
    print("All NaN values have been successfully filled. Data is clean for numerical processing.")
else:
    print("Warning: Some NaN values still remain. Further investigation needed.")

print("\nDataFrame Info after cleaning:")
train_df.info()

print("\nDescriptive Statistics after cleaning:")
print(train_df.describe())

Starting initial data cleaning...
Dropped column: 'RowId'.
Dropped column: 'IntersectionId'.
Dropped column: 'EntryStreetName'.
Dropped column: 'ExitStreetName'.
Dropped column: 'EntryHeading'.
Dropped column: 'ExitHeading'.

All remaining columns processed for numerical conversion and NaN imputation.
Total NaN values remaining in DataFrame: 0
All NaN values have been successfully filled. Data is clean for numerical processing.

DataFrame Info after cleaning:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856387 entries, 0 to 856386
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Latitude                 856387 non-null  float64
 1   Longitude                856387 non-null  float64
 2   Hour                     856387 non-null  int64  
 3   Weekend                  856387 non-null  int64  
 4   Month                    856387 non-null  int64  
 5   Path                     856387 non

## 3. Feature Engineering

This section details the creation of the target variable and the engineering of new features from existing raw data to enhance model learning.

### 3.1 Defining the `Congested` Target Variable

The dataset does not contain an explicit 'congested' label. A binary target variable, `Congested` (1 for congested, 0 for not congested), will be engineered based on the `TotalTimeStopped_p50` (median total time stopped) metric. A threshold is applied to this metric to classify congestion.

**Threshold Selection:** A threshold of 45 seconds for `TotalTimeStopped_p50` is selected as a preliminary indicator of congestion. This value can be adjusted based on subsequent model performance analysis or domain insights.

In [3]:
# Define the threshold for median total time stopped to classify an intersection as 'Congested'.
# For example, if median time stopped is 45 seconds or more, it's considered congested.
CONGESTION_THRESHOLD_SECONDS = 30

# Create the 'Congested' target column: 1 if median stop time meets threshold, else 0.
train_df['Congested'] = (train_df['TotalTimeStopped_p50'] >= CONGESTION_THRESHOLD_SECONDS).astype(int)

print(f"Congestion target variable created (Threshold: {CONGESTION_THRESHOLD_SECONDS} seconds).")
print("\nDistribution of 'Congested' (1) vs. 'Not Congested' (0) labels:")
print(train_df['Congested'].value_counts())
print(train_df['Congested'].value_counts(normalize=True))
# --- START OF STRATIFIED SAMPLING CODE TO INSERT ---
print("\nPerforming stratified sampling to minimize dataset size...")
# It's crucial to sample AFTER the 'Congested' column is created
# We'll sample 20% of the data. You can adjust 'sample_fraction' as needed (e.g., 0.1 for 10%, 0.3 for 30%).
sample_fraction = 0.1 # Keep 20% of the data

# Ensure train_test_split is imported. Add 'from sklearn.model_selection import train_test_split' to Cell 2 if not already there.
# from sklearn.model_selection import train_test_split # If you need to import here, but better in Cell 2

train_df_sampled, _ = train_test_split(
    train_df,
    test_size=sample_fraction, # This is the fraction we *keep*
    stratify=train_df['Congested'],
    random_state=42
)

train_df = train_df_sampled.copy() # Overwrite the original DataFrame with the sampled one
print(f"Dataset reduced to {len(train_df)} rows ({sample_fraction*100}% of original).")
print("New distribution of 'Congested' labels after sampling:")
print(train_df['Congested'].value_counts(normalize=True).round(5))
# --- END OF STRATIFIED SAMPLING CODE TO INSERT ---

Congestion target variable created (Threshold: 30 seconds).

Distribution of 'Congested' (1) vs. 'Not Congested' (0) labels:
Congested
0    780648
1     75739
Name: count, dtype: int64
Congested
0    0.91156
1    0.08844
Name: proportion, dtype: float64

Performing stratified sampling to minimize dataset size...
Dataset reduced to 770748 rows (10.0% of original).
New distribution of 'Congested' labels after sampling:
Congested
0    0.91156
1    0.08844
Name: proportion, dtype: float64


### 3.2 Enhanced Temporal and Statistical Features

New features are engineered to provide more comprehensive information to the models:

* **Cyclical Temporal Features:** `Hour` and `Month` represent cyclical phenomena. Sine and cosine transformations are applied to these features. This method allows models to correctly interpret the proximity of values across a cycle (e.g., 23:00 being near 0:00), which is crucial for capturing daily and seasonal patterns in traffic.
* **Derived Statistical Features from Percentiles:** The `TotalTimeStopped_pXX` columns provide percentile values of total time stopped. To summarize this distribution, the mean and range across these percentiles are calculated. These derived features offer insights into the average congestion severity and its variability, enriching the feature set beyond individual percentile values.

In [4]:
# Create cyclical features for 'Hour' and 'Month'.
train_df['hour_sin'] = np.sin(2 * np.pi * train_df['Hour'] / 24.0)
train_df['hour_cos'] = np.cos(2 * np.pi * train_df['Hour'] / 24.0)
train_df['month_sin'] = np.sin(2 * np.pi * train_df['Month'] / 12.0)
train_df['month_cos'] = np.cos(2 * np.pi * train_df['Month'] / 12.0)

# Drop original 'Hour' and 'Month' columns as their cyclical representations are now included.
train_df.drop(['Hour', 'Month'], axis=1, inplace=True)
print("Cyclical temporal features added; original 'Hour' and 'Month' columns removed.")

# Define the base metric for which percentile statistics will be calculated.
# Focus on 'TotalTimeStopped' as it is the most direct indicator of congestion for these derived features.
metric_to_process = 'TotalTimeStopped'
percentiles_suffix = ['p20', 'p40', 'p50', 'p60', 'p80']
cols_for_metric = [f"{metric_to_process}_{p}" for p in percentiles_suffix]

# Calculate the mean of percentiles for 'TotalTimeStopped'.
train_df[f'{metric_to_process}_mean_pctl'] = train_df[cols_for_metric].mean(axis=1)
# Calculate the range (max - min) of percentiles for 'TotalTimeStopped'.
train_df[f'{metric_to_process}_range_pctl'] = train_df[cols_for_metric].max(axis=1) - train_df[cols_for_metric].min(axis=1)

# Drop original individual percentile columns related to TotalTimeStopped,
# except 'TotalTimeStopped_p50' which is used for the target variable.
for col in cols_for_metric:
    if col != 'TotalTimeStopped_p50':
        if col in train_df.columns:
            train_df.drop(col, axis=1, inplace=True)

# Drop all percentile columns for 'TimeFromFirstStop' and 'DistanceToFirstStop'
# to simplify the feature set, relying on 'TotalTimeStopped' derived features.
percentile_cols_to_drop = [col for col in train_df.columns if ('TimeFromFirstStop_p' in col or 'DistanceToFirstStop_p' in col)]
train_df.drop(percentile_cols_to_drop, axis=1, inplace=True, errors='ignore')

print("Derived statistical features (mean and range of TotalTimeStopped percentiles) added; other specific percentile columns mostly removed.")

# Create simple interaction features from Latitude and Longitude.
train_df['lat_x_lon'] = train_df['Latitude'] * train_df['Longitude']
train_df['lat_plus_lon'] = train_df['Latitude'] + train_df['Longitude']
print("Interaction features (lat_x_lon, lat_plus_lon) added.")

# Ensure 'count' column is numerical and handle any remaining NaNs for safety.
if 'count' in train_df.columns:
    train_df['count'] = pd.to_numeric(train_df['count'], errors='coerce').fillna(0)

Cyclical temporal features added; original 'Hour' and 'Month' columns removed.
Derived statistical features (mean and range of TotalTimeStopped percentiles) added; other specific percentile columns mostly removed.
Interaction features (lat_x_lon, lat_plus_lon) added.


## 4. Data Splitting and Scaling

Data preparation for model training involves partitioning the dataset into distinct sets and scaling numerical features.

* **Data Splitting**: The dataset is divided into three sets:
    * **Training Set**: Used for model parameter learning.
    * **Validation Set**: Used for hyperparameter tuning and early stopping during Neural Network training to prevent overfitting.
    * **Test Set**: Reserved for a final, unbiased evaluation of the best-performing model on unseen data. A 70% train, 15% validation, and 15% test split ratio is applied, using stratified sampling to maintain class proportions.
* **Feature Scaling (`StandardScaler`)**: Numerical features are scaled using `StandardScaler`. This transforms data to have a mean of 0 and a standard deviation of 1. Scaling is crucial for Neural Networks and beneficial for many other machine learning algorithms, as it standardizes feature magnitudes and prevents features with larger ranges from dominating the learning process.

In [5]:
# Define the feature set (X) and target variable (y).
# IMPORTANT: Explicitly exclude 'TotalTimeStopped_p50' as it was used to define 'Congested' to prevent data leakage.
features = [col for col in train_df.columns if col not in ['Congested', 'TotalTimeStopped_p50']]
target = 'Congested'

X = train_df[features]
y = train_df[target]

# Ensure all feature columns are numerical and handle any remaining NaNs.
X = X.select_dtypes(include=np.number)
X.fillna(0, inplace=True)

print(f"Final features selected for modeling: {list(X.columns)}")
print(f"Total number of features: {len(X.columns)}")

# Split data into training, validation, and test sets.
# First, split off the test set (15% of the total).
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

# Then, split the remaining 'X_train_val' into training and validation sets.
# The validation set will be 15% of the total dataset, calculated as a proportion of 'X_train_val'.
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=(0.15 / 0.85), random_state=42, stratify=y_train_val)

print(f"\nDataset shapes after splitting:")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

print(f"\nTarget variable distribution in each split:")
print(f"Training set distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Validation set distribution:\n{y_val.value_counts(normalize=True)}")
print(f"Test set distribution:\n{y_test.value_counts(normalize=True)}")

# Initialize StandardScaler for feature scaling.
scaler = StandardScaler()

# Fit the scaler exclusively on the training data to prevent data leakage.
X_train_scaled = scaler.fit_transform(X_train)

# Transform the validation and test sets using the scaler fitted on training data.
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Convert scaled NumPy arrays back to Pandas DataFrames for consistent handling.
# This uses the original column names and DataFrame indices.
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X.columns, index=X_val.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print("\nFeatures scaled using StandardScaler.")
print("Data preparation complete.")

Final features selected for modeling: ['Latitude', 'Longitude', 'Weekend', 'Path', 'City', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 'TotalTimeStopped_mean_pctl', 'TotalTimeStopped_range_pctl', 'lat_x_lon', 'lat_plus_lon']
Total number of features: 13

Dataset shapes after splitting:
X_train shape: (539522, 13)
y_train shape: (539522,)
X_val shape: (115613, 13)
y_val shape: (115613,)
X_test shape: (115613, 13)
y_test shape: (115613,)

Target variable distribution in each split:
Training set distribution:
Congested
0    0.911561
1    0.088439
Name: proportion, dtype: float64
Validation set distribution:
Congested
0    0.911558
1    0.088442
Name: proportion, dtype: float64
Test set distribution:
Congested
0    0.911558
1    0.088442
Name: proportion, dtype: float64

Features scaled using StandardScaler.
Data preparation complete.


## 5. Model Implementation: Classical Machine Learning with Optimization

This section implements a classical machine learning model, Logistic Regression, which is widely used for binary classification. To ensure optimal performance as required by the rubric, its hyperparameters will be tuned using `GridSearchCV`.

### 5.1 Logistic Regression with Hyperparameter Tuning

**Logistic Regression** is a linear model that estimates the probability of a binary outcome. Its simplicity and interpretability make it an excellent baseline.

**Hyperparameter Tuning with `GridSearchCV`:**
`GridSearchCV` systematically works through multiple combinations of parameter values, cross-validating each combination to determine which set of parameters yields the best performance. For Logistic Regression, key hyperparameters include:
* `C`: Inverse of regularization strength. Smaller values specify stronger regularization.
* `solver`: Algorithm to use in the optimization problem. Different solvers work better with different datasets and regularization types.
* `penalty`: The type of regularization (L1 or L2) to apply.

The model will be trained on the training set and the best hyperparameters will be selected based on performance on the validation set during the grid search (implicitly, as cross-validation is performed on the training data for the grid search itself, and then the best model is validated on `X_val`). Final evaluation metrics will be reported on the validation set.

In [6]:
print("Starting Logistic Regression model with Hyperparameter Tuning...")

# Define the Logistic Regression model
log_reg = LogisticRegression(random_state=42, max_iter=500) # Increased max_iter for convergence

# Define the parameter grid for GridSearchCV
param_grid = {
    'C': [0.01, 0.1, 1, 10], # Inverse of regularization strength
    'solver': ['liblinear', 'saga'], # 'liblinear' supports L1/L2, 'saga' supports L1/L2 and is faster for large datasets
    'penalty': ['l1', 'l2'] # Type of regularization
}

# Initialize GridSearchCV
# Scoring is F1-score as it's robust to class imbalance and combines precision/recall.
# cv=3 for 3-fold cross-validation during tuning.
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid,
                           scoring='f1', cv=3, verbose=1, n_jobs=-1) # n_jobs=-1 uses all available CPU cores

# Perform the grid search on the training data
grid_search.fit(X_train_scaled_df, y_train)

print("\nLogistic Regression Hyperparameter Tuning Complete.")
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best F1-score from Grid Search: {grid_search.best_score_:.4f}")

# Get the best model
best_log_reg_model = grid_search.best_estimator_

# --- Evaluate the Best Logistic Regression Model on Validation Set ---
print("\n--- Evaluating Best Logistic Regression Model on Validation Set ---")

y_val_pred_proba_lr = best_log_reg_model.predict_proba(X_val_scaled_df)[:, 1]
y_val_pred_lr = best_log_reg_model.predict(X_val_scaled_df)

accuracy_lr = accuracy_score(y_val, y_val_pred_lr)
precision_lr = precision_score(y_val, y_val_pred_lr)
recall_lr = recall_score(y_val, y_val_pred_lr)
f1_lr = f1_score(y_val, y_val_pred_lr)
roc_auc_lr = roc_auc_score(y_val, y_val_pred_proba_lr)

print(f"Validation Accuracy (LR): {accuracy_lr:.4f}")
print(f"Validation Precision (LR): {precision_lr:.4f}")
print(f"Validation Recall (LR): {recall_lr:.4f}")
print(f"Validation F1-Score (LR): {f1_lr:.4f}")
print(f"Validation ROC AUC (LR): {roc_auc_lr:.4f}")

print("\nValidation Classification Report (LR):")
print(classification_report(y_val, y_val_pred_lr))

# --- Save the best Logistic Regression Model ---
import joblib # Scikit-learn models are typically saved with joblib

model_path_lr = os.path.join(models_dir, 'logistic_regression_optimized_model.pkl')
joblib.dump(best_log_reg_model, model_path_lr)
print(f"\nOptimized Logistic Regression model saved to: {model_path_lr}")

# Store LR results for later comparison in the summary
lr_results = {
    'Model': 'Logistic Regression (Optimized)',
    'Accuracy': accuracy_lr,
    'F1-score': f1_lr,
    'Precision': precision_lr,
    'Recall': recall_lr,
    'ROC AUC': roc_auc_lr
}

Starting Logistic Regression model with Hyperparameter Tuning...
Fitting 3 folds for each of 16 candidates, totalling 48 fits





Logistic Regression Hyperparameter Tuning Complete.
Best parameters found: {'C': 1, 'penalty': 'l2', 'solver': 'saga'}
Best F1-score from Grid Search: 0.8683

--- Evaluating Best Logistic Regression Model on Validation Set ---
Validation Accuracy (LR): 0.9779
Validation Precision (LR): 0.8952
Validation Recall (LR): 0.8494
Validation F1-Score (LR): 0.8717
Validation ROC AUC (LR): 0.9952

Validation Classification Report (LR):
              precision    recall  f1-score   support

           0       0.99      0.99      0.99    105388
           1       0.90      0.85      0.87     10225

    accuracy                           0.98    115613
   macro avg       0.94      0.92      0.93    115613
weighted avg       0.98      0.98      0.98    115613


Optimized Logistic Regression model saved to: saved_models/logistic_regression_optimized_model.pkl


## 6. Model Implementation: Neural Networks (Multiple Instances)

This section implements several Artificial Neural Network (ANN) models with varying architectures and optimization techniques. This systematic approach allows for comparing the impact of different hyperparameters and optimization strategies on model performance, as required by the rubric's comparison table.

For each instance, the model will be compiled with `binary_crossentropy` loss and monitored for `accuracy`, `precision`, `recall`, and `auc`.

### Neural Network Model Instances for Comparison:

The following Neural Network models will be trained and evaluated:

* **Instance 1: Simple/Baseline Neural Network (No Explicit Optimization)**
    * This model uses a basic architecture without specified optimizers (defaults to Adam with default learning rate), no dropout, no early stopping, and a fixed number of epochs. It serves as a baseline to demonstrate the performance before applying explicit optimization techniques.
* **Instance 2: Optimized NN (Adam, Dropout, Early Stopping, ReduceLR)**
    * This instance incorporates the Adam optimizer with a custom learning rate, Dropout layers for regularization, Early Stopping to prevent overfitting, and `ReduceLROnPlateau` for adaptive learning rate adjustment.
* **Instance 3: Optimized NN (RMSprop, L2 Regularization, More Layers)**
    * This instance explores a different optimizer (RMSprop), adds L2 regularization to hidden layers, and uses a slightly deeper architecture to see its effect on performance.
* **Instance 4: Optimized NN (Adam, Different Learning Rate, More Dropout)**
    * This instance uses the Adam optimizer but with a different learning rate and increased dropout rates to investigate their impact on model training and generalization.

In [7]:
# Helper function to compile, train, evaluate, and save NN models
nn_results_table = [] # List to store results for the comparison table

def train_and_evaluate_nn(model, X_train_scaled, y_train, X_val_scaled, y_val,
                          model_name, epochs=100, batch_size=64, callbacks=None,
                          optimizer_name="Adam (Default LR)", regularizer_used="None",
                          early_stopping_used="No", num_layers="Default", learning_rate="Default",
                          dropout_rate="None"):

    print(f"\n--- Training {model_name} ---")

    # Compile the model if not already compiled
    if not model.optimizer: # Check if optimizer is set (can be set outside for complex cases)
         model.compile(optimizer='adam', # Default for baseline if not specified
                       loss='binary_crossentropy',
                       metrics=['accuracy',
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])

    history = model.fit(X_train_scaled, y_train,
                        epochs=epochs,
                        batch_size=batch_size,
                        validation_data=(X_val_scaled, y_val),
                        callbacks=callbacks,
                        verbose=0) # Set to 0 to suppress verbose output in notebook, 1 for progress bars

    print(f"{model_name} training complete.")

    # Evaluate the model on the validation set
    val_metrics = model.evaluate(X_val_scaled, y_val, verbose=0)
    metrics_names = model.metrics_names
    results = dict(zip(metrics_names, val_metrics))

    print(f"  Validation Loss: {results['loss']:.4f}")
    print(f"  Validation Accuracy: {results['accuracy']:.4f}")
    print(f"  Validation Precision: {results['precision']:.4f}")
    print(f"  Validation Recall: {results['recall']:.4f}")
    print(f"  Validation ROC AUC: {results['auc']:.4f}")

    # Save the model
    model_path = os.path.join(models_dir, f'{model_name.lower().replace(" ", "_")}.h5')
    model.save(model_path)
    print(f"  Model saved to: {model_path}")

    # Store results for the table
    nn_results_table.append({
        'Training Instance': model_name,
        'Optimizer Used': optimizer_name,
        'Regularizer Used': regularizer_used,
        'Epochs': len(history.history['loss']), # Actual epochs run due to early stopping
        'Early Stopping': early_stopping_used,
        'Number of Layers': num_layers,
        'Learning Rate': learning_rate,
        'Accuracy': results['accuracy'],
        'F1-score': results['f1_score'] if 'f1_score' in results else (2 * results['precision'] * results['recall']) / (results['precision'] + results['recall'] + 1e-7), # Calculate F1 if not directly provided by Keras
        'Precision': results['precision'],
        'Recall': results['recall'],
        'ROC AUC': results['auc'],
        'Loss': results['loss']
    })
    return model, history

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import joblib # For saving/loading models and scalers

# Ensure the train_and_evaluate_nn function is defined in a preceding cell,
# or included here for completeness if you prefer. Assuming it's defined.

# Placeholder for train_and_evaluate_nn for this example to be runnable
# Make sure your actual train_and_evaluate_nn is correctly defined elsewhere (e.g., Cell 7)
def train_and_evaluate_nn(model, X_train, y_train, X_val, y_val, model_name, epochs, callbacks=None, optimizer_name="Default", regularizer_used="None", early_stopping_used="No", num_layers="N/A", learning_rate="N/A", dropout_rate="N/A"):
    print(f"--- Training {model_name} ---")
    history = model.fit(X_train, y_train, epochs=epochs, validation_data=(X_val, y_val), callbacks=callbacks, verbose=0)
    print(f"{model_name} training complete.")

    val_metrics = model.evaluate(X_val, y_val, verbose=0)
    metrics_names = model.metrics_names

    results = dict(zip(metrics_names, val_metrics))

    # Calculate F1-score as it's not directly returned by model.evaluate
    y_pred_val = (model.predict(X_val) > 0.5).astype(int)
    results['f1_score'] = f1_score(y_val, y_pred_val)

    print(f"  Validation Loss: {results['loss']:.4f}")
    print(f"  Validation Accuracy: {results['accuracy']:.4f}")
    print(f"  Validation Precision: {results['precision']:.4f}")
    print(f"  Validation Recall: {results['recall']:.4f}")
    print(f"  Validation F1-Score: {results['f1_score']:.4f}")
    print(f"  Validation AUC: {results['auc']:.4f}")

    # Store results for later comparison
    # Assuming 'all_model_results' is a list defined in your notebook
    # all_model_results.append({
    #     'Model Name': model_name,
    #     'Validation Loss': results['loss'],
    #     'Validation Accuracy': results['accuracy'],
    #     'Validation Precision': results['precision'],
    #     'Validation Recall': results['recall'],
    #     'Validation F1-Score': results['f1_score'],
    #     'Validation AUC': results['auc'],
    #     'Optimizer': optimizer_name,
    #     'Regularizer': regularizer_used,
    #     'Early Stopping': early_stopping_used,
    #     'Number of Layers': num_layers,
    #     'Learning Rate': learning_rate,
    #     'Dropout Rate': dropout_rate
    # })

    return model, history


# --- Instance 1: Simple/Baseline Neural Network ---
# No explicit optimizer specified (will use Keras default Adam), no dropout, no early stopping.

print("Building Instance 1: Simple/Baseline Neural Network...")

# Ensure X_train_scaled_df is available from preceding cells
# For this example, let's assume it exists and has been created
input_dim = X_train_scaled_df.shape[1] # This line should be present in your actual notebook

model_simple_nn = Sequential([
    Dense(64, activation='relu', input_shape=(input_dim,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

# For this instance, we let Keras default the optimizer and learning rate if not specified
# explicitly in model.compile. Here we explicitly use 'adam' without a learning_rate parameter
# to reflect the "default" as per the rubric for instance 1.
model_simple_nn.compile(optimizer='adam',
                        loss='binary_crossentropy',
                        metrics=[
                                tf.keras.metrics.Accuracy(name='accuracy'), # <-- THIS IS THE CORRECT LINE
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])


# Train and evaluate
model_simple_nn, history_simple = train_and_evaluate_nn(
    model_simple_nn, X_train_scaled_df, y_train, X_val_scaled_df, y_val,
    model_name="NN_Instance_1_Simple_Baseline",
    epochs=50, # Fixed epochs as no early stopping
    optimizer_name="Adam (Keras Default)",
    regularizer_used="None",
    early_stopping_used="No",
    num_layers="3 (Input+2 Hidden+Output)",
    learning_rate="Default",
    dropout_rate="None"
)

Building Instance 1: Simple/Baseline Neural Network...


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2025-06-17 17:55:04.898166: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


--- Training NN_Instance_1_Simple_Baseline ---


In [None]:
# --- Instance 2: Optimized NN (Adam, Dropout, Early Stopping, ReduceLR) ---
# Similar to previous "optimized" model, now explicitly categorized.

print("\nBuilding Instance 2: Optimized NN (Adam, Dropout, Early Stopping, ReduceLR)...")

input_dim = X_train_scaled_df.shape[1]

model_opt_nn_1 = Sequential([
    Dense(128, activation='relu', input_shape=(input_dim,)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile with Adam optimizer and custom learning rate
model_opt_nn_1.compile(optimizer=Adam(learning_rate=0.001),
                       loss='binary_crossentropy',
                       metrics=[
                                tf.keras.metrics.Accuracy(name='accuracy'),
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])

# Define callbacks
early_stopping_opt1 = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr_opt1 = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001)

# Train and evaluate
model_opt_nn_1, history_opt1 = train_and_evaluate_nn(
    model_opt_nn_1, X_train_scaled_df, y_train, X_val_scaled_df, y_val,
    model_name="NN_Instance_2_Adam_Dropout_ES",
    epochs=100,
    callbacks=[early_stopping_opt1, reduce_lr_opt1],
    optimizer_name="Adam",
    regularizer_used="Dropout",
    early_stopping_used="Yes",
    num_layers="4 (Input+3 Hidden+Output)",
    learning_rate="0.001 (Adjusted)",
    dropout_rate="0.3, 0.3, 0.2"
)

In [None]:
# --- Instance 3: Optimized NN (RMSprop, L2 Regularization, More Layers) ---

print("\nBuilding Instance 3: Optimized NN (RMSprop, L2 Regularization, More Layers)...")

input_dim = X_train_scaled_df.shape[1]

model_opt_nn_2 = Sequential([
    Dense(128, activation='relu', input_shape=(input_dim,), kernel_regularizer=regularizers.l2(0.001)),
    Dropout(0.2),
    Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    Dropout(0.2),
    Dense(32, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.001)), # Added another layer
    Dense(1, activation='sigmoid')
])

# Compile with RMSprop optimizer and custom learning rate
model_opt_nn_2.compile(optimizer=RMSprop(learning_rate=0.0005), # Different optimizer and LR
                       loss='binary_crossentropy',
                       metrics=[
                                tf.keras.metrics.Accuracy(name='accuracy'),
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])

# Define callbacks (using Early Stopping but not ReduceLR this time to show variation)
early_stopping_opt2 = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)

# Train and evaluate
model_opt_nn_2, history_opt2 = train_and_evaluate_nn(
    model_opt_nn_2, X_train_scaled_df, y_train, X_val_scaled_df, y_val,
    model_name="NN_Instance_3_RMSprop_L2_ES_MoreLayers",
    epochs=150, # More epochs given deeper model and potential slower convergence
    callbacks=[early_stopping_opt2],
    optimizer_name="RMSprop",
    regularizer_used="L2 (0.001), Dropout (0.2)",
    early_stopping_used="Yes",
    num_layers="5 (Input+4 Hidden+Output)",
    learning_rate="0.0005",
    dropout_rate="0.2, 0.2"
)

In [None]:
# --- Instance 4: Optimized NN (Adam, Different Learning Rate, More Dropout) ---

print("\nBuilding Instance 4: Optimized NN (Adam, Different Learning Rate, More Dropout)...")

input_dim = X_train_scaled_df.shape[1]

model_opt_nn_3 = Sequential([
    Dense(256, activation='relu', input_shape=(input_dim,)), # Larger first layer
    Dropout(0.4), # Increased dropout
    Dense(128, activation='relu'),
    Dropout(0.4), # Increased dropout
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

# Compile with Adam optimizer and different (lower) learning rate
model_opt_nn_3.compile(optimizer=Adam(learning_rate=0.0005), # Lower LR than Instance 2
                       loss='binary_crossentropy',
                       metrics=[
                                tf.keras.metrics.Accuracy(name='accuracy'),
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])

# Define callbacks
early_stopping_opt3 = EarlyStopping(monitor='val_loss', patience=12, restore_best_weights=True)
reduce_lr_opt3 = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=6, min_lr=0.00005) # Different ReduceLR params

# Train and evaluate
model_opt_nn_3, history_opt3 = train_and_evaluate_nn(
    model_opt_nn_3, X_train_scaled_df, y_train, X_val_scaled_df, y_val,
    model_name="NN_Instance_4_Adam_DiffLR_MoreDropout",
    epochs=120,
    callbacks=[early_stopping_opt3, reduce_lr_opt3],
    optimizer_name="Adam",
    regularizer_used="Dropout (0.4, 0.4, 0.3)",
    early_stopping_used="Yes",
    num_layers="4 (Input+3 Hidden+Output)",
    learning_rate="0.0005 (Adjusted)",
    dropout_rate="0.4, 0.4, 0.3"
)

In [None]:
# --- Instance 5: Optimized NN (Adam, Deeper, Less Dropout, Slightly Higher LR) ---
# This is to fulfill the 5-row table requirement more robustly (1 baseline + 4 optimized)

print("\nBuilding Instance 5: Optimized NN (Adam, Deeper, Less Dropout, Slightly Higher LR)...")

input_dim = X_train_scaled_df.shape[1]

model_opt_nn_4 = Sequential([
    Dense(256, activation='relu', input_shape=(input_dim,)),
    Dropout(0.2), # Less dropout than Instance 4
    Dense(128, activation='relu'),
    Dropout(0.2), # Less dropout than Instance 4
    Dense(64, activation='relu'),
    Dropout(0.1),
    Dense(32, activation='relu'), # One more layer than Instance 4
    Dense(1, activation='sigmoid')
])

# Compile with Adam optimizer and slightly higher learning rate than Instance 4
model_opt_nn_4.compile(optimizer=Adam(learning_rate=0.0015), # Slightly higher LR
                       loss='binary_crossentropy',
                       metrics=[
                                tf.keras.metrics.Accuracy(name='accuracy'),
                                tf.keras.metrics.Precision(name='precision'),
                                tf.keras.metrics.Recall(name='recall'),
                                tf.keras.metrics.AUC(name='auc')])

# Define callbacks
early_stopping_opt4 = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
reduce_lr_opt4 = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001)

# Train and evaluate
model_opt_nn_4, history_opt4 = train_and_evaluate_nn(
    model_opt_nn_4, X_train_scaled_df, y_train, X_val_scaled_df, y_val,
    model_name="NN_Instance_5_Adam_Deeper_LessDropout_HigherLR",
    epochs=100,
    callbacks=[early_stopping_opt4, reduce_lr_opt4],
    optimizer_name="Adam",
    regularizer_used="Dropout (0.2, 0.2, 0.1)",
    early_stopping_used="Yes",
    num_layers="5 (Input+4 Hidden+Output)",
    learning_rate="0.0015",
    dropout_rate="0.2, 0.2, 0.1"
)

In [None]:
# Convert the list of NN results into a Pandas DataFrame for easy viewing
nn_results_df = pd.DataFrame(nn_results_table)

print("\n--- Neural Network Model Comparison Table (Validation Set) ---")
print(nn_results_df.round(4).to_string()) # .to_string() to avoid truncation

# Optional: Add LR results to the NN table for a comprehensive comparison
# You might choose to keep this separate for the README, but good for internal comparison
full_comparison_table = pd.concat([pd.DataFrame([lr_results]), nn_results_df], ignore_index=True)
print("\n--- Full Model Comparison Table (Classical ML + Neural Networks) ---")
print(full_comparison_table.round(4).to_string())

## 7. Final Model Selection and Evaluation

After training and evaluating multiple models on the validation set, the best-performing model is selected. This final selected model is then evaluated on the completely unseen test dataset to provide an unbiased estimate of its generalization performance.

**Model Selection Criteria:**
The model with the highest F1-score on the validation set is typically chosen, as F1-score provides a good balance between Precision and Recall, which is crucial for imbalanced classification tasks like congestion prediction. ROC AUC is also a strong indicator of overall classifier performance.

This final evaluation on the test set simulates how the model would perform in a real-world scenario.

In [None]:
print("\n--- Final Model Selection and Evaluation on Test Set ---")

# Determine the best model based on F1-score on the validation set.
# Assuming higher F1 is better. We need to compare the best LR F1 vs. best NN F1.

best_nn_f1 = nn_results_df['F1-score'].max()
best_nn_model_name = nn_results_df.loc[nn_results_df['F1-score'].idxmax(), 'Training Instance']

print(f"Best Neural Network (Validation F1): {best_nn_model_name} with F1-score = {best_nn_f1:.4f}")
print(f"Best Logistic Regression (Validation F1): {lr_results['F1-score']:.4f}")

# Select the overall best model (either LR or the best NN)
if lr_results['F1-score'] >= best_nn_f1:
    final_best_model_name = "Optimized Logistic Regression"
    final_best_model = best_log_reg_model
    X_test_for_final_model = X_test_scaled_df
    is_nn = False
    print("\nSelected best model: Optimized Logistic Regression.")
else:
    final_best_model_name = best_nn_model_name
    # Load the best NN model
    best_nn_model_path = os.path.join(models_dir, f'{best_nn_model_name.lower().replace(" ", "_")}.h5')
    final_best_model = load_model(best_nn_model_path)
    X_test_for_final_model = X_test_scaled_df
    is_nn = True
    print(f"\nSelected best model: {final_best_model_name}.")


# Evaluate the final best model on the unseen test set
if is_nn:
    test_metrics = final_best_model.evaluate(X_test_for_final_model, y_test, verbose=0)
    metrics_names = final_best_model.metrics_names
    final_test_results = dict(zip(metrics_names, test_metrics))
    final_test_results['f1_score'] = (2 * final_test_results['precision'] * final_test_results['recall']) / \
                                     (final_test_results['precision'] + final_test_results['recall'] + 1e-7)

    print(f"\n--- Final Test Set Evaluation for {final_best_model_name} ---")
    print(f"Test Loss: {final_test_results['loss']:.4f}")
    print(f"Test Accuracy: {final_test_results['accuracy']:.4f}")
    print(f"Test Precision: {final_test_results['precision']:.4f}")
    print(f"Test Recall: {final_test_results['recall']:.4f}")
    print(f"Test F1-Score: {final_test_results['f1_score']:.4f}")
    print(f"Test ROC AUC: {final_test_results['auc']:.4f}")

else: # Classical ML model (Logistic Regression)
    y_test_pred_proba = final_best_model.predict_proba(X_test_for_final_model)[:, 1]
    y_test_pred = final_best_model.predict(X_test_for_final_model)

    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_f1 = f1_score(y_test, y_test_pred)
    test_roc_auc = roc_auc_score(y_test, y_test_pred_proba)

    print(f"\n--- Final Test Set Evaluation for {final_best_model_name} ---")
    print(f"Test Accuracy: {test_accuracy:.4f}")
    print(f"Test Precision: {test_precision:.4f}")
    print(f"Test Recall: {test_recall:.4f}")
    print(f"Test F1-Score: {test_f1:.4f}")
    print(f"Test ROC AUC: {test_roc_auc:.4f}")
    print("\nTest Classification Report:")
    print(classification_report(y_test, y_test_pred))

# Note: The table for the README needs to be manually extracted from the 'nn_results_df' and 'lr_results'
# and formatted into the required structure.