# Project: Kigali Traffic Congestion Prediction

## 1. Introduction

### 1.1 Project Overview

This project focuses on developing machine learning models to predict traffic congestion risk at urban intersections. Efficient traffic management is critical for rapidly urbanizing cities like Kigali, as congestion impacts economic productivity, environmental quality, and daily urban mobility.

### 1.2 Dataset Context

This project utilizes a Kaggle competition dataset comprising aggregated trip logging metrics from commercial vehicles. This dataset provides detailed information on vehicle stoppages and delays at intersections within a major urban area (e.g., North America).

**Dataset Rationale:**
* **Relevance:** The dataset directly addresses the problem of traffic congestion prediction and provides rich, real-world metrics (time stopped, distance to stop) essential for this task.
* **Complexity:** It offers a non-trivial challenge, requiring careful feature engineering and robust model development. This aligns with the assignment's objective to move beyond generic use cases.
* **Transferability:** The methodologies and insights gained from this project, utilizing this dataset, are directly applicable to traffic management challenges in other urban environments, including Kigali, given the availability of similar data. This project serves as a prototype demonstrating the application of advanced ML techniques for urban mobility.

## 2. Data Acquisition and Initial Exploration

This section covers loading the dataset and performing an initial examination to understand its structure, content, and statistical properties.

In [8]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, log_loss, roc_auc_score
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Load the training dataset
# The 'train.csv' file is expected to be in the same directory as this notebook.
train_df = pd.read_csv('train.csv')

print("Dataset loaded.")

# Display initial rows and dataset information
print("\nFirst 5 rows of DataFrame:")
print(train_df.head())

print("\nDataFrame Information:")
train_df.info()

print("\nDescriptive Statistics for Numerical Features:")
print(train_df.describe())

Dataset loaded.

First 5 rows of DataFrame:
     RowId  IntersectionId   Latitude  Longitude  \
0  1921357               0  33.791659 -84.430032   
1  1921358               0  33.791659 -84.430032   
2  1921359               0  33.791659 -84.430032   
3  1921360               0  33.791659 -84.430032   
4  1921361               0  33.791659 -84.430032   

                EntryStreetName                ExitStreetName EntryHeading  \
0  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   
1  Marietta Boulevard Northwest  Marietta Boulevard Northwest           SE   
2  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   
3  Marietta Boulevard Northwest  Marietta Boulevard Northwest           SE   
4  Marietta Boulevard Northwest  Marietta Boulevard Northwest           NW   

  ExitHeading  Hour  Weekend  ...  TimeFromFirstStop_p40  \
0          NW     0        0  ...                    0.0   
1          SE     0        0  ...                    0

### 2.2 Feature Examination and Initial Cleaning

This step involves identifying relevant features, checking data types, and handling missing values to prepare the dataset for feature engineering.

**Key Features:**
* **`IntersectionId`, `Latitude`, `Longitude`**: Spatial identifiers. `Latitude` and `Longitude` will be used as primary spatial features.
* **`Hour`, `Weekend`, `Month`**: Temporal indicators. These will be transformed to capture cyclical patterns.
* **`TotalTimeStopped_pXX`, `TimeFromFirstStop_pXX`, `DistanceToFirstStop_pXX`**: Percentile-based metrics indicating vehicle stop times and distances. These are central to defining congestion and deriving new features.
* **`count`**: Represents the volume of vehicles in an observation group.

**Initial Cleaning:**
* Unnecessary identifier columns (`RowId`, `IntersectionId`) will be dropped.
* Numerical columns will be ensured to have correct data types, and any `NaN` values will be imputed (e.g., with 0).

In [11]:
print("\n--- Diagnosing Remaining NaNs ---")
# Check which columns still have NaNs and how many
nan_counts = train_df.isnull().sum()
columns_with_nan = nan_counts[nan_counts > 0]

if not columns_with_nan.empty:
    print("Columns still containing NaN values:")
    print(columns_with_nan)

    # Let's re-apply a more thorough NaN handling for ALL numerical columns
    print("\nAttempting to re-fill NaNs in all numerical columns with 0...")
    for col in columns_with_nan.index:
        if train_df[col].dtype != np.number: # Try converting to numeric first if it's not already
            train_df[col] = pd.to_numeric(train_df[col], errors='coerce')
        train_df[col] = train_df[col].fillna(0) # Then fill NaNs

    print("\n--- Re-checking NaN values after re-filling ---")
    final_nan_check = train_df.isnull().sum().sum()
    print(f"Total NaN values remaining in DataFrame: {final_nan_check}")

    if final_nan_check == 0:
        print("All NaN values have been successfully filled! Proceeding with confidence.")
    else:
        print("Warning: Some NaN values still remain. Further investigation needed.")
else:
    print("No NaN values found in any columns. Good to go!")

# Let's re-display the info and describe to confirm types and no NaNs
print("\nDataFrame Info after NaN fix attempt:")
train_df.info()

print("\nDescriptive Statistics after NaN fix attempt:")
print(train_df.describe())


--- Diagnosing Remaining NaNs ---
No NaN values found in any columns. Good to go!

DataFrame Info after NaN fix attempt:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 856387 entries, 0 to 856386
Data columns (total 26 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   Latitude                 856387 non-null  float64
 1   Longitude                856387 non-null  float64
 2   EntryStreetName          856387 non-null  float64
 3   ExitStreetName           856387 non-null  float64
 4   EntryHeading             856387 non-null  object 
 5   ExitHeading              856387 non-null  object 
 6   Hour                     856387 non-null  int64  
 7   Weekend                  856387 non-null  int64  
 8   Month                    856387 non-null  int64  
 9   Path                     856387 non-null  object 
 10  TotalTimeStopped_p20     856387 non-null  float64
 11  TotalTimeStopped_p40     856387 non-null  float

## 3. Feature Engineering

This section details the creation of the target variable and the engineering of new features from existing raw data to enhance model learning.

### 3.1 Defining the `Congested` Target Variable

The dataset does not contain an explicit 'congested' label. A binary target variable, `Congested` (1 for congested, 0 for not congested), will be engineered based on the `TotalTimeStopped_p50` (median total time stopped) metric. A threshold is applied to this metric to classify congestion.

**Threshold Selection:** A threshold of 45 seconds for `TotalTimeStopped_p50` is selected as a preliminary indicator of congestion. This value can be adjusted based on subsequent model performance analysis or domain insights.

In [13]:
# Define the threshold for median total time stopped to classify an intersection as 'Congested'.
# For example, if median time stopped is 45 seconds or more, it's considered congested.
CONGESTION_THRESHOLD_SECONDS = 30

# Create the 'Congested' target column: 1 if median stop time meets threshold, else 0.
train_df['Congested'] = (train_df['TotalTimeStopped_p50'] >= CONGESTION_THRESHOLD_SECONDS).astype(int)

print(f"Congestion target variable created (Threshold: {CONGESTION_THRESHOLD_SECONDS} seconds).")
print("\nDistribution of 'Congested' (1) vs. 'Not Congested' (0) labels:")
print(train_df['Congested'].value_counts())
print(train_df['Congested'].value_counts(normalize=True))

Congestion target variable created (Threshold: 30 seconds).

Distribution of 'Congested' (1) vs. 'Not Congested' (0) labels:
Congested
0    780648
1     75739
Name: count, dtype: int64
Congested
0    0.91156
1    0.08844
Name: proportion, dtype: float64


### 3.2 Enhanced Temporal and Statistical Features

New features are engineered to provide more comprehensive information to the models:

* **Cyclical Temporal Features:** `Hour` and `Month` represent cyclical phenomena. Sine and cosine transformations are applied to these features. This method allows models to correctly interpret the proximity of values across a cycle (e.g., 23:00 being near 0:00), which is crucial for capturing daily and seasonal patterns in traffic.
* **Derived Statistical Features from Percentiles:** The `TotalTimeStopped_pXX` columns provide percentile values of total time stopped. To summarize this distribution, the mean and range across these percentiles are calculated. These derived features offer insights into the average congestion severity and its variability, enriching the feature set beyond individual percentile values.

In [14]:
# Create cyclical features for 'Hour' and 'Month'.
train_df['hour_sin'] = np.sin(2 * np.pi * train_df['Hour'] / 24.0)
train_df['hour_cos'] = np.cos(2 * np.pi * train_df['Hour'] / 24.0)
train_df['month_sin'] = np.sin(2 * np.pi * train_df['Month'] / 12.0)
train_df['month_cos'] = np.cos(2 * np.pi * train_df['Month'] / 12.0)

# Drop original 'Hour' and 'Month' columns as their cyclical representations are now included.
train_df.drop(['Hour', 'Month'], axis=1, inplace=True)
print("Cyclical temporal features added; original 'Hour' and 'Month' columns removed.")

# Define the base metric for which percentile statistics will be calculated.
# Focus on 'TotalTimeStopped' as it is the most direct indicator of congestion for these derived features.
metric_to_process = 'TotalTimeStopped'
percentiles_suffix = ['p20', 'p40', 'p50', 'p60', 'p80']
cols_for_metric = [f"{metric_to_process}_{p}" for p in percentiles_suffix]

# Calculate the mean of percentiles for 'TotalTimeStopped'.
train_df[f'{metric_to_process}_mean_pctl'] = train_df[cols_for_metric].mean(axis=1)
# Calculate the range (max - min) of percentiles for 'TotalTimeStopped'.
train_df[f'{metric_to_process}_range_pctl'] = train_df[cols_for_metric].max(axis=1) - train_df[cols_for_metric].min(axis=1)

# Drop original individual percentile columns related to TotalTimeStopped,
# except 'TotalTimeStopped_p50' which is used for the target variable.
for col in cols_for_metric:
    if col != 'TotalTimeStopped_p50':
        if col in train_df.columns:
            train_df.drop(col, axis=1, inplace=True)

# Drop all percentile columns for 'TimeFromFirstStop' and 'DistanceToFirstStop'
# to simplify the feature set, relying on 'TotalTimeStopped' derived features.
percentile_cols_to_drop = [col for col in train_df.columns if ('TimeFromFirstStop_p' in col or 'DistanceToFirstStop_p' in col)]
train_df.drop(percentile_cols_to_drop, axis=1, inplace=True, errors='ignore')

print("Derived statistical features (mean and range of TotalTimeStopped percentiles) added; other specific percentile columns mostly removed.")

# Create simple interaction features from Latitude and Longitude.
train_df['lat_x_lon'] = train_df['Latitude'] * train_df['Longitude']
train_df['lat_plus_lon'] = train_df['Latitude'] + train_df['Longitude']
print("Interaction features (lat_x_lon, lat_plus_lon) added.")

# Ensure 'count' column is numerical and handle any remaining NaNs for safety.
if 'count' in train_df.columns:
    train_df['count'] = pd.to_numeric(train_df['count'], errors='coerce').fillna(0)

Cyclical temporal features added; original 'Hour' and 'Month' columns removed.
Derived statistical features (mean and range of TotalTimeStopped percentiles) added; other specific percentile columns mostly removed.
Interaction features (lat_x_lon, lat_plus_lon) added.


## 4. Data Splitting and Scaling

Data preparation for model training involves partitioning the dataset into distinct sets and scaling numerical features.

* **Data Splitting**: The dataset is divided into three sets:
    * **Training Set**: Used for model parameter learning.
    * **Validation Set**: Used for hyperparameter tuning and early stopping during Neural Network training to prevent overfitting.
    * **Test Set**: Reserved for a final, unbiased evaluation of the best-performing model on unseen data. A 70% train, 15% validation, and 15% test split ratio is applied, using stratified sampling to maintain class proportions.
* **Feature Scaling (`StandardScaler`)**: Numerical features are scaled using `StandardScaler`. This transforms data to have a mean of 0 and a standard deviation of 1. Scaling is crucial for Neural Networks and beneficial for many other machine learning algorithms, as it standardizes feature magnitudes and prevents features with larger ranges from dominating the learning process.

In [15]:
# Define the feature set (X) and target variable (y).
features = [col for col in train_df.columns if col not in ['Congested']]
target = 'Congested'

X = train_df[features]
y = train_df[target]

# Ensure all feature columns are numerical and handle any remaining NaNs.
X = X.select_dtypes(include=np.number)
X.fillna(0, inplace=True)

print(f"Final features selected for modeling: {list(X.columns)}")
print(f"Total number of features: {len(X.columns)}")

# Split data into training, validation, and test sets.
# First, split off the test set (15% of the total).
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

# Then, split the remaining 'X_train_val' into training and validation sets.
# The validation set will be 15% of the total dataset, calculated as a proportion of 'X_train_val'.
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=(0.15 / 0.85), random_state=42, stratify=y_train_val)

print(f"\nDataset shapes after splitting:")
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}")
print(f"y_val shape: {y_val.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

print(f"\nTarget variable distribution in each split:")
print(f"Training set distribution:\n{y_train.value_counts(normalize=True)}")
print(f"Validation set distribution:\n{y_val.value_counts(normalize=True)}")
print(f"Test set distribution:\n{y_test.value_counts(normalize=True)}")

# Initialize StandardScaler for feature scaling.
scaler = StandardScaler()

# Fit the scaler exclusively on the training data to prevent data leakage.
X_train_scaled = scaler.fit_transform(X_train)

# Transform the validation and test sets using the scaler fitted on training data.
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Convert scaled NumPy arrays back to Pandas DataFrames for consistent handling.
# This uses the original column names and DataFrame indices.
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X.columns, index=X_val.index)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print("\nFeatures scaled using StandardScaler.")
print("Data preparation complete.")

Final features selected for modeling: ['Latitude', 'Longitude', 'EntryStreetName', 'ExitStreetName', 'Weekend', 'TotalTimeStopped_p50', 'hour_sin', 'hour_cos', 'month_sin', 'month_cos', 'TotalTimeStopped_mean_pctl', 'TotalTimeStopped_range_pctl', 'lat_x_lon', 'lat_plus_lon']
Total number of features: 14

Dataset shapes after splitting:
X_train shape: (599470, 14)
y_train shape: (599470,)
X_val shape: (128458, 14)
y_val shape: (128458,)
X_test shape: (128459, 14)
y_test shape: (128459,)

Target variable distribution in each split:
Training set distribution:
Congested
0    0.91156
1    0.08844
Name: proportion, dtype: float64
Validation set distribution:
Congested
0    0.911559
1    0.088441
Name: proportion, dtype: float64
Test set distribution:
Congested
0    0.911559
1    0.088441
Name: proportion, dtype: float64

Features scaled using StandardScaler.
Data preparation complete.
