# Predictive Analysis of Public Transportation Delays (End-to-End Notebook)


This notebook is the **submission notebook** for the project. It walks through the full workflow step-by-step:


1. Load the raw dataset (`dirty_transport_dataset.csv`)


2. Clean and normalize the data


3. Engineer the required features (delay minutes, time-of-day, day type, weather severity, route frequency)


4. Train and evaluate **multiple ML models** (Linear Regression, kNN, Gradient Boosting, Random Forest)


5. Export deliverables to `outputs/` (cleaned dataset, metrics, feature importance)


6. (Optional) Generate interpretability outputs using SHAP (saved under `outputs/`)



> Notes:


- If you run this notebook on Google Colab, upload the project folder or at least the `src/` package and the CSV file(s).


- Outputs are written to the `outputs/` folder.

In [2]:
# 1) Imports + project setup
# Make sure Python can import from the project root when running from /notebooks
import sys
from pathlib import Path

project_dir = Path().resolve().parent
if str(project_dir) not in sys.path:
    sys.path.insert(0, str(project_dir))

import pandas as pd

# Import project pipeline and individual steps (for transparency)
from src.transport_delay.cleaning import clean_raw_dataset, CleaningConfig
from src.transport_delay.features import add_features, impute_delay_and_actual_time
from src.transport_delay.modeling import train_and_evaluate
from src.transport_delay.pipeline import run_pipeline

In [3]:
# 2) Load raw data
project_dir = Path().resolve().parent
input_csv = project_dir / "dirty_transport_dataset.csv"
output_dir = project_dir / "outputs"
output_dir.mkdir(parents=True, exist_ok=True)

raw = pd.read_csv(input_csv)
print("Raw shape:", raw.shape)
raw.head()

Raw shape: (300, 7)


Unnamed: 0,route_id,scheduled_time,actual_time,weather,passenger_count,latitude,longitude
0,3,1/1/2025 0:00,1:22 AM,SUN,250.0,999.0,
1,Route-4,1/1/2025 1:00,12:00 AM,Sunny,250.0,24.643878,32.636296
2,R03,1/1/2025 2:00,12:00 AM,sunny,250.0,24.363132,31.186819
3,Route-4,1/1/2025 3:00,12:00 AM,cloudy,36.0,25.533071,32.537508
4,Route-4,1/1/2025 4:00,04.56AM,cloudy,76.0,23.294248,33.419211


In [8]:
# 3) Clean the dataset (normalization + missing values)
#
# The cleaning is implemented in: src/transport_delay/cleaning.py
# Main entry point: clean_raw_dataset(df)
#
# Inside clean_raw_dataset(), the following sub-steps/functions are applied:
#
# A) Route normalization
#    - _normalize_route_id(value): extracts the numeric route and formats it as R01, R02, ...
#    - Missing routes are filled with default "R00"
#
# B) Weather normalization
#    - _normalize_weather(value): lowercases/strips and maps synonyms (sun->sunny, rain->rainy)
#    - Missing weather is filled with the dataset mode (most frequent value)
#
# C) Datetime parsing
#    - scheduled_dt = pd.to_datetime(scheduled_time, errors="coerce")
#
# D) Numeric coercion + range validation
#    - passenger_count: coerced to numeric; values outside [1, 200] are set to NaN
#    - latitude/longitude: coerced to numeric; invalid ranges are set to NaN
#      * latitude must be in [-90, 90]
#      * longitude must be in [-180, 180]
#
# E) Parse actual arrival datetime
#    - _standardize_time_string(value): fixes formats like "930" -> "09:30", "04.56AM" -> "04:56AM"
#    - _parse_actual_datetime(scheduled_dt, actual_raw, cfg): parses a time-of-day and combines it with
#      the scheduled date; also handles midnight and day-crossing edge cases.
#
# F) Missing-value imputation
#    - passenger_count: filled using per-route median, then global median fallback
#    - latitude/longitude: filled using per-route median, then global median fallback

cfg = CleaningConfig()  # can tweak ranges/thresholds here if needed
cleaned = clean_raw_dataset(raw, cfg=cfg)
print("Cleaned shape:", cleaned.shape)

# Show a few key columns after cleaning
cols_preview = [
    "route_id",
    "weather",
    "scheduled_time",
    "actual_time",
    "scheduled_dt",
    "actual_dt",
    "passenger_count",
    "latitude",
    "longitude",
]
cleaned[cols_preview].head()

Cleaned shape: (300, 9)


Unnamed: 0,route_id,weather,scheduled_time,actual_time,scheduled_dt,actual_dt,passenger_count,latitude,longitude
0,R03,sunny,1/1/2025 0:00,1:22 AM,2025-01-01 00:00:00,2025-01-01 01:22:00,59.0,24.417358,32.529664
1,R04,sunny,1/1/2025 1:00,12:00 AM,2025-01-01 01:00:00,2025-01-01 00:00:00,68.0,24.643878,32.636296
2,R03,sunny,1/1/2025 2:00,12:00 AM,2025-01-01 02:00:00,2025-01-01 00:00:00,59.0,24.363132,31.186819
3,R04,cloudy,1/1/2025 3:00,12:00 AM,2025-01-01 03:00:00,2025-01-01 00:00:00,36.0,25.533071,32.537508
4,R04,cloudy,1/1/2025 4:00,04.56AM,2025-01-01 04:00:00,2025-01-01 04:56:00,76.0,23.294248,33.419211


In [9]:
# 4) Feature engineering (required features)
featured = add_features(cleaned)
featured = impute_delay_and_actual_time(featured)

# Confirm required engineered columns exist
required_cols = [
    "delay_minutes",
    "time_of_day",
    "day_type",
    "weather_severity",
    "route_frequency",
]
missing = [c for c in required_cols if c not in featured.columns]
print("Missing required columns:", missing)

featured[required_cols + ["route_id", "weather", "scheduled_time_iso", "actual_time_iso"]].head()

Missing required columns: []


Unnamed: 0,delay_minutes,time_of_day,day_type,weather_severity,route_frequency,route_id,weather,scheduled_time_iso,actual_time_iso
0,82.0,night,weekday,1,137,R03,sunny,2025-01-01 00:00:00,2025-01-01 01:22:00
1,-60.0,night,weekday,1,48,R04,sunny,2025-01-01 01:00:00,2025-01-01 00:00:00
2,-96.0,night,weekday,1,137,R03,sunny,2025-01-01 02:00:00,2025-01-01 00:24:00
3,-96.0,night,weekday,2,48,R04,cloudy,2025-01-01 03:00:00,2025-01-01 01:24:00
4,56.0,night,weekday,2,48,R04,cloudy,2025-01-01 04:00:00,2025-01-01 04:56:00


In [6]:
# 5) Model training + evaluation (must test >= 2; we test 4)
modeling_out = train_and_evaluate(featured)
metrics = modeling_out["metrics"]
feature_importance = modeling_out["feature_importance"]

metrics

Unnamed: 0,model,MAE,MSE,RMSE,R2,CV_RMSE
3,random_forest,43.65725,2915.049392,53.991197,0.233727,60.583204
0,linear_regression,44.062291,3042.044497,55.154732,0.200344,59.879813
2,gradient_boosting,49.269921,3449.92247,58.736041,0.093126,65.228452
1,knn,45.712681,3546.236504,59.550286,0.067808,62.625015


In [7]:
# 6) Export required deliverables
# Cleaned dataset
cleaned_out_path = output_dir / "cleaned_transport_dataset.csv"
featured.to_csv(cleaned_out_path, index=False)

# Model evaluation summary (metrics)
metrics_out_path = output_dir / "model_metrics.csv"
metrics.to_csv(metrics_out_path, index=False)

# Feature importance (interpretability)
fi_out_path = output_dir / "feature_importance.csv"
feature_importance.to_csv(fi_out_path, index=False)

print("Saved:")
print("-", cleaned_out_path)
print("-", metrics_out_path)
print("-", fi_out_path)

Saved:
- C:\Users\maged\Downloads\college\SEM-5\AI\Project\outputs\cleaned_transport_dataset.csv
- C:\Users\maged\Downloads\college\SEM-5\AI\Project\outputs\model_metrics.csv
- C:\Users\maged\Downloads\college\SEM-5\AI\Project\outputs\feature_importance.csv


In [None]:
# 7) OPTIONAL: SHAP interpretability (can take time)
# This reproduces the same outputs as `run_explainability.py`:
# - outputs/shap_mean_abs.csv
# - outputs/figures/shap_summary.png

from src.transport_delay.explainability import save_shap_outputs

# Comment out the next line if you're in a hurry
save_shap_outputs(featured, output_dir)

In [None]:
# Show top feature importances (from tree-based models)
feature_importance.head(15)

In [None]:
# Preview the cleaned + featured dataset that will be submitted
pd.read_csv(output_dir / "cleaned_transport_dataset.csv").head()