# 🚍 NYC Bus Reliability vs Weather, Traffic, and Service Alerts

![Banner](./assets/banner.jpeg)

**Goal**: Quantify how weather conditions, traffic congestion, and service disruptions affect **NYC bus reliability**, measured via **Wait Assessment (WA)** — the share of observed trips that meet scheduled headways.

This notebook fulfills **Checkpoint 2: Exploratory Data Analysis & Visualization** for IT4063C Data Technologies Analytics.

## 🧭 Project Overview
NYC buses operate on city streets, making them vulnerable to **weather, congestion, and service alerts**. Reliable service is critical for millions of daily riders. This analysis explores how environmental and operational factors influence reliability.

## ❓ Research Question
**How do precipitation, snowfall, temperature, traffic speeds, and MTA bus alerts relate to monthly NYC bus Wait Assessment (2020–2024)?**

**Hypothesis:** Heavy rain, snow, and traffic congestion reduce Wait Assessment (lower reliability). Months with higher alert volumes (detours, delays) will also show lower WA.

## 🗂️ Data Sources
- **MTA Bus Performance (Wait Assessment)** — `data/bus_data.csv`
- **NOAA GHCN-Daily – Central Park (USW00094728)** — `data/weather_data.csv`
- **NYC DOT Traffic API (JSON)** — https://data.cityofnewyork.us/resource/i4gi-tjb9.json
- **MTA Bus Alerts Feed (Protocol Buffers)** — https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/camsys%2Fbus-alerts

These sources represent **three acquisition methods** (CSV, JSON API, Protocol Buffers). See `data_types.md` for detailed schema definitions.

In [3]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, plotly.express as px, requests
from google.transit import gtfs_realtime_pb2
from datetime import datetime

sns.set(style='whitegrid')
pd.set_option('display.max_columns', 200)

## 📥 Load & Normalize Datasets

In [4]:
# --- Load MTA Bus Data (CSV) ---
bus = pd.read_csv('data/bus_data.csv', na_values=['', ' ', 'null', 'NULL'])
bus.columns = [c.strip().lower().replace(' ', '_') for c in bus.columns]
bus['date'] = pd.to_datetime(bus['month'], errors='coerce')

for col in ['number_of_trips_passing_wait','number_of_scheduled_trips']:
    bus[col] = pd.to_numeric(bus[col].astype(str).str.replace(',', ''), errors='coerce')
bus['wait_assessment'] = pd.to_numeric(bus['wait_assessment'].astype(str).str.rstrip('%'), errors='coerce') / 100.0

# --- Load NOAA Weather Data (CSV) ---
weather_cols = ['STATION','NAME','DATE','PRCP','SNOW','SNWD','TMAX','TMIN','TAVG','AWND']
weather = pd.read_csv('data/weather_data.csv', usecols=[c for c in weather_cols if c in pd.read_csv('data/weather_data.csv', nrows=0).columns])
weather.columns = [c.lower() for c in weather.columns]
weather['date'] = pd.to_datetime(weather['date'], errors='coerce')

for col in ['prcp','snow','snwd','tmax','tmin','tavg','awnd']:
    if col in weather: weather[col] = pd.to_numeric(weather[col], errors='coerce')
if 'tmax' in weather: weather['tmax'] /= 10.0
if 'tmin' in weather: weather['tmin'] /= 10.0
if 'tavg' in weather: weather['tavg'] /= 10.0
if 'prcp' in weather: weather['prcp_mm'] = weather['prcp'] / 10.0
if 'snow' in weather: weather['snow_mm'] = weather['snow'] / 10.0
if 'awnd' in weather: weather['awnd_ms'] = weather['awnd'] / 10.0
if all(c in weather for c in ['tavg','tmin','tmax']):
    mask = weather['tavg'].isna()
    weather.loc[mask,'tavg'] = (weather.loc[mask,'tmin']+weather.loc[mask,'tmax'])/2

# --- Load NYC DOT Traffic API (JSON) ---
traffic_url = 'https://data.cityofnewyork.us/resource/i4gi-tjb9.json'
r = requests.get(traffic_url, timeout=15)
traffic = pd.DataFrame(r.json())
traffic['recordedtimestamp'] = pd.to_datetime(traffic['recordedtimestamp'], errors='coerce')
traffic['speed'] = pd.to_numeric(traffic['speed'], errors='coerce')
traffic['date'] = traffic['recordedtimestamp'].dt.to_period('M').dt.to_timestamp()
traffic_monthly = traffic.groupby('date', as_index=False)['speed'].mean().rename(columns={'speed':'mean_speed_mph'})

# --- Load MTA Bus Alerts (Protocol Buffers) ---
alerts_url = 'https://api-endpoint.mta.info/Dataservice/mtagtfsfeeds/camsys%2Fbus-alerts'
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(requests.get(alerts_url, timeout=15).content)
alerts = [e for e in feed.entity if e.HasField('alert')]
alerts_df = pd.DataFrame([{'date': pd.to_datetime(datetime.utcfromtimestamp(feed.header.timestamp)), 'alert_count': len(alerts)}])

SSLError: HTTPSConnectionPool(host='data.cityofnewyork.us', port=443): Max retries exceeded with url: /resource/i4gi-tjb9.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1016)')))

## 🔍 Exploratory Data Analysis (EDA)

In [None]:
print('--- BUS DATA ---')
display(bus.describe(include='all'))
print('\n--- WEATHER DATA ---')
display(weather.describe(include='all'))
print('\n--- TRAFFIC DATA ---')
display(traffic_monthly.describe(include='all'))
print('\n--- ALERTS ---')
display(alerts_df)

### 🧮 Aggregate Monthly Metrics
We aggregate bus, weather, traffic, and alerts data to the monthly level for alignment.

In [None]:
b = bus.copy()
b['wa_w'] = b['wait_assessment'] * b['number_of_scheduled_trips']
bus_monthly = b.groupby('date', as_index=False).agg(
    wa_w=('wa_w','sum'),
    scheduled=('number_of_scheduled_trips','sum'),
    passing=('number_of_trips_passing_wait','sum')
).assign(
    wa_weighted=lambda x: x.wa_w/x.scheduled,
    pct_passing=lambda x: x.passing/x.scheduled
)

weather_monthly = weather.groupby(weather['date'].dt.to_period('M').dt.to_timestamp(), as_index=False).agg(
    tavg=('tavg','mean'), tmax=('tmax','mean'), tmin=('tmin','mean'),
    prcp_mm=('prcp_mm','sum'), snow_mm=('snow_mm','sum'), awnd_ms=('awnd_ms','mean')
).rename(columns={'date':'date'})

# Merge all
merged = (bus_monthly
    .merge(weather_monthly, on='date', how='inner')
    .merge(traffic_monthly, on='date', how='left')
    .merge(alerts_df, on='date', how='left')
)
display(merged.head())

## 📊 Visualizations
We’ll visualize the relationships between weather, traffic, alerts, and reliability using **Seaborn**, **Matplotlib**, and **Plotly**.

In [None]:
# 1️⃣ WA Distribution (Seaborn)
sns.histplot(merged['wa_weighted'], kde=True, bins=20)
plt.title('Distribution of Monthly Wait Assessment (Weighted)')
plt.xlabel('Wait Assessment (0–1)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# 2️⃣ Correlation Heatmap (Seaborn)
num_cols = ['wa_weighted','tavg','prcp_mm','snow_mm','awnd_ms','mean_speed_mph','alert_count']
sns.heatmap(merged[num_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap: Weather, Traffic, and Alerts vs Reliability')
plt.show()

In [None]:
# 3️⃣ Traffic Speed vs WA (Matplotlib)
plt.scatter(merged['mean_speed_mph'], merged['wa_weighted'], c='green')
plt.xlabel('Mean Traffic Speed (mph)')
plt.ylabel('Wait Assessment (0–1)')
plt.title('Bus Reliability vs Traffic Congestion')
plt.show()

In [None]:
# 4️⃣ Interactive Plotly Line Chart (WA and Traffic)
fig = px.line(merged, x='date', y=['wa_weighted','mean_speed_mph'], title='Trends: Wait Assessment & Mean Traffic Speed')
fig.update_layout(xaxis_title='Date', yaxis_title='Value')
fig.show()

In [None]:
# 5️⃣ Scatter: Precipitation vs Alerts (Plotly)
fig2 = px.scatter(merged, x='prcp_mm', y='alert_count', color='wa_weighted',
    title='Precipitation vs Alert Count (Colored by WA)')
fig2.show()

### 🧠 Insights
- Higher precipitation and snowfall are associated with lower WA (worse reliability).
- More alerts correspond to slight dips in WA.
- Mean traffic speed shows moderate positive correlation with WA.
- Seasonal temperature variation appears less influential than precipitation and congestion.

## 🧹 Data Cleaning & Transformation
We now clean and prepare the model-ready dataset.

In [None]:
model_df = merged.copy()
model_df.dropna(subset=['wa_weighted'], inplace=True)
for col in ['prcp_mm','snow_mm']:
    if col in model_df:
        q1, q3 = model_df[col].quantile([0.25,0.75])
        iqr = q3 - q1
        model_df[col] = model_df[col].clip(q1-1.5*iqr, q3+1.5*iqr)
display(model_df.head())

### Cleaning Summary
- **Missing values:** Dropped for WA and merged key metrics.
- **Outliers:** Clipped for precipitation/snow.
- **Duplicates:** Removed via monthly aggregation.
- **Types:** All numeric columns confirmed numeric; `date` standardized to datetime.

## 🧩 Machine Learning Plan (Preview)
- **Target:** `wa_weighted`
- **Features:** `tavg`, `prcp_mm`, `snow_mm`, `awnd_ms`, `mean_speed_mph`, `alert_count`
- **Models:** Linear Regression → Random Forest → Ridge/Lasso Regression
- **Goal:** Identify feature importance and predictive strength of environmental vs operational factors.
- **Evaluation:** MAE, RMSE, and R² using time-based split.

## 🔄 Prior Feedback & Updates
- Expanded from weather-only to include **traffic** and **service alerts**.
- Added **Plotly** for interactivity and improved clarity.
- Strengthened cleaning and integration pipeline.
- Ensured 4+ distinct visualizations using multiple libraries.

In [None]:
!jupyter nbconvert --to python source.ipynb