# NYC Bus Reliability vs. Weather

![Banner](./assets/banner.jpeg)

**Goal**: Quantify how weather conditions (temperature, precipitation, snow) are associated with **NYC bus reliability**, measured via **Wait Assessment** (share of observed trips that meet scheduled headways).

## Topic
Bus riders and planners need reliable headways to avoid long waits and bunching. Buses operate on streets, so they are **directly exposed** to weather (rain, snow, wind) and road conditions. This project tests whether bus reliability declines during adverse weather and which conditions matter most.

## Project Question
**How do precipitation, snowfall, and temperature bands relate to monthly NYC bus Wait Assessment (2020–2024)?**

**Hypothesis**: Rain and snow correlate with **lower** Wait Assessment (worse reliability). Extremely cold or hot months show additional degradation.

## Data Sources
- **MTA Bus Performance (Wait Assessment)** — file CSV → `data/bus_data.csv`
- **NOAA GHCN-Daily – Central Park (USW00094728)** — file CSV → `data/weather_data.csv`
- **OpenWeatherMap API (Current NYC weather)** — optional live API demo (JSON) to satisfy multi-source requirement

See `data_sources.md` for field documentation and provenance.

## Setup & Imports

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import requests

pd.set_option('display.max_columns', 200)
sns.set(style='whitegrid')

## Load & Normalize Datasets
We parse the bus CSV (commas in counts, percent strings), clean the weather CSV (NOAA units), and create useful flags and categories. **All sanity checks are explicit**.

In [None]:
# =========================
# Load & Normalize Datasets
# =========================
# ---------- Load BUS ----------
bus = pd.read_csv(
    'data/bus_data.csv',
    na_values=['', ' ', 'null', 'NULL'],
    low_memory=False
)

# Normalize column names
bus.columns = [c.strip().lower().replace(' ', '_') for c in bus.columns]

expected_bus_cols = {
    'month','borough','day_type','trip_type','route_id','period',
    'number_of_trips_passing_wait','number_of_scheduled_trips','wait_assessment'
}
missing = expected_bus_cols - set(bus.columns)
assert not missing, f"Missing bus columns: {missing}"

# Parse date (these are monthly stamps like 2020-01-01)
bus['date'] = pd.to_datetime(bus['month'], errors='coerce')
assert bus['date'].notna().all(), 'Some bus dates failed to parse.'

# Clean numeric fields: remove commas from counts; strip % and convert to float in [0,1]
for col in ['number_of_trips_passing_wait','number_of_scheduled_trips']:
    bus[col] = bus[col].astype(str).str.replace(',', '', regex=False)
    bus[col] = pd.to_numeric(bus[col], errors='coerce')

bus['wait_assessment'] = (
    bus['wait_assessment']
        .astype(str)
        .str.strip()
        .str.rstrip('%')
)
bus['wait_assessment'] = pd.to_numeric(bus['wait_assessment'], errors='coerce') / 100.0

# Sanity checks
assert (bus['number_of_scheduled_trips'] >= 0).all(), 'Negative scheduled trips found.'
assert (bus['number_of_trips_passing_wait'] >= 0).all(), 'Negative passing trips found.'
assert bus['wait_assessment'].between(0,1).all(), 'Wait assessment must be in [0,1].'

# ---------- Load NOAA Weather ----------
wanted = ['STATION','NAME','DATE','PRCP','SNOW','SNWD','TMAX','TMIN','TAVG','AWND']
available_cols = pd.read_csv('data/weather_data.csv', nrows=0).columns.tolist()
usecols = [c for c in wanted if c in available_cols]

weather = pd.read_csv(
    'data/weather_data.csv',
    usecols=usecols,
    na_values=['', ' ', 'null', 'NULL'],
    low_memory=False
)

weather.columns = [c.lower() for c in weather.columns]
weather['date'] = pd.to_datetime(weather['date'], errors='coerce')

for col in ['prcp','snow','snwd','tmax','tmin','tavg','awnd']:
    if col in weather.columns:
        weather[col] = pd.to_numeric(weather[col], errors='coerce')

# NOAA units: tenths (°C, mm, m/s)
if 'tmax' in weather: weather['tmax'] = weather['tmax'] / 10.0
if 'tmin' in weather: weather['tmin'] = weather['tmin'] / 10.0
if 'tavg' in weather: weather['tavg'] = weather['tavg'] / 10.0
if 'prcp' in weather: weather['prcp_mm'] = weather['prcp'] / 10.0
if 'snow' in weather: weather['snow_mm'] = weather['snow'] / 10.0
if 'awnd' in weather: weather['awnd_ms'] = weather['awnd'] / 10.0

# Impute tavg if missing but tmin/tmax exist
if all(c in weather.columns for c in ['tavg','tmin','tmax']):
    na_tavg = weather['tavg'].isna() & weather['tmin'].notna() & weather['tmax'].notna()
    weather.loc[na_tavg, 'tavg'] = (weather.loc[na_tavg, 'tmin'] + weather.loc[na_tavg, 'tmax'])/2.0

weather['is_rain'] = weather.get('prcp_mm', 0).fillna(0) > 0.1
weather['is_snow'] = weather.get('snow_mm', 0).fillna(0) > 0.1

def temp_band_from_tavg(t):
    if pd.isna(t): return 'unknown'
    if t < 0:     return '<0°C'
    if t < 10:    return '0–10°C'
    if t < 20:    return '10–20°C'
    if t < 30:    return '20–30°C'
    return '≥30°C'

weather['temp_band'] = weather.get('tavg', np.nan).apply(temp_band_from_tavg)

# Checks
assert weather['date'].notna().any(), 'Weather dates failed to parse.'
bus.head(), weather.head()

## Exploratory Data Analysis (EDA)
We provide dtype summaries, missingness, and then aggregate to **monthly** rider-weighted KPIs to align with the bus dataset.

In [None]:
# =========================
# EDA: dtypes, summaries, nulls
# =========================
print('— Bus dtypes —')
bus.info()
print('\n— Weather dtypes —')
weather.info()

print('\n— Bus describe (numeric) —')
display(bus.select_dtypes(include=[np.number]).describe().T)
print('\nNulls in bus:')
display(bus.isnull().sum())

print('\n— Weather describe (numeric) —')
display(weather.select_dtypes(include=[np.number]).describe().T)
print('\nNulls in weather:')
display(weather.isnull().sum())

### Build Monthly Bus KPIs (Weighted by Scheduled Trips)
MTA defines Wait Assessment as a **share**. To avoid small-sample bias, we compute a **route/period-level** weighted average by **number_of_scheduled_trips** and then aggregate to monthly system KPIs.

In [None]:
# Weighted sums at the most granular level available
b = bus.copy()
b['wa_w'] = b['wait_assessment'] * b['number_of_scheduled_trips']

# Monthly total across all routes/periods/boroughs
bus_monthly = (
    b.groupby('date', as_index=False)
     .agg(
         wa_w=('wa_w','sum'),
         scheduled=('number_of_scheduled_trips','sum'),
         passing=('number_of_trips_passing_wait','sum')
     )
)
bus_monthly['wa_weighted'] = bus_monthly['wa_w'] / bus_monthly['scheduled']
bus_monthly['pct_passing'] = bus_monthly['passing'] / bus_monthly['scheduled']

# Sanity: wa_weighted and pct_passing in [0,1]
assert bus_monthly['wa_weighted'].between(0,1).all(), 'Monthly WA out of range.'
assert bus_monthly['pct_passing'].between(0,1).all(), 'Monthly pct_passing out of range.'

display(bus_monthly.head())
print('Monthly rows:', len(bus_monthly), '| Range:', bus_monthly['date'].min(), '→', bus_monthly['date'].max())

### Convert Daily Weather → Monthly
We convert daily NOAA observations to **monthly** means/sums to align with bus months.

In [None]:
w = weather.copy()
w['year_month'] = w['date'].dt.to_period('M').dt.to_timestamp()

agg_map = {}
if 'tavg'   in w: agg_map['tavg']    = 'mean'
if 'tmin'   in w: agg_map['tmin']    = 'mean'
if 'tmax'   in w: agg_map['tmax']    = 'mean'
if 'prcp_mm'in w: agg_map['prcp_mm'] = 'sum'   # total precip in month
if 'snow_mm'in w: agg_map['snow_mm'] = 'sum'   # total snow in month
if 'awnd_ms'in w: agg_map['awnd_ms'] = 'mean'

weather_monthly = (
    w.groupby('year_month', as_index=False)
     .agg(agg_map)
     .rename(columns={'year_month':'date'})
)
display(weather_monthly.head())
print('Weather monthly rows:', len(weather_monthly))

### Merge Bus (Monthly) with Weather (Monthly)
We use an inner join on the monthly `date` to align bus reliability with weather conditions.

In [None]:
merged = pd.merge(bus_monthly, weather_monthly, on='date', how='inner')
merged = merged.sort_values('date')
display(merged.head())
print(merged.shape)

## Visualizations (≥ 4; matplotlib + seaborn)
Each chart includes a short interpretation per the rubric.

In [None]:
# 1) Distribution of monthly Wait Assessment (weighted)
sns.histplot(merged['wa_weighted'].dropna(), bins=20, kde=True)
plt.title('Distribution of Monthly Wait Assessment (Weighted)')
plt.xlabel('Wait Assessment (0–1)')
plt.ylabel('Months')
plt.show()
print('Interpretation: Most months cluster at relatively high WA, with a lower tail indicating worse reliability months.')

In [None]:
# 2) Wait Assessment over time
plt.plot(merged['date'], merged['wa_weighted'])
plt.title('Monthly Wait Assessment Over Time')
plt.xlabel('Date')
plt.ylabel('WA (0–1)')
plt.show()
print('Interpretation: Visual check for seasonality and long-run shifts in reliability.')

In [None]:
# 3) Precipitation vs Wait Assessment (scatter)
if 'prcp_mm' in merged.columns:
    plt.scatter(merged['prcp_mm'], merged['wa_weighted'])
    plt.title('Monthly Precipitation vs Wait Assessment')
    plt.xlabel('Monthly Total Precipitation (mm)')
    plt.ylabel('Wait Assessment (0–1)')
    plt.show()
    # quick correlation print
    print('Pearson corr (precip, WA):', merged[['prcp_mm','wa_weighted']].corr().iloc[0,1])
else:
    print('No precipitation column present.')

In [None]:
# 4) Temperature band vs Wait Assessment (via categorical bands from tavg)
def band(t):
    if pd.isna(t): return 'unknown'
    if t < 0: return '<0°C'
    if t < 10: return '0–10°C'
    if t < 20: return '10–20°C'
    if t < 30: return '20–30°C'
    return '≥30°C'

if 'tavg' in merged.columns:
    merged['temp_band'] = merged['tavg'].apply(band)
    order = ['<0°C','0–10°C','10–20°C','20–30°C','≥30°C','unknown']
    sns.barplot(data=merged, x='temp_band', y='wa_weighted', order=[o for o in order if o in merged['temp_band'].unique()])
    plt.title('Wait Assessment by Temperature Band')
    plt.xlabel('Avg Monthly Temp Band')
    plt.ylabel('WA (0–1)')
    plt.show()
    print('Interpretation: Compare bands to see whether colder/hotter months systematically reduce WA.')
else:
    print('No tavg column present.')

In [None]:
# 5) Correlation heatmap
num_cols = [c for c in ['wa_weighted','pct_passing','tavg','tmin','tmax','prcp_mm','snow_mm','awnd_ms'] if c in merged.columns]
corr = merged[num_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap: Weather vs Bus Reliability')
plt.show()
print('Interpretation: Expect negative correlation between precip/snow and WA; wind/temperature effects may be weaker but present.')

## Data Issues Noted So Far
- **Missing values**: NOAA `tavg` occasionally missing; imputed from `(tmin+tmax)/2`. Bus WA parsed from percent strings; counts had commas.
- **Outliers**: High precipitation/snow months may depress WA; we will avoid dropping and instead rely on robust stats.
- **Types**: All dates normalized to monthly; NOAA units converted to metric; WA constrained to [0,1].
- **Duplicates**: Aggregations remove route/period duplication by weighted averaging.

## Data Cleaning & Transformation (Applied)
We already performed conversions, imputations, and monthly alignment above. Here we assemble a **model-ready** table and add simple caps for extreme outliers (optional).

In [None]:
# Model-ready frame
model_df = merged[['date','wa_weighted','pct_passing']].copy()
for col in ['tavg','tmin','tmax','prcp_mm','snow_mm','awnd_ms']:
    if col in merged.columns:
        model_df[col] = merged[col]

# Optional: cap extreme weather to reduce undue leverage in simple models
for col in ['prcp_mm','snow_mm']:
    if col in model_df:
        q1, q3 = model_df[col].quantile([0.25, 0.75])
        iqr = q3 - q1
        lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
        model_df[col] = model_df[col].clip(lower, upper)

display(model_df.head())
assert model_df['wa_weighted'].between(0,1).all(), 'WA out of range after processing.'

### Cleaning Rationale (Write-up)
- **Parsing**: Converted percent strings and comma-formatted counts to numeric; enforced WA in [0,1].
- **Alignment**: Aggregated route/period records to **monthly** weighted WA using scheduled trips as weights; converted daily NOAA to monthly to match.
- **Imputation**: Filled `tavg` from `(tmin+tmax)/2` when present; avoided imputing precip/snow (true zeros are informative).
- **Outliers**: Extreme precip/snow months capped for modeling stability; original values retained elsewhere for transparency.
- **Sanity checks**: Assertions on ranges and parsed dates to fail fast if inputs change.

## Optional Third Source: OpenWeatherMap (Live Snapshot)
Small demo of API ingestion to satisfy the “3 sources / 2 types” requirement. Use an env var `OWM_API_KEY` locally. If offline, the cell prints a friendly message and continues.

In [None]:
API_KEY = os.getenv('OWM_API_KEY')
CITY = 'New York'
if API_KEY:
    url = f'http://api.openweathermap.org/data/2.5/weather?q={CITY}&appid={API_KEY}&units=metric'
    try:
        r = requests.get(url, timeout=8)
        d = r.json()
        df_api = pd.DataFrame([{
            'city': d.get('name'),
            'date_time_utc': datetime.utcfromtimestamp(d.get('dt', 0)),
            'temp_c': d.get('main',{}).get('temp'),
            'humidity_pct': d.get('main',{}).get('humidity'),
            'wind_mps': d.get('wind',{}).get('speed'),
            'condition': (d.get('weather') or [{}])[0].get('description')
        }])
        display(df_api)
    except Exception as e:
        print('OpenWeatherMap call failed:', e)
else:
    print('No OWM_API_KEY set; skipping live API demo.')

## Prior Feedback & Updates
- Scope narrowed to **bus reliability** (more weather-exposed than subway). Data sources finalized (bus CSV + NOAA CSV + optional OWM API).
- Clear monthly alignment and weighted WA calculation documented.
- Added sanity checks and explicit cleaning rationale to aid peer review.

## Machine Learning Plan (Preview)
- **Target**: `wa_weighted` (monthly). Secondary: `pct_passing`.
- **Features**: `tavg`, `tmin`, `tmax`, `prcp_mm`, `snow_mm`, `awnd_ms` + month index/season dummies.
- **Models**: Linear regression baseline; tree ensembles (Random Forest/GBM) for nonlinearities; regularized linear (Ridge/Lasso) to manage collinearity.
- **Challenges**: Limited monthly sample size; potential confounders (service changes); multicollinearity across temperature metrics; seasonality.
- **Next steps**: Time-series split; MAE/RMSE; SHAP/feature importance; sensitivity analysis with/without outlier capping.

## Conclusions (So Far)
- Monthly WA distribution shows variability across months.
- Scatter suggests **precipitation** is associated with **lower** Wait Assessment (negative correlation expected).
- Temperature band comparison offers weaker but plausible effects (extreme cold/hot bands underperform).
- Cleaning choices (weighted monthly WA, NOAA unit conversions, imputations) provide a robust base for modeling.

## Convert Notebook to Python Script
Run this cell before submission to generate `source.py` for peer review.

In [None]:
!jupyter nbconvert --to python source.ipynb