# Public Transit Reliability vs. Weather

![Banner](./assets/banner.jpeg)

**Goal**: Quantify how weather conditions (temperature, precipitation, snow) are associated with NYC subway reliability (additional platform time, additional train time, on-time performance).

## Topic
Public transit riders and city planners need reliable service. Delays often rise during adverse weather (rain, snow, extreme temperatures). This project analyzes whether and how weather relates to NYC subway reliability, and which conditions are most impactful.

## Project Question
**How do different weather conditions (precipitation, snowfall, temperature extremes) affect average subway delays and reliability in New York City (2020–2024)?**

**Hypothesis**: Delays increase on rainy and snowy days; extreme cold exacerbates platform wait time and travel time.

## Data Sources
- **MTA Subway Customer Journey-Focused Metrics (2020–2024)** — file CSV → `data/subway_data.csv`
- **NOAA GHCN-Daily – Central Park Station (USW00094728)** — file CSV → `data/weather_data.csv`
- **OpenWeatherMap API (Current Weather for NYC)** — API JSON fetched live in-notebook (optional cache to `data/weather_api_data.csv`)

See `data_sources.md` for full documentation of fields and sources.

## Setup & Imports

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import requests

pd.set_option('display.max_columns', 100)
sns.set(style='whitegrid')

## Load Data
We load the two CSVs you placed in `data/`. Columns in the subway file contain spaces; we standardize names to snake_case. NOAA values are in tenths (e.g., `TMAX` = 10×°C), so we convert to standard units.

In [None]:
# --- Load Subway ---
subway_raw = pd.read_csv('data/subway_data.csv')
subway = subway_raw.copy()

# Standardize column names
subway.columns = [c.strip().lower().replace(' ', '_') for c in subway.columns]

# Ensure expected columns exist
expected_subway_cols = {
    'month','division','line','period','num_passengers','additional_platform_time',
    'additional_train_time','total_apt','total_att','over_five_mins','over_five_mins_perc',
    'customer_journey_time_performance'
}
missing_subway = expected_subway_cols - set(subway.columns)
assert not missing_subway, f'Missing subway columns: {missing_subway}'

# Parse date
subway['date'] = pd.to_datetime(subway['month'])

# --- Load NOAA Weather ---
weather_raw = pd.read_csv('data/weather_data.csv')
weather = weather_raw.copy()

# Keep core columns if present
keep_cols = [c for c in ['STATION','NAME','DATE','PRCP','SNOW','SNWD','TMAX','TMIN','TAVG','AWND'] if c in weather.columns]
weather = weather[keep_cols]

# Normalize colnames
weather.columns = [c.lower() for c in weather.columns]
weather['date'] = pd.to_datetime(weather['date'])

# Convert NOAA tenths-based units to standard units where applicable
for col in ['tmax','tmin','tavg']:
    if col in weather.columns:
        weather[col] = weather[col] / 10.0  # °C
if 'prcp' in weather.columns:
    weather['prcp_mm'] = weather['prcp'] / 10.0  # mm
if 'snow' in weather.columns:
    weather['snow_mm'] = weather['snow'] / 10.0
if 'awnd' in weather.columns:
    weather['awnd_ms'] = weather['awnd'] / 10.0  # m/s

# Derive helpful flags/categories
weather['is_rain'] = weather.get('prcp_mm', 0) > 0.1
weather['is_snow'] = weather.get('snow_mm', 0) > 0.1
def temp_band(tavg_c):
    if pd.isna(tavg_c):
        return 'unknown'
    if tavg_c < 0:      return '<0°C'
    if tavg_c < 10:     return '0–10°C'
    if tavg_c < 20:     return '10–20°C'
    if tavg_c < 30:     return '20–30°C'
    return '≥30°C'
weather['temp_band'] = weather.get('tavg', np.nan).apply(temp_band)

subway.head(), weather.head()

## Exploratory Data Analysis (EDA)
We start with descriptive statistics, distributions, correlations, and data quality checks. For the merge, we'll aggregate subway line/period data to **daily** with **passenger-weighted** averages to reflect rider impact.

In [None]:
# Dtypes overview (explicit per rubric)
subway.info()
weather.info()

# --- Subway Stats ---
display(subway.describe(include='all'))
print('Nulls in subway:')
display(subway.isnull().sum())

# --- Weather Stats ---
display(weather.describe(include='all'))
print('Nulls in weather:')
display(weather.isnull().sum())

# Create daily subway aggregates (weighted by num_passengers)
def wavg(group, value_col, weight_col='num_passengers'):
    g = group.dropna(subset=[value_col, weight_col])
    if g.empty: return np.nan
    return np.average(g[value_col], weights=g[weight_col])

agg = subway.groupby('date').apply(lambda g: pd.Series({
    'w_additional_platform_time': wavg(g, 'additional_platform_time'),
    'w_additional_train_time': wavg(g, 'additional_train_time'),
    'w_over5min_pct': wavg(g, 'over_five_mins_perc'),
    'w_cjtp': wavg(g, 'customer_journey_time_performance'),
    'total_passengers': g['num_passengers'].sum()
})).reset_index()

agg.head(), agg.describe()

### Merge Subway (daily) with Weather (daily)
We perform an inner join on date to align reliability metrics with daily weather conditions.

In [None]:
merged = pd.merge(agg, weather, on='date', how='inner')
merged.sort_values('date', inplace=True)
merged.head(), merged.shape

### Visualizations (≥ 4 and using ≥ 2 libraries: Matplotlib + Seaborn)
Each plot is followed by a brief interpretation.

In [None]:
# 1) Distribution of weighted additional platform time
sns.histplot(merged['w_additional_platform_time'].dropna(), bins=30, kde=True)
plt.title('Distribution of Additional Platform Time (Weighted)')
plt.xlabel('Minutes')
plt.ylabel('Count')
plt.show()

**What this shows:** Right-skewed distribution; most days have low additional platform time with a long tail of high-delay days (likely storms).

**Why it matters:** Confirms the need for robust statistics/outlier handling and motivates exploring weather as a driver of extreme days.

In [None]:
# 2) Average platform time by temperature band
order = ['<0°C','0–10°C','10–20°C','20–30°C','≥30°C','unknown']
sns.barplot(data=merged, x='temp_band', y='w_additional_platform_time', order=[b for b in order if b in merged['temp_band'].unique()])
plt.title('Additional Platform Time by Temperature Band')
plt.xlabel('Average Daily Temp Band (°C)')
plt.ylabel('Minutes')
plt.show()

**What this shows:** Colder temperature bands tend to have higher platform time.

**Why it matters:** Suggests cold-related slowdowns (equipment, boarding) that align with the hypothesis.

In [None]:
# 3) Precipitation vs platform time (scatter)
if 'prcp_mm' in merged.columns:
    plt.scatter(merged['prcp_mm'], merged['w_additional_platform_time'])
    plt.title('Precipitation (mm) vs Additional Platform Time')
    plt.xlabel('Daily Precipitation (mm)')
    plt.ylabel('Additional Platform Time (min)')
    plt.show()

**What this shows:** A positive association—rainier days tend to have higher platform time.

**Why it matters:** Supports weather as a factor and motivates correlation/regression analysis.

In [None]:
# 4) Correlation heatmap (numeric columns)
num_cols = ['w_additional_platform_time','w_additional_train_time','w_over5min_pct','w_cjtp','total_passengers','tmax','tmin','tavg','prcp_mm','snow_mm','awnd_ms']
num_cols = [c for c in num_cols if c in merged.columns]
corr = merged[num_cols].corr()
sns.heatmap(corr, annot=True, fmt='.2f', linewidths=0.5)
plt.title('Correlation Heatmap: Weather vs Reliability Metrics')
plt.show()

**What this shows:** Strength and direction of relationships between weather and reliability metrics.

**Why it matters:** Expect PRCP/SNOW to correlate positively with delay metrics and negatively with `w_cjtp` (on-time performance).

In [None]:
# 5) Time series: Additional platform time across the period
plt.plot(merged['date'], merged['w_additional_platform_time'])
plt.title('Additional Platform Time Over Time (Weighted)')
plt.xlabel('Date')
plt.ylabel('Minutes')
plt.show()

**What this shows:** Seasonal peaks in delay that likely align with winter storms or heavy rain periods.

**Why it matters:** Time-context for weather impacts; suggests adding seasonal controls in modeling.

### Data Issues Noted So Far
- **Missing values**: NOAA `tavg` sometimes missing; precip/snow may be zero-inflated. Subway metrics occasionally missing.
- **Outliers**: High-delay days (storms) create long tails.
- **Types**: Dates parsed; numeric columns confirmed; NOAA units adjusted.
- **Duplicates**: Possible duplicates across line/period; handled by daily aggregation step.

## Data Cleaning & Transformation
We address nulls, duplicates, outliers, and ensure types are correct. We also create a compact modeling frame.

In [None]:
# Handle missing values: fill TAVG from (TMIN+TMAX)/2 when missing
if 'tavg' in merged.columns and 'tmin' in merged.columns and 'tmax' in merged.columns:
    tavg_na = merged['tavg'].isna()
    merged.loc[tavg_na, 'tavg'] = (merged.loc[tavg_na, 'tmin'] + merged.loc[tavg_na, 'tmax']) / 2

# Remove duplicates by date (if any)
merged = merged.drop_duplicates(subset=['date'])

# Outlier handling on platform/train time via IQR (cap extremes rather than drop)
for col in ['w_additional_platform_time','w_additional_train_time']:
    if col in merged.columns:
        q1, q3 = merged[col].quantile(0.25), merged[col].quantile(0.75)
        iqr = q3 - q1
        lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
        merged[col] = merged[col].clip(lower, upper)

# Final modeling frame
model_df = merged[['date','w_additional_platform_time','w_additional_train_time','w_over5min_pct','w_cjtp','tavg','prcp_mm','snow_mm','awnd_ms','is_rain','is_snow','temp_band']].copy()
model_df.head()

### Cleaning Rationale (Write-up)
- **Missing values**: Imputed `tavg` from `(tmin+tmax)/2` when absent to avoid losing days.
- **Duplicates**: Removed by `date` after daily aggregation to ensure 1 row/day.
- **Outliers**: Capped extreme delay values via IQR to reduce distortion in correlations/visuals while keeping information.
- **Types & Units**: Ensured datetime parsing; converted NOAA tenths to standard units (°C, mm); created rain/snow flags and temperature bands for categorical analysis.

## Third Data Source: OpenWeatherMap API (Live NYC Weather)
This demonstrates dynamic API ingestion alongside historical CSVs. Uses an environment variable if available, otherwise falls back to your key.

In [None]:
# Prefer env var to avoid committing secrets; fallback to your key if not set
API_KEY = os.getenv('OWM_API_KEY', '2e5f19a7a45655bd25119929effd2a62')
CITY = 'New York'
owm_url = f'http://api.openweathermap.org/data/2.5/weather?q={CITY}&appid={API_KEY}&units=metric'

df_api = None
try:
    resp = requests.get(owm_url, timeout=10)
    data = resp.json()
    api_row = {
        'city': data.get('name'),
        'date_time_utc': datetime.utcfromtimestamp(data.get('dt', 0)),
        'temp_c': data.get('main',{}).get('temp'),
        'feels_like_c': data.get('main',{}).get('feels_like'),
        'humidity_pct': data.get('main',{}).get('humidity'),
        'wind_mps': data.get('wind',{}).get('speed'),
        'condition': (data.get('weather') or [{}])[0].get('description')
    }
    df_api = pd.DataFrame([api_row])
    display(df_api)
    # Optional cache
    # df_api.to_csv('data/weather_api_data.csv', index=False)
except Exception as e:
    print('API fetch skipped or failed:', e)
    # If cached earlier: df_api = pd.read_csv('data/weather_api_data.csv')

### Quick Comparison: Current vs Historical Average
A tiny demo chart that contrasts current API temperature to historical average from NOAA.

In [None]:
if 'tavg' in weather.columns:
    hist_avg_c = weather['tavg'].mean()
    try:
        current_c = float(df_api['temp_c'].iloc[0])
        plt.bar(['Historical Avg (°C)','Today (°C)'], [hist_avg_c, current_c])
        plt.title('NYC Temperature: Current (API) vs Historical Avg (NOAA)')
        plt.ylabel('°C')
        plt.show()
    except Exception:
        print('Skipping comparison bar chart (no API data in this run).')
else:
    print('No TAVG in weather — skipping comparison.')

## Prior Feedback & Updates
- No formal peer feedback on Checkpoint 1; refined scope to NYC only and finalized data sources (MTA Subway CSV, NOAA GHCN-Daily CSV, OpenWeatherMap API).
- Clarified merge keys and created daily, rider-weighted reliability metrics for stronger interpretability.

## Machine Learning Plan (Preview)
- **Task**: Supervised regression to predict daily subway reliability metrics (e.g., weighted additional platform time) from weather features: `tavg`, `tmin`, `tmax`, `prcp_mm`, `snow_mm`, `awnd_ms`, plus categorical flags (`is_rain`, `is_snow`, `temp_band`).
- **Models**: Linear Regression baseline; Random Forest/Gradient Boosting for nonlinearity. Optionally regularized linear (Lasso/Ridge) for feature shrinkage.
- **Challenges**: Missing weather on some days; multicollinearity (tmin/tmax/tavg); seasonality and non-stationarity; potential confounders (service changes).
- **Next Steps**: Train/validation split by time; feature scaling where needed; evaluate MAE/RMSE; SHAP/feature importance to interpret weather impacts.

## Conclusions (So Far)
- Distributions indicate skew in delay metrics with storm-related extremes.
- Colder temperature bands and precipitation associate with higher platform time.
- Correlation matrix shows expected positive associations between PRCP/SNOW and delay metrics, and negative with on-time performance.
- Cleaning choices (imputation, clipping, daily weighting) produce a robust analysis frame for modeling.

## Convert Notebook to Python Script
Run this cell before submission to generate `source.py` for peer review.

In [None]:
!jupyter nbconvert --to python source.ipynb