# NYC Curbside Congestion - Data Exploration & Preprocessing

This notebook covers:
1. Loading 311 complaint data from NYC Open Data API
2. Filtering for truck/delivery-related complaints
3. Feature engineering (temporal, spatial, weather)
4. Creating the final modeling dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import requests
import sys
sys.path.insert(0, '..')

# Project modules
from src.config import *
from src.api_311 import fetch_recent_complaints, process_live_complaints

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('dark_background')

## 1. Load Processed Data

We use pre-processed data that has already been filtered for truck-related complaints.

In [None]:
# Load the features dataset
df = pd.read_csv('../data/complaints_with_features.csv')
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Data info
df.info()

In [None]:
# Parse dates
df['created_date'] = pd.to_datetime(df['created_date'])
print(f"Date range: {df['created_date'].min()} to {df['created_date'].max()}")

## 2. Exploratory Data Analysis

In [None]:
# Complaint types distribution
print("Top Complaint Types:")
df['complaint_type'].value_counts().head(10)

In [None]:
# Hourly distribution
fig, ax = plt.subplots(figsize=(12, 5))
df['hour'].value_counts().sort_index().plot(kind='bar', ax=ax, color='#10b981')
ax.set_xlabel('Hour of Day')
ax.set_ylabel('Number of Complaints')
ax.set_title('Complaints by Hour of Day')
plt.tight_layout()
plt.show()

In [None]:
# Day of week distribution
day_names = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
fig, ax = plt.subplots(figsize=(10, 5))
day_counts = df['day_of_week'].value_counts().sort_index()
ax.bar(day_names, day_counts.values, color='#3b82f6')
ax.set_xlabel('Day of Week')
ax.set_ylabel('Number of Complaints')
ax.set_title('Complaints by Day of Week')
plt.tight_layout()
plt.show()

In [None]:
# Heatmap: Hour vs Day of Week
heatmap_data = df.groupby(['day_of_week', 'hour']).size().unstack(fill_value=0)
heatmap_data.index = day_names

fig, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(heatmap_data, cmap='YlOrRd', ax=ax, cbar_kws={'label': 'Complaints'})
ax.set_xlabel('Hour of Day')
ax.set_ylabel('Day of Week')
ax.set_title('Complaint Frequency: Hour vs Day of Week')
plt.tight_layout()
plt.show()

## 3. Spatial Analysis

In [None]:
# Grid distribution
if 'grid_id' in df.columns:
    print(f"Number of unique grid cells: {df['grid_id'].nunique()}")
    print(f"\nTop 10 busiest grid cells:")
    print(df['grid_id'].value_counts().head(10))

In [None]:
# Geographic scatter plot
fig, ax = plt.subplots(figsize=(10, 12))
ax.scatter(df['longitude'], df['latitude'], alpha=0.3, s=1, c='#10b981')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Complaint Locations in Manhattan')
plt.tight_layout()
plt.show()

## 4. Modeling Dataset

Load the aggregated dataset used for model training.

In [None]:
# Load modeling dataset
model_df = pd.read_csv('../data/modeling_dataset.csv')
print(f"Modeling dataset shape: {model_df.shape}")
model_df.head()

In [None]:
# Class distribution
print("Target variable distribution:")
print(model_df['high_congestion'].value_counts(normalize=True).round(3))

In [None]:
# Feature statistics
model_df.describe()

## Summary

Key findings from EDA:
- **Peak hours**: 8-10 AM and 2-6 PM have highest complaint volumes
- **Weekdays vs Weekends**: Weekdays have significantly more complaints
- **Spatial patterns**: Midtown and Lower Manhattan are hotspots
- **Class imbalance**: ~30% high congestion, 70% low congestion