# NYC Delivery Truck Congestion – Step 4: Spatial Features & Data Aggregation
*Author: Karan Chauhan*  

This notebook creates spatial features and aggregates complaints into a prediction-ready dataset.

**Part A: Spatial Features**
- Create geographic grid cells (0.01° resolution)
- Assign each complaint to a grid cell

**Part B: Data Aggregation**
- Group complaints by (grid cell + hour + day of week)
- Create target variable for prediction
- Prepare train/test split

---

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
from sklearn.model_selection import train_test_split

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

---
# Part A: Spatial Features (Grid Cells)
---

## Load Dataset with Temporal Features

In [None]:
df = pd.read_csv('../data/complaints_with_features.csv')

print(f"Loaded {len(df):,} complaints with temporal features")
df.head()

## Clean Spatial Data

In [None]:
# Drop rows with missing lat/lon
df_clean = df.dropna(subset=['latitude', 'longitude']).copy()
print(f"After removing missing location data: {len(df_clean):,} complaints")
print(f"Dropped: {len(df) - len(df_clean):,} rows ({(len(df) - len(df_clean))/len(df)*100:.1f}%)")

## Create Grid Cells

In [None]:
GRID_PRECISION = 0.01

df_clean['grid_lat'] = (df_clean['latitude'] / GRID_PRECISION).round() * GRID_PRECISION
df_clean['grid_lon'] = (df_clean['longitude'] / GRID_PRECISION).round() * GRID_PRECISION
df_clean['grid_id'] = df_clean['grid_lat'].astype(str) + '_' + df_clean['grid_lon'].astype(str)

print(f"Created {df_clean['grid_id'].nunique():,} unique grid cells")
print(f"Average complaints per cell: {len(df_clean) / df_clean['grid_id'].nunique():.1f}")

## Visualize Grid Coverage

In [None]:
cell_counts = df_clean['grid_id'].value_counts()

plt.figure(figsize=(12, 4))
plt.hist(cell_counts, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Complaints per Grid Cell')
plt.ylabel('Number of Grid Cells')
plt.title('Distribution of Complaints Across Grid Cells')
plt.axvline(cell_counts.median(), color='red', linestyle='--', label=f'Median: {cell_counts.median():.0f}')
plt.legend()
plt.tight_layout()
plt.show()

print("\nTop 10 busiest grid cells:")
print(cell_counts.head(10))

---
# Part B: Data Aggregation for Modeling
---

## Create Aggregated Dataset

Transform individual complaints into aggregated observations:
- Each row = (grid cell + hour + day of week)
- Target variable = complaint count in that combination
- This enables prediction: "What's the risk in grid X at hour Y on day Z?"

In [None]:
# Group by grid cell, hour, and day of week
agg_data = df_clean.groupby(['grid_id', 'grid_lat', 'grid_lon', 'hour', 'day_of_week']).agg({
    'created_date': 'count',  # Count complaints
    'is_weekend': 'first',     # Weekend flag (same for all in group)
    'is_rush_hour': 'first',   # Rush hour flag
    'month': lambda x: x.mode()[0] if len(x.mode()) > 0 else x.iloc[0]  # Most common month
}).reset_index()

# Rename complaint count column
agg_data.rename(columns={'created_date': 'complaint_count'}, inplace=True)

print(f"Aggregated {len(df_clean):,} complaints into {len(agg_data):,} observations")
print(f"\nEach observation represents: (grid cell + hour + day of week)")

agg_data.head(10)

## Create Target Variable

Define prediction task: Binary classification (high/low congestion)

In [None]:
# Define high congestion threshold (75th percentile)
threshold = agg_data['complaint_count'].quantile(0.75)
agg_data['high_congestion'] = (agg_data['complaint_count'] >= threshold).astype(int)

print(f"High congestion threshold: {threshold:.1f} complaints")
print(f"\nClass distribution:")
print(agg_data['high_congestion'].value_counts())
print(f"\nHigh congestion percentage: {agg_data['high_congestion'].mean()*100:.1f}%")

## Feature Summary

In [None]:
print("Modeling dataset shape:", agg_data.shape)
print("\nFeatures available:")
print(list(agg_data.columns))

print("\nFeature value counts:")
print(f"Unique grid cells: {agg_data['grid_id'].nunique()}")
print(f"Unique hours: {agg_data['hour'].nunique()}")
print(f"Unique days: {agg_data['day_of_week'].nunique()}")

## Aggregated Data Statistics

In [None]:
print("Complaint count distribution:")
print(agg_data['complaint_count'].describe())

plt.figure(figsize=(12, 4))
plt.hist(agg_data['complaint_count'], bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Complaint Count')
plt.ylabel('Frequency')
plt.title('Distribution of Complaint Counts (Aggregated by Grid + Time)')
plt.axvline(threshold, color='red', linestyle='--', label=f'High Congestion Threshold: {threshold:.1f}')
plt.legend()
plt.tight_layout()
plt.show()

## Save Modeling Dataset

In [None]:
# Save aggregated modeling dataset
modeling_path = '../data/modeling_dataset.csv'
agg_data.to_csv(modeling_path, index=False)

print(f"Saved modeling dataset: {modeling_path}")
print(f"Rows: {len(agg_data):,}")
print(f"Columns: {len(agg_data.columns)}")
print(f"\nReady for model training in Step 5!")

## Summary

**What we accomplished:**
1. Created spatial grid cells (90 unique cells covering Manhattan)
2. Aggregated 110k complaints into prediction-ready format
3. Defined target variable (high/low congestion)
4. Saved modeling dataset ready for ML

**Features available for modeling:**
- Spatial: `grid_id`, `grid_lat`, `grid_lon`
- Temporal: `hour`, `day_of_week`, `is_weekend`, `is_rush_hour`, `month`
- Target: `high_congestion` (binary), `complaint_count` (continuous)

**Next step:** Build and evaluate predictive models (Step 5)