## 📦 Cell 1: Geocoding Parking Violations with Mapbox
This cell samples 60,000 NYC parking violations, builds clean address strings, and geocodes them to latitude/longitude using the Mapbox API.
To avoid geocoding the same address multiple times, it maintains a local geocode_cache.json that stores results.
Coordinates are filtered to NYC bounds and exported to geocoded_fines_sample_50k.csv.

In [5]:
import pandas as pd
from mapbox import Geocoder
from tqdm import tqdm
import time
import os
import json

# ----------------------------------------
# CONFIGURATION
# ----------------------------------------
MAPBOX_TOKEN = "pk.eyJ1IjoiYXlhZmFoaW0iLCJhIjoiY21haDI0NzNtMDZnYjJrc2did2ozb2diMSJ9.ucfbzSq_1BEtyk7jqGJb1g"
CSV_PATH = "data/nyc_parking_violations_sample.csv"
CACHE_PATH = "geocode_cache.json"
BACKUP_PATH = "geocode_cache_backup.json"
SAVE_EVERY = 1000
SLEEP_TIME = 0.05  # Be respectful
SAMPLE_SIZE = 60000

# ----------------------------------------
# LOAD & SAMPLE DATA
# ----------------------------------------
cols = ['house_number', 'street_name', 'violation_county']
df = pd.read_csv(CSV_PATH, usecols=cols, low_memory=False)
df = df.dropna(subset=cols).sample(SAMPLE_SIZE, random_state=42)

# Create full address
df['full_address'] = df['house_number'].astype(str) + " " + df['street_name'] + ", " + df['violation_county'] + ", NYC"

# ----------------------------------------
# LOAD OR CREATE CACHE
# ----------------------------------------
if os.path.exists(CACHE_PATH):
    with open(CACHE_PATH, "r") as f:
        cache = json.load(f)
else:
    cache = {}

# ----------------------------------------
# SETUP GEOCODER
# ----------------------------------------
geocoder = Geocoder(access_token=MAPBOX_TOKEN)

# Deduplicate addresses and filter out already cached ones
unique_addresses = df['full_address'].unique()
uncached = [addr for addr in unique_addresses if addr not in cache]
print(f"Geocoding {len(uncached):,} new addresses from {SAMPLE_SIZE:,} fines...")

# ----------------------------------------
# GEOCODE LOOP
# ----------------------------------------
for i, addr in enumerate(tqdm(uncached)):
    try:
        response = geocoder.forward(addr, limit=1)
        features = response.geojson().get('features', [])
        if features:
            coords = features[0]['geometry']['coordinates']
            cache[addr] = [coords[1], coords[0]]  # lat, lon
        else:
            cache[addr] = [None, None]
    except Exception:
        cache[addr] = [None, None]
    time.sleep(SLEEP_TIME)

    # Save progress
    if (i + 1) % SAVE_EVERY == 0:
        with open(CACHE_PATH, "w") as f:
            json.dump(cache, f)
        with open(BACKUP_PATH, "w") as f:
            json.dump(cache, f)
        print(f"💾 Checkpoint saved at {i+1} geocoded")

# Final save
with open(CACHE_PATH, "w") as f:
    json.dump(cache, f)
with open(BACKUP_PATH, "w") as f:
    json.dump(cache, f)
print("✅ Final geocode cache saved.")

# ----------------------------------------
# APPLY GEOCOORDINATES TO DATAFRAME
# ----------------------------------------
df[['lat', 'lon']] = df['full_address'].apply(lambda x: pd.Series(cache.get(x, [None, None])))

# Filter to valid NYC area
df = df.dropna(subset=['lat', 'lon'])
df = df[(df['lat'].between(40.49, 40.92)) & (df['lon'].between(-74.26, -73.68))]

# Save final dataset
df.to_csv("geocoded_fines_sample_50k.csv", index=False)
print(f"✅ Geocoded fines saved: {len(df):,} rows with coordinates.")


Geocoding 0 new addresses from 60,000 fines...


0it [00:00, ?it/s]

✅ Final geocode cache saved.





✅ Geocoded fines saved: 30,164 rows with coordinates.


## 🗺️ Cell 2: Spatial Join – Mapping Fines to Neighborhoods
This cell reads the geocoded fines and converts them into a GeoDataFrame.
It then loads the Neighborhood Tabulation Areas (NTA) from a GeoJSON file, reprojects it to the same coordinate system, and performs a spatial join to attach each fine to a neighborhood (if it intersects spatially).

In [7]:
import geopandas as gpd
import pandas as pd

# Load geocoded fine data
df = pd.read_csv("geocoded_fines_sample_50k.csv")

# Filter to NYC bounds (just in case)
# df = df[(df['lat'].between(40.49, 40.92)) & (df['lon'].between(-74.26, -73.68))]

# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df['lon'], df['lat']),
    crs="EPSG:4326"
)

# ✅ Load the clean GeoJSON instead of the shapefile
nta = gpd.read_file("data/nynta2020.geojson")  # replace path if needed
nta = nta.to_crs("EPSG:4326")  # ensure same CRS

# Spatial join: fines → neighborhoods
joined = gpd.sjoin(gdf, nta, how="left", predicate="intersects")


# Preview matched fines with neighborhood info
print(joined[['house_number', 'street_name', 'boroname', 'ntaname']].dropna().head())


  house_number    street_name   boroname                              ntaname
0          290       Broadway   Brooklyn                         Williamsburg
1            E  Lefferts Blvd     Queens                     South Ozone Park
2          489       LENOX RD   Brooklyn                East Flatbush-Erasmus
3           30      E 18th St  Manhattan  Midtown South-Flatiron-Union Square
4          320   Atlantic Ave   Brooklyn  Downtown Brooklyn-DUMBO-Boerum Hill


## 📊 Cell 3: Count Fines by Neighborhood
Here, we group the spatially joined data by ntaname (neighborhood name) and count how many fines occurred in each neighborhood.
This gives us raw counts of violations by area.

In [8]:
# Count number of fines per neighborhood
nta_counts = joined.groupby('ntaname').size().reset_index(name='num_fines')

# Preview top 10
print(nta_counts.sort_values(by='num_fines', ascending=False).head(10))


                                 ntaname  num_fines
131                 Midtown-Times Square       1098
32                  Chelsea-Hudson Yards        673
88                     Greenwich Village        669
204        Upper East Side-Carnegie Hill        651
207            Upper West Side (Central)        649
61                          East Village        626
130  Midtown South-Flatiron-Union Square        608
216                         West Village        603
93                        Hell's Kitchen        493
142                 Murray Hill-Kips Bay        488


## 👥 Cell 4: Normalize by Population
We load NYC population data by NTA and merge it with the fine data by matching on neighborhood names.
Then we compute fines per 1,000 residents, which normalizes the violation counts relative to population size — giving a fairer comparison across neighborhoods.

In [9]:
pop_df = pd.read_csv("data/New_York_City_Population_By_Neighborhood_Tabulation_Areas.csv")

# Strip whitespace and normalize
joined['ntaname'] = joined['ntaname'].str.strip()
pop_df['NTA Name'] = pop_df['NTA Name'].str.strip()

# Merge using ntaname
merged = joined.merge(pop_df, left_on='ntaname', right_on='NTA Name', how='left')

# Group and compute stats
nta_stats = (
    merged.groupby(['ntaname', 'ntaabbrev', 'boroname', 'Population'])
    .size()
    .reset_index(name='num_fines')
)

nta_stats['fines_per_1000'] = (nta_stats['num_fines'] / nta_stats['Population']) * 1000

# Preview
print(nta_stats.sort_values(by='fines_per_1000', ascending=False).head())



                           ntaname   ntaabbrev   boroname  Population  \
38                    East Village      EstVlg  Manhattan     41746.0   
39                    East Village      EstVlg  Manhattan     44136.0   
62                        Gramercy      Grmrcy  Manhattan     26184.0   
63                        Gramercy      Grmrcy  Manhattan     27988.0   
144  Upper East Side-Carnegie Hill  UES_CrngHl  Manhattan     61207.0   

     num_fines  fines_per_1000  
38         626       14.995449  
39         626       14.183433  
62         299       11.419187  
63         299       10.683150  
144        651       10.636038  


## 🧹 Cell 5: Clean and Aggregate Neighborhood Stats
Some neighborhoods may appear multiple times (due to data duplication or merged borders).
This step groups by neighborhood name again and computes the mean fines per 1,000 residents for each neighborhood to get a clean final dataset for visualization.

In [10]:
nta_cleaned = (
    nta_stats.groupby('ntaname')
    .agg({
        'num_fines': 'sum',
        'Population': 'mean',
        'fines_per_1000': 'mean'
    })
    .reset_index()
    .sort_values(by='fines_per_1000', ascending=False)
)
print(nta_cleaned.head())


                          ntaname  num_fines  Population  fines_per_1000
19                   East Village       1252     42941.0       14.589441
31                       Gramercy        598     27086.0       11.051169
72  Upper East Side-Carnegie Hill       1302     62435.5       10.430799
51           Murray Hill-Kips Bay        976     49580.5        9.847984
73                   West Village       1206     67681.5        8.910627


In [13]:
# Save cleaned neighborhood-level stats
nta_cleaned.to_csv("data/nta_fine_stats_cleaned.csv", index=False)

# Also save raw joined fine points (optional)
joined.to_file("data/fines_joined_with_neighborhoods.geojson", driver="GeoJSON")

# Save GeoJSON with stats merged in (for choropleth)
geo_merged = nta.merge(nta_cleaned, on='ntaname', how='left')
geo_merged.to_file("data/nta_with_fine_stats.geojson", driver="GeoJSON")
