- Dataset source: [NYC OpenData](https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk/about_data)
- There are 178 rows, each containing unique zip codes. The data contains `MODZCTA` (single-number major zip codes combining minor ones, stable and used in NYC Health), `ZCTA` (some of the those zip codes have additional minor ones which are comma-separated), `label` (like ZCTA, not sure of its purpose), `pop_est` (population counts as strings - remove commas and cast to int before using), and `the_geom` (a WKT-formatted MultiPolygon string).
- I updated crime_data.csv, removing the columns `Lat_Lon` and `New Georeferenced Column`, replacing them with a new `ZIP` column. Out of them, 27 rows are missing a ZIP after removing the 2638 rows with a zero lat/lon.

In [None]:
import pandas as pd
from shapely import wkt
from tqdm import tqdm

In [None]:
zip_data = pd.read_csv('NYC_Zip_Codes.csv', dtype=str)
crime_data = pd.read_csv('crime_data.csv')

point_strs = crime_data['New Georeferenced Column'].unique()
zip_areas = [(row[0], wkt.loads(row[4])) for row in zip_data.values.tolist()]

In [None]:
zip_map = {}
for s in tqdm(point_strs):
  zip_map[s] = next((z for z, poly in zip_areas if poly.contains(wkt.loads(s))), None)

100%|██████████| 70063/70063 [01:52<00:00, 621.76it/s] 


In [127]:
crime_data['ZIP'] = ''
for i, row in crime_data.iterrows():
  crime_data.at[i, 'ZIP'] = zip_map[row['New Georeferenced Column']]
crime_data_condensed = crime_data[crime_data['Lat_Lon'] != '(0.0, 0.0)'].drop(['Lat_Lon', 'New Georeferenced Column'], axis=1)

In [None]:
crime_data_condensed.to_csv('Full_crimedata_with_zip.csv', index=False)