**Goal**
Understand the size, coverage, and basic patterns in the NYPD Complaint Data so we can  build crime-based neighborhood features around restaurants.


**Plan**
1. Load the NYPD complaint CSV from `data/raw`.
2. Keep only the columns we need (ID, date/time, borough, offense, lat/long).
3. Convert dates and numeric fields to the right types and handle missing values.
4. Filter out invalid records
5. Summarize the cleaned data: size, date range, borough counts, top offense types.
6. Save a cleaned version
7. Write a short summary of data

Looking at the columns, it is easy to see that many are not needed for my analysis.

In [1]:

import pandas as pd
keep_cols = [
    "CMPLNT_NUM", # complaint ID
    "CMPLNT_FR_DT", # date of incident
    "CMPLNT_FR_TM", # time of incident
    "BORO_NM", # borough name
    "ADDR_PCT_CD", # precinct code
    "OFNS_DESC", # offense description
    "LAW_CAT_CD", # felony/misdemeanor/violation
    "CRM_ATPT_CPTD_CD", # completed vs attempted
    "Latitude", # latitude
    "Longitude" # longitude
]

nypd_path = "../data/raw/NYPD_Complaint_Data_Historic.csv"
print("Loading NYPD data (all rows, selected columns)...")

nypd = pd.read_csv(
    nypd_path,
    usecols=keep_cols,
    dtype=str
)

print(f"Loaded {len(nypd):,} rows and {len(nypd.columns)} columns")



Loading NYPD data (all rows, selected columns)...
Loaded 9,491,946 rows and 10 columns


In [2]:
nypd.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,ADDR_PCT_CD,OFNS_DESC,CRM_ATPT_CPTD_CD,LAW_CAT_CD,BORO_NM,Latitude,Longitude
0,39468181,02/20/2008,07:00:00,88,BURGLARY,COMPLETED,FELONY,BROOKLYN,40.692464,-73.972708
1,50539499,08/21/2008,22:00:00,19,GRAND LARCENY,COMPLETED,FELONY,MANHATTAN,40.771341,-73.953418
2,45223390,04/03/2008,03:35:00,77,FELONY ASSAULT,COMPLETED,FELONY,BROOKLYN,40.671245,-73.926713
3,50594658,08/19/2008,09:00:00,32,PETIT LARCENY,COMPLETED,MISDEMEANOR,MANHATTAN,40.813412,-73.943226
4,44451016,03/10/2008,22:00:00,67,ROBBERY,COMPLETED,FELONY,BROOKLYN,40.650142,-73.944674


### Columns kept and what they mean

- **CMPLNT_NUM** – Randomly generated persistent ID for each complaint
- **CMPLNT_FR_DT** – Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)
- **CMPLNT_FR_TM** – Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)
- **BORO_NM** – The name of the borough in which the incident occurre
- **ADDR_PCT_CD** – The precinct in which the incident occurred
- **OFNS_DESC** – Description of offense corresponding with key code
- **LAW_CAT_CD** – Level of offense: felony, misdemeanor, violation
- **CRM_ATPT_CPTD_CD** – Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely
- **Latitude** – Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
- **Longitude** – Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)


In [3]:
# Check missing values
print("Missing values before cleaning:")
print(nypd.isnull().sum())
print(f"\nTotal rows: {len(nypd):,}")


Missing values before cleaning:
CMPLNT_NUM            0
CMPLNT_FR_DT        655
CMPLNT_FR_TM          0
ADDR_PCT_CD         771
OFNS_DESC             0
CRM_ATPT_CPTD_CD      0
LAW_CAT_CD            0
BORO_NM               0
Latitude            479
Longitude           479
dtype: int64

Total rows: 9,491,946


In [4]:
# Filter out rows with missing coordinates 
nypd_clean = nypd.dropna(subset=['Latitude', 'Longitude']).copy()

# Also drop rows with missing borough
nypd_clean = nypd_clean.dropna(subset=['BORO_NM'])

print(f"Rows after removing missing coordinates/borough: {len(nypd_clean):,}")
print(f"Rows removed: {len(nypd) - len(nypd_clean):,}")


Rows after removing missing coordinates/borough: 9,491,467
Rows removed: 479


In [5]:
# Summary statistics by borough
print("Complaints by Borough:")
print(nypd_clean['BORO_NM'].value_counts())
print("\n" + "="*50)
print("\nCrime severity (LAW_CAT_CD):")
print(nypd_clean['LAW_CAT_CD'].value_counts())


Complaints by Borough:
BORO_NM
BROOKLYN         2777599
MANHATTAN        2288039
BRONX            2054056
QUEENS           1928783
STATEN ISLAND     434271
(null)              8719
Name: count, dtype: int64


Crime severity (LAW_CAT_CD):
LAW_CAT_CD
MISDEMEANOR    5216099
FELONY         2980919
VIOLATION      1294449
Name: count, dtype: int64


In [6]:
# Top 15 offense types
print("Top 15 Offense Types:")
print(nypd_clean['OFNS_DESC'].value_counts().head(15))


Top 15 Offense Types:
OFNS_DESC
PETIT LARCENY                     1666722
HARRASSMENT 2                     1272969
ASSAULT 3 & RELATED OFFENSES       998300
CRIMINAL MISCHIEF & RELATED OF     916259
GRAND LARCENY                      831909
DANGEROUS DRUGS                    471829
OFF. AGNST PUB ORD SENSBLTY &      455170
FELONY ASSAULT                     393342
ROBBERY                            331497
BURGLARY                           310262
MISCELLANEOUS PENAL LAW            252593
DANGEROUS WEAPONS                  191024
GRAND LARCENY OF MOTOR VEHICLE     188110
OFFENSES AGAINST PUBLIC ADMINI     165394
VEHICLE AND TRAFFIC LAWS           161061
Name: count, dtype: int64


## Save Cleaned Data


In [7]:
# Save cleaned NYPD data
output_path = '../data/processed/nypd_complaints_clean.csv'
nypd_clean.to_csv(output_path, index=False)

print(f"✓ Saved {len(nypd_clean):,} crime complaints to: {output_path}")
print(f"✓ Columns: {nypd_clean.columns.tolist()}")
print("\nNYPD data cleaning complete!")


✓ Saved 9,491,467 crime complaints to: ../data/processed/nypd_complaints_clean.csv
✓ Columns: ['CMPLNT_NUM', 'CMPLNT_FR_DT', 'CMPLNT_FR_TM', 'ADDR_PCT_CD', 'OFNS_DESC', 'CRM_ATPT_CPTD_CD', 'LAW_CAT_CD', 'BORO_NM', 'Latitude', 'Longitude']

NYPD data cleaning complete!
