**Goal**
Understand the size, coverage, and basic patterns in the NYPD Complaint Data so we can  build crime-based neighborhood features around restaurants.


**Plan**
1. Load the NYPD complaint CSV from `data/raw`.
2. Keep only the columns we need (ID, date/time, borough, offense, lat/long).
3. Convert dates and numeric fields to the right types and handle missing values.
4. Filter out invalid records
5. Summarize the cleaned data: size, date range, borough counts, top offense types.
6. Save a cleaned version
7. Write a short summary of data

In [1]:
import pandas as pd

nypd_path = "../data/raw/NYPD_Complaint_Data_Historic.csv"
nypd = pd.read_csv(nypd_path)

  nypd = pd.read_csv(nypd_path)


In [2]:
nypd.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,CMPLNT_TO_DT,CMPLNT_TO_TM,ADDR_PCT_CD,RPT_DT,KY_CD,OFNS_DESC,PD_CD,...,SUSP_SEX,TRANSIT_DISTRICT,Latitude,Longitude,Lat_Lon,PATROL_BORO,STATION_NAME,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,39468181,02/20/2008,07:00:00,02/23/2008,08:00:00,88.0,02/23/2008,107,BURGLARY,221.0,...,(null),,40.692464,-73.972708,"(40.692464, -73.972708)",PATROL BORO BKLYN NORTH,(null),25-44,WHITE,F
1,50539499,08/21/2008,22:00:00,08/21/2008,23:00:00,19.0,08/22/2008,109,GRAND LARCENY,438.0,...,(null),,40.771341,-73.953418,"(40.771341, -73.953418)",PATROL BORO MAN NORTH,(null),45-64,WHITE HISPANIC,F
2,45223390,04/03/2008,03:35:00,04/03/2008,03:50:00,77.0,04/03/2008,106,FELONY ASSAULT,109.0,...,(null),,40.671245,-73.926713,"(40.671245, -73.926713)",PATROL BORO BKLYN NORTH,(null),25-44,BLACK,F
3,50594658,08/19/2008,09:00:00,,(null),32.0,08/27/2008,341,PETIT LARCENY,349.0,...,(null),,40.813412,-73.943226,"(40.813412, -73.943226)",PATROL BORO MAN NORTH,(null),(null),UNKNOWN,M
4,44451016,03/10/2008,22:00:00,03/10/2008,22:10:00,67.0,03/11/2008,105,ROBBERY,397.0,...,M,,40.650142,-73.944674,"(40.650142, -73.944674)",PATROL BORO BKLYN SOUTH,(null),25-44,BLACK,M


Looking at the columns, it is easy to see that many are not needed for my analysis.

In [3]:
nypd.columns.tolist()

['CMPLNT_NUM',
 'CMPLNT_FR_DT',
 'CMPLNT_FR_TM',
 'CMPLNT_TO_DT',
 'CMPLNT_TO_TM',
 'ADDR_PCT_CD',
 'RPT_DT',
 'KY_CD',
 'OFNS_DESC',
 'PD_CD',
 'PD_DESC',
 'CRM_ATPT_CPTD_CD',
 'LAW_CAT_CD',
 'BORO_NM',
 'LOC_OF_OCCUR_DESC',
 'PREM_TYP_DESC',
 'JURIS_DESC',
 'JURISDICTION_CODE',
 'PARKS_NM',
 'HADEVELOPT',
 'HOUSING_PSA',
 'X_COORD_CD',
 'Y_COORD_CD',
 'SUSP_AGE_GROUP',
 'SUSP_RACE',
 'SUSP_SEX',
 'TRANSIT_DISTRICT',
 'Latitude',
 'Longitude',
 'Lat_Lon',
 'PATROL_BORO',
 'STATION_NAME',
 'VIC_AGE_GROUP',
 'VIC_RACE',
 'VIC_SEX']

In [4]:
keep_cols = [
    "CMPLNT_NUM", # complaint ID
    "CMPLNT_FR_DT", # date of incident
    "CMPLNT_FR_TM", # time of incident
    "BORO_NM", # borough name
    "ADDR_PCT_CD", # precinct code
    "OFNS_DESC", # offense description
    "LAW_CAT_CD", # felony/misdemeanor/violation
    "CRM_ATPT_CPTD_CD", # completed vs attempted
    "Latitude", # latitude
    "Longitude" # longitude
]
nypd_small = nypd[keep_cols].copy()
nypd_small.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,BORO_NM,ADDR_PCT_CD,OFNS_DESC,LAW_CAT_CD,CRM_ATPT_CPTD_CD,Latitude,Longitude
0,39468181,02/20/2008,07:00:00,BROOKLYN,88.0,BURGLARY,FELONY,COMPLETED,40.692464,-73.972708
1,50539499,08/21/2008,22:00:00,MANHATTAN,19.0,GRAND LARCENY,FELONY,COMPLETED,40.771341,-73.953418
2,45223390,04/03/2008,03:35:00,BROOKLYN,77.0,FELONY ASSAULT,FELONY,COMPLETED,40.671245,-73.926713
3,50594658,08/19/2008,09:00:00,MANHATTAN,32.0,PETIT LARCENY,MISDEMEANOR,COMPLETED,40.813412,-73.943226
4,44451016,03/10/2008,22:00:00,BROOKLYN,67.0,ROBBERY,FELONY,COMPLETED,40.650142,-73.944674


### Columns kept and what they mean

- **CMPLNT_NUM** – Randomly generated persistent ID for each complaint
- **CMPLNT_FR_DT** – Exact date of occurrence for the reported event (or starting date of occurrence, if CMPLNT_TO_DT exists)
- **CMPLNT_FR_TM** – Exact time of occurrence for the reported event (or starting time of occurrence, if CMPLNT_TO_TM exists)
- **BORO_NM** – The name of the borough in which the incident occurre
- **ADDR_PCT_CD** – The precinct in which the incident occurred
- **OFNS_DESC** – Description of offense corresponding with key code
- **LAW_CAT_CD** – Level of offense: felony, misdemeanor, violation
- **CRM_ATPT_CPTD_CD** – Indicator of whether crime was successfully completed or attempted, but failed or was interrupted prematurely
- **Latitude** – Midblock Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)
- **Longitude** – Midblock Longitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)


In [None]:
nypd_small.info()
nypd_small["Latitude"] = pd.to_numeric(nypd_small["Latitude"])
nypd_small["Longitude"] = pd.to_numeric(nypd_small["Longitude"])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9491946 entries, 0 to 9491945
Data columns (total 10 columns):
 #   Column            Dtype  
---  ------            -----  
 0   CMPLNT_NUM        object 
 1   CMPLNT_FR_DT      object 
 2   CMPLNT_FR_TM      object 
 3   BORO_NM           object 
 4   ADDR_PCT_CD       float64
 5   OFNS_DESC         object 
 6   LAW_CAT_CD        object 
 7   CRM_ATPT_CPTD_CD  object 
 8   Latitude          float64
 9   Longitude         float64
dtypes: float64(3), object(7)
memory usage: 724.2+ MB


In [5]:
nypd_small.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,BORO_NM,ADDR_PCT_CD,OFNS_DESC,LAW_CAT_CD,CRM_ATPT_CPTD_CD,Latitude,Longitude
0,39468181,02/20/2008,07:00:00,BROOKLYN,88.0,BURGLARY,FELONY,COMPLETED,40.692464,-73.972708
1,50539499,08/21/2008,22:00:00,MANHATTAN,19.0,GRAND LARCENY,FELONY,COMPLETED,40.771341,-73.953418
2,45223390,04/03/2008,03:35:00,BROOKLYN,77.0,FELONY ASSAULT,FELONY,COMPLETED,40.671245,-73.926713
3,50594658,08/19/2008,09:00:00,MANHATTAN,32.0,PETIT LARCENY,MISDEMEANOR,COMPLETED,40.813412,-73.943226
4,44451016,03/10/2008,22:00:00,BROOKLYN,67.0,ROBBERY,FELONY,COMPLETED,40.650142,-73.944674


In [None]:
#save cleaned file
nypd_small.to_csv("../data/processed/nypd_clean.csv", index=False)