## Rat Inspection Data Cleaning and EDA

This notebook is an initial study of the rat inspection data.

In [None]:
## Importing Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

import os
import glob

In [None]:
## Imports the rat inspection data from the split up csv files and concatenates them into one dataframe called rat_insp.

path = r'data/split_up_rat_inspection_data' 
all_files = glob.glob(os.path.join(path , "*.csv"))
rat_insp = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [None]:
display(rat_insp.sample(3)) #get a sense of what data looks like

print(f"Below are the columns in the dataframe.\n")
display(rat_insp.columns)

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,...,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION,COMMUNITY BOARD,COUNCIL DISTRICT,CENSUS TRACT,BIN,NTA
2876440,Initial,11852107,PC7049289,1,3045580000.0,3.0,4558.0,93.0,650,SAPPHIRE STREET,...,Brooklyn,07/05/2012 10:46:14 AM,Passed,07/10/2012 08:34:56 AM,"(40.66438414972, -73.855833315954)",5.0,42.0,1220.0,3350430.0,East New York-City Line
466810,Initial,12201794,PC6205946,1,1021570000.0,1.0,2157.0,75.0,1534,ST NICHOLAS AVENUE,...,Manhattan,10/17/2016 03:31:23 PM,Failed for Rat Act,10/18/2016 01:27:05 PM,"(40.852846128948, -73.931134848875)",12.0,10.0,269.0,1063791.0,Washington Heights (North)
2913208,Initial,12582060,PC7279343,1,3018490000.0,3.0,1849.0,26.0,4,VERONA PLACE,...,Brooklyn,03/26/2019 11:40:23 AM,Passed,03/27/2019 09:29:19 AM,"(40.681104465761, -73.947659923703)",3.0,36.0,249.0,3053240.0,Bedford-Stuyvesant (West)


Below are the columns in the dataframe.



Index(['INSPECTION_TYPE', 'JOB_TICKET_OR_WORK_ORDER_ID', 'JOB_ID',
       'JOB_PROGRESS', 'BBL', 'BORO_CODE', 'BLOCK', 'LOT', 'HOUSE_NUMBER',
       'STREET_NAME', 'ZIP_CODE', 'X_COORD', 'Y_COORD', 'LATITUDE',
       'LONGITUDE', 'BOROUGH', 'INSPECTION_DATE', 'RESULT', 'APPROVED_DATE',
       'LOCATION', 'COMMUNITY BOARD', 'COUNCIL DISTRICT', 'CENSUS TRACT',
       'BIN', 'NTA'],
      dtype='str')

In [None]:
#Make letters lowercase, replace spaces with underscores, get rid of text after '(' etc
rat_insp.columns = [t.partition('(')[0].strip().lower().replace(' ', '_') for t in rat_insp.columns] #apply to column headers


In [None]:
rat_insp.info()

<class 'pandas.DataFrame'>
RangeIndex: 2990130 entries, 0 to 2990129
Data columns (total 25 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   inspection_type              str    
 1   job_ticket_or_work_order_id  int64  
 2   job_id                       str    
 3   job_progress                 int64  
 4   bbl                          float64
 5   boro_code                    float64
 6   block                        float64
 7   lot                          float64
 8   house_number                 str    
 9   street_name                  str    
 10  zip_code                     float64
 11  x_coord                      float64
 12  y_coord                      float64
 13  latitude                     float64
 14  longitude                    float64
 15  borough                      str    
 16  inspection_date              str    
 17  result                       str    
 18  approved_date                str    
 19  location   

In [None]:
# boro_code and borough appear to be redundant information. We check if the borough code corresponds to borough names.

display(rat_insp['boro_code'].value_counts())
display(rat_insp['borough'].value_counts())

boro_code
1.0    956780
3.0    886910
2.0    831929
4.0    246998
5.0     67182
9.0       330
Name: count, dtype: int64

borough
Manhattan        956780
Brooklyn         886910
Bronx            831929
Queens           246998
Staten Island     67182
Name: count, dtype: int64

In [None]:
# boro_code 9 seems to correspond to 'Unspecified' borough. We check if all rows with boro_code 9 have borough as 'Unspecified'.
rat_insp[rat_insp['boro_code'] == 9]['borough'].value_counts()

Series([], Name: count, dtype: int64)

In [None]:
# let's set those with boro_code 9 to have borough as 'Unspecified' just to be safe. 
rat_insp.loc[rat_insp['boro_code'] == 9, 'borough'] = 'Unspecified'

In [None]:
# Now, we drop boro_code since we have the borough column which is more descriptive.
rat_insp.drop(columns=['boro_code'], inplace=True)

In [None]:
# It looks like locationa and latitude and longitude are also redundant. We check if the location corresponds to the lat and long values.
display(rat_insp[['location', 'latitude', 'longitude']].sample(5))

Unnamed: 0,location,latitude,longitude
989608,"(40.830206408828, -73.91696797985)",40.830206,-73.916968
1274127,"(40.687051215297, -73.906870517922)",40.687051,-73.906871
1882059,"(40.871590788488, -73.905378075553)",40.871591,-73.905378
1344629,"(40.811849155158, -73.943380606722)",40.811849,-73.943381
17815,"(40.833927859298, -73.912702708301)",40.833928,-73.912703


In [None]:
# To-Do write a code block to drop the latitude and longitude columns. Ensure both lat/lon are present before dropping.
# Check for matches to be safe as well. Location is more descriptive
#  than lat and long, so we will keep location and drop lat and long.


In [None]:
# Here, we drop a lot of the extra columns we might not need for our analysis. 
# We can always add them back in later if we find that we need them.

# It looks like job_ticket_or_work_order_id, job_id, and job_progress are all related to the same thing. 
# It also looks like x_coord, y_coord, community board, council district, and census tract are all related to location.
# We drop all of these.

rat_insp.drop(columns=['job_ticket_or_work_order_id', 'job_id', 'job_progress', 'x_coord', 'y_coord', 'community_board', 'council_district', 'census_tract'], inplace=True)

# We might also want to drop house_numer, street_name, depending on what we focus on.

rat_insp.drop(columns=['house_number', 'street_name'], inplace=True)

# Same for block, lot, and nta.

rat_insp.drop(columns=['block', 'lot', 'nta'], inplace=True)

# We also probably won't be using bbl for anything.

rat_insp.drop(columns=['bbl'], inplace=True)

# Same for bin.

rat_insp.drop(columns=['bin'], inplace=True)




In [None]:
rat_insp.info()

<class 'pandas.DataFrame'>
RangeIndex: 2990130 entries, 0 to 2990129
Data columns (total 9 columns):
 #   Column           Dtype  
---  ------           -----  
 0   inspection_type  str    
 1   zip_code         float64
 2   latitude         float64
 3   longitude        float64
 4   borough          str    
 5   inspection_date  str    
 6   result           str    
 7   approved_date    str    
 8   location         str    
dtypes: float64(3), str(6)
memory usage: 501.0 MB


In [None]:
rat_insp.sample(4)

Unnamed: 0,inspection_type,zip_code,latitude,longitude,borough,inspection_date,result,approved_date,location
2706157,Initial,10021.0,40.769218,-73.958064,Manhattan,06/27/2012 10:51:50 AM,Passed,07/02/2012 11:54:37 AM,"(40.769218494125, -73.958063661984)"
1908822,Compliance,10461.0,40.851759,-73.832226,Bronx,09/17/2025 12:37:33 PM,Passed,09/19/2025 10:18:22 AM,"(40.851758934858, -73.832226323242)"
2543539,Initial,10128.0,40.784481,-73.949734,Manhattan,03/23/2017 11:15:11 AM,Passed,04/04/2017 09:58:28 AM,"(40.784481474253, -73.949734145375)"
1942261,Initial,10467.0,40.87827,-73.880424,Bronx,09/30/2025 10:31:43 AM,Passed,10/01/2025 08:18:38 AM,"(40.878270164636, -73.880424201122)"


In [None]:
# Let's look at the "results" of the inspections.

rat_insp['result'].value_counts()

# "Failed for Other R" seems to be irrelevant if we are focused on inspections involving rats.
# "Bait applied" could indicate that there were rats, but it could also indicate that there were just signs of rats. 
# We will keep it for now and see if we can find more information about it later.

# It is not clear what "Stoppage Done" and "Cleanup Done" mean. We need to look into this more later as well.




# To-do: Clean-up this column based on what we intend to do with the data later.

result
Passed                1761938
Bait applied           410256
Failed for Rat Act     289592
Failed for Other R     251783
Rat Activity           214102
Monitoring visit        46892
Stoppage done           12153
Cleanup done             3389
Name: count, dtype: int64

In [None]:
# To-Do: Clean-up the "inspection_type" column as well.
# Let's check the inspection_type column and see if there are any types of inspections that we might want to focus on or exclude.
rat_insp['inspection_type'].value_counts()

inspection_type
Initial       2059794
Compliance     457646
Treatments     457148
Stoppage        12153
Clean Ups        3389
Name: count, dtype: int64

SyntaxError: invalid syntax (1464900545.py, line 1)

: 