## Rat Inspection Data Cleaning and EDA

This notebook is an initial study of the rat inspection data.

In [137]:
## Importing Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

import os
import glob

In [138]:
## Imports the rat inspection data from the split up csv files and concatenates them into one dataframe called rat_insp.

path = r'data/split_up_rat_inspection_data' 
all_files = glob.glob(os.path.join(path , "*.csv"))
rat_insp = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [139]:
display(rat_insp.sample(3)) #get a sense of what data looks like

print(f"Below are the columns in the dataframe.\n")
display(rat_insp.columns)

Unnamed: 0,INSPECTION_TYPE,JOB_TICKET_OR_WORK_ORDER_ID,JOB_ID,JOB_PROGRESS,BBL,BORO_CODE,BLOCK,LOT,HOUSE_NUMBER,STREET_NAME,...,BOROUGH,INSPECTION_DATE,RESULT,APPROVED_DATE,LOCATION,COMMUNITY BOARD,COUNCIL DISTRICT,CENSUS TRACT,BIN,NTA
2452182,Compliance,13757247,PC8123827,7,2025250000.0,2.0,2525.0,19.0,1046,SUMMIT AVENUE,...,Bronx,11/21/2023 11:00:06 AM,Failed for Rat Act,11/29/2023 08:41:32 AM,"(40.834252354227, -73.929596635982)",4.0,16.0,189.0,2003463.0,Highbridge
1352527,Treatments,2743969,PC7251335,4,3032340000.0,3.0,3234.0,30.0,166,WILSON AVENUE,...,Brooklyn,06/02/2019 08:45:14 AM,Monitoring visit,07/01/2019 11:09:36 AM,"(40.699934361835, -73.924401472105)",4.0,37.0,423.0,3073532.0,Bushwick (West)
2216866,Treatments,2794536,PC7625785,4,3017930000.0,3.0,1793.0,57.0,318,NOSTRAND AVENUE,...,Brooklyn,11/23/2020 12:10:03 PM,Monitoring visit,11/30/2020 10:30:38 AM,"(40.688843579377, -73.95127410267)",3.0,36.0,243.0,3414291.0,Bedford-Stuyvesant (West)


Below are the columns in the dataframe.



Index(['INSPECTION_TYPE', 'JOB_TICKET_OR_WORK_ORDER_ID', 'JOB_ID',
       'JOB_PROGRESS', 'BBL', 'BORO_CODE', 'BLOCK', 'LOT', 'HOUSE_NUMBER',
       'STREET_NAME', 'ZIP_CODE', 'X_COORD', 'Y_COORD', 'LATITUDE',
       'LONGITUDE', 'BOROUGH', 'INSPECTION_DATE', 'RESULT', 'APPROVED_DATE',
       'LOCATION', 'COMMUNITY BOARD', 'COUNCIL DISTRICT', 'CENSUS TRACT',
       'BIN', 'NTA'],
      dtype='str')

In [140]:
#Make letters lowercase, replace spaces with underscores, get rid of text after '(' etc
rat_insp.columns = [t.partition('(')[0].strip().lower().replace(' ', '_') for t in rat_insp.columns] #apply to column headers


In [141]:
rat_insp.info()

<class 'pandas.DataFrame'>
RangeIndex: 2990130 entries, 0 to 2990129
Data columns (total 25 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   inspection_type              str    
 1   job_ticket_or_work_order_id  int64  
 2   job_id                       str    
 3   job_progress                 int64  
 4   bbl                          float64
 5   boro_code                    float64
 6   block                        float64
 7   lot                          float64
 8   house_number                 str    
 9   street_name                  str    
 10  zip_code                     float64
 11  x_coord                      float64
 12  y_coord                      float64
 13  latitude                     float64
 14  longitude                    float64
 15  borough                      str    
 16  inspection_date              str    
 17  result                       str    
 18  approved_date                str    
 19  location   

In [142]:
# boro_code and borough appear to be redundant information. We check if the borough code corresponds to borough names.

display(rat_insp['boro_code'].value_counts())
display(rat_insp['borough'].value_counts())

boro_code
1.0    956780
3.0    886910
2.0    831929
4.0    246998
5.0     67182
9.0       330
Name: count, dtype: int64

borough
Manhattan        956780
Brooklyn         886910
Bronx            831929
Queens           246998
Staten Island     67182
Name: count, dtype: int64

In [143]:
# boro_code 9 seems to correspond to 'Unspecified' borough. We check if all rows with boro_code 9 have borough as 'Unspecified'.
rat_insp[rat_insp['boro_code'] == 9]['borough'].value_counts()

Series([], Name: count, dtype: int64)

In [144]:
# let's set those with boro_code 9 to have borough as 'Unspecified' just to be safe. 
rat_insp.loc[rat_insp['boro_code'] == 9, 'borough'] = 'Unspecified'

In [145]:
# Now, we drop boro_code since we have the borough column which is more descriptive.
rat_insp.drop(columns=['boro_code'], inplace=True)

In [146]:
# It looks like locationa and latitude and longitude are also redundant. We check if the location corresponds to the lat and long values.
display(rat_insp[['location', 'latitude', 'longitude']].sample(5))

Unnamed: 0,location,latitude,longitude
1151981,"(40.690924611242, -73.935586619693)",40.690925,-73.935587
2840213,"(40.836282543231, -73.912305694118)",40.836283,-73.912306
12489,"(40.852344787629, -73.916043778358)",40.852345,-73.916044
2129580,"(40.759427780238, -73.906588448085)",40.759428,-73.906588
620932,"(40.695438076851, -73.948542968487)",40.695438,-73.948543


In [147]:
# To-Do write a code block to drop the latitude and longitude columns. Ensure both lat/lon are present before dropping.
# Check for matches to be safe as well. Location is more descriptive
#  than lat and long, so we will keep location and drop lat and long.


In [148]:
# Here, we drop a lot of the extra columns we might not need for our analysis. 
# We can always add them back in later if we find that we need them.

# It looks like job_ticket_or_work_order_id, job_id, and job_progress are all related to the same thing. 
# It also looks like x_coord, y_coord, community board, council district, and census tract are all related to location.
# We drop all of these.

rat_insp.drop(columns=['job_ticket_or_work_order_id', 'job_id', 'job_progress', 'x_coord', 'y_coord', 'community_board', 'council_district', 'census_tract'], inplace=True)

# We might also want to drop house_numer, street_name, depending on what we focus on.

rat_insp.drop(columns=['house_number', 'street_name'], inplace=True)

# Same for block, lot, and nta.

rat_insp.drop(columns=['block', 'lot', 'nta'], inplace=True)

# We also probably won't be using bbl for anything.

rat_insp.drop(columns=['bbl'], inplace=True)

# Same for bin.

rat_insp.drop(columns=['bin'], inplace=True)




In [149]:
rat_insp.info()

<class 'pandas.DataFrame'>
RangeIndex: 2990130 entries, 0 to 2990129
Data columns (total 9 columns):
 #   Column           Dtype  
---  ------           -----  
 0   inspection_type  str    
 1   zip_code         float64
 2   latitude         float64
 3   longitude        float64
 4   borough          str    
 5   inspection_date  str    
 6   result           str    
 7   approved_date    str    
 8   location         str    
dtypes: float64(3), str(6)
memory usage: 501.0 MB


In [150]:
rat_insp.sample(4)

Unnamed: 0,inspection_type,zip_code,latitude,longitude,borough,inspection_date,result,approved_date,location
914846,Initial,10459.0,40.829448,-73.892881,Bronx,02/26/2016 04:22:54 PM,Passed,03/01/2016 11:54:57 AM,"(40.829448139376, -73.8928813085)"
359576,Initial,10031.0,40.824785,-73.948623,Manhattan,03/29/2011 03:26:08 PM,Passed,03/30/2011 10:54:31 AM,"(40.824784850775, -73.948623221328)"
1375658,Treatments,10028.0,40.775732,-73.952748,Manhattan,12/13/2017 12:10:13 PM,Bait applied,12/13/2017 03:09:45 PM,"(40.775732479232, -73.952748385486)"
667677,Initial,11429.0,40.709256,-73.746549,Queens,03/02/2018 08:30:57 AM,Failed for Other R,03/05/2018 10:41:24 AM,"(40.709255696835, -73.746549267391)"


In [151]:
# Let's look at the "results" of the inspections.

rat_insp['result'].value_counts()

# "Failed for Other R" seems to be irrelevant if we are focused on inspections involving rats.
# "Bait applied" could indicate that there were rats, but it could also indicate that there were just signs of rats. 
# We will keep it for now and see if we can find more information about it later.

# It is not clear what "Stoppage Done" and "Cleanup Done" mean. We need to look into this more later as well.




# To-do: Clean-up this column based on what we intend to do with the data later.

result
Passed                1761938
Bait applied           410256
Failed for Rat Act     289592
Failed for Other R     251783
Rat Activity           214102
Monitoring visit        46892
Stoppage done           12153
Cleanup done             3389
Name: count, dtype: int64

In [152]:
# To-Do: Clean-up the "inspection_type" column as well.
# Let's check the inspection_type column and see if there are any types of inspections that we might want to focus on or exclude.
rat_insp['inspection_type'].value_counts()

inspection_type
Initial       2059794
Compliance     457646
Treatments     457148
Stoppage        12153
Clean Ups        3389
Name: count, dtype: int64