## Rat Inspection Data Cleaning

This notebook cleans the data in the folder scr/data/split_up_rat_insepection_data and saves the results to folder scr/data/cleaned_rat_inspection_data. The main steps of our cleaning process are outlined as follows.

1. We first made column names lower case and replaced spaces with underscores.

2. We updated the entries for borough based on the borough_code column.

3. We dropped the redundant borough_code column. We also see that the location column is redundant so we drop that as well.

4. We observed that there were some outliers for inspection_date of certain entries. We also saw that most of the data is concentrated from 01-01-2010 to the present day. We dropped rows with dates outside of this range.

5. We observed that there were problematic longitude and latitude entries. For those outside of New York City boundaries, we set to np.nan entries.

6. We saw that zip_code entries of 0, 12345, 458, 1045 were problematic in different ways. A zip_code entry of 0 indicated an entry which had sparse location information e.g. lacking longitude or latitude entries and lacking enough information to determine the zip code. A zip_code entry of 12345 almost always had sufficient longitude and latitude data to determine the zip_code. So we updated these entries accordingly. The zip_code entries 458 and 1045 had only one entry each with insufficient information for use and we dropped these entries.

7. After this clean-up process, we exported the data to scr/data/cleaned_rat_inspection_data. We chose to split up the data by year of the inspection for ease of use. At the very end, we quantified the missingness of the data by using missingno's matrix and heatmap.

In [None]:
# Import Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import os
import glob




In [None]:
# Obtain rat inspection data from the csv files concatenates them into one dataframe called rat_insp.

path = r'../data/split_up_rat_inspection_data' 
all_files = glob.glob(os.path.join(path , "*.csv"))
rat_insp = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [None]:
display(rat_insp.sample(3)) #get a sense of what data looks like

print(f"Below are the columns in the dataframe.\n")
display(rat_insp.columns)

In [None]:
# Make letters lowercase, replace spaces with underscores, get rid of text after '(' etc
rat_insp.columns = [t.partition('(')[0].strip().lower().replace(' ', '_') for t in rat_insp.columns] #apply to column headers

display(rat_insp.columns)


In [None]:
# boro_code and borough appear to be redundant information.
display(rat_insp['boro_code'].value_counts())
display(rat_insp['borough'].value_counts())

In [None]:
# boro_code 9 seems to correspond to 'Unspecified' borough. 
# check if all rows with boro_code 9 have borough as 'Unspecified'.
rat_insp[rat_insp['boro_code'] == 9]['borough'].value_counts()

# boro_code 9 corresponds to 'Unspecified' so we set those with boro_code 9 to have borough as 'Unspecified' just to be safe. 
rat_insp.loc[rat_insp['boro_code'] == 9, 'borough'] = 'Unspecified'

# drop boro_code since we have the borough column which is more descriptive.
rat_insp.drop(columns=['boro_code'], inplace=True)

In [None]:
# make the datetime the correct format

rat_insp['inspection_date'] = pd.to_datetime(rat_insp['inspection_date']) 

In [None]:
# looks like location and latitude and longitude are also redundant. 
display(rat_insp[['location', 'latitude', 'longitude']].sample(5))
# we drop the location column.
rat_insp.drop(columns=['location'], inplace=True)

In [None]:
# Let's look at the "results" of the inspections.
rat_insp['result'].value_counts()

In [None]:
# Let's check the inspection_type column and see if there are any types of inspections that we might want to focus on or exclude.
rat_insp['inspection_type'].value_counts()

In [None]:
failed_rat_act = rat_insp[rat_insp['result'] == 'Failed for Rat Act']
failedidate = failed_rat_act.groupby(failed_rat_act['inspection_date'].dt.date).size().reset_index(name='count')
notfail = rat_insp[rat_insp['result'] != 'Failed for Rat Act']
idate = notfail.groupby(notfail['inspection_date'].dt.date).size().reset_index(name='count')


plt.figure(figsize=(35,20))
plt.plot(idate['inspection_date'], idate['count'], 'o', color="b", alpha=0.50, label='Passed Inspections')
plt.plot(failedidate['inspection_date'], failedidate['count'], 'o', color="r", alpha=0.50, label='Failed Inspections')
plt.xlabel('Inspection Date')
plt.ylabel('Count of Inspections')
plt.legend()
plt.title('Count of Inspections Over Time (Blue = All Inspections, Red = Failed due to Rat Activity')
plt.show()

In [None]:
# Most of the data appears concentrated between 2010 to present day.

rat_insp['inspection_date'].describe()

In [None]:
# we keep entries between 01-01-2010 and present day.

today = pd.to_datetime("today").strftime("%m/%d/%Y")

rat_insp = rat_insp[(rat_insp['inspection_date'] >= '2010-01-01') & (rat_insp['inspection_date'] <= today)]

In [None]:
import plotly.figure_factory as ff


# Add a dummy column to count each row
rat_insp['dummy_count'] = 1

fig = ff.create_hexbin_mapbox(
    data_frame=rat_insp,
    lat="latitude",
    lon="longitude",
    nx_hexagon=20,             # Number of hexagons in x direction
    color="dummy_count",       # Sum of dummy_count = number of occurrences
    agg_func=np.sum,           # Sum the dummy column
    opacity=0.85,
    labels={"color": "Number of Inspections"},
)

fig.update_layout(
    mapbox_style="open-street-map",
    margin=dict(b=0, t=0, l=0, r=0),
)
fig.show()

rat_insp.drop(columns=['dummy_count'], inplace=True)


In [None]:
# The above map has points not in New York City.
display(rat_insp[['latitude', 'longitude']].describe())

In [None]:
# Let's look at the rows with the minimum and maximum latitude and longitude values to see if there are any obvious errors or outliers.
display(rat_insp[rat_insp['latitude'] == rat_insp['latitude'].min()])
display(rat_insp[rat_insp['latitude'] == rat_insp['latitude'].max()])
display(rat_insp[rat_insp['longitude'] == rat_insp['longitude'].min()])
display(rat_insp[rat_insp['longitude'] == rat_insp['longitude'].max()])

In [None]:
# For these entries, let's set their latitude and longitude values to NaN since they are likely to be errors.
rat_insp.loc[rat_insp['latitude'] == rat_insp['latitude'].min(), ['latitude', 'longitude']] = np.nan
rat_insp.loc[rat_insp['latitude'] == rat_insp['latitude'].max(), ['latitude', 'longitude']] = np.nan
rat_insp.loc[rat_insp['longitude'] == rat_insp['longitude'].min(), ['latitude', 'longitude']] = np.nan
rat_insp.loc[rat_insp['longitude'] == rat_insp['longitude'].max(), ['latitude', 'longitude']] = np.nan

In [None]:
pip install geopy pandas

In [None]:
# Let's make sure that we deal with entries with weird 'zip_code' entries.
# The weird zip-codes are 0, 458, 1045, 12345.



In [None]:
zipcodes = rat_insp['zip_code'].values
zipcodes = np.unique(zipcodes)
display(zipcodes)


In [None]:
display(rat_insp[rat_insp['zip_code'].isna()])

In [None]:
display(rat_insp[rat_insp['zip_code']== 0])

In [None]:
display(rat_insp[rat_insp['zip_code']== 458])
display(rat_insp[rat_insp['zip_code']== 1045])

# We drop the entries with 'zip_code' 458 or 1045 for lack of information.
rat_insp = rat_insp[(rat_insp['zip_code'] != 458) & (rat_insp['zip_code'] != 1045)]

In [None]:
display(rat_insp[rat_insp['zip_code']== 12345])

In [None]:
null_coords = rat_insp[rat_insp['latitude'].isna() | rat_insp['longitude'].isna()]
display(null_coords)

In [None]:
from scipy.spatial import cKDTree

# load the zip_code data
zip_db = pd.read_csv("map_data_for_cleaning/uszips.csv")
zip_db = zip_db[['zip', 'lat', 'lng']].dropna()

# Remove invalid (NaN or inf) coordinates
zip_db = zip_db[np.isfinite(zip_db['lat']) & np.isfinite(zip_db['lng'])]

# Build KDTree
tree = cKDTree(zip_db[['lat', 'lng']].values)

def nearest_zip(lat, lon):
    """Return ZIP code nearest to a given latitude/longitude."""
    if not np.isfinite(lat) or not np.isfinite(lon):
        return pd.NA  # skip invalid coordinates
    distance, idx = tree.query([lat, lon])
    return int(zip_db.iloc[idx]['zip'])

# fix problematic rows
zip_codes_to_fix = {0, 12345}
mask = rat_insp["zip_code"].isin(zip_codes_to_fix)

# Only apply to rows with valid lat/lon
valid_mask = mask & rat_insp['latitude'].notna() & rat_insp['longitude'].notna()

rat_insp.loc[valid_mask, 'zip_code'] = rat_insp.loc[valid_mask].apply(
    lambda r: nearest_zip(r['latitude'], r['longitude']),
    axis=1
)

In [None]:
# Let's save the cleaned dataframe to a new csv file for future use.
# Since the dataframe is quite large, we will split it up into multiple csv files 
# based on the year of the inspection date.
for year in rat_insp['inspection_date'].dt.year.unique():
    yearly_data = rat_insp[rat_insp['inspection_date'].dt.year == year]
    yearly_data.to_csv(f'../data/cleaned_rat_inspection_data/cleaned_rat_inspection_{year}.csv', index=False)

In [None]:
msno.matrix(rat_insp)
msno.heatmap(rat_insp)