## Summary
This dataset contains the results of health inspections conducted by the Department of Public Health from 2024 to Present. It includes the name and location of each facility inspected, the facility status (Pass, Conditional Pass, or Closure) after the inspection, and violations observed.


## Description of our dataset
* **inspection_date** (Floating Timestamp): Date and time when the inspection occurred.
* **inspector** (Text): Name of the inspector who conducted the inspection.
* **district** (Text): District in which the inspection took place.
* **subdistrict** (Text): Sub-district where the inspection was performed.
* **subsector** (Text): Specific sub-sector of the inspection area.
* **permit_number** (Text): Permit number associated with the facility, if applicable.
* **dba** (Text): “Doing Business As” name of the facility. Public trade name of the establishment.
* **permit_type** (Text): Type of permit held by the facility.
* **street_address** (Text): Street address of the facility.
* **street_address_clean** (Text): Cleaned and standardized street address.
* **inspection_type** (Text): Type/category of the inspection conducted.
* **inspection_frequency_type** (Text): Frequency classification of inspections.
* **total_time** (Number): Total duration of the inspection in minutes.
* **facility_rating_status** (Text): Rating status of the facility after inspection.
* **census** (Text): Census information linked to the facility location.
* **suspension_notes** (Text): Notes regarding any permit suspensions.
* **inspection_notes** (Text): Additional notes recorded during the inspection.
* **violation_count** (Number): Total number of violations observed.
* **violation_codes** (Text): Codes corresponding to the observed violations.
* **latitude** (Number): Latitude coordinate of the facility.
* **longitude** (Number): Longitude coordinate of the facility.
* **point** (Point): Geospatial point combining latitude and longitude.
* **analysis_neighborhood** (Text): Neighborhood used for analysis purposes.
* **supervisor_district** (Number): Supervisor district number for administrative purposes.
* **data_as_of** (Floating Timestamp): Date when the dataset was last updated in the source system.
* **data_loaded_at** (Floating Timestamp): Timestamp when the data was uploaded to the open data portal.


In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# from sklearn.preprocessing import StandardScaler

# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier 
# from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
# from sklearn.tree import DecisionTreeClassifier
# from xgboost import XGBClassifier

# from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedShuffleSplit, cross_validate
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_curve, f1_score,roc_auc_score, roc_curve, precision_recall_fscore_support
# from sklearn.inspection import PartialDependenceDisplay
# from sklearn import tree

# from treeinterpreter import treeinterpreter
# from waterfall_chart import plot as waterfall

# from imblearn.over_sampling import SMOTE,ADASYN,RandomOverSampler,BorderlineSMOTE,SVMSMOTE
# from imblearn.under_sampling import RandomUnderSampler

In [2]:
pd.set_option('display.max_columns',200)
pd.set_option('display.max_rows',200)
pd.set_option('display.max_colwidth',200)

In [3]:
# os.getcwd()
df_org = pd.read_csv("../data/clean/sfData_cleaned.csv")


In [4]:
df_org

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,street_address,street_address_clean,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,suspension_notes,inspection_notes,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district,data_as_of,data_loaded_at,canonical_address
0,2025-04-23,Michael Mooney,1,103,607,06734928,Surfside - Walk Thru,H36 - STADIUM CONCESSIONS (PERM),24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,,,37.778130,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,2025-07-01 10:09:15,2025-10-30 02:37:07,3rd street king street
1,2025-04-23,Michael Mooney,2,201,106,06735187,HARBOR EMPEROR,H33 - COMMISSARIES,41 EMBARCADERO,41 EMBARCADERO,Routine,2,60,Conditional Pass,106,,,4.0,"113953(c), 114163(a)(3), 114189, 114192.1, 114195, 114279 - Immediately provide hot running potable water of at least 120Â°F or greater for the entire food preparation facility., 113953, 113953.1,...",37.787126,-122.387925,POINT (-122.387924588 37.787126305),Financial District/South Beach,6.0,2025-07-01 10:09:15,2025-10-30 02:37:07,41 embarcadero
2,2025-04-23,Patrick Wood,1,101,176A,06743776,RA @ BLOOMBERG FLOOR 22,H85 - EMPLOYEE CAFETERIA LIMITED FOOD PREP,140 NEW MONTGOMERY ST FL 22,140 NEW MONTGOMERY ST,New Ownership (I),1,30,Pass,176A,,,,,37.786617,-122.400018,POINT (-122.400018 37.786617),Financial District/South Beach,6.0,2025-07-01 10:09:15,2025-10-30 02:37:07,140 new montgomery street
3,2025-04-23,Rochelle Veloso,5,502,401,102419,MOKUKU,"H26 - RESTAURANT OVER 2,000 SQFT",332 CLEMENT ST,332 CLEMENT ST,Site Visit,2,30,Pass,401,,,,,37.783269,-122.462932,POINT (-122.4629325 37.783269),Inner Richmond,1.0,2025-07-01 10:09:15,2025-10-30 02:37:07,332 clement street
4,2025-04-23,Sojeatta Khim,1,103,607,06734548,STEEP CREAMERY & TEA - SECTION 110,H36 - STADIUM CONCESSIONS (PERM),24 WILLIE MAYS PLZ,3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,,,37.778130,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,2025-07-01 10:09:15,2025-10-30 02:37:07,3rd street king street
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19051,2025-10-29,Michael Mooney,4,404,164,87648,,"H25 - RESTAURANT 1,000 - 2,000 SQFT",762 DIVISADERO,762 DIVISADERO,,1,45,,,,,6.0,"113947.1-113947.6, 113948 - Obtain a state approved food safety certification course within 60 days of operation or when a Food Safety Manager has left the facility. Maintain copies of food safety...",37.776511,-122.438032,POINT (-122.438031803 37.776511475),Hayes Valley,5.0,2025-10-30 02:00:05,2025-10-30 02:37:07,762 divisadero
19052,2024-10-29,Amy Johnson,No District,No Sub District,No Sub Sector,102395,PIZZALICIOUS LLC,"H24 - RESTAURANT UNDER 1,000 SQFT",1210 POLK E ST,1210 POLK E ST,Routine,1,75,Pass,167,,,7.0,"114143(d), 114266, 114268, 114268.1, 114271, 114272 - Provide walls / ceilings using materials that are durable, smooth, nonabsorbent, light-colored, and washable surfaces. All floor surfaces, oth...",,,,,,2025-07-01 10:09:15,2025-10-30 02:37:07,1210 polk e street
19053,2024-09-19,Danny Nguyen,2,202,179T,06733434,DA BOOT TO DA BAY,H79 - MOBILE FOOD FACILITY CLASS 5,900 AVENUE ST BLDG D,900 AVENUE ST BLDG D,Reinspection,1,20,Pass,MFF,,,,,,,,,,2025-07-01 10:09:15,2025-10-30 02:37:07,900 avenue street bldg d
19054,2024-08-19,Michael Mooney,No District,No Sub District,No Sub Sector,H0306732464,MV ARGO,H07 - RETAIL MKTS W/FOOD PREP (UNDER 5001),PIER FISHERMANS WHARF PIER,PIER FISHERMANS WHARF PIER,Routine,1,25,Pass,106,,,2.0,"113953, 113953.1, 113953.2, 114067(f) - Provide soap and single-use towels in dispensers, or a drying device at each handwash sink at all times. Maintain all handwash sinks unobstructed and access...",,,,,,2025-07-01 10:09:15,2025-10-30 02:37:07,pier fishermans wharf pier


In [5]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   street_address             19056 non-null  object 
 9   street_address_clean       19056 non-null  object 
 10  inspection_type            16673 non-null  object 
 11  inspection_frequency_type  18319 non-null  object 
 12  total_time                 18980 non-null  object 
 13  facility_rating_status     16533 non-null  obj

In [6]:
df_org.describe(include='all')

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,street_address,street_address_clean,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,suspension_notes,inspection_notes,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district,data_as_of,data_loaded_at,canonical_address
count,19056,19056,19054.0,19042.0,19035.0,19038.0,18148,19056,19056,19056,16673,18319.0,18980.0,16533,17622.0,0.0,0.0,10184.0,10184,18919.0,18919.0,18919,18874,18874.0,19056,19056,19056
unique,483,52,8.0,41.0,160.0,8102.0,6857,70,6813,6332,14,3.0,204.0,3,164.0,,,,6416,,,5096,41,,88,1,5871
top,2025-04-23,Michael Mooney,2.0,103.0,607.0,18306.0,BON APPETIT MANAGEMENT CO,"H25 - RESTAURANT 1,000 - 2,000 SQFT",49 S Van Ness Ave 7th Floor,3RD ST & KING ST,Routine,2.0,60.0,Pass,607.0,,,,"114259, 114259.1, 114259.4, 114259.5 - Eliminate the infestation/activity of cockroaches/rodents/flies/vermin from the food facility by using only approved methods. Remove all evidence of the infe...",,,POINT (-122.391855 37.77813),Mission,,2025-07-01 10:09:15,2025-10-30 02:37:07,3rd street king street
freq,197,1180,4268.0,1345.0,958.0,41.0,42,4269,207,243,9268,10487.0,3740.0,15575,553.0,,,,291,,,243,2583,,18148,19056,243
mean,,,,,,,,,,,,,,,,,,4.053712,,37.772371,-122.423539,,,5.529194,,,
std,,,,,,,,,,,,,,,,,,3.315183,,0.03691,0.042882,,,2.688655,,,
min,,,,,,,,,,,,,,,,,,1.0,,33.761824,-122.510677,,,1.0,,,
25%,,,,,,,,,,,,,,,,,,2.0,,37.760283,-122.43429,,,3.0,,,
50%,,,,,,,,,,,,,,,,,,3.0,,37.777527,-122.416182,,,6.0,,,
75%,,,,,,,,,,,,,,,,,,5.0,,37.788984,-122.40522,,,8.0,,,


In [7]:
df_org['facility_rating_status'].value_counts(normalize=True)*100

facility_rating_status
Pass                94.205528
Conditional Pass     3.477893
Closure              2.316579
Name: proportion, dtype: float64

In [8]:
# Lets check values in the dataset.
for i in df_org.columns:
    print(i,df_org[i].unique())

inspection_date ['2025-04-23' '2025-04-22' '2025-04-21' '2025-04-18' '2025-04-17'
 '2025-04-16' '2025-04-15' '2025-04-14' '2025-04-11' '2025-04-10'
 '2025-04-09' '2025-04-08' '2025-04-07' '2025-04-04' '2025-04-03'
 '2025-04-02' '2025-04-01' '2025-05-06' '2025-03-31' '2025-03-28'
 '2025-03-27' '2025-03-26' '2025-03-25' '2025-03-24' '2025-03-21'
 '2025-03-20' '2025-03-19' '2025-03-18' '2025-03-17' '2025-03-14'
 '2025-03-13' '2025-03-12' '2025-03-11' '2025-03-10' '2025-03-07'
 '2025-03-06' '2025-03-05' '2025-03-04' '2025-03-03' '2025-02-28'
 '2025-02-27' '2025-02-26' '2025-02-25' '2025-02-24' '2025-02-21'
 '2025-02-20' '2025-02-19' '2025-02-18' '2025-02-14' '2025-02-13'
 '2025-02-12' '2025-02-11' '2025-02-10' '2025-02-07' '2025-02-06'
 '2025-02-05' '2025-02-04' '2025-02-03' '2025-01-31' '2025-01-30'
 '2025-01-29' '2025-01-28' '2025-01-27' '2025-01-24' '2025-01-23'
 '2025-01-22' '2025-01-21' '2025-01-17' '2025-01-16' '2025-01-15'
 '2025-01-14' '2025-01-13' '2025-01-11' '2025-01-10' '2025-0

street_address_clean ['3RD ST & KING ST' '41 EMBARCADERO' '140 NEW MONTGOMERY ST' ...
 '2020   LOMBARD ST' '762   DIVISADERO' '1210 POLK E ST']
inspection_type ['Routine' 'New Ownership (I)' 'Site Visit' 'Reinspection'
 'New Ownership (R)' nan 'Plan Check (I)' 'Complaint (I)' 'Plan Check (R)'
 'Foodborne Illness' 'Structural' 'Complaint (R)' 'Change of Ownership'
 'Plan Check' 'Consultation']
inspection_frequency_type ['1' '2' nan 'High Hazard']
total_time ['10' '60' '30' '15' '150' '120' '45' '105' '35' '90' '25' '55' '95' '40'
 '80' '20' '75' '50' '140' '85' '65' '70' '100' '110' '-660' '180' '130'
 '-690' '91' '135' '19' '7' '42' '4' '11' '23' '-630' '59' '-645' nan '47'
 '5' '44' '165' '61' '250' '-680' '-705' '38' '0' '138' '-695' '210'
 '-615' '765' '115' '-631' '155' '330' '185' '750' '145' '735' '125' '740'
 '-30' '34' '245' '810' '225' '270' '53' '74' '240' '220' '62' '865' '79'
 '215' '-585' '78' '-650' '108' '780' '37' '29' '-652' '41' '-710' '101'
 '8' '-675' '48' '-570' '5

In [9]:
# Total time has -ve values

In [8]:
#Let'see the percentage of null values
(df_org.isnull().sum()).sort_values(ascending=False)/len(df_org)

inspection_notes             1.000000
suspension_notes             1.000000
violation_codes              0.465575
violation_count              0.465575
facility_rating_status       0.132399
inspection_type              0.125052
census                       0.075252
dba                          0.047649
inspection_frequency_type    0.038675
analysis_neighborhood        0.009551
supervisor_district          0.009551
longitude                    0.007189
point                        0.007189
latitude                     0.007189
total_time                   0.003988
subsector                    0.001102
permit_number                0.000945
subdistrict                  0.000735
district                     0.000105
street_address_clean         0.000000
street_address               0.000000
permit_type                  0.000000
inspector                    0.000000
inspection_date              0.000000
data_as_of                   0.000000
data_loaded_at               0.000000
canonical_ad

In [9]:
# 'suspension_notes' and 'inspection_notes' has 100% of null values. So we can just drop it.
df_org = df_org.drop(['suspension_notes','inspection_notes'],axis=1)

In [10]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   street_address             19056 non-null  object 
 9   street_address_clean       19056 non-null  object 
 10  inspection_type            16673 non-null  object 
 11  inspection_frequency_type  18319 non-null  object 
 12  total_time                 18980 non-null  object 
 13  facility_rating_status     16533 non-null  obj

In [11]:
#Compare street_address and street_address_clean to see if we can drop one.
df_org[['street_address','street_address_clean']].head(100)

Unnamed: 0,street_address,street_address_clean
0,24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST
1,41 EMBARCADERO,41 EMBARCADERO
2,140 NEW MONTGOMERY ST FL 22,140 NEW MONTGOMERY ST
3,332 CLEMENT ST,332 CLEMENT ST
4,24 WILLIE MAYS PLZ,3RD ST & KING ST
5,24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST
6,24 WILLIE MAYS PLZ # VIEW L,3RD ST & KING ST
7,24 WILLIE MAYS PLZ # # 5319,3RD ST & KING ST
8,24 WILLIE MAYS PLZ # VIEW L,3RD ST & KING ST
9,1201 ORTEGA ST,1201 ORTEGA ST


In [12]:
# Compare the two columns element-wise
matches = df_org['street_address'] == df_org['street_address_clean']

# Sum the True values to count how many rows match exactly
num_matches = matches.sum()

# Sum the False values to count how many rows do not match
num_non_matches = (~matches).sum()

print("Number of exact matches:", num_matches)
print("Number of non-matches:", num_non_matches)


Number of exact matches: 16321
Number of non-matches: 2735


In [15]:
# It is almost the same, 16k rows matches the cleaned street adress so we will remove street_address 
# and street_adress_cleaned.  

In [13]:
# Removing unwanted columns such as 'data_as_of' and 'data_loaded_at' and 'street_adress'
df_org = df_org.drop(['data_as_of','data_loaded_at','street_address', 'street_address_clean'],axis=1)

In [14]:
df_org.head()

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district,canonical_address
0,2025-04-23,Michael Mooney,1,103,607,6734928,Surfside - Walk Thru,H36 - STADIUM CONCESSIONS (PERM),Routine,1,10,Pass,STADIUM,,,37.77813,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,3rd street king street
1,2025-04-23,Michael Mooney,2,201,106,6735187,HARBOR EMPEROR,H33 - COMMISSARIES,Routine,2,60,Conditional Pass,106,4.0,"113953(c), 114163(a)(3), 114189, 114192.1, 114195, 114279 - Immediately provide hot running potable water of at least 120Â°F or greater for the entire food preparation facility., 113953, 113953.1,...",37.787126,-122.387925,POINT (-122.387924588 37.787126305),Financial District/South Beach,6.0,41 embarcadero
2,2025-04-23,Patrick Wood,1,101,176A,6743776,RA @ BLOOMBERG FLOOR 22,H85 - EMPLOYEE CAFETERIA LIMITED FOOD PREP,New Ownership (I),1,30,Pass,176A,,,37.786617,-122.400018,POINT (-122.400018 37.786617),Financial District/South Beach,6.0,140 new montgomery street
3,2025-04-23,Rochelle Veloso,5,502,401,102419,MOKUKU,"H26 - RESTAURANT OVER 2,000 SQFT",Site Visit,2,30,Pass,401,,,37.783269,-122.462932,POINT (-122.4629325 37.783269),Inner Richmond,1.0,332 clement street
4,2025-04-23,Sojeatta Khim,1,103,607,6734548,STEEP CREAMERY & TEA - SECTION 110,H36 - STADIUM CONCESSIONS (PERM),Routine,1,10,Pass,STADIUM,,,37.77813,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,3rd street king street


In [15]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   inspection_type            16673 non-null  object 
 9   inspection_frequency_type  18319 non-null  object 
 10  total_time                 18980 non-null  object 
 11  facility_rating_status     16533 non-null  object 
 12  census                     17622 non-null  object 
 13  violation_count            10184 non-null  flo

'district','subdistrict','subsector','permit_number','dba', 'inspection_type', 'inspection_frequency_type', 'total_time',
'facility_rating_status', 'census', 'violation_count', 'violation_codes', 'latitude', 'longitude', 'point', 'analysis_neighborhood'
'supervisor_district'

In [19]:
# Target variable facility_rating_status has null values. We will drop those rows. Imputing them and 
# predicting that partial truth does not make sense

In [16]:
# Drop rows where 'facility_rating_status' is null
df_org = df_org.dropna(subset=['facility_rating_status'])

In [17]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16533 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            16533 non-null  object 
 1   inspector                  16533 non-null  object 
 2   district                   16533 non-null  object 
 3   subdistrict                16533 non-null  object 
 4   subsector                  16533 non-null  object 
 5   permit_number              16530 non-null  object 
 6   dba                        16533 non-null  object 
 7   permit_type                16533 non-null  object 
 8   inspection_type            15138 non-null  object 
 9   inspection_frequency_type  16182 non-null  object 
 10  total_time                 16480 non-null  object 
 11  facility_rating_status     16533 non-null  object 
 12  census                     16124 non-null  object 
 13  violation_count            9263 non-null   float64


In [18]:
#Let'see the percentage of null values
(df_org.isnull().sum()).sort_values(ascending=False)/len(df_org)

violation_codes              0.439727
violation_count              0.439727
inspection_type              0.084377
census                       0.024738
inspection_frequency_type    0.021230
analysis_neighborhood        0.007137
supervisor_district          0.007137
point                        0.005807
longitude                    0.005807
latitude                     0.005807
total_time                   0.003206
permit_number                0.000181
inspection_date              0.000000
subsector                    0.000000
subdistrict                  0.000000
district                     0.000000
inspector                    0.000000
facility_rating_status       0.000000
permit_type                  0.000000
dba                          0.000000
canonical_address            0.000000
dtype: float64

'permit_number', 'inspection_type', 'inspection_frequency_type', 'total_time', 'census', 'violation_count',
'violation_codes', 'latitude', 'longitude', 'point', 'analysis_neighborhood', 'supervisor_district'

In [19]:
#Fix all null values

# Percentage of rows with null violation count that are Pass
null_violation_count_pass = df_org[df_org['violation_count'].isna()]['facility_rating_status'].value_counts(normalize=True) * 100
print(null_violation_count_pass)

# Percentage of rows with null violation codes that are Pass
null_violation_code_pass = df_org[df_org['violation_codes'].isna()]['facility_rating_status'].value_counts(normalize=True) * 100
print(null_violation_code_pass)


facility_rating_status
Pass                99.724897
Conditional Pass     0.165062
Closure              0.110041
Name: proportion, dtype: float64
facility_rating_status
Pass                99.724897
Conditional Pass     0.165062
Closure              0.110041
Name: proportion, dtype: float64


In [24]:
# Since 99% of rows that has null in violation_counts and violation_codes, it makes sense to assume that null here means
# no violation. So, we will impute 0 for all NaN values in violation_counts and an empty string in violation_codes

In [20]:
# Fill nulls in violation_count with 0
df_org['violation_count'] = np.where(df_org['violation_count'].isnull(), 0, df_org['violation_count'])

# Fill nulls in violation_code with empty string
df_org['violation_codes'] = np.where(df_org['violation_codes'].isnull(), '', df_org['violation_codes'])


In [21]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16533 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            16533 non-null  object 
 1   inspector                  16533 non-null  object 
 2   district                   16533 non-null  object 
 3   subdistrict                16533 non-null  object 
 4   subsector                  16533 non-null  object 
 5   permit_number              16530 non-null  object 
 6   dba                        16533 non-null  object 
 7   permit_type                16533 non-null  object 
 8   inspection_type            15138 non-null  object 
 9   inspection_frequency_type  16182 non-null  object 
 10  total_time                 16480 non-null  object 
 11  facility_rating_status     16533 non-null  object 
 12  census                     16124 non-null  object 
 13  violation_count            16533 non-null  float64


In [22]:
#Looking at the percentage of unique values for each column that has null values.
for i in ['permit_number', 'inspection_type', 'inspection_frequency_type', 'total_time', 
          'census', 'latitude', 'longitude', 'point', 'analysis_neighborhood', 'supervisor_district']:
    print(i, df_org[i].value_counts(normalize=True)*100, sep='\n\n')
    print('-'*75)

permit_number

permit_number
105555         0.193587
18306          0.187538
2133           0.120992
06733818       0.114943
H2606732772    0.090744
                 ...   
110488         0.006050
83644          0.006050
84321          0.006050
79599          0.006050
FYF119407      0.006050
Name: proportion, Length: 7167, dtype: float64
---------------------------------------------------------------------------
inspection_type

inspection_type
Routine                55.641432
Reinspection           19.447747
New Ownership (I)       7.993130
New Ownership (R)       4.643942
Site Visit              4.564672
Complaint (I)           4.492007
Foodborne Illness       1.770379
Complaint (R)           0.858766
Plan Check (I)          0.277447
Structural              0.178359
Plan Check (R)          0.105694
Plan Check              0.019818
Change of Ownership     0.006606
Name: proportion, dtype: float64
---------------------------------------------------------------------------
inspection_fr

In [28]:
# inspection_type and inspection_frequency_type has mode. We can just impute null values with the mode for these 2 cols.
# Drop remaining null values since it is very less in number

In [23]:
# 1. impute mode for inspection_type and inspection_frequency_type
for col in ["inspection_type", "inspection_frequency_type"]:
    mode_val = df_org[col].mode(dropna=True)[0]
    df_org[col] = df_org[col].fillna(mode_val)

# 2. drop remaining null rows
df_org = df_org.dropna()

In [24]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15975 entries, 0 to 18943
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            15975 non-null  object 
 1   inspector                  15975 non-null  object 
 2   district                   15975 non-null  object 
 3   subdistrict                15975 non-null  object 
 4   subsector                  15975 non-null  object 
 5   permit_number              15975 non-null  object 
 6   dba                        15975 non-null  object 
 7   permit_type                15975 non-null  object 
 8   inspection_type            15975 non-null  object 
 9   inspection_frequency_type  15975 non-null  object 
 10  total_time                 15975 non-null  object 
 11  facility_rating_status     15975 non-null  object 
 12  census                     15975 non-null  object 
 13  violation_count            15975 non-null  float64


In [25]:
# Convert date columns to datetime
df_org['inspection_date'] = pd.to_datetime(df_org['inspection_date'], errors='coerce')

# Convert numeric columns
df_org['violation_count'] = pd.to_numeric(df_org['violation_count'], errors='coerce')
df_org['latitude'] = pd.to_numeric(df_org['latitude'], errors='coerce')
df_org['longitude'] = pd.to_numeric(df_org['longitude'], errors='coerce')
df_org['supervisor_district'] = pd.to_numeric(df_org['supervisor_district'], errors='coerce')

# Convert categorical/text columns to string (object)
text_columns = [
    'inspector', 'district', 'subdistrict', 'subsector', 'permit_number', 
    'dba', 'permit_type', 'canonical_address', 'inspection_type',
    'inspection_frequency_type', 'facility_rating_status', 
    'violation_codes', 'analysis_neighborhood'
]
for col in text_columns:
    df_org[col] = df_org[col].astype(str)


In [26]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15975 entries, 0 to 18943
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   inspection_date            15975 non-null  datetime64[ns]
 1   inspector                  15975 non-null  object        
 2   district                   15975 non-null  object        
 3   subdistrict                15975 non-null  object        
 4   subsector                  15975 non-null  object        
 5   permit_number              15975 non-null  object        
 6   dba                        15975 non-null  object        
 7   permit_type                15975 non-null  object        
 8   inspection_type            15975 non-null  object        
 9   inspection_frequency_type  15975 non-null  object        
 10  total_time                 15975 non-null  object        
 11  facility_rating_status     15975 non-null  object        
 12  census   

In [27]:
negative_rows = df_org[df_org['total_time'].str.startswith('-')]
len(negative_rows)

185

In [34]:
# We have 185 negative values in total_time column


In [28]:
df_org['facility_rating_status'].value_counts(normalize=True)*100

facility_rating_status
Pass                94.084507
Conditional Pass     3.580595
Closure              2.334898
Name: proportion, dtype: float64

## OUTLINE (moving forward):

Total time with negative values

Investigate the cause first. Negative values likely indicate data entry errors, e.g., end time < start time.

Options:

Remove the rows if they are errors.

Take absolute values if negative time is an artifact of computation.

Recompute from start and end times if available.

In [36]:
# FIgure out what negative values in total_time means
# change the dtype
# Remove duplicates and unnecessary columns 
# Visualize and analyse the data 
# ( Feature Engineering )
#Modelling

In [29]:
df = df_org


In [32]:
# Ensure date field is datetime
df["inspection_date"] = pd.to_datetime(df["inspection_date"], errors="coerce")

# Convert violation_count to numeric (if not already)
df["violation_count"] = pd.to_numeric(df["violation_count"], errors="coerce").fillna(0)

# Normalize facility_rating_status for trend calculation
df["rating_binary"] = df["facility_rating_status"].map({
    "Pass": 1,
    "Conditional Pass": 0.5,
    "Closure": 0,
    "": np.nan,
    None: np.nan
})

# Sort data properly
df = df.sort_values(["permit_number", "inspection_date"])

In [33]:
df["avg_violations_last_3"] = (
    df.groupby("permit_number")["violation_count"]
          .rolling(3, min_periods=1)
          .mean()
          .reset_index(level=0, drop=True)
)


In [34]:
df["fail_flag"] = df["facility_rating_status"].isin(["Closure", "Fail"]).astype(int)

df["fail_rate_last_3"] = (
    df.groupby("permit_number")["fail_flag"]
          .rolling(3, min_periods=1)
          .mean()
          .reset_index(level=0, drop=True)
)


In [35]:
df["prev_inspection_date"] = (
    df.groupby("permit_number")["inspection_date"].shift(1)
)

df["days_since_last_inspection"] = (
    (df["inspection_date"] - df["prev_inspection_date"]).dt.days
)

# First inspection for each restaurant will be NaN → fill with a neutral value or median later


In [36]:
def compute_trend(group):
    ratings = group["rating_binary"].rolling(3, min_periods=2).apply(
        lambda x: np.polyfit(range(len(x)), x, 1)[0]  # slope of line
    )
    return ratings

df["trend_last_3"] = df.groupby("permit_number", group_keys=False).apply(compute_trend)

In [38]:
df["facility_rating_status"].value_counts(dropna=False)


facility_rating_status
Pass                15030
Conditional Pass      572
Closure               373
Name: count, dtype: int64

In [38]:
#Saving the cleaned Dataset

In [37]:
df.to_csv('../data/clean/AddedFeatures.csv', index=False)
