## Summary
This dataset contains the results of health inspections conducted by the Department of Public Health from 2024 to Present. It includes the name and location of each facility inspected, the facility status (Pass, Conditional Pass, or Closure) after the inspection, and violations observed.


## Description of our dataset
* **inspection_date** (Floating Timestamp): Date and time when the inspection occurred.
* **inspector** (Text): Name of the inspector who conducted the inspection.
* **district** (Text): District in which the inspection took place.
* **subdistrict** (Text): Sub-district where the inspection was performed.
* **subsector** (Text): Specific sub-sector of the inspection area.
* **permit_number** (Text): Permit number associated with the facility, if applicable.
* **dba** (Text): “Doing Business As” name of the facility. Public trade name of the establishment.
* **permit_type** (Text): Type of permit held by the facility.
* **street_address** (Text): Street address of the facility.
* **street_address_clean** (Text): Cleaned and standardized street address.
* **inspection_type** (Text): Type/category of the inspection conducted.
* **inspection_frequency_type** (Text): Frequency classification of inspections.
* **total_time** (Number): Total duration of the inspection in minutes.
* **facility_rating_status** (Text): Rating status of the facility after inspection.
* **census** (Text): Census information linked to the facility location.
* **suspension_notes** (Text): Notes regarding any permit suspensions.
* **inspection_notes** (Text): Additional notes recorded during the inspection.
* **violation_count** (Number): Total number of violations observed.
* **violation_codes** (Text): Codes corresponding to the observed violations.
* **latitude** (Number): Latitude coordinate of the facility.
* **longitude** (Number): Longitude coordinate of the facility.
* **point** (Point): Geospatial point combining latitude and longitude.
* **analysis_neighborhood** (Text): Neighborhood used for analysis purposes.
* **supervisor_district** (Number): Supervisor district number for administrative purposes.
* **data_as_of** (Floating Timestamp): Date when the dataset was last updated in the source system.
* **data_loaded_at** (Floating Timestamp): Timestamp when the data was uploaded to the open data portal.


In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# from sklearn.preprocessing import StandardScaler

# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier 
# from sklearn.ensemble import RandomForestClassifier,BaggingClassifier
# from sklearn.tree import DecisionTreeClassifier
# from xgboost import XGBClassifier

# from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedShuffleSplit, cross_validate
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, precision_recall_curve, f1_score,roc_auc_score, roc_curve, precision_recall_fscore_support
# from sklearn.inspection import PartialDependenceDisplay
# from sklearn import tree

# from treeinterpreter import treeinterpreter
# from waterfall_chart import plot as waterfall

# from imblearn.over_sampling import SMOTE,ADASYN,RandomOverSampler,BorderlineSMOTE,SVMSMOTE
# from imblearn.under_sampling import RandomUnderSampler

In [2]:
pd.set_option('display.max_columns',200)
pd.set_option('display.max_rows',200)
pd.set_option('display.max_colwidth',200)

In [3]:
df_org = pd.read_csv('Health_Inspection_Scores_(2024-Present)_20251030.csv',low_memory = False)

In [4]:
df_org

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,street_address,street_address_clean,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,suspension_notes,inspection_notes,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district,data_as_of,data_loaded_at
0,2025/04/23 12:00:00 AM,Michael Mooney,1,103,607,06734928,Surfside - Walk Thru,H36 - STADIUM CONCESSIONS (PERM),24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,,,37.778130,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
1,2025/04/23 12:00:00 AM,Michael Mooney,2,201,106,06735187,HARBOR EMPEROR,H33 - COMMISSARIES,41 EMBARCADERO,41 EMBARCADERO,Routine,2,60,Conditional Pass,106,,,4.0,"113953(c), 114163(a)(3), 114189, 114192.1, 114195, 114279 - Immediately provide hot running potable water of at least 120°F or greater for the entire food preparation facility., 113953, 113953.1, ...",37.787126,-122.387925,POINT (-122.387924588 37.787126305),Financial District/South Beach,6.0,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
2,2025/04/23 12:00:00 AM,Patrick Wood,1,101,176A,06743776,RA @ BLOOMBERG FLOOR 22,H85 - EMPLOYEE CAFETERIA LIMITED FOOD PREP,140 NEW MONTGOMERY ST FL 22,140 NEW MONTGOMERY ST,New Ownership (I),1,30,Pass,176A,,,,,37.786617,-122.400018,POINT (-122.400018 37.786617),Financial District/South Beach,6.0,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
3,2025/04/23 12:00:00 AM,Rochelle Veloso,5,502,401,102419,MOKUKU,"H26 - RESTAURANT OVER 2,000 SQFT",332 CLEMENT ST,332 CLEMENT ST,Site Visit,2,30,Pass,401,,,,,37.783269,-122.462932,POINT (-122.4629325 37.783269),Inner Richmond,1.0,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
4,2025/04/23 12:00:00 AM,Sojeatta Khim,1,103,607,06734548,STEEP CREAMERY & TEA - SECTION 110,H36 - STADIUM CONCESSIONS (PERM),24 WILLIE MAYS PLZ,3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,,,37.778130,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19051,2025/10/29 12:00:00 AM,Michael Mooney,4,404,164,87648,,"H25 - RESTAURANT 1,000 - 2,000 SQFT",762 DIVISADERO,762 DIVISADERO,,1,45,,,,,6.0,"113947.1-113947.6, 113948 - Obtain a state approved food safety certification course within 60 days of operation or when a Food Safety Manager has left the facility. Maintain copies of food safety...",37.776511,-122.438032,POINT (-122.438031803 37.776511475),Hayes Valley,5.0,2025/10/30 02:00:05 AM,2025/10/30 02:37:07 AM
19052,2024/10/29 12:00:00 AM,Amy Johnson,No District,No Sub District,No Sub Sector,102395,PIZZALICIOUS LLC,"H24 - RESTAURANT UNDER 1,000 SQFT",1210 POLK E ST,1210 POLK E ST,Routine,1,75,Pass,167,,,7.0,"114143(d), 114266, 114268, 114268.1, 114271, 114272 - Provide walls / ceilings using materials that are durable, smooth, nonabsorbent, light-colored, and washable surfaces. All floor surfaces, oth...",,,,,,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
19053,2024/09/19 12:00:00 AM,Danny Nguyen,2,202,179T,06733434,DA BOOT TO DA BAY,H79 - MOBILE FOOD FACILITY CLASS 5,900 AVENUE ST BLDG D,900 AVENUE ST BLDG D,Reinspection,1,20,Pass,MFF,,,,,,,,,,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
19054,2024/08/19 12:00:00 AM,Michael Mooney,No District,No Sub District,No Sub Sector,H0306732464,MV ARGO,H07 - RETAIL MKTS W/FOOD PREP (UNDER 5001),PIER FISHERMANS WHARF PIER,PIER FISHERMANS WHARF PIER,Routine,1,25,Pass,106,,,2.0,"113953, 113953.1, 113953.2, 114067(f) - Provide soap and single-use towels in dispensers, or a drying device at each handwash sink at all times. Maintain all handwash sinks unobstructed and access...",,,,,,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM


In [5]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 26 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   street_address             19056 non-null  object 
 9   street_address_clean       19056 non-null  object 
 10  inspection_type            16673 non-null  object 
 11  inspection_frequency_type  18319 non-null  object 
 12  total_time                 18980 non-null  object 
 13  facility_rating_status     16533 non-null  obj

In [6]:
df_org.describe(include='all')

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,street_address,street_address_clean,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,suspension_notes,inspection_notes,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district,data_as_of,data_loaded_at
count,19056,19056,19054.0,19042.0,19035.0,19038.0,18148,19056,19056,19056,16673,18319.0,18980.0,16533,17622.0,0.0,0.0,10184.0,10184,18919.0,18919.0,18919,18874,18874.0,19056,19056
unique,483,52,8.0,41.0,160.0,8102.0,6857,70,6813,6332,14,3.0,204.0,3,164.0,,,,6416,,,5096,41,,88,1
top,2025/04/23 12:00:00 AM,Michael Mooney,2.0,103.0,607.0,18306.0,AGRICULTURAL INSTITUTE OF MARIN,"H25 - RESTAURANT 1,000 - 2,000 SQFT",49 S Van Ness Ave 7th Floor,3RD ST & KING ST,Routine,2.0,60.0,Pass,607.0,,,,"114259, 114259.1, 114259.4, 114259.5 - Eliminate the infestation/activity of cockroaches/rodents/flies/vermin from the food facility by using only approved methods. Remove all evidence of the infe...",,,POINT (-122.391855 37.77813),Mission,,2025/07/01 10:09:15 AM,2025/10/30 02:37:07 AM
freq,197,1180,4268.0,1345.0,958.0,41.0,42,4269,207,243,9268,10487.0,3740.0,15575,553.0,,,,291,,,243,2583,,18148,19056
mean,,,,,,,,,,,,,,,,,,4.053712,,37.772371,-122.423539,,,5.529194,,
std,,,,,,,,,,,,,,,,,,3.315183,,0.03691,0.042882,,,2.688655,,
min,,,,,,,,,,,,,,,,,,1.0,,33.761824,-122.510677,,,1.0,,
25%,,,,,,,,,,,,,,,,,,2.0,,37.760283,-122.43429,,,3.0,,
50%,,,,,,,,,,,,,,,,,,3.0,,37.777527,-122.416182,,,6.0,,
75%,,,,,,,,,,,,,,,,,,5.0,,37.788984,-122.40522,,,8.0,,


In [7]:
# Lets check values in the dataset.
for i in df_org.columns:
    print(i,df_org[i].unique())

inspection_date ['2025/04/23 12:00:00 AM' '2025/04/22 12:00:00 AM'
 '2025/04/21 12:00:00 AM' '2025/04/18 12:00:00 AM'
 '2025/04/17 12:00:00 AM' '2025/04/16 12:00:00 AM'
 '2025/04/15 12:00:00 AM' '2025/04/14 12:00:00 AM'
 '2025/04/11 12:00:00 AM' '2025/04/10 12:00:00 AM'
 '2025/04/09 12:00:00 AM' '2025/04/08 12:00:00 AM'
 '2025/04/07 12:00:00 AM' '2025/04/04 12:00:00 AM'
 '2025/04/03 12:00:00 AM' '2025/04/02 12:00:00 AM'
 '2025/04/01 12:00:00 AM' '2025/05/06 12:00:00 AM'
 '2025/03/31 12:00:00 AM' '2025/03/28 12:00:00 AM'
 '2025/03/27 12:00:00 AM' '2025/03/26 12:00:00 AM'
 '2025/03/25 12:00:00 AM' '2025/03/24 12:00:00 AM'
 '2025/03/21 12:00:00 AM' '2025/03/20 12:00:00 AM'
 '2025/03/19 12:00:00 AM' '2025/03/18 12:00:00 AM'
 '2025/03/17 12:00:00 AM' '2025/03/14 12:00:00 AM'
 '2025/03/13 12:00:00 AM' '2025/03/12 12:00:00 AM'
 '2025/03/11 12:00:00 AM' '2025/03/10 12:00:00 AM'
 '2025/03/07 12:00:00 AM' '2025/03/06 12:00:00 AM'
 '2025/03/05 12:00:00 AM' '2025/03/04 12:00:00 AM'
 '2025/03/03 12

In [8]:
# Total time has -ve values. Weird

In [9]:
#Let'see the percentage of null values
(df_org.isnull().sum()).sort_values(ascending=False)/len(df_org)

suspension_notes             1.000000
inspection_notes             1.000000
violation_codes              0.465575
violation_count              0.465575
facility_rating_status       0.132399
inspection_type              0.125052
census                       0.075252
dba                          0.047649
inspection_frequency_type    0.038675
supervisor_district          0.009551
analysis_neighborhood        0.009551
longitude                    0.007189
point                        0.007189
latitude                     0.007189
total_time                   0.003988
subsector                    0.001102
permit_number                0.000945
subdistrict                  0.000735
district                     0.000105
data_as_of                   0.000000
inspection_date              0.000000
inspector                    0.000000
street_address_clean         0.000000
street_address               0.000000
permit_type                  0.000000
data_loaded_at               0.000000
dtype: float

In [10]:
# 'suspension_notes' and 'inspection_notes' has 100% of null values. So we can just drop it.
df_org = df_org.drop(['suspension_notes','inspection_notes'],axis=1)

In [11]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 24 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   street_address             19056 non-null  object 
 9   street_address_clean       19056 non-null  object 
 10  inspection_type            16673 non-null  object 
 11  inspection_frequency_type  18319 non-null  object 
 12  total_time                 18980 non-null  object 
 13  facility_rating_status     16533 non-null  obj

In [12]:
#Compare street_address and street_address_clean to see if we can drop one.
df_org[['street_address','street_address_clean']].head(100)

Unnamed: 0,street_address,street_address_clean
0,24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST
1,41 EMBARCADERO,41 EMBARCADERO
2,140 NEW MONTGOMERY ST FL 22,140 NEW MONTGOMERY ST
3,332 CLEMENT ST,332 CLEMENT ST
4,24 WILLIE MAYS PLZ,3RD ST & KING ST
5,24 WILLIE MAYS PLZ # PROMEN,3RD ST & KING ST
6,24 WILLIE MAYS PLZ # VIEW L,3RD ST & KING ST
7,24 WILLIE MAYS PLZ # # 5319,3RD ST & KING ST
8,24 WILLIE MAYS PLZ # VIEW L,3RD ST & KING ST
9,1201 ORTEGA ST,1201 ORTEGA ST


In [13]:
# Compare the two columns element-wise
matches = df_org['street_address'] == df_org['street_address_clean']

# Sum the True values to count how many rows match exactly
num_matches = matches.sum()

# Sum the False values to count how many rows do not match
num_non_matches = (~matches).sum()

print("Number of exact matches:", num_matches)
print("Number of non-matches:", num_non_matches)


Number of exact matches: 16321
Number of non-matches: 2735


In [14]:
# It is almost the same, 16k rows matches the cleaned street adress so we will remove street_address

In [15]:
# Removing unwanted columns such as 'data_as_of' and 'data_loaded_at' and 'street_adress'
df_org = df_org.drop(['data_as_of','data_loaded_at','street_address'],axis=1)

In [16]:
df_org.head()

Unnamed: 0,inspection_date,inspector,district,subdistrict,subsector,permit_number,dba,permit_type,street_address_clean,inspection_type,inspection_frequency_type,total_time,facility_rating_status,census,violation_count,violation_codes,latitude,longitude,point,analysis_neighborhood,supervisor_district
0,2025/04/23 12:00:00 AM,Michael Mooney,1,103,607,6734928,Surfside - Walk Thru,H36 - STADIUM CONCESSIONS (PERM),3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,37.77813,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0
1,2025/04/23 12:00:00 AM,Michael Mooney,2,201,106,6735187,HARBOR EMPEROR,H33 - COMMISSARIES,41 EMBARCADERO,Routine,2,60,Conditional Pass,106,4.0,"113953(c), 114163(a)(3), 114189, 114192.1, 114195, 114279 - Immediately provide hot running potable water of at least 120°F or greater for the entire food preparation facility., 113953, 113953.1, ...",37.787126,-122.387925,POINT (-122.387924588 37.787126305),Financial District/South Beach,6.0
2,2025/04/23 12:00:00 AM,Patrick Wood,1,101,176A,6743776,RA @ BLOOMBERG FLOOR 22,H85 - EMPLOYEE CAFETERIA LIMITED FOOD PREP,140 NEW MONTGOMERY ST,New Ownership (I),1,30,Pass,176A,,,37.786617,-122.400018,POINT (-122.400018 37.786617),Financial District/South Beach,6.0
3,2025/04/23 12:00:00 AM,Rochelle Veloso,5,502,401,102419,MOKUKU,"H26 - RESTAURANT OVER 2,000 SQFT",332 CLEMENT ST,Site Visit,2,30,Pass,401,,,37.783269,-122.462932,POINT (-122.4629325 37.783269),Inner Richmond,1.0
4,2025/04/23 12:00:00 AM,Sojeatta Khim,1,103,607,6734548,STEEP CREAMERY & TEA - SECTION 110,H36 - STADIUM CONCESSIONS (PERM),3RD ST & KING ST,Routine,1,10,Pass,STADIUM,,,37.77813,-122.391855,POINT (-122.391855 37.77813),Mission Bay,6.0


In [17]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19056 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            19056 non-null  object 
 1   inspector                  19056 non-null  object 
 2   district                   19054 non-null  object 
 3   subdistrict                19042 non-null  object 
 4   subsector                  19035 non-null  object 
 5   permit_number              19038 non-null  object 
 6   dba                        18148 non-null  object 
 7   permit_type                19056 non-null  object 
 8   street_address_clean       19056 non-null  object 
 9   inspection_type            16673 non-null  object 
 10  inspection_frequency_type  18319 non-null  object 
 11  total_time                 18980 non-null  object 
 12  facility_rating_status     16533 non-null  object 
 13  census                     17622 non-null  obj

'district','subdistrict','subsector','permit_number','dba', 'inspection_type', 'inspection_frequency_type', 'total_time',
'facility_rating_status', 'census', 'violation_count', 'violation_codes', 'latitude', 'longitude', 'point', 'analysis_neighborhood'
'supervisor_district'

In [18]:
# Target variable facility_rating_status has null values. We will drop those rows. Imputing them and 
# predicting that partial truth does not make sense

In [19]:
# Drop rows where 'facility_rating_status' is null
df_org = df_org.dropna(subset=['facility_rating_status'])

In [20]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16533 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            16533 non-null  object 
 1   inspector                  16533 non-null  object 
 2   district                   16533 non-null  object 
 3   subdistrict                16533 non-null  object 
 4   subsector                  16533 non-null  object 
 5   permit_number              16530 non-null  object 
 6   dba                        16533 non-null  object 
 7   permit_type                16533 non-null  object 
 8   street_address_clean       16533 non-null  object 
 9   inspection_type            15138 non-null  object 
 10  inspection_frequency_type  16182 non-null  object 
 11  total_time                 16480 non-null  object 
 12  facility_rating_status     16533 non-null  object 
 13  census                     16124 non-null  obj

In [21]:
#Let'see the percentage of null values
(df_org.isnull().sum()).sort_values(ascending=False)/len(df_org)

violation_codes              0.439727
violation_count              0.439727
inspection_type              0.084377
census                       0.024738
inspection_frequency_type    0.021230
analysis_neighborhood        0.007137
supervisor_district          0.007137
latitude                     0.005807
longitude                    0.005807
point                        0.005807
total_time                   0.003206
permit_number                0.000181
permit_type                  0.000000
street_address_clean         0.000000
dba                          0.000000
inspector                    0.000000
facility_rating_status       0.000000
subsector                    0.000000
subdistrict                  0.000000
district                     0.000000
inspection_date              0.000000
dtype: float64

'permit_number', 'inspection_type', 'inspection_frequency_type', 'total_time', 'census', 'violation_count',
'violation_codes', 'latitude', 'longitude', 'point', 'analysis_neighborhood', 'supervisor_district'

In [22]:
#Fix all null values

# Percentage of rows with null violation count that are Pass
null_violation_count_pass = df_org[df_org['violation_count'].isna()]['facility_rating_status'].value_counts(normalize=True) * 100
print(null_violation_count_pass)

# Percentage of rows with null violation codes that are Pass
null_violation_code_pass = df_org[df_org['violation_codes'].isna()]['facility_rating_status'].value_counts(normalize=True) * 100
print(null_violation_code_pass)


Pass                99.724897
Conditional Pass     0.165062
Closure              0.110041
Name: facility_rating_status, dtype: float64
Pass                99.724897
Conditional Pass     0.165062
Closure              0.110041
Name: facility_rating_status, dtype: float64


In [23]:
# Since 99% of rows that has null in violation_counts and violation_codes, it makes sense to assume that null here means
# no violation. So, we will impute 0 for all NaN values in violation_counts and an empty string in violation_codes

In [24]:
# Fill nulls in violation_count with 0
df_org['violation_count'] = np.where(df_org['violation_count'].isnull(), 0, df_org['violation_count'])

# Fill nulls in violation_code with empty string
df_org['violation_codes'] = np.where(df_org['violation_codes'].isnull(), '', df_org['violation_codes'])


In [25]:
df_org.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16533 entries, 0 to 19055
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   inspection_date            16533 non-null  object 
 1   inspector                  16533 non-null  object 
 2   district                   16533 non-null  object 
 3   subdistrict                16533 non-null  object 
 4   subsector                  16533 non-null  object 
 5   permit_number              16530 non-null  object 
 6   dba                        16533 non-null  object 
 7   permit_type                16533 non-null  object 
 8   street_address_clean       16533 non-null  object 
 9   inspection_type            15138 non-null  object 
 10  inspection_frequency_type  16182 non-null  object 
 11  total_time                 16480 non-null  object 
 12  facility_rating_status     16533 non-null  object 
 13  census                     16124 non-null  obj

In [26]:
#Looking at the percentage of unique values for each column that has null values.
for i in ['permit_number', 'inspection_type', 'inspection_frequency_type', 'total_time', 
          'census', 'latitude', 'longitude', 'point', 'analysis_neighborhood', 'supervisor_district']:
    print(i, df_org[i].value_counts(normalize=True)*100, sep='\n\n')
    print('-'*75)

permit_number

105555         0.193587
18306          0.187538
2133           0.120992
06733818       0.114943
H2606732772    0.090744
                 ...   
H2606732589    0.006050
86478          0.006050
06733120       0.006050
113229         0.006050
90213          0.006050
Name: permit_number, Length: 7167, dtype: float64
---------------------------------------------------------------------------
inspection_type

Routine                55.641432
Reinspection           19.447747
New Ownership (I)       7.993130
New Ownership (R)       4.643942
Site Visit              4.564672
Complaint (I)           4.492007
Foodborne Illness       1.770379
Complaint (R)           0.858766
Plan Check (I)          0.277447
Structural              0.178359
Plan Check (R)          0.105694
Plan Check              0.019818
Change of Ownership     0.006606
Name: inspection_type, dtype: float64
---------------------------------------------------------------------------
inspection_frequency_type

2       

In [None]:
# inspection_type and inspection_frequency_type has mode. We can just impute null values with the mode for these 2 cols.

In [None]:
# point is nothng but (lang, long).

## OUTLINE (moving forward):

Geographical columns (census, latitude, longitude, point, analysis, neighborhood, supervisor, district)

All are related to location. Nulls here can be imputed using geo-mapping:

If you have the street address, you can use geocoding APIs (Google Maps, OpenStreetMap/Nominatim, or geopy in Python) to get latitude, longitude, and potentially district, neighborhood, and census info.

If you have district or neighborhood, you can infer missing census or analysis points based on that aggregation.

point is a geometric representation (latitude-longitude pair or a GIS point object), which can also be reconstructed from latitude and longitude.

Null supervisor can be filled if each district/neighborhood has a known supervisor mapping.

Permit number- Nulls can be assigned a placeholder such as "NA" or "Missing" 

Total time with negative values

Investigate the cause first. Negative values likely indicate data entry errors, e.g., end time < start time.

Options:

Remove the rows if they are errors.

Take absolute values if negative time is an artifact of computation.

Recompute from start and end times if available.

In [40]:
#Change the data type
dtype_mapping = {
    'total_time': 'int64',       # number
    'inspection_date': 'datetime64[ns]',  # timestamp
}

for col, dtype in dtype_mapping.items():
    df_org[col] = df_org[col].astype(dtype)


ValueError: cannot convert float NaN to integer

In [41]:
(df_org['total_time'] < 0).sum()

TypeError: '<' not supported between instances of 'str' and 'int'

In [42]:
# Fill in NA values 
# change the dtype
# Remove duplicates and unnecessary columns
# Visualize and analyse the data 
#Modelling