In [1]:
# Import necessary libraries and modules
import pandas as pd
import numpy as np
import john_acquire as a  # Custom module for data acquisition
%load_ext autoreload
%autoreload 2

# Set the option to display all columns in DataFrames
pd.set_option('display.max_columns', None)


### Dataset Overview

| Column | Description | Data Type |
| --- | --- | --- |
| camis | Unique identifier for each record | int64 |
| dba | Doing Business As (DBA) name | object |
| boro | Borough where the establishment is located | object |
| building | Building number | object |
| street | Street name | object |
| zipcode | Zip code | float64 |
| phone | Phone number | object |
| cuisine\_description | Description of the cuisine type | object |
| inspection\_date | Date of inspection | object |
| action | Action taken during inspection | object |
| critical\_flag | Indicator of critical violations | object |
| score | Inspection score | float64 |
| record\_date | Date of record | object |
| inspection\_type | Type of inspection | object |
| latitude | Latitude of the establishment | float64 |
| longitude | Longitude of the establishment | float64 |
| community\_board | Community board district | float64 |
| council\_district | Council district | float64 |
| census\_tract | Census tract | float64 |
| bin | Building identification number | float64 |
| bbl | Borough block and lot number | float64 |
| nta | Neighborhood Tabulation Area (NTA) | object |
| violation\_code | Code indicating violations | object |
| violation\_description | Description of the violations | object |
| grade | Inspection grade | object |
| grade\_date | Date of inspection grade | object |

In [2]:
# Load the acquired data from the CSV file
inspection_df = pd.read_csv('nyc_health_inspections_2000_to_2023.csv', index_col=False)

# Display the first few rows of the loaded DataFrame
inspection_df.head().to_csv('inspections_df_head.csv')

In [1360]:
pd.DataFrame({
    'Numeric_Zero_Count': (inspection_df == 0).sum(),
    'String_Zero_Count': (inspection_df == '0').sum(),
    'Null_Count': inspection_df.isna().sum()
})

Unnamed: 0,Numeric_Zero_Count,String_Zero_Count,Null_Count
camis,0,0,0
dba,0,0,0
boro,0,0,0
building,0,430,289
street,0,0,0
zipcode,0,0,2642
phone,0,0,6
cuisine_description,0,0,0
inspection_date,0,0,0
action,0,0,0


In [1318]:
# Display information about the dataset, including non null counts per column
inspection_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207365 entries, 0 to 207364
Data columns (total 26 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   camis                  207365 non-null  int64  
 1   dba                    207365 non-null  object 
 2   boro                   207365 non-null  object 
 3   building               207076 non-null  object 
 4   street                 207365 non-null  object 
 5   zipcode                204723 non-null  float64
 6   phone                  207359 non-null  object 
 7   cuisine_description    207365 non-null  object 
 8   inspection_date        207365 non-null  object 
 9   action                 207365 non-null  object 
 10  critical_flag          207365 non-null  object 
 11  score                  199793 non-null  float64
 12  record_date            207365 non-null  object 
 13  inspection_type        207365 non-null  object 
 14  latitude               207119 non-nu

### Checking for Missing Values

Summarizing Missing Values by Column

In [1319]:
# Calculate the count of missing values in each column
null_counts_by_column = inspection_df.isnull().sum()

# Filter and display columns with missing values
null_counts_by_column[null_counts_by_column > 0]

building                    289
zipcode                    2642
phone                         6
score                      7572
latitude                    246
longitude                   246
community_board            3153
council_district           3157
census_tract               3157
bin                        4144
bbl                         511
nta                        3153
violation_code             1157
violation_description      1157
grade                    104404
grade_date               113157
dtype: int64

##### Inferring Missing Values

Our next step is to strategize how to address these missing values by leveraging available data in other columns. The proposed hierarchy for inference is as follows:

`lat&long < building < bin < bbl < nta, zipcode* < community board < council district < census tract`

Given the relatively low count of missing values in the BBL column, it appears to be a promising candidate for inferring related data such as NTA (Neighborhood Tabulation Area), Community Board, Council District, and Census Tract.

Let's examine the first few unique values in the BBL column to understand its content

In [1320]:
# Let's take a look at the first few unique values in the BBL column
sorted(inspection_df.bbl.unique())[:10]


[1.0,
 2.0,
 3.0,
 4.0,
 1000000000.0,
 1000010010.0,
 1000020002.0,
 1000030001.0,
 1000047501.0,
 1000057501.0]

The initial examination of the BBL column shows the presence of non-standard values such as 1.0, 2.0, 3.0, 4.0, etc., which do not conform to the expected 10-digit format.

Now, let's find out the count of these non-standard values:

In [1321]:
# Define non-standard BBL values
bbl_values = [np.nan, 1.0, 2.0, 3.0, 4.0, 5.0]

# Calculate the count of these non-standard values in the BBL column
inspection_df['bbl'].isin(bbl_values).sum()

4144

The count of non-standard BBL values is found to be relatively consistent with the number of NaN values in the BIN column, indicating a pattern of missing values across these key columns.

- census_tract               3157
- bin                        4144
- bbl                        4144
- nta                        3153
- community_board            3153
- council_district           3157

We are unable to rely on bbl make inference, and so we must abandon the hierarchy inference plan. To proceed, we drop rows with NaN values in the BIN column.

In [1322]:
# Dropping rows with null values in the 'bin' column
inspection_df = inspection_df.dropna(subset=['bin'])


##### Reevaluating Null Counts After BIN Column Cleanup

In [1323]:
# Calculate the count of missing values in each column
null_counts_by_column = inspection_df.isnull().sum()

# Filter and display columns with missing values
null_counts_by_column[null_counts_by_column > 0]

zipcode                      30
phone                         6
score                      7440
community_board              30
council_district             34
census_tract                 34
nta                          30
violation_code             1076
violation_description      1076
grade                    102689
grade_date               111279
dtype: int64

As expected, the remaining NaNs are mostly related or in common with the initial set. 

##### Handling Remaining Null Values

For the small number of remaining NaNs (e.g., 30 in ZIP code), we can safely drop them due to their limited impact on the dataset. I chose to drop 'council_district' to see if this also got rid of the other NaNs.

In [1324]:
# Dropping rows with null values in the 'council_district' column
inspection_df = inspection_df.dropna(subset=['council_district'])

# Reassessing the null counts in the dataset
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

phone                         6
score                      7438
violation_code             1076
violation_description      1076
grade                    102677
grade_date               111265
dtype: int64

After dropping null values in the 'council\_district' column, we are left with a few nulls in columns such as 'phone,' 'score,' 'violation\_code,' 'violation\_description,' 'grade,' and 'grade\_date.'

### Analyzing Phone

Since only numerical values are left, we can fill these remaining nulls with a common placeholder, such as '0000000000,' to maintain data integrity:


only numbers are left, we could simply fill na with 0000000000

In [1325]:
# Fill remaining nulls in numerical columns with '0000000000'
inspection_df['phone'].fillna('0000000000', inplace=True)

# Reassessing the null counts in the dataset
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

score                      7438
violation_code             1076
violation_description      1076
grade                    102677
grade_date               111265
dtype: int64

### Grades Column

Drop the 'grade' and 'grade\_date' columns. According to the documentation, not all inspections receive a grade. We can simply calculate the grade using the score. Additionally, the documentation mentions that the grade may not match the scores due to input errors.

In [1326]:
# Dropping the 'grade' and 'grade_date' columns
inspection_df = inspection_df.drop(['grade', 'grade_date'], axis=1)

In [1327]:
# Reassessing the null counts in the dataset
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

score                    7438
violation_code           1076
violation_description    1076
dtype: int64

##### Identifying Relevant Inspection Types

Before proceeding with the score nulls, let's identify and focus on inspection types related to food safety.

In [1328]:
# Assuming 'inspection_df' is your DataFrame
unique_inspection_types = inspection_df['inspection_type'].unique()

# Convert the numpy array to a list and then sort it
sorted_inspection_types = sorted(unique_inspection_types.tolist())
sorted_inspection_types

# List of inspection types to be removed
remove_types = ["Calorie Posting", "Pre-permit", "Smoke-Free Air Act", "Trans Fat"]

# Filtering the DataFrame to focus on relevant inspection types
inspection_df = inspection_df[~inspection_df['inspection_type'].str.startswith(tuple(remove_types))]

# Checking the length of the DataFrame after filtering

 We will exclude types such as "Calorie Posting," "Pre-permit," "Smoke-Free Air Act," and "Trans Fat," as they do not directly pertain to food safety

In [1329]:
# List of inspection types to be removed
remove_types = ["Calorie Posting", "Pre-permit", "Smoke-Free Air Act", "Trans Fat"]

# Filter the DataFrame in a single step
inspection_df = inspection_df[~inspection_df['inspection_type'].str.startswith(tuple(remove_types))]
len(inspection_df)

155970

In [1330]:
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

score                    6024
violation_code            797
violation_description     797
dtype: int64

Some rows had overlapping null values when we dropped them, resulting in a minor reduction in overall null counts. However, we still need to investigate a few remaining violation codes.

Now, let's examine the history of a restaurant with a null value in the violation code to understand the reasons behind this occurrence.

In [1331]:
# Step 1: Group by 'camis' and 'inspection_date' and check for nulls in 'violation_code'
grouped = inspection_df.groupby(['camis', 'inspection_date'])
groups_with_nulls = grouped.apply(lambda x: x['score'].isna().any())

# Step 2: Filter the DataFrame to include only those groups
filtered_df = inspection_df[inspection_df.set_index(['camis', 'inspection_date']).index.isin(groups_with_nulls[groups_with_nulls].index)].reset_index(drop=True)

# Now, 'filtered_df' contains only the groups where there are null values in 'violation_code'
filtered_df.sort_values(by='camis').head(3)


Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,critical_flag,score,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,violation_code,violation_description
5179,30112340,WENDY'S,Brooklyn,469,FLATBUSH AVENUE,11225.0,7182875005,Hamburgers,2022-07-13T00:00:00.000,Violations were cited in the following area(s).,Critical,11.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.662652,-73.962081,309.0,40.0,32700.0,3029737.0,3011970000.0,BK60,02G,Cold TCS food item held above 41 °F; smoked or...
2213,30112340,WENDY'S,Brooklyn,469,FLATBUSH AVENUE,11225.0,7182875005,Hamburgers,2022-07-13T00:00:00.000,Violations were cited in the following area(s).,Not Critical,,2023-12-01T06:00:08.000,Administrative Miscellaneous / Initial Inspection,40.662652,-73.962081,309.0,40.0,32700.0,3029737.0,3011970000.0,BK60,20-06,Current letter grade or Grade Pending card not...
3508,30112340,WENDY'S,Brooklyn,469,FLATBUSH AVENUE,11225.0,7182875005,Hamburgers,2022-07-13T00:00:00.000,Violations were cited in the following area(s).,Not Critical,11.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.662652,-73.962081,309.0,40.0,32700.0,3029737.0,3011970000.0,BK60,10F,Non-food contact surface or equipment made of ...


using the date, we can see that in a single inspection, each violation is in a new row. from here we can see that a single inspection can have more thna one inspection type. Administrative Miscellaneous seem to be the ones that hold the NaN data. We should figure out how many Administrative types have nan

Each violation is recorded in a separate row, implying that a single inspection can encompass multiple inspection types. Notably, 'Administrative Miscellaneous' inspections appear to have missing data in 'violation_code'. Let's determine how many 'Administrative Miscellaneous' inspections have NaN values."

In [1332]:
# Group by 'inspection_type' and count null 'violation_code' entries
null_score_count = inspection_df.groupby('inspection_type').apply(lambda x: x['score'].isnull().sum())

# The result is a Series where the index is 'inspection_type' and the values are the counts of null 'violation_code'
print(null_score_count)


inspection_type
Administrative Miscellaneous / Compliance Inspection             99
Administrative Miscellaneous / Initial Inspection              4899
Administrative Miscellaneous / Re-inspection                    975
Administrative Miscellaneous / Reopening Inspection              43
Administrative Miscellaneous / Second Compliance Inspection       8
Cycle Inspection / Compliance Inspection                          0
Cycle Inspection / Initial Inspection                             0
Cycle Inspection / Re-inspection                                  0
Cycle Inspection / Reopening Inspection                           0
Cycle Inspection / Second Compliance Inspection                   0
Inter-Agency Task Force / Initial Inspection                      0
Inter-Agency Task Force / Re-inspection                           0
dtype: int64


We can see that all the null values in the 'violation_code' column are associated with the inspection type "Administrative." It's possible that this inspection type is used to record violations of a different category, especially since there's a mix of "Administrative" and "Cycle" inspections for a single visit. This observation suggests that we might be able to infer the score or consider dropping the "Administrative" inspection type altogether.

To make an informed decision, let's examine the types of violations that we commonly see in "Administrative" inspections. This analysis will provide further insights into whether these inspections are relevant for our data analysis and whether they have a meaningful impact on the overall score.

-   We filtered the dataset to include rows where the 'inspection\_type' starts with "Administrative."
-   We counted and displayed the unique 'violation\_description' values for these rows, shedding light on the types of violations associated with "Administrative" inspections

In [1333]:
# Filter for rows where 'inspection_type' starts with "Administrative"
administrative_rows = inspection_df[inspection_df['inspection_type'].str.startswith("Administrative")]

# Get a count of each unique 'violation_description' in these rows
violation_description_counts = administrative_rows['violation_description'].value_counts()

# Display the counts
violation_description_counts

violation_description
Food allergy information poster not conspicuously posted where food is being prepared or processed by food workers.                                                                                                                                                                                                                                  699
Current letter grade or Grade Pending card not posted                                                                                                                                                                                                                                                                                                600
Failure to post or conspicuously post healthy eating information                                                                                                                                                                                                                                

The analysis revealed that "Administrative" inspections primarily include non-food safety violations, such as missing posters, signage, or documentation, rather than critical food safety issues. Common violations in "Administrative" inspections include:

Missing "Choking first aid" and "Alcohol and pregnancy" posters.
Failure to post or conspicuously post current letter grades or Grade Pending cards.
Providing certain items without customer request, such as plastic straws.

Given that "Administrative" inspections do not contribute to our food safety analysis and that they primarily involve non-critical violations, we made the decision to drop rows where the 'inspection_type' starts with "Administrative." 

In [1334]:
# Drop rows where 'inspection_type' starts with "Administrative"
inspection_df = inspection_df[~inspection_df['inspection_type'].str.startswith("Administrative")]
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

violation_code           438
violation_description    438
dtype: int64

As a result, the null values in the score column were addressed, so we have no nans left for score.  We will continue to investigate the remaining null values in 'violation_code' and 'violation_description' to gain insights into why they exist, even though they are relatively few in number.

In [1335]:
# Group by 'inspection_type' and count null 'violation_code' entries
null_violation_count = inspection_df.groupby('inspection_type').apply(lambda x: x['violation_code'].isnull().sum())
null_violation_count

inspection_type
Cycle Inspection / Compliance Inspection             2
Cycle Inspection / Initial Inspection              256
Cycle Inspection / Re-inspection                    40
Cycle Inspection / Reopening Inspection             70
Cycle Inspection / Second Compliance Inspection      0
Inter-Agency Task Force / Initial Inspection        69
Inter-Agency Task Force / Re-inspection              1
dtype: int64

As observed, the presence of null values in the 'violation_code' and 'violation_description' columns varies depending on the inspection type. While this insight doesn't directly explain why these nulls exist, it's a useful observation. To delve deeper into the reasons behind these nulls, we can examine the 'action' column.

In [1336]:
violation_code_null = inspection_df[inspection_df['violation_code'].isna()]
# Group by 'inspection_type' and count null 'violation_code' entries
null_violation_count = violation_code_null.groupby('action').apply(lambda x: x['violation_code'].isnull().sum())
null_violation_count

action
Establishment re-opened by DOHMH.                               70
No violations were recorded at the time of this inspection.    364
Violations were cited in the following area(s).                  4
dtype: int64

In [1337]:
null_violation_count.sum()

438

Our analysis has revealed that a significant portion of the null values in the 'violation_code' and 'violation_description' columns are associated with cases where no violations were found during inspections.

We plan to handle null values as follows:

1.  For rows with "No violations were recorded at the time of this inspection" action, also replace NaN values with "No violations were recorded."


In [1338]:
# Identify rows where 'action' starts with the specified strings and 'violation_code' is null
condition = inspection_df['violation_code'].isna() & inspection_df['action'].str.startswith("No violations were recorded at the time of this inspection.")

# Update 'violation_code' and 'violation_description' for these rows
inspection_df.loc[condition, ['violation_code', 'violation_description']] = ['none', 'No violations were recorded']

In [1339]:
violation_code_null = inspection_df[inspection_df['violation_code'].isna()]
# Group by 'inspection_type' and count null 'violation_code' entries
null_violation_count = violation_code_null.groupby('action').apply(lambda x: x['violation_code'].isnull().sum())
null_violation_count

action
Establishment re-opened by DOHMH.                  70
Violations were cited in the following area(s).     4
dtype: int64


To gain further clarity and address the remaining nulls, we will focus on the few remaining rows. Let's begin by examining the rows related to reopening inspections to understand why some of them have null values in these columns.

In [1340]:
# Filter rows where 'inspection_type' starts with "Administrative"
action_reopened = inspection_df[inspection_df['action'].str.startswith("Establishment re-opened by DOHMH")]
len(action_reopened)

1360

In [1341]:
action_reopened.head(5)

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,critical_flag,score,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,violation_code,violation_description
9,50078860,A FEI CHINESE RESTAURANT,Brooklyn,553,THROOP AVENUE,11216.0,7184535205,Chinese,2022-02-23T00:00:00.000,Establishment re-opened by DOHMH.,Not Applicable,0.0,2023-12-01T06:00:08.000,Cycle Inspection / Reopening Inspection,40.68316,-73.940966,303.0,36.0,27500.0,3052864.0,3018410000.0,BK35,,
88,50032753,AGAVI ORGANIC JUICEBAR,Manhattan,72,EAST 7 STREET,10003.0,2123908042,"Juice, Smoothies, Fruit Salads",2022-03-11T00:00:00.000,Establishment re-opened by DOHMH.,Not Applicable,0.0,2023-12-01T06:00:08.000,Cycle Inspection / Reopening Inspection,40.72739,-73.986766,103.0,2.0,3800.0,1006277.0,1004480000.0,MN22,,
139,50005590,FAMOUS SICHUAN,Manhattan,10,PELL STREET,10013.0,2122333888,Chinese,2022-09-29T00:00:00.000,Establishment re-opened by DOHMH.,Not Applicable,0.0,2023-12-01T06:00:08.000,Cycle Inspection / Reopening Inspection,40.714729,-73.997598,103.0,1.0,2900.0,1001776.0,1001630000.0,MN27,,
227,41549281,DUNKIN,Manhattan,316,WEST 34 STREET,10001.0,2127602600,Donuts,2022-07-29T00:00:00.000,Establishment re-opened by DOHMH.,Not Applicable,0.0,2023-12-01T06:00:08.000,Cycle Inspection / Reopening Inspection,40.752494,-73.994221,104.0,3.0,10300.0,1013552.0,1007570000.0,MN13,,
249,41642570,JOHN'S DELI,Brooklyn,2438,STILLWELL AVENUE,11223.0,7187144377,American,2022-01-25T00:00:00.000,Establishment re-opened by DOHMH.,Not Applicable,0.0,2023-12-01T06:00:08.000,Cycle Inspection / Reopening Inspection,40.588029,-73.983622,313.0,47.0,30800.0,3187046.0,3069050000.0,BK26,,


the reopening inspections with some having violation codes and descriptions while others have NaN values, we can reasonably assume that the ones with NaN values indicate no violations were found during those inspections. Therefore

We plan to handle null values as follows:

1.  For rows with "Establishment re-opened by DOHMH" action, replace NaN values with "No violations were recorded."


In [1342]:
# Identify rows where 'action' starts with the specified strings and 'violation_code' is null
condition = inspection_df['violation_code'].isna() & inspection_df['action'].str.startswith("Establishment re-opened by DOHMH")

# Update 'violation_code' and 'violation_description' for these rows
inspection_df.loc[condition, ['violation_code', 'violation_description']] = ['none', 'No violations were recorded']

In [1343]:
violation_code_null = inspection_df[inspection_df['violation_code'].isna()]
# Group by 'inspection_type' and count null 'violation_code' entries
null_violation_count = violation_code_null.groupby('action').apply(lambda x: x['violation_code'].isnull().sum())
null_violation_count

action
Violations were cited in the following area(s).    4
dtype: int64

That leaves us with:
- Violations were cited in the following area(s).


In [1344]:
# Drop rows where 'inspection_type' starts with "Violations"
action_violationcited = inspection_df[inspection_df['action'].str.startswith("Violations were cited in the following area(s)")]
action_violationcited.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,critical_flag,score,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta,violation_code,violation_description
2,50064240,DAXI SICHUAN,Queens,136-20,ROOSEVELT AVENUE,11354.0,9175631983,Chinese,2022-09-21T00:00:00.000,Violations were cited in the following area(s).,Not Critical,13.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.759778,-73.829235,407.0,20.0,85300.0,4113546.0,4050190000.0,QN22,09B,Thawing procedure improper.
4,50069583,PHO BEST,Queens,4235,MAIN ST,11355.0,9173618878,Southeast Asian,2022-05-09T00:00:00.000,Violations were cited in the following area(s).,Critical,30.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.754418,-73.827881,407.0,20.0,85300.0,4573539.0,4051358000.0,QN22,02B,Hot food item not held at or above 140º F.
7,50089970,FOO ON RESTAURANT,Queens,18304,HILLSIDE AVE,11432.0,7182971287,Chinese,2022-04-08T00:00:00.000,Violations were cited in the following area(s).,Not Critical,19.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.71387,-73.778712,412.0,23.0,47000.0,4212729.0,4099300000.0,QN61,10J,Hand wash sign not posted
8,40986189,LIEBMAN'S DELI,Bronx,552,WEST 235 STREET,10463.0,7185484534,Jewish/Kosher,2023-10-18T00:00:00.000,Violations were cited in the following area(s).,Critical,23.0,2023-12-01T06:00:08.000,Cycle Inspection / Initial Inspection,40.885579,-73.909622,208.0,11.0,29500.0,2084091.0,2057860000.0,BX29,02B,Hot TCS food item not held at or above 140 °F.
11,41433469,THE CAPITAL GRILLE,Manhattan,120,BROADWAY,10271.0,2123741811,American,2019-10-21T00:00:00.000,Violations were cited in the following area(s).,Critical,11.0,2023-12-01T06:00:08.000,Cycle Inspection / Re-inspection,40.708539,-74.011041,101.0,1.0,700.0,1001026.0,1000478000.0,MN25,02B,Hot food item not held at or above 140º F.


We examine inspections with the action "Violations were cited in the following area(s)," which have a mix of nulls and codes. We assume that the nulls were in error.

We plan to handle null values as follows:

1.  Drop rows with the action "Violations were cited in the following area(s)".

In [1345]:
inspection_df = inspection_df.drop(inspection_df[(inspection_df['violation_code'].isna()) & (inspection_df['action'].str.startswith("Violations were cited in the following area(s)"))].index)

There are no violation_code nulls left.

In [1346]:
null_counts_by_column = inspection_df.isnull().sum()
null_counts_by_column[null_counts_by_column > 0]

Series([], dtype: int64)

We have sucessfully addressed all the nulls in the dataframe. 

In [1347]:
pd.DataFrame({
    'Numeric_Zero_Count': (inspection_df == 0).sum(),
    'String_Zero_Count': (inspection_df == '0').sum(),
    'Null_Count': (inspection_df.isna().sum()).sum()
})

Unnamed: 0,Numeric_Zero_Count,String_Zero_Count,Null_Count
camis,0,0,0
dba,0,0,0
boro,0,0,0
building,0,276,0
street,0,0,0
zipcode,0,0,0
phone,0,0,0
cuisine_description,0,0,0
inspection_date,0,0,0
action,0,0,0


#### Dealing with 0s

For the 'building' column, it appears to have some 0 values, but there's not much we can do about that, so we will leave it as is.

Regarding the 'score' column, we can infer that a score of 0 indicates no violations.

### Dealing with Data Types

In [1348]:
inspection_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 149942 entries, 0 to 207364
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   camis                  149942 non-null  int64  
 1   dba                    149942 non-null  object 
 2   boro                   149942 non-null  object 
 3   building               149942 non-null  object 
 4   street                 149942 non-null  object 
 5   zipcode                149942 non-null  float64
 6   phone                  149942 non-null  object 
 7   cuisine_description    149942 non-null  object 
 8   inspection_date        149942 non-null  object 
 9   action                 149942 non-null  object 
 10  critical_flag          149942 non-null  object 
 11  score                  149942 non-null  float64
 12  record_date            149942 non-null  object 
 13  inspection_type        149942 non-null  object 
 14  latitude               149942 non-null  f

### Building Column
First, lets address the building column.

In [1349]:
inspection_df['building'].str.isalpha().any()

True

Building has a mix of letter and numbers, it must remain an object type. 

### Score Column

Prepare the 'score' column for numerical analysis, the following action has been taken.

In [1350]:
inspection_df['score'] = inspection_df['score'].astype(int)

### Float data type columns

The following columns should exclusively contain whole numbers. Currently, they are in float type. To ensure their integrity:

1. I will initially verify if they already consist of whole numbers.
2. Then, I will convert them to integers to confirm the absence of special characters.
3. Finally, I will convert them back to strings, as these columns are categorical features.

In [1351]:
columns_to_check = ['zipcode', 'score', 'community_board', 'council_district', 'census_tract', 'bin', 'bbl']

for column in columns_to_check:
    is_integer = (inspection_df[column] % 1 == 0).all()
    print(f"{column} Column: {is_integer}")

zipcode Column: True
score Column: True
community_board Column: True
council_district Column: True
census_tract Column: True
bin Column: True
bbl Column: True


In [1352]:
for column in columns_to_check:
    inspection_df[column] = inspection_df[column].astype(int)
    inspection_df[column] = inspection_df[column].astype(str)

### Phone Column

Lets work on the 'phone' column, we will perform the following steps:

1. Remove all non-numerical characters from the 'phone' column.
2. Replace missing or empty values with '1000000000' to avoid having all zeros.


In [1353]:
# Use regex to extract digits from the "phone" column
inspection_df['phone'] = inspection_df['phone'].str.replace(r'\D', '', regex=True)

In [1354]:
# Remove spaces and replace empty values with '1000000000' in the 'phone' column
inspection_df['phone'] = inspection_df['phone'].str.strip().replace(['', '0000000000'], '1000000000')

## Inspection Date Column

To standardize the 'inspection_date' column, we will follow these steps:

1. Begin by printing the 'inspection_date' from the first row of the DataFrame to verify the initial format, which is in the format 'YYYY-MM-DDThh:mm:ss.sss'.
2. Next, convert the 'inspection_date' column to datetime format and format it to display only the date in 'YYYY-MM-DD' format.
3. Finally, print the 'inspection_date' from the first row of the DataFrame again to confirm that it has been standardized to 'YYYY-MM-DD'.



In [1355]:
# Print the 'inspection_date' from the first row of the DataFrame
inspection_df.loc[0, 'inspection_date']

'2021-09-12T00:00:00.000'

In [1356]:
# Convert the 'inspection_date' column to datetime and format it to display only the date (YYYY-MM-DD)
inspection_df['inspection_date'] = pd.to_datetime(inspection_df['inspection_date']).dt.strftime('%Y-%m-%d')


In [1357]:
# Print the 'inspection_date' from the first row of the DataFrame
inspection_df.loc[0, 'inspection_date']

'2021-09-12'

The DataFrame 'inspection_df' has been thoroughly checked and cleaned, resulting in the following characteristics:

- No null values exist in any of the columns.
- The data types of the columns are appropriate.

The data is now ready for further analysis and exploration. If you have any additional tasks or questions related to this DataFrame or any other topic, please feel free to ask.

In [1314]:
inspection_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 149942 entries, 0 to 207364
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   camis                  149942 non-null  int64  
 1   dba                    149942 non-null  object 
 2   boro                   149942 non-null  object 
 3   building               149942 non-null  object 
 4   street                 149942 non-null  object 
 5   zipcode                149942 non-null  object 
 6   phone                  149942 non-null  object 
 7   cuisine_description    149942 non-null  object 
 8   inspection_date        149942 non-null  object 
 9   action                 149942 non-null  object 
 10  critical_flag          149942 non-null  object 
 11  score                  149942 non-null  object 
 12  record_date            149942 non-null  object 
 13  inspection_type        149942 non-null  object 
 14  latitude               149942 non-null  f