## OVERVIEW

This project aim to prove that health inspections can be reorganized to make them more efficient. First point that is considered is the frequency of health inspections carried out each day per health inspector. Optimally a health inspector should carry out one health inspection per day or two at the most if we want their work to be thorough. The second point that is considered is whether health inspections in non-chain restaurants should be carried more frequently than chain restaurants. We will show that the answer to the previous inquiry is affirmative since chain restaurants tend to have far less health violations. 

## Name and PID

Name: Samuel Dossou 

PID: A11502037

## Research Question

How can the frequency and order of health inspections any health inspector conducts daily be adjusted to not only improve the inspection process but also prevent further food safety violations?


## Background and Prior work

Foodborne illnesses are a serious public health concern as it causes millions of illnesses, hundreds of thousands of hospitalizations, as well as thousands of deaths each year [1]. That is precisely why restaurant health inspections are crucial to prevent the spread of foodborne illnesses. Thus, if a health inspection is done right it can prevent restaurants with multiple food violations from serving bad food. Conversely if done improperly it can cause the spread of foodborne illnesses.

First, it is important to note that restaurant health inspections are subject to certain unconscious biases. For instance, a study has shown that health inspectors tend to cite fewer violations at each successive establishment they visited through their days [2]. Another study worth mentioning is a study that shows that more frequent health inspections does not necessarily decrease the number of violations in restaurant chain but does decrease the number of violations in nonchain restaurants [1]. 
References (include links):

1) https://journals.sagepub.com/doi/full/10.1177/0033354916687741

2) https://hbr.org/2019/05/to-improve-food-inspections-change-the-way-theyre-scheduled

## Hypothesis 

I expect to see that if we limit the number of health inspections to 2 inspection sites per day for each health inspector, as well as order nonchain restaurants to be put as the first inspection during any given day, and make nonchain inspections more frequent than chain inspections then we would see less foodborne illness. 

## Datasets

Variable: restaurant id or HSISID, number of inspection per restaurants in a given year, number of violations per inspection, score given to the restaurant after inspection, number of inspection per day per inspector, whether a restaurant is a chain restaurant or nonchain restaurant, zipcode, as well as the median household income in that zipcode. 

HSISID, number of violations, and whether the restaurant is a chain or nonchain restaurant would be collected from the relevant websites where the health inspection is published (usually publically published). However in this case the data files are provided to us. Restaurants.csv can tell us the facilytype, restaurant ID or HSISID. Inspections.csv can give us the inspection score, zipcode, date, as well as who inspected it. Zipcode.csv can give us the median household income associated to every zipcode. Finally violations.csv can give us information on the number of violations per inspection. 

This data will be collected within a given state, in this case North Carolina. For the purpose of this analysis one State is enough but the data should not be collected in only one city of the State. Therefore data from multiple cities should be collected. 

The data should also be collected over the period of 4 years. Since health inspections do not occur often in a year, we need to accumulate 4 years worth of data to have enough data to make a proper analysis.

After datacollection is complete; to put all this data in a usable form, we can just import the relevant csv files, take the relevant columns from each dataframes and merge them together to get a dataframe with 8 features. 




## Setup

In [92]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Data Cleaning/Processing 

In [93]:
#loading the CSV files
df_inspections = pd.read_csv('Documents/data/inspections.csv')
df_restaurants = pd.read_csv('Documents/data/restaurants.csv')
df_violations = pd.read_csv('Documents/data/violations.csv')
df_zipcodes = pd.read_csv('Documents/data/zipcodes.csv')

In [94]:
#taking the relevant columns from the datarames
df_restaurants = df_restaurants[['hsisid','facilitytype']]
df_inspections = df_inspections[['hsisid','score', 'zip', 'date', 'inspectedby']]
df_zipcodes = df_zipcodes[['zip', 'median_family_income_dollars']]
df_violations = df_violations[['hsisid', 'critical']]

#### merging the data frames to get one clean data frame

This will be a full type of merge. However if we do a full merge then the facilitytypes, median family income, and the crtical columns won't match with the appropriate hsisd. Thus instead doing a full merge right away i will be creating 3 dictionaries. 

These dictionaries will allow me to store what facilitytype, median family income, and critical correspond to each HSISID. Then once the dictionaries are created I can create new columns and 'assign' the appropriate HSISID to each column



In [95]:
#creating the dictionaries
median_family_income_dollars_dict = pd.Series(df_zipcodes.median_family_income_dollars.values,index=df_zipcodes.zip).to_dict()
critical_dict = pd.Series(df_violations.critical.values,index=df_violations.hsisid).to_dict()
facility_type_dict = pd.Series(df_restaurants.facilitytype.values,index=df_restaurants.hsisid).to_dict()

#creating empty new columns
df = df_inspections
df['median_family_income_dollars'] = df.apply(lambda _: '', axis=1)
df['critical'] = df.apply(lambda _: '', axis=1)
df['facilitytype'] = df.apply(lambda _: '', axis=1)

#filling the columns with the information
for i in range(0,len(df)):
    hsisid = df.iloc[i,0]
    df.iloc[i,-1] = facility_type_dict[hsisid]

for i in range(0,len(df)):
    hsisid = df.iloc[i,0]
    df.iloc[i,-2] = critical_dict[hsisid] 

for i in range(0,len(df)):
    zipp = df.iloc[i,2]
    df.iloc[i,-3] = median_family_income_dollars_dict[zipp]

KeyError: 27512

In [96]:
df
        

Unnamed: 0,hsisid,score,zip,date,inspectedby,median_family_income_dollars,critical,facilitytype
0,4092013748,96.0,27610,2012-09-21T00:00:00Z,Melissa Harrison,49213,Yes,Restaurant
1,4092014046,98.0,27610,2012-09-21T00:00:00Z,Christopher Walker,49213,No,Restaurant
2,4092015191,97.0,27610,2012-09-21T00:00:00Z,Anne Bartoli,49213,Yes,Restaurant
3,4092016122,99.0,27513,2012-09-21T00:00:00Z,Lisa McCoy,109736,Yes,Restaurant
4,4092021513,97.0,27597,2012-09-21T00:00:00Z,Christopher Walker,59395,No,Food Stand
5,4092110151,99.0,27587,2012-09-21T00:00:00Z,Naterra McQueen,96247,No,Public School Lunchrooms
6,4092013134,96.0,27606,2012-09-24T00:00:00Z,Jennifer Edwards,61324,No,Restaurant
7,4092013281,95.5,27529,2012-09-24T00:00:00Z,Anne Bartoli,68627,Yes,Restaurant
8,4092110005,99.0,27511,2012-09-24T00:00:00Z,David Adcock,82292,No,Public School Lunchrooms
9,4092160070,93.5,27513,2012-09-24T00:00:00Z,Andrea Anover,109736,No,Institutional Food Service


## Data analysis & Results

For the data analysis, I included the median family income in each zip code so see if that feature was a confounder. Indeed one of the avenues i am supposed to investigate is whether non-chain restaurants typically have more violations than chain restaurants. Thus, maybe median family income in the area which indicates if the restaurant is in a wealthy area can impact the result of the research. 

In the data cleaned, the date column along with the inspectedby column can let us know how often an inspector inspects in a given day. 


## Ethics and Privacy 

For this Section we will be assuming that our hypothesis is true since I haven't done any data analysis and haven't gotten any results. 

First and foremost it is esential to protect the identity of the health inspectors. As a result their names should be replaced with numerical IDs to ensure anonymity. Furthermore, consent should be requested from the appropriate Counties in order to use their observation; consent that can be withdrawn at any time. Furthermore all restaurants should not be able to be identified. This is due to the fact that if all the analysis and results are published, it could push people to assume some restaurants are poorly maintained and have poor hygiene. 

Moreover, If the the analysis and results support the hypothesis, then people could think that non-chain restaurants generally have poor hygiene.  

Also in this case study we assume that every health inspector have similar capabilities. However that may not be the case. Indeed, there could be different result that come with a difference in experience, gender, and perhaps zipcode. Thus, failing to include these factors may lead to a different conclusion.


## Conclusion & Discussion

If the hypothesis is right then the conclusion would be that non-chain restaurants should be inspected more often than chain restaurants. Furthermore, the frequency of inspections per day shoudl be kept at a minimum to ensure thorough work.  

Should have started this project much earlier or I simply should have done a group project. 
