# COGS 108 - Final Project 

# Overview

*Fill in your overview here*

# Name & PID

- Name: Jack Giddings
- PID: A12785560

# Research Question

Do lower scoring health inspection ratings and/or frequency of inspections correlate to lower income cities of North Carolina? Does this have an effect on the number of COVID-19 cases within that area?

## Background and Prior Work

During the current COVID-19 pandemic that we are also experiencing, the health and safety of a community are now its top priority. Food safety has become increasingly common as people disinfect their groceries as soon as they get home. In this time, ensuring health inspections are being carried out properly and to the correct standards is critical to public safety and halting the spread of germs. This begs us to look for ways to improve health inspections and ask questions about the relationships between inspection ratings and the number of COVID cases per neighborhood in LA County. Do worse health inspection ratings correlate to a higher number of cases in a specific neighborhood?
    
In LA County, health inspections are carried out on a rating scale of 1 to 100. A business starts with the full 100 points and is then deducted points depending on the severity of each infraction found during the inspection. Depending on its score it is then issued a grade of A for 90-100, B for 80-89, C for 70-79, and a Fail for anything below a 70. The three levels of violations are minor, major, and critical with any critical violation resulting in an automatic fail regardless of the score [1]. Because the COVID-19 outbreak has occurred so recently there are no prior projects on this topic, though there are a few that are similar. One project looks at Yelp ratings for restaurants in neighborhoods around Toronto to see if lower income neighborhoods have worse service, and therefore worse ratings [2]. This is similar to my question as the number of COVID cases is also broken down by neighborhoods in Los Angeles to analyze any possible correlation. Although this project determined that there is no correlation between Yelp ratings and neighborhood income, this does not discount possible relationships between neighborhood income and health inspection ratings. As such neighborhood income might have an unseen effect on health inspections and thus an unseen effect on the spread of COVID-19.

The second project I found is more directly related to my question as they used the Yelp Health Inspection System, called LIVES, to analyze and identify specific regions of San Francisco with high numbers of health violations [3]. This project did not correlate to neighborhood income, however, but instead has higher violations coming from more high-trafficked areas around SF. My question takes this one step further by asking how LA County neighborhoods with higher frequencies and severities of health inspection infractions affect the spread of COVID-19. 

References (include links):
- 1) http://publichealth.lacounty.gov/eh/misc/ehpost.htm
- 2) https://medium.com/swlh/is-there-a-correlation-between-a-restaurants-ratings-and-the-income-levels-of-a-neighborhood-5fe41165e4f1
- 3) https://nycdatascience.com/blog/student-works/san-francisco-restaurant-inspection-analysis-visualization/

# Hypothesis


I believe that lower health inspection scores does in fact correlate to a higher number of COVID-19 cases, which could serve as an explanation to why lower income neighborhoods have been affected the most by this pandemic. Because the COVID-19 virus can live on surfaces, it serves to think that restaurants with worse health inspection ratings are not as clean and might therefore be contributing more heavily to the spread of the virus. If this is true then it would explain a higher number of cases in those neighborhoods, which I predict to be lower income neighborhoods.

# Dataset(s)

- Dataset Name: violations.csv
- Link to the dataset: https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/violations.csv
- Number of observations: 189,803

- Dataset Name: inspections.csv
- Link to the dataset: https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/inspections.csv
- Number of observations: 18,467

- Dataset Name: zipcodes.csv
- Link to the dataset: https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/zipcodes.csv
- Number of observations: 39

- Dataset Name: zipcodes.csv
- Link to the dataset: https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/zipcodes.csv
- Number of observations: 39

The violations.csv dataset contains a list of all health code violations for the state of North Carolina from 2012-2016 while the inspections.csv dataset contains the scores of all the health code inspections done from 2012-2016. The zipcodes.csv dataset contains financial information, median household income etc., of thirty-nine (39) cities in NC. To get these different datasets into one usable dataframe, I will use  merging the violations.csv and inspections.csv datasets on their 'hsisd' columns.

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
# All imports used in the project
%matplotlib inline

import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 
import random


# Data Cleaning

Describe your data cleaning steps here.

In [2]:
# Read in all databases given
inspections_db = pd.read_csv("https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/inspections.csv", sep = ',')
#restaurants_db = pd.read_csv("https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/restaurants.csv", sep = ',')
violations_db = pd.read_csv("https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/violations.csv", sep = ',')
#yelp_db = pd.read_csv("https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/yelp.csv", sep = ',')
zipcode_db = pd.read_csv("https://raw.githubusercontent.com/jrgiddin/individual_sp20/master/zipcodes.csv", sep = ',')

In [3]:
# Clean Data for Inspections
inspections_db.head()
inspections_db = inspections_db.dropna(thresh = len(inspections_db) * 0.9, axis = 'columns')
inspections_db.isna().sum()

hsisid                             0
date                               0
name                               0
address1                           0
city                               0
state                              0
postalcode                         0
phonenumber                      501
restaurantopendate                 0
days_from_open_date                0
facilitytype                       0
x                                  0
y                                  0
geocodestatus                      0
zip                                0
type                               0
inspectedby                        5
inspection_num                     0
inspector_id                       5
score                              0
num_critical                       0
num_non_critical                   0
avg_neighbor_num_critical          6
avg_neighbor_num_non_critical      6
top_match                          6
second_match                       6
critical                           0
d

In [14]:
# Clean Data for Restaurants Database
#restaurants_db = restaurants_db.dropna(thresh = len(restaurants_db) * 0.9, axis = 'columns')
#restaurants_db.isna().sum()

X.objectid              0
hsisid                  0
name                    0
address1                0
city                    0
state                   0
postalcode              0
phonenumber           145
restaurantopendate      0
facilitytype            0
x                       0
y                       0
geocodestatus           0
dtype: int64

In [4]:
# Clean Data for Violations Database
violations_db = violations_db.dropna(thresh = len(violations_db) * 0.9, axis = 'columns')
violations_db.isna().sum()

X.objectid            0
hsisid                0
inspectdate           0
category              0
statecode             0
critical           8957
questionno            0
violationcode         0
severity           8957
shortdesc             0
inspectedby          78
comments            448
pointvalue            0
observationtype     279
dtype: int64

In [5]:
# Clean Data for Zipcode Database
zipcode_db.dropna(thresh = len(zipcode_db) * 0.9, axis = 'columns')
zipcode_db.isna().sum()

zip                                     0
median_family_income_dollars            0
median_household_income_dollars         0
per_capita_income_dollars               0
percent_damilies_below_poverty_line     0
percent_snap_benefits                   0
percent_supplemental_security_income    0
percent_nonwhite                        0
dtype: int64

In [21]:
# Merge Datasets
insp_rest_db = pd.merge(inspections_db, restaurants_db, on = ['hsisid', 'state', 'phonenumber', 'postalcode', 'restaurantopendate'], how = 'inner')
insp_rest_db
#insp_rest_db.isna().sum()

Unnamed: 0,hsisid,date,name_x,address1_x,city_x,state,postalcode,phonenumber,restaurantopendate,days_from_open_date,...,second_match,critical,X.objectid,name_y,address1_y,city_y,facilitytype_y,x_y,y_y,geocodestatus_y
0,4092013748,2012-09-21T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,3562.0,...,,1.0,,,,,,,,
1,4092013748,2013-02-14T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,3708.0,...,4.092016e+09,1.0,,,,,,,,
2,4092013748,2013-08-08T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,3883.0,...,4.092016e+09,1.0,,,,,,,,
3,4092013748,2014-04-03T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,4121.0,...,4.092016e+09,1.0,,,,,,,,
4,4092013748,2014-10-03T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,4304.0,...,4.092016e+09,1.0,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21785,4092016658,,,,,NC,27529,(919) 662-1700,2014-04-03T00:00:00.000Z,,...,,,2996.0,LA ROMA PIZZA,1322 FIFTH AVE,GARNER,Restaurant,-78.621859,35.709485,M
21786,4092016663,,,,,NC,27603,(919) 772-4512,2014-04-08T00:00:00.000Z,,...,,,2997.0,BOJANGLES #5,3301 S WILMINGTON ST,RALEIGH,Restaurant,-78.649803,35.735063,M
21787,4092016557,,,,,NC,27587,(919) 556-7773,2013-10-31T00:00:00.000Z,,...,,,2998.0,BURGER KING #19795,22114 S MAIN ST,Wake Forest,Restaurant,0.000000,0.000000,U
21788,4092017227,,,,,NC,27560,(984) 465-0347,2016-05-19T00:00:00.000Z,,...,,,2999.0,QUICKLY,4141 DAVIS DR,MORRISVILLE,Restaurant,-78.858116,35.835626,M


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

*Fill in your ethics & privacy discussion here*

# Conclusion & Discussion

*Fill in your discussion information here*