# COGS 108 - Final Project 

# Overview

*Fill in your overview here*

# Name & PID

- Name: Sophie Truong
- PID: A15279448

# Research Question

Do health inspection ratings differ based on the socioeconomic areas chain restaurants are located in?

## Background and Prior Work

References found refer to violations in restaurants and how health inspections are done according to the restaurants' socioeconomic locations. It was commonly found that the number of violations positively correlate with populations that have a higher minority concentration and that health inspections are done more frequently in those areas. 

In a 2-year study where they explored the outcomes of health inspections in relation to restaurant inspection frequency and neighborhood sociodemographic, chain restaurants were found to have significantly fewer violations per inspection (1). Chain restaurants serve less perishable food so the chance of foodborne illnesses were less likely. Also, the number of total violations per inspection was not associated with any block group sociodemographic characteristics. However, it was also noted that “the number of foodborne-illness risk factor violations per inspection was significantly associated with the proportion of black residents, whereas the number of good retail practice violations per inspection was significantly positively associated with proportion of black residents and hispanics.” The takeaway message of this study was that it doesn’t matter if a restaurant is non-chain or chain; health inspections should be done in a way that focuses more on restaurants in areas that seem to be in higher risk to reduce frequency of violations. 

Another study focused on tracking critical health violations (CHV) in communities with different socioeconomic status and demographics. They were able to pinpoint health indicators to specific geographic locations in Philadelphia through geographic information systems (GIS). “Overall, food service facilities in higher poverty areas had a greater number of [food service facilities] (with at least one CHV) and had more frequent inspections than facilities in lower poverty areas. The facilities in lower poverty areas, however, had a higher average number of CHV per inspection” (2). In high concentrations of minority populations, Hispanic facilities had more CHV than other demographics, and Hispanic and African American facilities had fewer days between inspections. This study reveals how subjective health inspections can be and that it indicates that other factors might be affecting inspection frequency and identification of CHV.

References (include links):
- 1) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3323064/
- 2) https://www.ncbi.nlm.nih.gov/pubmed/21902922/

# Hypothesis


My hypothesis is that chain restaurants in regions associated with lower socioeconomic statuses tend to receive a lower health inspection score. After the research that I have done, I have noticed that there is a trend of more health violations in areas with more residents that have lower income, and that it being chain or non-chain restaurant doesn't have too much of an influence on the outcome.|

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name: df_restaurants
- Link to the dataset: restaurants.csv
- Number of observations: 18466 

Dataset of restaurants' name and locations. This will be used match their zip codes with df_violations and df_zipcodes to determine their socioeconomic rating. 

- Dataset Name: df_violations
- Link to the dataset: violations.csv
- Number of observations: 189802 

Dataset of violations from restauants with a short description and labels. This will be used match their zip codes with df_restaurants and df_zipcodes to determine their socioeconomic rating. 

- Dataset Name: df_zipcodes
- Link to the dataset: zipcodes.csv
- Number of observations: 38 

Dataset of zipcodes and their associated income and poverty rate. This will be used match their zip codes with df_restaurants and df_violations to determine their socioeconomic rating. 


# Setup

In [11]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

In [12]:
# Reading in CSVs
df_inspections = pd.read_csv('inspections.csv')
df_restaurants = pd.read_csv('restaurants.csv')
df_violations = pd.read_csv('violations.csv')
df_yelp = pd.read_csv('yelp.csv')
df_zipcodes = pd.read_csv('zipcodes.csv')

In [13]:
# inspections
df_inspections

Unnamed: 0,hsisid,date,name,address1,address2,city,state,postalcode,phonenumber,restaurantopendate,...,num_non_critical,num_critical_previous,num_non_critical_previous,num_critical_mean_previous,num_non_critical_mean_previous,avg_neighbor_num_critical,avg_neighbor_num_non_critical,top_match,second_match,critical
0,4092013748,2012-09-21T00:00:00Z,Cafe 3000 At Wake Med,3000 New Bern Ave,,raleigh,NC,27610,(919) 350-8047,2002-12-21T00:00:00Z,...,7,,,,,,,,,1
1,4092014046,2012-09-21T00:00:00Z,Overtime Sports Pub,1030-149 N Rogers Ln,,raleigh,NC,27610,(919) 255-9556,2004-05-04T00:00:00Z,...,11,,,,,,,,,0
2,4092015191,2012-09-21T00:00:00Z,TASTE OF CHINA,6209 ROCK QUARRY RD,STE 126,raleigh,NC,27610,(919) 773-2285,2008-08-04T00:00:00Z,...,8,,,,,,,,,1
3,4092016122,2012-09-21T00:00:00Z,Panera Bread #1643,1065 Darrington DR,,cary,NC,27513,,2012-03-28T00:00:00Z,...,3,,,,,,,,,1
4,4092021513,2012-09-21T00:00:00Z,WalMart Supercenter #4499-00 Deli/Bakery,841 E Gannon AVE,,zebulon,NC,27597,(919) 269-2221 ext. 304,2008-02-25T00:00:00Z,...,4,,,,,,,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18461,4092021142,2016-11-03T00:00:00Z,Sino Wok,5959-1108 Triangle Town Blv,,raleigh,NC,27616,(919) 792-2499,2002-08-19T00:00:00Z,...,13,1.0,4.0,1.000000,4.777778,4.692525,9.863838,4.092022e+09,4.092017e+09,1
18462,4092110100,2016-11-03T00:00:00Z,Weatherstone Elem. Sch. Cafeteria,1000 Olde Weatherstone Way,,cary,NC,27513,(919) 380-6985,1995-09-05T00:00:00Z,...,3,2.0,5.0,0.750000,3.250000,5.853333,5.972381,4.092013e+09,4.092016e+09,1
18463,4092110487,2016-11-03T00:00:00Z,ALSTON RIDGE ELEMENTARY SCHOOL CAFETERIA,11555 GREEN LEVEL CHURCH RD,,cary,NC,27519,,2009-11-13T00:00:00Z,...,4,0.0,2.0,0.875000,1.750000,2.466667,3.000000,4.092017e+09,4.092015e+09,1
18464,4092300177,2016-11-03T00:00:00Z,Food Lion #996 Meat Market,7971 FAYETTEVILLE RD,,raleigh,NC,27603-5631,(919) 772-0317,2000-07-01T00:00:00Z,...,2,0.0,3.0,1.111111,2.000000,3.737302,4.606349,4.092016e+09,4.092015e+09,1


In [14]:
# restaurants
df_restaurants

Unnamed: 0,X.objectid,hsisid,name,address1,address2,city,state,postalcode,phonenumber,restaurantopendate,facilitytype,x,y,geocodestatus
0,1001,4092017230,SPRING CAFE 2,2900-104 SPRING FOREST RD,,RALEIGH,NC,27616-1895,(919) 977-3679,2016-05-26T00:00:00.000Z,Restaurant,-78.591634,35.855487,M
1,1002,4092040338,CAROLINA CLASSIC HOT DOGS #2 (WCID #549),309 HOLLOMAN ST,,APEX,NC,27502,,2016-07-01T00:00:00.000Z,Pushcarts,-78.855348,35.730219,M
2,1003,4092014444,Taco Bell #22798,2207 S MAIN ST,,WAKE FOREST,NC,27587,(919) 554-4924,2005-12-05T00:00:00.000Z,Restaurant,-78.536145,35.946695,M
3,1004,4092015333,THE REMEDY DINER,137 E HARGETT ST,,RALEIGH,NC,27601,(919) 835-3553,2009-02-04T00:00:00.000Z,Restaurant,-78.636895,35.777999,M
4,1005,4092160069,ZEBULON HOUSE (KITCHEN),551 PONY RD,,ZEBULON,NC,27597,,2009-02-18T00:00:00.000Z,Institutional Food Service,-78.332138,35.816779,M
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3319,2996,4092016658,LA ROMA PIZZA,1322 FIFTH AVE,,GARNER,NC,27529,(919) 662-1700,2014-04-03T00:00:00.000Z,Restaurant,-78.621859,35.709485,M
3320,2997,4092016663,BOJANGLES #5,3301 S WILMINGTON ST,,RALEIGH,NC,27603,(919) 772-4512,2014-04-08T00:00:00.000Z,Restaurant,-78.649803,35.735063,M
3321,2998,4092016557,BURGER KING #19795,22114 S MAIN ST,,Wake Forest,NC,27587,(919) 556-7773,2013-10-31T00:00:00.000Z,Restaurant,0.000000,0.000000,U
3322,2999,4092017227,QUICKLY,4141 DAVIS DR,,MORRISVILLE,NC,27560,(984) 465-0347,2016-05-19T00:00:00.000Z,Restaurant,-78.858116,35.835626,M


In [15]:
# violations
df_violations

Unnamed: 0,X.objectid,hsisid,inspectdate,category,statecode,critical,questionno,violationcode,severity,shortdesc,inspectedby,comments,pointvalue,observationtype,violationtype,count,cdcriskfactor,cdcdataitem
0,2149,4092015279,2014-09-22T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Lucy Schrum,7-102.11; Priority Foundation - Found unlabele...,0,Out,R,,,
1,2150,4092014572,2014-09-29T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Daryl Beasley,7-102.11; Priority Foundation; One sanitizer b...,0,Out,CDI,,,
2,2151,4092015906,2014-10-01T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Lucy Schrum,7-102.11; Priority Foundation - Found an unlab...,1,Out,CDI,,,
3,2152,4092013840,2014-10-08T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Lucy Schrum,7-102.11; Priority Foundation - Found unlabele...,0,Out,CDI,,,
4,2153,4092021788,2014-10-09T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Lucy Schrum,7-102.11; Priority Foundation - Found one unla...,0,Out,CDI,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
189797,2144,4092015549,2014-09-10T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Lucy Schrum,7-102.11; Priority Foundation - Found a few un...,0,Out,CDI,,,
189798,2145,4092016135,2014-09-11T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",James Smith,7-102.11; Priority Foundation; Spray bottle o...,1,Out,R,,,
189799,2146,4092020997,2014-09-12T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Andrea Anover,7-102.11; Priority Foundation; 1 bottle of oil...,0,Out,CDI,,,
189800,2147,4092021798,2014-09-19T00:00:00.000Z,Chemical,".2653,.2657",Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Thomas Jumalon,7-102.11; Priority Foundation; REQUIRES CHEMC...,1,Out,,,,


In [16]:
# zipcodes
df_zipcodes

Unnamed: 0,zip,median_family_income_dollars,median_household_income_dollars,per_capita_income_dollars,percent_damilies_below_poverty_line,percent_snap_benefits,percent_supplemental_security_income,percent_nonwhite
0,27501,59408,51121,21631,10.5,15.5,5.2,17.9
1,27502,109891,95857,36763,3.4,2.4,0.8,18.9
2,27511,82292,67392,33139,9.6,4.5,2.2,24.8
3,27513,109736,87262,41232,3.8,2.4,1.5,27.8
4,27518,125432,98247,49865,5.5,1.0,1.7,19.9
5,27519,137193,121815,45778,3.2,2.3,2.3,35.2
6,27520,67939,58455,25628,5.0,7.8,2.4,23.5
7,27522,66250,59221,25513,6.0,7.0,4.2,31.4
8,27523,89184,68342,36976,3.1,3.1,1.8,22.4
9,27526,74666,66025,28074,8.4,8.3,3.9,22.5


# Data Cleaning

Describe your data cleaning steps here.

In [17]:
# to later use to merge 
df_zipcodes.rename(columns={'zip':'postalcode'}, inplace=True)

df_zipcodes['postalcode'] = df_zipcodes['postalcode'].astype(str)

In [18]:
# matching restaurants and their violations using their IDs
df_restViol = pd.merge(left = df_restaurants, right = df_violations, left_on = 'X.objectid', right_on = 'X.objectid') 
df_restViol.head()

Unnamed: 0,X.objectid,hsisid_x,name,address1,address2,city,state,postalcode,phonenumber,restaurantopendate,...,severity,shortdesc,inspectedby,comments,pointvalue,observationtype,violationtype,count,cdcriskfactor,cdcdataitem
0,1001,4092017230,SPRING CAFE 2,2900-104 SPRING FOREST RD,,RALEIGH,NC,27616-1895,(919) 977-3679,2016-05-26T00:00:00.000Z,...,Priority Foundation,"Required records available: shellstock tags, p...",Karla Crowder,"3-402.12;(C) If raw, raw-marinated, partially ...",1,Out,VR,,Food from Unsafe Source,Records
1,1002,4092040338,CAROLINA CLASSIC HOT DOGS #2 (WCID #549),309 HOLLOMAN ST,,APEX,NC,27502,,2016-07-01T00:00:00.000Z,...,Priority Foundation,"Required records available: shellstock tags, p...",Lucy Schrum,Pf - 3-402.12 - Observed improper parasite des...,0,Out,CDI,,Food from Unsafe Source,Records
2,1003,4092014444,Taco Bell #22798,2207 S MAIN ST,,WAKE FOREST,NC,27587,(919) 554-4924,2005-12-05T00:00:00.000Z,...,Priority Foundation,"Required records available: shellstock tags, p...",Lucy Schrum,Pf - 3-402.12 - Observed improper parasite des...,0,Out,R,,Food from Unsafe Source,Records
3,1004,4092015333,THE REMEDY DINER,137 E HARGETT ST,,RALEIGH,NC,27601,(919) 835-3553,2009-02-04T00:00:00.000Z,...,Priority Foundation,"Required records available: shellstock tags, p...",Karla Crowder,3-402.12; Priority Foundation; A written agree...,1,Out,CDI,,Food from Unsafe Source,Records
4,1005,4092160069,ZEBULON HOUSE (KITCHEN),551 PONY RD,,ZEBULON,NC,27597,,2009-02-18T00:00:00.000Z,...,Priority Foundation,"Required records available: shellstock tags, p...",Chris Askew,3-402.12;No records stating that aquacultered ...,0,Out,CDI,,Food from Unsafe Source,Records


In [19]:
# combining df_zipcodes, df_restaurants, and df_violations based on their postal code
df_zipRestViol = pd.merge(left = df_zipcodes, right = df_restViol, left_on = 'postalcode', right_on = 'postalcode') 
lowest_income = df_zipRestViol['median_household_income_dollars'].min()
highest_income = df_zipRestViol['median_household_income_dollars'].max()
average_income = (lowest_income + highest_income)/ 2

# getting numbers for income
print('lowest avg household income: ', lowest_income)
print('highest avg household income: ', highest_income)
print('average household income: ', average_income)

lowest avg household income:  27564
highest avg household income:  121815
average household income:  74689.5


In [24]:
# dataframe for areas where household have less than the average income
df_belowAvgIncome = df_zipRestViol.loc[df_zipRestViol['median_household_income_dollars'] < average_income]

# using only necessary columns
df_cleanBelowAvgIncome = df_belowAvgIncome[['postalcode', 'median_household_income_dollars', 'percent_damilies_below_poverty_line', 'X.objectid',
                                             'name', 'facilitytype', 'category', 'critical', 'questionno', 'violationcode', 'severity',
                                           'shortdesc', 'comments', 'pointvalue', 'violationtype']]
# fixing column title
df_cleanBelowAvgIncome.rename(columns={'percent_damilies_below_poverty_line':'percent_below_poverty_line'}, inplace=True)
df_cleanBelowAvgIncome.head()


Unnamed: 0,postalcode,median_household_income_dollars,percent_below_poverty_line,X.objectid,name,facilitytype,category,critical,questionno,violationcode,severity,shortdesc,comments,pointvalue,violationtype
0,27501,51121,10.5,1530,FILIPINO CUISINE,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Pf - 7-102.11 - Found two unlabeled bottles of...,0,R
116,27511,67392,9.6,1037,CHINA CARY,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",7-102.11/Pf; Bottle of green liquid mislabeled...,1,
117,27511,67392,9.6,1091,Harris Teeter #257 Deli,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...","Pf - 7-102.11 - Found one bottle labeled ""blea...",1,CDI
118,27511,67392,9.6,1124,CARIBOU COFFEE #306,Restaurant,Chemical,No,26,7-209.11,Core,"Toxic substances properly identified, stored, ...",7-209.11 - Bottle of medication on same shelf ...,0,CDI
119,27511,67392,9.6,1129,CASABLANCA MARKET,Meat Market,Chemical,No,26,7-209.11,Core,"Toxic substances properly identified, stored, ...","7-209.11 ; Core; Soap, shampoo and body wash s...",1,CDI


In [26]:
# dataframe for areas where household have the average income or higher
df_AvgIncomeOrHigher = df_zipRestViol.loc[df_zipRestViol['median_household_income_dollars'] >= average_income]# using only necessary columns

# getting necessary columns
df_cleanAvgIncomeOrHigher = df_AvgIncomeOrHigher[['postalcode', 'median_household_income_dollars', 'percent_damilies_below_poverty_line', 'X.objectid',
                                             'name', 'facilitytype', 'category', 'critical', 'questionno', 'violationcode', 'severity',
                                           'shortdesc', 'comments', 'pointvalue', 'violationtype']]
# fixing column title
df_cleanAvgIncomeOrHigher.rename(columns={'percent_damilies_below_poverty_line':'percent_below_poverty_line'}, inplace=True)
df_cleanBelowAvgIncome.head()

Unnamed: 0,postalcode,median_household_income_dollars,percent_below_poverty_line,X.objectid,name,facilitytype,category,critical,questionno,violationcode,severity,shortdesc,comments,pointvalue,violationtype
0,27501,51121,10.5,1530,FILIPINO CUISINE,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",Pf - 7-102.11 - Found two unlabeled bottles of...,0,R
116,27511,67392,9.6,1037,CHINA CARY,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...",7-102.11/Pf; Bottle of green liquid mislabeled...,1,
117,27511,67392,9.6,1091,Harris Teeter #257 Deli,Restaurant,Chemical,Yes,26,7-102.11,Priority Foundation,"Toxic substances properly identified, stored, ...","Pf - 7-102.11 - Found one bottle labeled ""blea...",1,CDI
118,27511,67392,9.6,1124,CARIBOU COFFEE #306,Restaurant,Chemical,No,26,7-209.11,Core,"Toxic substances properly identified, stored, ...",7-209.11 - Bottle of medication on same shelf ...,0,CDI
119,27511,67392,9.6,1129,CASABLANCA MARKET,Meat Market,Chemical,No,26,7-209.11,Core,"Toxic substances properly identified, stored, ...","7-209.11 ; Core; Soap, shampoo and body wash s...",1,CDI


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [22]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

Issues related to my topic area could include certain chain restaurants gaining less business and worst case scenario, possibly shutting down. Analyses that are potentially problematic in terms of data privacy are that people may find specific restaurant locations. Workers may be harassed as a result. However, to mitigate these issues, I would leave out names of restaurants and what city they belong to in my analysis. I would classify the restaurants as Restaurant 1 or Restaurant 2 in which socioeconomic status the city has. They would be described as Restaurant 1 in the high socioeconomic city.

Race might be tracked down due to the bias that poorer neighborhoods are associated with minorities. I would try to minimize this risk by leaving out any way to identify the neighborhoods such as leaving out zipcodes or descriptions of the area. 

# Conclusion & Discussion

*Fill in your discussion information here*