# COGS 108 - Final Project 

# Overview

Looking at the datasets of violations for restaurants, we want to see if there is a correlation between violations and restaurant type (chain vs non-chain). 

# Name & PID

- Name: Andy Dong
- PID: A15321789

# Research Question

Are chain restaraunts performing better than nonchain restaurants in regards to cleaniness(cases of food-borne illness)

## Background and Prior Work

According to Pei Liu (Source 3), independent restaurants are 1.64 times more likely to receive critical violations compared to chain restaurants. In addition, the public health report (Source 1) states that chain restaurants have fewer total total violations compared to nonchain restaurants at 6.5 to 9.6. Even among chain restaurants, in this case fast-food restaurants, the amount of training received correlates to the number of violations an employee made (Source 2). Source 1 collected restaurant type (chain vs nonchain), inspection frequency, and violation type. Source 2 collected the amount of training received from each restaurant and violation types. Similarly, Source 3 collected restaurant type and type of violation.  

References (include links):
- 1)https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5349477/ 
- 2)https://www.tandfonline.com/doi/full/10.1080/15332840802156881 
- 3)https://www.tandfonline.com/doi/abs/10.1080/15378020.2016.1206770?scroll=top&needAccess=true&journalCode=wfbr20

# Hypothesis


The cleaniness of a restaurant (food inspection violations & cases of foodborne illness) is related to the amount of training an employee receives in regards to food hygiene. The more time an employer spends on training its staff, the less violations and cases the restaurant will receive. 

# Dataset(s)

*Fill in your dataset information here*
- Dataset Name: inspections.csv
- Link to the dataset: 
- Number of observations: 18466

Contains the results of inspections of restaurants. The dataset includes number of critical and non-critical violations, date, and name of restaurant. 

- Dataset Name: restaurants.csv
- Link to the dataset:
- Number of observations: 3324

Contains the restaurants being inspected. Used to narrow down restaurants only.

- Dataset Name: violations.csv
- Link to the dataset:
- Number of observations: 189802

Contains the violations that restaurants broke. Used to find the type of violation.

- Dataset Name: FastFoodRestaurants.csv
- Link to the dataset:
- Number of observations: 10000

Contains the address and name of Fast Food Restaurants in the US. Used to determine if a restaurant is chain or not.

# Setup

In [5]:
# Imports 
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

# Data Cleaning

Describe your data cleaning steps here.

In [107]:
## Clean Up for Inspection
inspections = pd.read_csv('./data/inspections.csv')
inspections = inspections[['hsisid', 'date', 'name', 'address1', 'city', 'state', 'num_non_critical', 'critical']]

# Find any empty rows (none)
inspections.isnull().any(axis=1)
empty_row = inspections[inspections.isnull().any(axis=1)]

# Lower case the name and address
def lower_string(string):
    string = string.lower()
    return string

def normal_date(string):
    string = string[:10]
    return string

inspections['name'] = inspections['name'].apply(lower_string)
inspections['address1'] = inspections['address1'].apply(lower_string)
inspections['date'] = inspections['date'].apply(normal_date)
inspections['city'] = inspections['city'].apply(lower_string)

inspections
sum_violations = inspections[['hsisid', 'num_non_critical', 'critical']]

unique = sum_violations['hsisid'].unique()
non_critical = [0] * len(unique)
critical = [0] * len(unique)

In [71]:
## Clean up for Restaurants 
restaurants = pd.read_csv('./data/restaurants.csv')
restaurants = restaurants[['hsisid', 'name','address1', 'city', 'state', 'facilitytype']]
restaurants['name'] = restaurants['name'].apply(lower_string)
restaurants['address1'] = restaurants['address1'].apply(lower_string)
restaurants['city'] = restaurants['city'].apply(lower_string)
restaurants['facilitytype'] = restaurants['facilitytype'].apply(lower_string)

# Get only restaurants
restaurants = restaurants[ restaurants['facilitytype'] == 'restaurant']

In [86]:
len(inspections['hsisid'].unique())

3045

In [36]:
## Clean up for Violations
violations = pd.read_csv('./data/violations.csv')
violations = violations[['hsisid', 'category', 'critical', 'severity', 'shortdesc', 'violationtype']]
violations.head(5)

Unnamed: 0,hsisid,category,critical,severity,shortdesc,violationtype
0,4092015279,Chemical,Yes,Priority Foundation,"Toxic substances properly identified, stored, ...",R
1,4092014572,Chemical,Yes,Priority Foundation,"Toxic substances properly identified, stored, ...",CDI
2,4092015906,Chemical,Yes,Priority Foundation,"Toxic substances properly identified, stored, ...",CDI
3,4092013840,Chemical,Yes,Priority Foundation,"Toxic substances properly identified, stored, ...",CDI
4,4092021788,Chemical,Yes,Priority Foundation,"Toxic substances properly identified, stored, ...",CDI


In [89]:
## Clean up for Fast Food Restaurants
ff = pd.read_csv('./data/FastFoodRestaurants.csv')
ff = ff[['address']]
ff['address'] = ff['address'].apply(lower_string)

Unnamed: 0,address
0,324 main st
1,530 clinton ave
2,408 market square dr
3,6098 state highway 37
4,139 columbus rd
...,...
9995,3013 peach orchard rd
9996,678 northwest hwy
9997,1708 main st
9998,67740 highway 111


# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [5]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

We need to anonymize the name of restaurants that we used to not single out a specific restaurant. 

# Conclusion & Discussion

*Fill in your discussion information here*