# Final Project: Something Smells Fishy 

# Name & GitHub

Name : Fatima Housni
GitHub Username : fhousni
PID : A10197077

# I) Overview

In this report, I grouped different seafood restaurants according to zip codes. A population size was assigned to each zip code using census data. I then tested whether restaurants in zip codes with high populations had a higher heath inspection score compared to restaurants in zip codes with a low population. 

The data analysis showed that there was no real relationship between population size and health inspection score.

# II) Research Question

_Does the size of a population in a given area affect the score of health inspections in restaurants that serve seafood?_ _If so, does population size also determine which type of violation a seafood restaurant is likely to be issued?_

Growing up, my family moved around a lot. We have lived in both small towns and large metropolitan cities throughout Southern California. Regardless of where our home is, we were always on the lookout for our next favorite seafood restaurant. From experience, I found that areas with a dense population of people tended to have restaurants that serve fresher seafood. This holds true even within the same city or county. When looking for good, safe seafood to eat, I always look for restaurants located in areas where most people live. I usually avoid restaurants that are in quiet, out-of-the-way locations with less people. I am suspisious of those establishments, and do not want to risk my safety by consuming ill-prepared seafood.

I held these beleifs for years. Everything changed, however, after a breif conversation with my sister. She beleives the exact opposite. She prefers seafood establishments located in low-density populations. She feels that the best seafood comes from places with less foot traffic and less patrons. You see, my sister worked as a waitress in a small-town sushi restaurant for two years. Her experience lends weight to her opinion, but I am not ready to give up yet. I decided to dedicate this project to discovering which of us is correct. 

This project will analyse each health inspection's overall score and the population density of the zip code in which it resides.

### Background and Prior Work

Foodservice establishments are inspected by environmental health specialists in order to make sure the food served there is safe for public consumption. This prevents food-borne illnesses from ravaging communities. Depending on its conditions, the inspector gives the establishment a score ranging from 70 - 100. Each violation of heath safety code is given a point value of 1 or 2. These violations are then added up and subtracted from 100. An 'A' rating is a score from 90 - 100, 'B' : 80-89.5, and finally 'C' : 70 - 79.5. 

In addition to being assigned a point value, each violation is also assigned a code. This code indicates the nature of the violation. They can be food related such as temperature control, labeling, contamination, improper sources, or improper cooking, etc. Or they can be related to other categories such as poor management and training, chemical risks, equipment, etc. If not careful, the restaurants can lose points due to any one of these factors.

By law, food establishments must display their food score rating. This information is also availible through online government or state websites. This project utilizes health score information to complie the data and locate the businesses on a map. Through web scraping, lists of zip codes and the number of people living in that zip code was compiled. Data from Yelp, an online platform that allows business and customers to interact, is also used.

Refrences:
1. About the Inspection Process, 2020, www.forsyth.cc/PublicHealth/EnvironmentalHealth/aboutInspections.aspx.
2. Hicks, D. (2012, May 30). Seafood Safety: What Consumers Need to Know. Retrieved from https://web.uri.edu/foodsafety/seafood-safety-what-consumers-need-to-know/

### Hypothesis

$H_o$: There is no relationship between population size and health inspection score

$H_a$: There is a relationship between population size and health inspection score

I hypothesize that there is a relationship between population size and health inspection score. I predict that as population in a given area increases, the health inspection score will also increase.

    My reasoning for this is the following: 
    
    The state or county authorities deploy more frequesnt health inspections to restaurants in heavily dense areas compared to sparse areas because of the higher risk to public safety. So a restaurant in downtown LA will receive more yearly inspections than one in Elysian Park, for example. Therefore, these establishments will have more experience and motivation to remain clean and avoid violations. 
    
    Also, when a population in a given area is large, restaurants in those areas will naturally have more foot traffic and more patrons eating there. A busy establishment means that food must be bought on a regular basis since it is being consumed quickly, therefore the food will be more fresh and pose less of a safety risk to people.

# III) Datasets

#### Yelp Data

I used the Yelp dataset to find restaurants and food establishments that fall under the food category of seafood or sushi. Sushi was included in this analysis since it is also seafood. The Yelp datset include phone numbers of establishments. This came in handy when wanting to extract and merge Yelp with other datasets.

#### Restaurant Data

The restaurant dataset was useful because it was the bridge between connecting the Yelp food category with health inspections for each restaurant. Restaurant data gives an identification for each health inspection. This identification (hsis id) was used to find the corresponding inspection. Additionally, this dataset gives a zip code (ie. postal code) of the restaurant location. 

#### Data Scraping for Population Data

This dataset was obtained through web-scraping. A list of all relavent zip codes and their respective population number was found on an official state website. All population data of these zip codes came from a census from 2018. Therefore, this population data is applicable to this analysis. 

Source:  https://www.northcarolina-demographics.com/zip_codes_by_population

#### Inspection Data

The inspection data was essential to this report. This dataset provided me with the inspection score from each inspection done. Each row is a different inspection, which means that the same restaurant may have been inspected twice. This dataset was connected to the population data through zip codes. 

#### Violation Data

This dataset includes all violations and violation types. This dataset was impsortant in determining the type of violation a restaurant is likely to have. This dataset was connected to the population data through zip codes. 

### Important: Assumptions / Simplifications 

I am making a few generalizations about the data to simplify my report. 

1. Each area a zip code encompases has a unique size and shape, which affects how the population density is distributed within that area. Unfortunately, it is impossible to take this into account here. For this reason, I am assuming each zip code area is approxamitly equal in square mileage (even though in reality, this is not the case). 

2. The population density is equally distributed within a given zip code area.

3. Restaurants that share a zip code have similar customer numbers and foot traffic coming their way. Of course, this is not true. You can have two restaurants next to each other with wildly different popularities, but for the sake of this report we will ignore that. 

# The Setup

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()
sns.set_context('talk')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

# Import nltk package 
#   PennTreeBank word tokenizer 
#   English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from math import radians, cos, sin, asin, sqrt

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Vectorizer, which transforms text data into bag-of-words feature
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

### Load the Data

In [None]:
df_ins = pd.read_csv('inspections.csv')
df_res = pd.read_csv('restaurants.csv')
df_vio = pd.read_csv('violations.csv')
df_yelp = pd.read_csv('yelp.csv')
df_zip = pd.read_csv('zipcodes.csv')
df_pop = pd.read_csv('dataminer.csv')

# IV) Data Cleaning

### Remove, Rename, and Reorganize columns of dataframes

In [None]:
#Violation Dataframe

df_vio = df_vio.drop(['inspectdate','statecode','questionno','inspectedby','comments',
                      'shortdesc'], axis=1)
df_vio = df_vio.rename(columns={"category":"violation_category"})

In [None]:
# Inspection Dataframe


df_ins = df_ins.drop(['date','address1','address2','restaurantopendate','days_from_open_date',
                      'name', 'x','y','geocodestatus', 'facilitytype', 'type', 'description', 
                      'inspectedby', 'inspection_num', 'inspector_id','previous_inspection_date', 
                      'days_since_previous_inspection', 'previous_inspection_by_same_inspector', 
                      'num_critical_previous','num_non_critical_previous', 'num_critical_mean_previous',
                      'num_non_critical_mean_previous', 'avg_neighbor_num_critical','avg_neighbor_num_non_critical', 
                      'top_match', 'second_match'], axis=1)


df_ins = df_ins.rename(columns={"phonenumber":"phone"})
df_ins['postalcode'] = df_ins['postalcode'].str.replace('-....', '')
df_ins = df_ins.dropna(subset=['phone'])

In [None]:
# Restaurant Dataframe


df_res = df_res.drop(['address1','address2','restaurantopendate', 'x','y','name','geocodestatus'], axis=1)
df_res = df_res.rename(columns={"phonenumber":"phone"})
df_res['postalcode'] = df_res['postalcode'].str.replace('-....', '')
df_res = df_res.dropna(subset=['phone'])

In [None]:
# Yelp Dataframe

df_yelp = df_yelp.drop(['price', 'italian','newamerican','chicken_wings', 'delis', 'bars', 'salad', 'burgers', 'mexican',
                       'grocery','breakfast_brunch','coffee','chinese','bbq','bakeries','hotdogs','sandwiches','pizza',
                       'tradamerican', 'longitude', 'latitude', 'is_closed', 'name','id','address1','zip_code'], axis=1)
df_yelp = df_yelp.dropna(subset=['phone'])

# Keep only Seafood or Sushi places

df_yelp = df_yelp.loc[(df_yelp['seafood'] == True) | (df_yelp['sushi'] == True)]

# Remove leading 1 in Yelp phone numbers

df_yelp['phone'] = df_yelp['phone'].str.lstrip('1')


In [None]:
# Population Dataframe

df_pop = df_pop.rename(columns={"Zip Code":"postalcode"})
df_pop = df_pop.rename(columns={"Population":"population"})

### Standardize

In [None]:
# The phone numbers in Restaurants are formatted differently, so we must write a function to correct that

def standardize_phonenumber(string):

    string = string.lower()
    string = string.strip()
    
    string = string.replace("(", "")
    string = string.replace(")", "")
    string = string.replace("-", "")
    string = string.replace(".", "")
    string = string.replace(" ", "")
    
    string = string.strip()
    
   
    return string

In [None]:
# Standardize Population for df_zip

def standardize_population(string):

    string = string.lower()
    string = string.strip()
    
    string = string.replace(",", "")
    string = string.replace(" ", "")
    
    string = string.strip()
    
   
    return string

In [None]:
# Standardize Violation category

def standardize_category(string):

    string = string.lower()
    string = string.strip()
    
    string = string.replace("approved source", "food")

    string = string.strip()
    
   
    return string

In [None]:
# Standardize Violation codes

def standardize_codes(string):

    string = string.lower()
    string = string.strip()
    
    string = string.replace("-", "")
    string = string.replace(".", "")

    string = string.strip()
    
   
    return string

In [None]:
# Apply the standardization

df_res['phone'] = df_res['phone'].apply(standardize_phonenumber)
df_ins['phone'] = df_ins['phone'].apply(standardize_phonenumber)
df_ins['phone'] = df_ins['phone'].str.replace('ext...', '')

df_pop['population'] = df_pop['population'].apply(standardize_population)
df_pop['population'] = df_pop.population.astype(int)

df_vio['violationcode'] = df_vio['violationcode'].apply(standardize_codes)
df_vio['violation_category'] = df_vio['violation_category'].apply(standardize_category)

### Merge Dataframes

In [None]:
# Merge restaurants with yelp and name it df_ry

df_ry = pd.merge(df_res, df_yelp, on='phone', how='inner')

In [None]:
# Merge inspections with df_ry using hsis id

df_ir = pd.merge(df_ry, df_ins, on='hsisid', how='inner')

df_ir = df_ir.drop(['X.objectid','num_critical','num_non_critical','hsisid', 
                    'city_x', 'state_x', 'phone_x','facilitytype', 'rating', 
                    'review_count', 'seafood', 'sushi', 'city_y', 'state_y', 
                    'postalcode_y', 'phone_y', 'zip'], axis=1)

df_ir = df_ir.rename(columns={"postalcode_x":"postalcode"})
df_ir['postalcode'] = df_ir.postalcode.astype(int)


In [None]:
# Merge df_i with df_pop

df_i = pd.merge(df_ir, df_pop, on='postalcode', how='inner')

In [None]:
# Merge violations with df_ry using object ID

df_ryv = pd.merge(df_ry, df_vio, on='X.objectid', how='inner')

In [None]:
#Clean the new table from unnessecary columns

df_ryv = df_ryv.drop(['X.objectid','violationtype',
                      'facilitytype', 'rating', 
                      'review_count', 'seafood','sushi','hsisid_y','critical',
                      'severity','pointvalue','observationtype','count',
                      'cdcriskfactor','cdcdataitem','hsisid_x', 'state',
                      'phone','city'], axis=1)


In [None]:
# Merge df_ryv with df_pop 

df_ryv["postalcode"] = df_ryv["postalcode"].astype(str).astype(int)
df = pd.merge(df_pop, df_ryv, on='postalcode', how='outer')

In [None]:
# Drop rows with nan values for violation code
# Rows will be organized from highest population to lowest

df = df.dropna(subset=['violationcode'])

### Our final Dataframes

df_i is a dataframe of inspections done on restaurants. Each row is a different inspection. A restaurant can have multiple inspections taken at different times.

In [2]:
df_i

NameError: name 'df_i' is not defined

df is a dataframe of rows from violations only. This includes the violations types. The violation code referes to what kind of violation it is. A code begining in 3 is a food related violation. A code begining in 7 is a chemical related violation.

    Food related violation: contamination, bad temperature control, bad source of seafood, rot, poor cooking, insufficient freezing or cleaning of the seafood, etc
    
    Chemical related violations: cleaning products out, lubricants, exposed medications out, toxic chemicals, etc


In [None]:
df

# Data Analysis & Results

In [None]:
df_i.describe()

In [None]:
df.describe()

There is a tendency towards higher scores within our dataset

In [None]:
plt.subplot(1, 3, 1)
sns.distplot(df_i['score'], color='#DE2D26')

A graph of score vs. population. There seems to be a slight negative correlation which suggests that a higher population leads to a lower score. We cannot be sure of this relationship without doing some numerical analysis.

In [None]:
sns.lmplot(x='population', y='score', data=df_i, 
           fit_reg= True, height=6, aspect=2, 
           x_jitter=.5, y_jitter=.5);

A critical value of 1 means that the inspector found a dangerous violation during their inspection. A critical value of 0 means that there was no violation or that the violation was minor and posed no immediate threat to public health. 

The graph below shows a slight skew, but it cannot be determined just from eye. We must carry out a regression analysis.

In [None]:
sns.lmplot(x='population', y='critical', data=df_i, fit_reg=False);

#### Violation Category: Food or Chemical

The graph below shows that there are more chemical violations than food-related violations.

In [None]:
ax = sns.boxplot(x='violation_category', y='population', data=df)

ax.set_title('Relationship of Violation Type to Population Size', loc='left')
ax.set_ylabel('Population')
ax.set_xlabel('Violation Type');

In [None]:
# Create two dataframes of high population (>50,000) and low population (<25000)

high = ['78475', '66583', '55111', '53803', '53364']
low = ['20875', '18856', '10445', '5490']


df_high = df[df['population'].isin(high)]
df_low = df[df['population'].isin(low)]

df_high['violationcode'] = df_high['violationcode'].astype('int') 
df_low['violationcode'] = df_low['violationcode'].astype('int') 

In [None]:
# High Population Areas: Sum up the population for food violations and chemical violations

df_high['population'] = df_high.groupby('violation_category')['population'].transform('sum')
df_high['population'].unique()

In [None]:
# High Population Areas: Sum up the population for food violations and chemical violations

df_low['population'] = df_low.groupby('violation_category')['population'].transform('sum')
df_low['population'].unique()

In [None]:
df_high_plot = pd.DataFrame({'Type':['food', 'chemical'], 'val':[306028, 944503]})

ax = df_high_plot.plot.bar(x='Type', y='val', rot=0)
ax.set_title('Most Common Health Violation Type in High-Population Areas', loc='left')
ax.set_ylabel('Population')
ax.set_xlabel('Violation Type');

In [None]:
df_low_plot = pd.DataFrame({'Type':['food', 'chemical'], 'val':[20875, 109313]})

ax = df_low_plot.plot.bar(x='Type', y='val', rot=0)
ax.set_title('Most Common Health Violation Type in Low-Population Areas', loc='left')
ax.set_ylabel('Population')
ax.set_xlabel('Violation Type');

Both of the graphs show an overwhelmingly larget number of chemical violations compared to food violations. The results of the graphs above show that population size and violation type are most likely not related. More data will be needed for a better analysis.

### Regression Analysis

In [None]:
# Is population coorelated to Health Inspection Score
outcome, predictors = patsy.dmatrices('score ~ population', df_i)

In [None]:
# Now, that those are specified, let's run the model
mod = sm.OLS(outcome, predictors)

The high p-value and low coeff indicate that there is not relationship between population size and health inspection score for seafood restaurants.

In [None]:
## fit the model
res = mod.fit()

## look at the results
print(res.summary())

# Ethics & Privacy


    My biggest concern during this report was the privacy of the inspectors whose names were listed in the Inspections dataframe. Sometimes, restaurant owners can become disgruntled due to receiving a low score from an inspector, so it is important to protect the privacy of each inspector in order to insure their safety. Their names were not relavant to this report, so any data on the inspectors was removed. 

    Yelp posts its data set onto its site for consumer purposes. All data used in this report is for educational purposes only.

    No specific locations of restaurants were disclosed. Furthermore, all names and identifiers of each establishment was removed since it posed no relavance to this report. 


# Conclusion & Discussion

No coorelation was discovered between population size and health inspection score of seafood restaurants. A P-value of over 0.47 indicates that the variations found in the data were likely due to chance. A coeff value of -0.00000587 indicates and increadibly weak negative coerralation between our independent variable (population) and our dependent (health inspection score). As such, there is not sufficient evidence to reject the null hypothesis.

Alas, my inital goal of validating my beleifs over my sister's has failed. The data does not provide any evidence that population affects health inspection score. My sister and I must agree to disagree on this matter. 

In an ideal world, if I could do this project again, I would have a much wider dataset that encompases the whole of the United States. I would have a lot more zip codes to work with and the population variance between zip codes will be much higher.

_Parting thoughts:_

I beleive that a much more elegant way of approaching this report would be to see if there is a relationship between health inspection score and distance from a city. Using maping data, we can test if health reports are better or worse the farther you get from a dense city. This is a project I would be interested in revisiting in the future when my coding skills improve.
