# COGS 108 - Final Project

# Overview

There has always been an inherent bias between White and non-White establishments in the context of what makes something trustworthy. Especially during this turbulent time, it's important to review our actions and learn from our past mistakes.

# Name & Github

- Name: Alexa Acosta
- GitHub Username: A12870957

# Research Question

I am aiming to question whether a restaurant at a certain area code is likelier to be considered a critical risk than those from a different area code, dependent on the population of non-Whites in the area.
By examining this relationship, we can see if implicit and explicit bias does negatively impact restaurants that do not deserve such harsh criticisms.

## Background and Prior Work

Unfortunately, I don't have background or prior work, because I got a bit too stressed with reading this week.
Thus, I'm only leaning upon my own experiences with implicit bias (coming from within my own family and my upbringing) after being exposed to a radically different culture during my formative years (from the Philippines to the U.S. at age 13) and explicit bias (experienced and observed by myself and my own close friends).

Additionally, it's hard to ignore something as important as bias (racism) when its impact is so enormous.

# Hypothesis

I hypothesize that the mean value of critical risks are higher in predominantly non-White areas than it is in predominantly White areas.

# Dataset(s)

- Dataset Name: inspections
- Link to the dataset: 'data/inspections.csv'
- Number of observations: 18466

This CSV contains information for a single inspection done on a specific establishment. It includes basic information like the date of inspection, any comments the inspector suggested, the inspector and their ID, as well as information on how critical/severe any problems are, the number of problems found, and statistics to refer to previous inspections.

- Dataset Name: violations
- Link to the dataset: 'data/inspections.csv'
- Number of observations: 189802

Information associated with a single violation found within a restaurant. It's basically an expanded version of the violations found during inspection in inspections.csv.

- Dataset Name: yelp
- Link to the dataset: 'data/inspections.csv'
- Number of observations: 3688

Details of restaurants that have Yelp! reviews associated with them, including ratings and how many people have given reviews. It will be helpful with showing a more subjective review of a restaurant.

- Dataset Name: zipcodes
- Link to the dataset: 'data/inspections.csv'
- Number of observations: 38

Details information about the associated incomes (hollistic) of the people in the area, as well as the percentage of non-Whites in the area. This will be helpful in determining whether restaurants in the same area reflect the quality of life.

inspections and violations are easily connected with the unique 'hsisid' property. yelp will get connected with the 'x' and 'y' values ('longitude' and 'latitude'). zipcodes will be used separately, as most of its data deals with a larger, general area.

# Setup

I require this setup to properly utilize DataFrames, models and plots, and testing once the data was cleaned.

In [1]:
# Imports 
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

import seaborn as sns
sns.set()
sns.set_context('talk')

import warnings
warnings.filterwarnings('ignore')

import patsy
import statsmodels.api as sm
import scipy.stats as stats
from scipy.stats import ttest_ind, chisquare, normaltest

I'm also reading in the CSV files from the 'data' folder.

In [2]:
inspections = pd.read_csv('data/inspections.csv')
violations = pd.read_csv('data/violations.csv')
yelp = pd.read_csv('data/yelp.csv')
zipcodes = pd.read_csv('data/zipcodes.csv')

# Data Cleaning

- I have created simple standardizing methods for easier parsing.
- I'm scrubbing a lot of the identifiable information (except 'x' and 'y' or 'longitude' and 'latitude' for use in connecting the dataframes with each other) since I'm primarily concerned with the geographic information and population.

In [3]:
def standardize_hsisid(hsisid):
    return int(str(hsisid)[4:])

def standardize_zip(code):
    code = str(code)
    if code.find('-') == -1:
        return int(code)
    return int(code[0:code.find('-')])

In [4]:
to_drop_ins = ['date', 'name', 'address1', 'address2', 'city', 'state',
       'postalcode', 'phonenumber', 'restaurantopendate', 'critical',
       'days_from_open_date', 'facilitytype', 'x', 'y', 'geocodestatus',
       'type', 'description', 'inspectedby', 'inspection_num', 'inspector_id',
       'previous_inspection_date', 'days_since_previous_inspection',
       'previous_inspection_by_same_inspector', 'top_match', 'second_match',
       'num_critical_previous','num_non_critical_previous','num_critical_mean_previous','num_non_critical_mean_previous',
       'avg_neighbor_num_critical','avg_neighbor_num_non_critical']

inspections_smaller = inspections.drop(columns=to_drop_ins,axis='columns')
inspections_smaller['hsisid'] = inspections_smaller['hsisid'].apply(standardize_hsisid)
inspections_smaller['zip'] = inspections_smaller['zip'].apply(standardize_zip)

In [5]:
to_drop_vio = ['X.objectid', 'inspectdate', 'category', 'statecode',
       'questionno', 'violationcode', 'shortdesc', 'count',
       'inspectedby', 'comments', 'observationtype', 'cdcdataitem']

violations_smaller = violations.drop(columns=to_drop_vio,axis='columns')
violations_smaller['hsisid'] = violations_smaller['hsisid'].apply(standardize_hsisid)

In [6]:
to_drop_yelp = ['id', 'name', 'is_closed', 'address1',
       'latitude', 'longitude', 'phone', 'hotdogs', 'longitude', 'latitude',
       'sandwiches', 'pizza', 'tradamerican', 'burgers', 'mexican', 'grocery',
       'breakfast_brunch', 'coffee', 'chinese', 'italian', 'newamerican',
       'chicken_wings', 'delis', 'bars', 'salad', 'seafood', 'bbq', 'bakeries',
       'sushi']

yelp_smaller = yelp.drop(columns=to_drop_yelp,axis='columns')

I'm creating a hollistic DataFrame object that holds all of the unique observations (restaurants) and its associated inspections and violations.

In [7]:
df = violations_smaller.merge(inspections_smaller, on='hsisid')

# Data Analysis & Results

I took out the mean, min and max values of the critical violations grouped by zip code. I also calculated the mean of the scores given by the inspectors.

In [8]:
mean_critical = df.groupby('zip')['num_critical'].mean()
min_critical = df.groupby('zip')['num_critical'].min()
max_critical = df.groupby('zip')['num_critical'].max()

mean_ncritical = df.groupby('zip')['num_non_critical'].mean()
min_ncritical = df.groupby('zip')['num_non_critical'].min()
max_ncritical = df.groupby('zip')['num_non_critical'].max()

scores = df.groupby('zip')['score'].mean()

Then, I proceed to create a model between critical violations by area code against percentage of non-Whites by area code. I also do the same for non-critical violations.

In [9]:
## CODE HERE

Plotting the models will give a visual on the effects of non-White percentage per area code on critical and non-critical violations.

In [10]:
## CODE HERE

Testing the models with a two-tailed t-test in order to give a more concrete view on how effective non-White percentage per area code is on violations.

In [11]:
## CODE HERE

# Ethics & Privacy

I have scrubbed sensitive identifying information according to the Safe Harbor method. Unfortunately, I had to retain information about a restaurant's zip/postal code as it is vital to the question at hand.

While zip codes are allowed to be used when there are more than 20,000 people living within the area, I have no way to confirm that the zip codes had more than 20,000 people within in. Thus, I cannot guarantee full anonymity of the information of the restaurants gathered with the datasets.

# Conclusion & Discussion

There would be a discussion here, if I finished this one time. Unfortunately I did not.

If I had to make a guesstimation, predominantly non-White establishments per area code may have more critical violations, as well as lower Yelp! ratings than other establishments.