# Revisiting Food-Safety Inspections from the Chicago Dataset - A Tutorial

## 1. Introduction

Foodborne illnesses afflict an estimated 48 million Americans each year, resulting in 128,000 hospitalizations and 3,000 fatalities [1]. City Governments curb the spread of illness by enforcing stringent standards for food preparation but often act under heavy resource constraints. Building on work by the Chicago Department of Public Health, this tutorial seeks to explore a predictive model of food code violations that could provide real-time risk analysis both to public officials and the facilities they inspect. (needs streamlining)

### 1.1. Past Work

In 2014 the Chicago Department of Public Health (CDPH) and other local agencies worked to develop a model that would forecast critical violations, allowing the CDPH to prioritize inspections of facilities expected to be in violation of health codes [2] [3].

After exploring a number of datasets and features, the team found the strongest predictors of critical violations to be the inspector assigned, previous inspection outcomes, age at inspection, licenses for alcohol or tobacco, the temperature at time of inspection, nearby burglaries, nearby sanitation complaints and nearby garbage cart requests.

Using a glmnet model trained on data from January 2011 through January 2014, the team was able to demonstrate a data-driven inspections schedule that significantly reduced the time to find a critical violation. This model has since been adopted by the CDPH.

### 1.2. Premise

In light of the CDPH's success, we began this project intending to explore and extend their work. Whereas the Chicago team set out to predict critical violations, we are working to develop a series of models to forecast individual violations and to explore neural networks for the prediction model.

By developing a higher resolution model we hope to provide actionable information not just to public officials but to restaurants themselves. In addition to being generally interested in customers' health, restaurants are fined $500 for each critical and $250 for each serious violation found,and so are highly motivated to maintain health codes. A real-time violation risk analysis could thus provide actionable info to restaurants on an ongoing basis, improving inspection outcomes and safeguarding public health.

[COULD PUT SOME OF THE COST ESTIMATE, AVERAGE INSPECTIONS AND FAILS WORK HERE]

## 2. Pre-processing and Data Structuring

While we may involve additional datasets as work progresses we decided to start by preparing those datasets that the Chicago team found to be significant, translating from R to Python and adapting to suit our purposes as needed. 

The datasets used are as follows:
* [food inspections](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data)
* [business licenses](https://data.cityofchicago.org/Community-Economic-Development/Business-Licenses/r5kz-chrr/data)
* [garbage cart requests](https://www.cityofchicago.org/city/en/dataset/garbage_cart_stolenormissing.html)
* [sanitation code complaints](https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints/me59-5fac/data)
* [crimes](https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy/data)
* [weather](https://darksky.net/dev)

With the exception of weather data from darksky.net, all data used is available through the [Chicago Data Portal](https://data.cityofchicago.org/). We will not be working with data on inspector assigned, as this is not publicly available and would not be provide useful inisght in practice.

!!!![
We used  Juptyper Notebook running a Python 3 kernel to preform the initial data gathering, preprocessing, and structuring.]

### 2.1. Preparating Food Inspections Data

To download the inspections data we used Sodapy, a client for the Socrata Open Data API used by the Chicago Data Portal:

In [1]:
!pip install sodapy


[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# from https://github.com/socrata/dev.socrata.com/blob/39c6581986466edb5e7f72f5beea5ce69238f8de/snippets/pandas.py# from  

import pandas as pd
from sodapy import Socrata

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)



Conveniently,  Sodapy converts all columns to snake case, removes special characters and standardizes date format by default:

In [3]:
# First 50000 results, returned as JSON from API 
# Connverted to Python list of dictionaries by sodapy.
# Column names converted to snake case, special chars removed
# Dates and location formatted
results = client.get("4ijn-s7e5", limit=50000)

# Convert to pandas DataFrame
inspections = pd.DataFrame.from_records(results)

As Socrata restricts queries to 50000 entries we then paged through the remainder to access the full dataset: 

In [4]:
# Download remaining food inspections (limit 50000 / call)
start = 50000
while results:
    print(start)
    results = client.get("4ijn-s7e5", limit=50000, offset=start)
    inspections = inspections.append(pd.DataFrame.from_records(results))
    start += 50000

50000
100000
150000
200000


After exploring the data and reading through the CDPH's work in R, we then applied a series of filters to produce a consistent and usable dataset:

In [5]:

# Remove trailing backslash (left over from sodapy conversion of "License #" column)
inspections.rename(columns={"license_": "license"}, inplace=True)

In [6]:
# Drop rows with missing data
inspections.dropna(subset=["inspection_date", "license"], inplace=True)

In [7]:
# Drop duplicates
inspections.drop_duplicates("inspection_id", inplace=True)

In [8]:
# Drop "0" licenses
inspections = inspections[inspections.license != "0"]

In [9]:
# Only consider successful inspections
inspections = inspections[~inspections.results.isin(["Out of Business", "Business Not Located", "No Entry"])]

In addition to semi-regular canvas inspections facilities may also be inspected due to complaints or a recent failed inspection. To ensure that our data is representative of restaurants operating as usual we filtered the dataset to consist only of canvas inspections:

In [10]:
# Only coonsider canvas inspections (not complaints or re-inspections)
inspections = inspections[inspections.inspection_type == "Canvass"]

We also filtered inspections by facility type, to eliminate the inconsistency of e.g. hospitals and schools, which follow different inspection schedules:

In [11]:
# Only consider restaurants and grocery stores (subject to change)
inspections = inspections[inspections.facility_type.isin(["Restaurant", "Grocery Store"])]

Finally we saved the resulting dataframe as a CSV file, ready for later use:

In [0]:
import os.path
root_path = os.path.dirname(os.getcwd())

# Save result
inspections.to_csv(os.path.join(root_path, "DATA/food_inspections.csv"), index=False)

### 2.2. Preparing Remaining Socrata Data

Downloading and filtering business licenses, garbage cart requests, sanitation complaints and crimes was much the same as with food inspections - for specifics please see the [CODE folder of our project repository](https://github.com/Sustainabilist/ChicagoDataAnalysis/tree/master/CODE). 

In brief:
* All datasets were formatted by sodapy and filtered to remove duplicates and missing data.
* Crimes were filtered to include only burglaries since 2010 to reduce size and ensure consistency.
* Garbage cart requests and sanitation complaints were filtered to include only completed or open requests.

### 2.3. Preparing Weather Data

COMING SOON!

### 2.4. Calculating Violation Data
Each entry in the violations column of the food inspections dataset is made up of a number of violation/comment pairs joined into a string:

!!!!!!!!!!!!!!
From the violations column of the inspections dataset we derived a dataset pairing inspection ids with a binary value for the presence or absence of each violation. We then tallied the critical, serious and minor violations for each inspection.

In [15]:
inspections.iloc[1].violations

'31. CLEAN MULTI-USE UTENSILS AND SINGLE SERVICE ARTICLES PROPERLY STORED: NO REUSE OF SINGLE SERVICE ARTICLES - Comments: OBSERVED THE KNIFES IMPROPERLY STORED BETWEEN  WALL AND PREP TABLE IN REAR AND IN FRONT BETWEEN TWO COOLERS, INSTRUCTED TO PROVIDE A KNIFE RACK FOR PROPER STORAGE.  | 35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTRUCTED PER CODE: GOOD REPAIR, SURFACES CLEAN AND DUST-LESS CLEANING METHODS - Comments: OBSERVED DUSTY CEILING VENTS, INSTRUCTED TO CLEAN. | 33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSILS CLEAN, FREE OF ABRASIVE DETERGENTS - Comments: OBSERVED THE #10 CAN OPENER, AND THE FRONT CUTING BOARDS NOT CLEAN, INSTRUCTED TO CLEAN AND SANITIZE. ALSO CLEAN CHAR. GRILL DRIP TRAY.'

!!!We wanted to turn this into a bunch of features so first we split each violations entry into a sequence of binary columns describing the presence of each violation:

In [16]:
# Split violations into binary values for each violation
def split_violations(violations):
    values_row = pd.Series([])
    
    if type(violations) == str:
        violations = violations.split(' | ')
        for violation in violations:
            index = "v_" + violation.split('.')[0]
            values_row[index] = 1
    return values_row

In [17]:
# 5 mins
values_data = inspections.violations.apply(split_violations)

We then generated a series of column titles of the form "v_1" for violations 1-14 (critical), 15-29 (serious) and 30-44 plus 70 (minor):

In [18]:
# Generate column names
critical_columns = [("v_" + str(num)) for num in range(1, 15)]
serious_columns = [("v_" + str(num)) for num in range(15, 30)]
minor_columns = [("v_" + str(num)) for num in range(30, 45)]
minor_columns.append("v_70")

columns = critical_columns + serious_columns + minor_columns

These column headings were then combined with the violation values and paired with the inspection id for each row:

In [19]:
# Ensure no missing columns, fill NaN
values = pd.DataFrame(values_data, columns=columns).fillna(0)

values['inspection_id'] = inspections.inspection_id

We created a separate dataframe for critical, serious and minor violation counts by pairing each inspection id with the sum of the appropriate subset of violation values:

In [20]:
# Count violations
counts = pd.DataFrame({
    "critical_count": values[critical_columns].sum(axis=1),
    "serious_count": values[serious_columns].sum(axis=1),
    "minor_count": values[minor_columns].sum(axis=1)
})

counts['inspection_id'] = inspections.inspection_id

Lastly we saved both datasets for later use:

In [None]:
# Save violation values and counts
values.to_csv(os.path.join(root_path, "DATA/violation_values.csv"), index=False)
counts.to_csv(os.path.join(root_path, "DATA/violation_counts.csv"), index=False)

### 2.5. Calculating Heat Map Data
* gaussian kde
* 90 day window
* where'd I put those notes?
* heat map for garbage, crime, sanitation
COMING SOON!

## 3. Exploratory Analysis

### 3.1. Inspections Map
* used folium to set the interactive map layer
  * then addded points to the map using lat long coords
    * mapped biz licenses
    * mapped inspections
  * plotted violations using a pareto chart package
    * had to modify to work in python 3 jupyter nb
    * created simple bar graph for violation counts

In [23]:
!pip install folium

# Import necessary packages 
import folium
from folium import plugins
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [24]:
m = folium.Map([41.8600, -87.6298], zoom_start=10)

In [25]:
inspections.apply(lambda row: folium.CircleMarker(location=[row.latitude, row.longitude],
                                                 radius=.5, fill_color="grey")
                                                 .add_to(m), axis=1)

TypeError: ('must be real number, not str', 'occurred at index 2')

In [None]:
# Convert to (n, 2) nd-array format for heatmap
inspections_arr = inspections[["latitude", "longitude"]].values

# Plot heatmap
m.add_child(plugins.HeatMap(inspections_arr.tolist(), radius=17))

### 3.2. Comments Word-cloud
We split inspection into a series of rows detailing each violation and comment. We then used wordcloud and matplotlib.pyplot to generate a word cloud for violation comments:
[WORDCLOUD HERE]
Although qualitative, the results point to cleaning, maintenance & prep area issues as the most common violations.

### 3.3. Violations Pareto-chart
We used the same distribution of individual violations to generate a pareto chart. We used this along with a [table detailing violation numbers and corresponding health standards](https://webapps1.cityofchicago.org/healthinspection/Code_Violations.jsp) to provide a more quantitative perspective.

[PUT PARETO CHART HERE]

The six most common violations (comprising over 60% of all violations) were all minor and pertained to cleanliness and maintenance of floors, walls & ceilings; sanitation & maintenance of utensils and equipment; proper venting; and proper organization of food and cleaning supplies.

The most common serious violation was improper control of vermin, and the most common critical violations related to improper storage of hot and cold foods.

### ?. inspection history???
The Chicago team found that one of the greatest predictors of critical violations and failed inspections was the establishment's recent inspection history. To investigate this, grouped inspections by license, shifted each group to find the previous inspection, and set up a crosstab (contingency?) table. We found that [RESULTS SEEM WEIRD, PROBABLY DON'T INCLUDE?]
* P(FAIL | FAIL) = .11
* p(fail | pass) = .25

### ?. oppertunity research???
We wanted to know how valuable a violation forecast would be to businesses. For example, if businesses rarely failed an inspection and only failed in summer, additional information would not be very useful.

To explore this we first grouped inspections by license, and then for each license group we determined the number of fails, inspection duration, yearly inspections and yearly fails.

Use of crosstab suggested a number of patterns
[INTERSPERSE WITH CROSSTAB]

For this exploration only current restauraants with an inspection history longer than 6 months were considered. Of these roughly 65% were inspected 1-2x/year and failed 0-.5x/year
[WHAT ABOUT PRIORITY?]

Looking at yearly fails vs years, most failed 0-.5x/year (including somewhat more established restaurants). Many failed .5-1.5x/year (including slightly more new restaurants). A smaller group failed 1.5x+ and were pretty much all new. This suggests a die-off, with resteraunts that fail more inspections not lasting as long.

Overall, it seems as though enough restaurants fail at least once a year that a predictive tool would be worthwhile (IS THAT THE CASE?).


## 4. Calculating Model Data
We then set out derive all features originally considered by the Chicago team plus several novel features, translating from R to Python in the process. While relative weights in forecasting individual violations rather than number of critical violations may be different, we expect the significant databases and [NEEDS REPHRASING & STUFF]

[MAYBE HAVE BULLET POINTS FOR DATABASES, FEATURES]

Each row of the final training dataset represents an inspection, its outcomes, and a number of other features. [SOMETHING ABOUT TRAINING METHODOLOGY, MAYBE ABOUT FIGURING OUT WHICH FEATURES MATTER FOR WHICH]

We used the inspections database as the basis for our training data.

### 4.1. features from food inspection data
* used subset of inspections as basis
* merged with violation values & counts on inspection id
* created pass / fail flags
* sorted by date, shifted to get previous info
  * past fail
  * past critical, serious, minor
  * inspections with no past default to 0
* calculate time since last
  * binary first_record
  * time since last defaults to 2
  
  ### 4.2. features from business license data
  
  ### 4.3. from heat maps
  
  ### 4.4. from weather

[1] https://www.cdc.gov/foodborneburden/index.html

[2] https://chicago.github.io/food-inspections-evaluation/

[3] https://github.com/Chicago/food-inspections-evaluatio]

## 5. Next Steps
COMING SOON!

## About the Authors