# Revisiting Food-Safety Inspections from the Chicago Dataset - A Tutorial (Part 2)
David Lewis, Russell Hofvendahl, Jason Trager
* I switched name order here and put my bio second at the bottom

## 0. Foreward
* probably touch this up

Sustainabilist often works on data that is related to quality assurance and control (QA/QC) inspections of public or private infrastructure. Typically, this infrastructure takes the form of solar energy systems or energy efficiency upgrades for buildings. These data sets almost exclusively belong to private entities that have commissioned a study to evaluate how safe and/or well-installed the infrastructure that they financed is. For this reason, it has been very difficult to put anything up in the public sphere about how our work is conducted and any public documentation of what kind of analysis we do.

Enter Epicodus, a coding bootcamp in Portland, OR. Several weeks ago, I met David and Russell - two eager coding students who were just learning how to code. They were attending the first meeting of CleanWeb Portland’s first meeting, which Sustainabilist organized. We were talking about the lack of public datasets in sustainability, and I mentioned how Chicago’s food science data set was very similar to many of the QA/QC data sets that I have looked at. Just like that, a project was born.

The coding work demonstrated herein is 100% that of the student interns, under my guidance for how to structure, examine, and explore the data. The work was conducted using Google Collaboratory, iPython notebooks, and Anaconda’s scientific computing packages.

## 1. Review
* foreward?
* To prevent foodborne illness inspectors enforce stringent food codes, sometimes with the help of predictive violation models
* We seek to expand the work of the CDPH, exploring highres predictions and neural nets
* We want to to focus on helping restaurants prevent illness and avoid costly violations
* We cleaned and pre-processed data from the following sources (databases)
* ...(probably more stuff)

## 2. Feature engineering
* something on how the model works, what we're building it for, the thing about blinding the model to outcome and then comparing it to actual outcome
* how by training model to guess outcome for canvass inspections we're building a tool that we can feed same paramaters at any time to guess outcome of a simulated canvass inspection
* Somthing on feature selection, why it makes sense to try out what we're trying out
* should we explain features here or below? idk

## 3. Food Inspection Features
* load inspections and select what we want from it to use as basis for model data
* Something on what this data is, where it comes from, why we're using it?

In [1]:
import numpy as np
import pandas as pd
import os.path

root_path = os.path.dirname(os.getcwd())

# Load food inspection data
inspections = pd.read_csv(os.path.join(root_path, "DATA/food_inspections.csv"))

# Create basis for model_data
data = inspections.loc[:, ["inspection_id", "license", "inspection_date", "facility_type"]]

### 3.1. Pass / Fail Flags
* pass fail flags denote inspection outcome, this is something that will be "covered" so model can guess it
* converted to individual presence/absence flags to help with something or other (what and why specifically?)

In [2]:
# Create pass / fail flags
data["pass_flag"] = inspections.results.apply(lambda x: 1 if x == "Pass" else 0)
data["fail_flag"] = inspections.results.apply(lambda x: 1 if x == "Fail" else 0)

### 3.2. Facility Risk Flags
* Facilities like restaurants pose greater risk than packaged food kiosks and are given higher risk levels
* Higher risk levels mean greater inspection frequency also (unsure if this is relevant)
* Again converted to numeric form to fit with (specs? what?)

In [3]:
# Create risk flags
data["risk_1"] = inspections.results.apply(lambda x: 1 if x == "Risk 1 (High)" else 0)
data["risk_2"] = inspections.results.apply(lambda x: 1 if x == "Risk 2 (Medium)" else 0)
data["risk_3"] = inspections.results.apply(lambda x: 1 if x == "Risk 3 (Low)" else 0)

### 3.3. Violation Data
* Violation data is also something the model will be guessing, another part of the inspection outcome
* The data consists of a bunch of rows (representing inspection outcomes) with binary values for whether a specific health code was violated in that inspection
* Merged on inspection ID (each row of data is matched and merged with a violation data row with same ID. rows with no matches are excluded.)


In [4]:
# Load violation data
values = pd.read_csv(os.path.join(root_path, "DATA/violation_values.csv"))
counts = pd.read_csv(os.path.join(root_path, "DATA/violation_counts.csv"))

# Merge with violation data, filtering missing data
data = pd.merge(data, values, on="inspection_id")
data = pd.merge(data, counts, on="inspection_id")

### 3.4. Past Fails
* Passed fails refers to the previous inspection outcome for that license (as a binary flag)
* This is a strong predictor of inspection outcomes
* Passed fails is something the model will have access to when predicting inspection outcomes, and will be used to guess the actual and current outcome.
* We first create a dataframe of past data by arranging inspections chronologically, grouping by license and shifting each group of inspections by 1, so that the data for each inspection lines up with the row of the next inspection (the first row for each license will by empty and the last inspection is not used). The pre-grouping order is preserved upon shifting.
* (this could use visualization)
* We can then simply attach the fail_flag column to our data as past fails, setting the empty first value as 0 (no previous fail)

In [None]:
# Sort inspections by date
grouped = data.sort_values(by="inspection_date", inplace=True)

# Find previous inspections by shifting each license group
past_data = data.groupby("license").shift(1)

# Add past fails, with 0 for first inspections
data["past_fail"] = past_data.fail_flag.fillna(0)

### 3.5. Past Violation Data
* individual past violation values might well be good for predicting individual violations (eg watch out mr. restaurant, you violated these codes last inspection so you're at risk for them)
* We can use the same past_data to get past violation values
* We'll modify the names to pv_1, etc
* If we drop inspection_id we can just tack them on to the end of the data using join
* first records are set to 0 (no past violation)
* For past_critical, past_serious and past_minor we can similarly just grab each column and add it as a new column in data

In [None]:
# Select past violation values, remove past inspection id
past_values = past_data[values.columns].drop("inspection_id", axis=1).add_prefix("p")

# Add past values to model data, with 0 for first records
data = data.join(past_values.fillna(0))

In [None]:
# Add past violation counts, with 0 for first records
data["past_critical"] = past_data.critical_count.fillna(0)
data["past_serious"] = past_data.serious_count.fillna(0)
data["past_minor"] = past_data.minor_count.fillna(0)

### 3.6. Time Since Last
* One potential risk factor is greater time since last inspection (do we say we got this from Chicago team or just give our own justification?)
* To access this convert each inspection date to a python datetime, subtract the previous datetime from the later to create a series of delta objects and convert to days.
* the default is set to two.

In [None]:
# Calculate time since previous inspection
deltas = pd.to_datetime(data.inspection_date) - pd.to_datetime(past_data.inspection_date)

# Add years since previous inspection (default to 2)
data["time_since_last"] = deltas.apply(lambda x: x.days / 365.25).fillna(2)

### 3.7. First Record
* Actually not sure why this would matter in predicting outcomes? (check)
* Maybe first records are more likely to fail?
* To get it we simply put 1s for rows where data is absent in the shifted past_data.

In [None]:
# Check if first record
data["first_record"] = past_data.inspection_id.map(lambda x: 1 if pd.isnull(x) else 0)

## 4. Business License Features
* These are the features derived from the busuiness license dataset
* What is a business license? other background info?

### 4.1. Matching Inspections with Licenses
* Load data, see publication 1

In [None]:
# Load business license data
licenses = pd.read_csv(os.path.join(root_path, "DATA/business_licenses.csv"))

* In order to link food inspections to the business licenses of the facilities inspected we create a table of matches, each linking an inspection to a license
* Many business licenses can be matched by license number to an inspection, but to account for licence discrepancies we also matched based on venue (street address and name)
* Due to formatting differences it was necessary to use only the street number

In [None]:
# Business licenses have numbers on end preventing simple match
# so using street number instead
def get_street_number(address):
    return address.split()[0]

licenses["street_number"] = licenses.address.apply(get_street_number)
inspections["street_number"] = inspections.address.apply(get_street_number)

# Match based on DBA name and street number
venue_matches = pd.merge(inspections, licenses, left_on=["dba_name", "street_number"], right_on=["doing_business_as_name", "street_number"])

# Match based on license numbers
licence_matches = pd.merge(inspections, licenses, left_on="license", right_on="license_number")

* to create the working matches dataset we then appended venue and licence matches and dropped any duplicate inspection / business licence matches.

In [None]:


# Join matches, reset index, drop duplicates
matches = venue_matches.append(license_matches, sort=False)
matches.reset_index(drop=True, inplace=True)
matches.drop_duplicates(["inspection_id", "id"], inplace=True)

# Restrict to matches where inspection falls within license period
matches = matches.loc[matches.inspection_date.between(matches.license_start_date, matches.expiration_date)]

### 4.2. Filterering by Category
* (This isn't a feature but is only convenient to do once we have the matches dataset. what to do?)
* many non-retail establishments eg schools, hospitals follow different inspection schedules, so to ensure consistent data we filter matches to include only inspections of retail food establishments
* to do this we select the inspection id's of all retail matches, drop any duplicates and merge these id's with the model data
* by default merge includes only rows with keys present in each dataset (inner join)

In [None]:
# Select retail food establishment inspection IDs
retail = matches.loc[matches.license_description == "Retail Food Establishment", ["inspection_id"]]
retail.drop_duplicates(inplace=True)

# FILTER: ONLY CONSIDER INSPECTIONS MATCHED WITH RETAIL LICENSES
data = pd.merge(data, retail, on="inspection_id")

### 4.3. Calculating Age at Inspection
* What might age at inspection tell?
* One feature previously found significant in predicting inspection outcomes is the age of the facility
* To calculate this we first convert all dates to datetime objects
* We then group by licence and within each group find the earliest license start date
* Finally we subtract this min date from the inspection date and merge the resulting age in with our model data

In [None]:
# Convert dates to datetime format
matches.inspection_date = pd.to_datetime(matches.inspection_date)
matches.license_start_date = pd.to_datetime(matches.license_start_date)

def get_age_data(group):
    min_date = group.license_start_date.min()
    deltas = group.inspection_date - min_date
    group["age_at_inspection"] = deltas.apply(lambda x: x.days / 365.25)
    return group[["inspection_id", "age_at_inspection"]]

# Calculate (3 mins), drop duplicates
age_data = matches.groupby("license").apply(get_age_data).drop_duplicates()

# Merge in age_at_inspection
data = pd.merge(data, age_data, on="inspection_id", how="left")

### 4.4. Calculating Category Data
* The chicago team found the categories of licences attributed to an establishment to be significant in predicting violation outcomes
* This data is derived from the licence_description column of the business licences dataset
* We will be noting the presence or absence of these categories as a series of binary flags
* To derive these features we first set up a dictionary linking the column entries to our desired snake case column titles
* We then group matches by inspection id to gather all licence descriptions for each inspection
* To generate the entries we apply our get_category_data method, using our dictionary to translate from licence_description entries to column titles
* Finally we fill missing entries as 0 and merge the results in with our model data

In [None]:
# Translate categories to snake-case titles
categories = {
    "Consumption on Premises - Incidental Activity": "consumption_on_premises_incidental_activity",
    "Tobacco": "tobacco",
    "Package Goods": "package_goods",
    "Limited Business License": "limited_business_license",
    "Outdoor Patio": "outdoor_patio",
    "Public Place of Amusement": "public_place_of_amusement",
    "Children's Services Facility License": "childrens_services_facility_license",
    "Tavern": "tavern",
    "Regulated Business License": "regulated_business_license",
    "Filling Station": "filling_station",
    "Caterer's Liquor License": "caterers_liquor_license",
    "Mobile Food License": "mobile_food_license"
}

# Create binary markers for license categories
def get_category_data(group):
    df = group[["inspection_id"]].iloc[[0]]
    for category in group.license_description:
        if category in categories:
            df[categories[category]] = 1
    return df
    
# group by inspection, get categories (2 mins)
category_data = matches.groupby("inspection_id").apply(get_category_data)

# Reset index, set absent categories to 0
category_data.reset_index(drop=True, inplace=True)
category_data.fillna(0, inplace=True)

# Merge in category data, fill nan with 0
data = pd.merge(data, category_data, on="inspection_id", how="left").fillna(0)

## 5. Density Estimates for Crime, Garbage Carts, Complaints

In [None]:
# Load observation datasets
burglaries = pd.read_csv(os.path.join(root_path, "DATA/burglaries.csv"))
carts = pd.read_csv(os.path.join(root_path, "DATA/garbage_carts.csv"))
complaints = pd.read_csv(os.path.join(root_path, "DATA/sanitation_complaints.csv"))

In [None]:
# Create datetime columns
inspections["datetime"] = pd.to_datetime(inspections.inspection_date)
burglaries["datetime"] = pd.to_datetime(burglaries.date)
carts["datetime"] = pd.to_datetime(carts.creation_date)
complaints["datetime"] = pd.to_datetime(complaints.creation_date)

In [None]:
# FILTER: consider only inspections since 2012
# Otherwise early inspections have few/no observations within window
inspections = inspections.loc[inspections.inspection_date >= "2012"]

In [None]:
from datetime import datetime, timedelta
from scipy import stats

def get_kde(observations, column_name, window, bandwidth):

    # Sort chronologically and index by datetime
    observations.sort_values("datetime", inplace=True)
    observations.index = observations.datetime.values
    
    # Generate kernel from 90 days of observations
    def get_kde_given_date(group):
        stop = group.datetime.iloc[0]
        start = stop - timedelta(days=window)
        recent = observations.loc[start:stop]
        
        x1 = recent.longitude
        y1 = recent.latitude
        values = np.vstack([x1, y1])
        kernel = stats.gaussian_kde(values)

        x2 = group.longitude
        y2 = group.latitude
        samples = np.vstack([x2, y2])
        group[column_name] = kernel(samples)
        return group[["inspection_id", column_name]]

    # Group inspections by date, generate kernels, sample
    return inspections.groupby("inspection_date").apply(get_kde_given_date)

In [None]:
# Calculate burglary density estimates
burglary_kde = get_kde(burglaries, "burglary_kde", 90, 1)

# Calculate garbage cart density estimates
cart_kde = get_kde(carts, "cart_kde", 90, 1)

# Calculate sanitation complaint density estimates
complaint_kde = get_kde(complaints, "complaint_kde", 90, 1)

In [None]:
# FILTER: only consider data since 2012 (with good kde data)
data = pd.merge(data, burglary_kde, on="inspection_id")
data = pd.merge(data, cart_kde, on="inspection_id")
data = pd.merge(data, complaint_kde, on="inspection_id")

## 6. Weather

In [None]:
# Load weather data
weather = pd.read_csv(os.path.join(root_path, "DATA/weather.csv"))

# Merge weather data with model data
data = pd.merge(data, weather, on="inspection_id")

## 7. Next Steps

* Russell Hofvendahl is a web application developer with a great fondness for data driven decision making. Russell is excited to explore the applications of data science and machine learning in improving human judgement.
* David Lewis is a seasoned corporate responsibility professional working to utilize technology to help improve the health and well being of human populations through environmental stewardship.
* Jason S. Trager, Ph.D. is the managing partner at Sustainabilist and an expert in process improvement for distributed systems. Jason’s work portfolio includes the creation of novel data-driven methods for improving contractor performance, machine learning to optimize value in energy efficiency sales, and equipment maintenance optimization methodologies.