## Homework 2: Food Safety

### Cleaning and Exploring Data with Pandas


This Assignment
In this homework, we will investigate restaurant food safety scores for restaurants in San Francisco. The scores and violation information have been made available by the San Francisco Department of Public Health. The main goal for this assignment is to walk through the process of Data Cleaning and EDA.

As we clean and explore these data, you will gain practice with:

Reading simple csv files and using Pandas
Working with data at different levels of granularity
Identifying the type of data collected, missing values, anomalies, etc.
Exploring characteristics and distributions of individual variables

### Before You Start
For each question in the assignment, please write down your answer in the answer cell(s) right below the question.

We understand that it is helpful to have extra cells breaking down the process towards reaching your final answer. If you happen to create new cells below your answer to run codes, NEVER add cells between a question cell and the answer cell below it. It will cause errors when we run the autograder, and it will sometimes cause a failure to generate the PDF file.

Important note: The local autograder tests will not be comprehensive. You can pass the automated tests in your notebook but still fail tests in the autograder. Please be sure to check your results carefully.

Finally, unless we state otherwise, do not use for loops or list comprehensions. The majority of this assignment can be done using builtin commands in Pandas and numpy. Our autograder isn't smart enough to check, but you're depriving yourself of key learning objectives if you write loops / comprehensions, and you also won't be read for the midterm.

In [60]:
import otter
grader = otter.Notebook("hw02.ipynb")

ModuleNotFoundError: No module named 'otter'

In [None]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
plt.style.use('fivethirtyeight')

import zipfile
from pathlib import Path

import os
import plotly

from IPython.display import display, Image 
def display_figure_for_grader(fig):
    plotly.io.write_image(fig, 'temp.png')
    display(Image('temp.png'))    


## Obtaining the Data

### File Systems and I/O

In general, we will focus on using python commands to investigate files. However, it can sometimes be easier to use shell commands in your local operating system. The following cells demonstrate how to do this.

In [None]:
from pathlib import Path
data_dir = Path('.')
data_dir.mkdir(exist_ok = True)
file_path = data_dir / Path('data.zip')
dest_path = file_path

After running the cell above, if you list the contents of the directory containing this notebook, you should see data.zip.

Note: The command below starts with an !. This tells our Jupyter notebook to pass this command to the operating system. In this case, the command is the ls Unix command which lists files in the current directory.

In [None]:
# !ls

1: Loading Food Safety Data

check all files

We have data, but we don't have any specific questions about the data yet. Let's focus on understanding the structure of the data; this involves answering questions such as:

Is the data in a standard format or encoding?
Is the data organized in records?
What are the fields in each record?
Let's start by looking at the contents of data.zip. It's not just a single file but rather a compressed directory of multiple files. We could inspect it by uncompressing it using a shell command such as !unzip data.zip, but in this homework we're going to do almost everything in Python for maximum portability.

Looking Inside and Extracting the Zip Files
The following codeblocks are setup. Simply run the cells; do not modify them. Question 1a is where you will start to write code.

Here, we assign my_zip to a zipfile.Zipfile object representing data.zip, and assign list_names to a list of all the names of the contents in data.zip.

In [None]:
my_zip = zipfile.ZipFile(dest_path, 'r')
list_names = my_zip.namelist()
list_names

# # print file name
# print(r_zip.filename)

# # open zip file and download in this source
# r_zip.extractall()

['data/',
 'data/bus.csv',
 'data/ins.csv',
 'data/ins2vio.csv',
 'data/vio.csv',
 'data/sf_zipcodes.json',
 'data/legend.csv']

In [None]:
my_zip = zipfile.ZipFile(dest_path, 'r')
for info in my_zip.infolist():
    print('{}\t{}'.format(info.filename, info.file_size)) # check file name and file size

data/	0
data/bus.csv	665365
data/ins.csv	1860919
data/ins2vio.csv	1032799
data/vio.csv	4213
data/sf_zipcodes.json	474
data/legend.csv	120


In [None]:
bus = pd.read_csv('data/bus.csv')
ins = pd.read_csv('data/bus.csv')
ins2vio = pd.read_csv('data/ins2vio.csv')
vio = pd.read_csv('data/vio.csv')
legend = pd.read_csv('data/legend.csv')

### Question 1a: Examining the Business Data File

In [67]:
# bus.head()
bus = bus.rename(columns={"business id column": "bid"})
bus.head()

Unnamed: 0,bid,name,address,city,state,postal_code,latitude,longitude,phone_number
0,1000,HEUNG YUEN RESTAURANT,3279 22nd St,San Francisco,CA,94110,37.755282,-122.420493,-9999
1,100010,ILLY CAFFE SF_PIER 39,PIER 39 K-106-B,San Francisco,CA,94133,-9999.0,-9999.0,14154827284
2,100017,AMICI'S EAST COAST PIZZERIA,475 06th St,San Francisco,CA,94103,-9999.0,-9999.0,14155279839
3,100026,LOCAL CATERING,1566 CARROLL AVE,San Francisco,CA,94124,-9999.0,-9999.0,14155860315
4,100030,OUI OUI! MACARON,2200 JERROLD AVE STE C,San Francisco,CA,94124,-9999.0,-9999.0,14159702675


In [68]:
# bus.describe()
# bus.value_counts()

In [69]:
bus.isnull().sum()

bid             0
name            0
address         0
city            0
state           0
postal_code     0
latitude        0
longitude       0
phone_number    0
dtype: int64

In [66]:
bid_unique = bus['bid'].unique()
# bid_unique
len(bid_unique) == bus.shape[0]

True

#### Question 1b
We will now work with some important fields in bus.

Assign top_names to an iterable containing the top 5 most frequently used business names, from most frequent to least frequent.
Assign top_addresses to an iterable containing the top 5 addressses where businesses are located, from most popular to least popular.
Recall from CS88 or CS61A that "an iterable value is anything that can be passed to the built-in iter function. Iterables include sequence values such as strings and tuples, as well as other containers such as sets and dictionaries."

Hint: You may find value_counts() helpful.

Hint 2: You'll need to somehow get the names / addresses, NOT the counts associated with each. If you're not sure how to do this, try looking through the class notes or using a search engine. We know this is annoying but we're trying to help you build independence.

Hint 3: To check your answer, top_names[0] should return the string Peet's Coffee & Tea. It should not be a number.

In [126]:
top_names = bus.name.value_counts().index[:5].tolist()
top_addresses = bus.address.value_counts().index[:5].tolist()

top_names, top_addresses

(["Peet's Coffee & Tea",
  'Starbucks Coffee',
  "McDonald's",
  'Jamba Juice',
  'STARBUCKS'],
 ['Off The Grid', '428 11th St', '2948 Folsom St', '3251 20th Ave', 'Pier 41'])

In [127]:
top_names[0] == "Peet's Coffee & Tea"

True

#### Question 1c
Based on the above exploration, what does each record represent?

A. "One location of a restaurant." B. "A chain of restaurants." C. "A city block."

Answer in the following cell. Your answer should be a string, either "A", "B", or "C".

In [129]:
# What does each record represent?  Valid answers are:
#    "One location of a restaurant."
#    "A chain of restaurants."
#    "A city block."
q1c = 'A'

### 2: Cleaning the Business Data Postal Codes
The business data contains postal code information that we can use to aggregate the ratings over regions of the city. Let's examine and clean the postal code field. The postal code (sometimes also called a ZIP code) partitions the city into regions:

#### Question 2a
How many restaurants are in each ZIP code?

In the cell below, create a series where the index is the postal code and the value is the number of records with that postal code in descending order of count. You may need to use groupby(), size(), or value_counts(). Do you notice any odd/invalid zip codes?

In [147]:
zip_counts = bus.postal_code.value_counts()
print(zip_counts.to_string())

94103         562
94110         555
94102         456
94107         408
94133         398
94109         382
94111         259
94122         255
94105         249
94118         231
94115         230
94108         229
94124         218
94114         200
-9999         194
94112         192
94117         189
94123         177
94121         157
94104         142
94132         132
94116          97
94158          90
94134          82
94127          67
94131          49
94130           8
94143           5
94301           2
94188           2
94101           2
CA              2
94013           2
941102019       1
941             1
95112           1
94105-2907      1
94102-5917      1
94124-1917      1
94621           1
95122           1
95132           1
95109           1
95133           1
95117           1
94901           1
94105-1420      1
94544           1
64110           1
94122-1909      1
00000           1
94080           1
Ca              1
94602           1
94129           1
94014     

In [149]:
zip_counts.shape

(63,)

#### Question 2b
Answer the following questions about the postal_code column in the bus dataframe.

1. The ZIP code column is which of the following type of data:
A. Quantitative Continuous
B. Quantitative Discrete
C. Qualitative Ordinal
D. Qualitative Nominal
2. What Python data type is used to represent a ZIP code?
A. str
B. int
C. bool
D. float
Note: ZIP codes and postal codes are the same thing.

Please write your answers in the cell below. Your answer should be a string, either "A", "B", "C", or "D".

In [None]:
# The ZIP code column is which of the following type of data:
q2b_part1 = 'D'

# What Python data type is used to represent a ZIP code? 
q2b_part2 = 'A'

#### Question 2c
In question 2a we noticed a large number of potentially invalid ZIP codes (e.g., "Ca"). These are likely due to data entry errors. To get a better understanding of the potential errors in the zip codes we will:

Import a list of valid San Francisco ZIP codes by using pd.read_json to load the file data/sf_zipcodes.json and ultimately create a series of type str containing the valid ZIP codes.
Construct a DataFrame containing only the businesses which DO NOT have valid ZIP codes. (step 2 below).

Step 1

In [153]:
bus.head()

Unnamed: 0,bid,name,address,city,state,postal_code,latitude,longitude,phone_number
0,1000,HEUNG YUEN RESTAURANT,3279 22nd St,San Francisco,CA,94110,37.755282,-122.420493,-9999
1,100010,ILLY CAFFE SF_PIER 39,PIER 39 K-106-B,San Francisco,CA,94133,-9999.0,-9999.0,14154827284
2,100017,AMICI'S EAST COAST PIZZERIA,475 06th St,San Francisco,CA,94103,-9999.0,-9999.0,14155279839
3,100026,LOCAL CATERING,1566 CARROLL AVE,San Francisco,CA,94124,-9999.0,-9999.0,14155860315
4,100030,OUI OUI! MACARON,2200 JERROLD AVE STE C,San Francisco,CA,94124,-9999.0,-9999.0,14159702675


In [161]:
valid_zips = pd.read_json('data/sf_zipcodes.json')['zip_codes']
valid_zips.head()

0    94102
1    94103
2    94104
3    94105
4    94107
Name: zip_codes, dtype: int64

In [162]:
valid_zips.dtype

dtype('int64')

In [163]:
valid_zips = valid_zips.astype("string")
type(valid_zips.dtype)

pandas.core.arrays.string_.StringDtype

Step 2

In [173]:
has_valid_zip = bus.postal_code.isin(valid_zips)
invalid_zip_bus = bus[~has_valid_zip]
invalid_zip_bus.head(20)

Unnamed: 0,bid,name,address,city,state,postal_code,latitude,longitude,phone_number
22,100126,Lamas Peruvian Food Truck,Private Location,San Francisco,CA,-9999,-9999.0,-9999.0,-9999
68,100417,"COMPASS ONE, LLC",1 MARKET ST. FL,San Francisco,CA,94105-1420,-9999.0,-9999.0,14154324000
96,100660,TEAPENTER,1518 IRVING ST,San Francisco,CA,94122-1909,-9999.0,-9999.0,14155868318
109,100781,LE CAFE DU SOLEIL,200 FILLMORE ST,San Francisco,CA,94117-3504,-9999.0,-9999.0,14155614215
144,101084,Deli North 200,1 Warriors Way Level 300 North East,San Francisco,CA,94518,-9999.0,-9999.0,-9999
156,101129,Vendor Room 200,1 Warriors Way Level 300 South West,San Francisco,CA,-9999,-9999.0,-9999.0,-9999
177,101192,Cochinita #2,2 Marina Blvd Fort Mason,San Francisco,CA,-9999,-9999.0,-9999.0,14150429222
276,102014,"DROPBOX (Section 3, Floor 7)",1800 Owens St,San Francisco,CA,-9999,-9999.0,-9999.0,-9999
295,102245,Vessell CA Operations (#4),2351 Mission St,San Francisco,CA,-9999,-9999.0,-9999.0,-9999
298,10227,The Napper Tandy,3200 24th St,San Francisco,CA,-9999,37.752581,-122.416482,-9999


#### Question 2d
In the previous question, many of the businesses had a common invalid postal code that was likely used to encode a MISSING postal code. Do they all share a potentially "interesting address"?

In the following cell, construct a series that counts the number of businesses at each address that have this single likely MISSING postal code value. Order the series in descending order by count.

After examining the output, please answer the following question (2e) by filling in the appropriate variable. If we were to drop businesses with MISSING postal code values would a particular class of business be affected? If you are unsure try to search the web for the most common addresses.

In [234]:
missing_code = invalid_zip_bus.postal_code.value_counts()
interesting_code = missing_code.index[0]
interesting_code

'-9999'

In [235]:
missing_zip_address_count = invalid_zip_bus[invalid_zip_bus.postal_code==interesting_code]['address'].value_counts()
missing_zip_address_count.head()

Off The Grid                  39
Off the Grid                  10
OTG                            4
Approved Locations             3
Approved Private Locations     3
Name: address, dtype: int64

#### Question 2e
If we were to drop businesses with MISSING postal code values, what specific types of businesses would we be excluding? In other words, is there a commonality among businesses with missing postal codes?

Hint: You may want to look at the names of the businesses with missing postal codes. Feel free to reuse parts of your code from 2d, but we will not be grading your code.

Type your answer here, replacing this text.

#### Question 2f
Examine the invalid_zip_bus dataframe we computed above and look at the businesses that DO NOT have the special MISSING ZIP code value. Some of the invalid postal codes are just the full 9 digit code rather than the first 5 digits. Create a new column named postal5 in the original bus dataframe which contains only the first 5 digits of the postal_code column.

Then, for any of the postal5 ZIP code entries that were not a valid San Francisco ZIP Code (according to valid_zips), the provided code will set the postal5 value to None.

In [280]:
bus['postal5'] = bus.postal_code.apply(lambda x:x[:5])
bus.loc[~bus['postal5'].isin(valid_zips), 'postal5'] = None
bus.loc[invalid_zip_bus.index, ['bid', 'name', 'postal_code', 'postal5']]

Unnamed: 0,bid,name,postal_code,postal5
22,100126,Lamas Peruvian Food Truck,-9999,
68,100417,"COMPASS ONE, LLC",94105-1420,94105
96,100660,TEAPENTER,94122-1909,94122
109,100781,LE CAFE DU SOLEIL,94117-3504,94117
144,101084,Deli North 200,94518,
...,...,...,...,...
6173,99369,HOTEL BIRON,94102-5917,94102
6174,99376,Mashallah Halal Food truck Ind,-9999,
6199,99536,FAITH SANDWICH #2,94105-2907,94105
6204,99681,Twister,95112,


### 3: Investigate the Inspection Data
Let's now turn to the inspection DataFrame. Earlier, we found that ins has 4 columns named iid, score, date and type. In this section, we determine the granularity of ins and investigate the kinds of information provided for the inspections.

Let's start by looking again at the first 5 rows of ins to see what we're working with.

In [281]:
ins.head(5)

Unnamed: 0,business id column,name,address,city,state,postal_code,latitude,longitude,phone_number
0,1000,HEUNG YUEN RESTAURANT,3279 22nd St,San Francisco,CA,94110,37.755282,-122.420493,-9999
1,100010,ILLY CAFFE SF_PIER 39,PIER 39 K-106-B,San Francisco,CA,94133,-9999.0,-9999.0,14154827284
2,100017,AMICI'S EAST COAST PIZZERIA,475 06th St,San Francisco,CA,94103,-9999.0,-9999.0,14155279839
3,100026,LOCAL CATERING,1566 CARROLL AVE,San Francisco,CA,94124,-9999.0,-9999.0,14155860315
4,100030,OUI OUI! MACARON,2200 JERROLD AVE STE C,San Francisco,CA,94124,-9999.0,-9999.0,14159702675
