In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
        
# businesses
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')

# inspections
insp = pd.read_csv("data/inspections.csv")

# violations
viol = pd.read_csv("data/violations.csv")

(ch:wrangling_granularity)=
# Table Shape and Granularity


In this section, we'll work with the San Francisco food safety datasets.

As described earlier, we refer to a dataset's **structure** as a mental
representation of the data, and in particular, we represent data that has a
**table** structure by arranging data values in rows and columns. For example,
we've seen that the dataframe of San Francisco restaurants
has 6406 rows and 9 columns.

In [2]:
bus.shape

(6406, 9)

But, what does a row of the table represent? The answer to this question is
what we refer to as the **granularity** of the table. Simply looking at the
dataframe, we get the impression that each row/record represents a single
restaurant/business.

In [3]:
bus.head(2)

Unnamed: 0,business_id,name,address,city,...,postal_code,latitude,longitude,phone_number
0,19,NRGIZE LIFESTYLE CAFE,"1200 VAN NESS AVE, 3RD FLOOR",San Francisco,...,94109,37.79,-122.42,14157763262
1,24,OMNI S.F. HOTEL - 2ND FLOOR PANTRY,"500 CALIFORNIA ST, 2ND FLOOR",San Francisco,...,94104,37.79,-122.4,14156779494


:::{note}

By default, `pandas` restricts its output to only show a few rows and columns
at once. To see more possible values, we can ask `pandas` to display more rows
and columns,
though the output can be verbose if there are many rows to display.

In this book, we've defined a function called `display_df` as a shorthand to
display more than the default number of rows and columns.

:::

Is there one record per business in the table? The field `business_id` seems to
imply that it is the unique identifier for the business. We can confirm this by
checking whether the number of records in bus matches the number of unique
values in `business_id`.

In [4]:
print("Number of records:", len(bus))
print("Number of unique business ids:", len(bus['business_id'].unique()))

Number of records: 6406
Number of unique business ids: 6406


Alternatively, we can count the number of occurrences of each `business_id` in
the table and check that they are all 1.


In [5]:
# Since value_counts() sorts the counts from largest to smallest, we can see
# that all IDs only appear 1 time.
bus['business_id'].value_counts()

83969    1
90801    1
37575    1
        ..
83394    1
7617     1
2047     1
Name: business_id, Length: 6406, dtype: int64

We call `business_id` the **primary key** in the business table. Either
approach confirms that `business_id` uniquely identifies each record.


And what does each column represent? The names of the columns are helpful in
identifying the contents: the name, address, city, state, postal code,
latitude, longitude, phone number, and zip code of the restaurant. It's a bit
odd that one column is for a postal code and another is for a zip code, and the
records that we have examined show the exact same values in these two fields.
In practice, we
will want to check that these column are indeed redundant and
possibly remove one of them. 
This step is left as an exercise to the reader.

Let's continue the examination of the inspections and violations data frames
and find their granularity.

## Granularity of Restaurant Inspections and Violations

Now, let's look at the data for restaurant inspections.
We notice that it also contains a field called `business_id`,
but there are duplicate values of the ID.

In [44]:
insp.head(4)

Unnamed: 0,business_id,score,date,type
0,19,94,20160513,routine
1,19,94,20171211,routine
2,24,98,20171101,routine
3,24,98,20161005,routine


We see two records for business #19. When we cross check this ID with the
business table, we see that the business name is NRGIZE LIFESTYLE CAFE. The
field called date are different for these two records, which implies that there
is one record for each inspection of a restaurant. In other words, the
granularity of this table is a restaurant inspection. If this is indeed the
case, that would mean that the unique identifier of a row is the combination of
`business_id` and `date`: the primary key consists of two fields.

To confirm this, we can group `insp` by both `business_id` and `date`, then
find the size of each group. If `business_id` and `date` uniquely define each
row of the dataframe, then each group should have a size of 1.

In [45]:
(insp
 .groupby(['business_id', 'date'])
 .size()
 .sort_values(ascending=False)
 .head(5)
)

business_id  date    
64859        20150924    2
87440        20160801    2
77427        20170706    2
19           20160513    1
71416        20171213    1
dtype: int64

It looks like restaurant `64859` got two different scores on the date 
`20150924` (Sept. 24, 2015).

In [46]:
insp.query('business_id == 64859 and date == 20150924')

Unnamed: 0,business_id,score,date,type
7742,64859,96,20150924,routine
7744,64859,91,20150924,routine


Unfortunately, we can see that on three occasions, a business has multiple
rows corresponding to a particular date.
How could this happen? It may
be that a restaurant had two inspections in one day, or it might be an error.

These sorts of questions are exactly what we'd like to address in data quality.
In either case, the
granularity of this table is an inspection event, and for all intents and
purposes the key is the combination of restaurant ID and date of the
inspection.

Note that the `business_id` field in the inspections table acts as a reference
to the primary key in the business table, for this reason `business_id` in
`inspections` is called a **foreign key**.


Finally, let's determine the granularity of the violations table.

In [47]:
viol

Unnamed: 0,business_id,date,description
0,19,20171211,Inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...
...,...,...,...
39039,94231,20171214,High risk vermin infestation [ date violation...
39040,94231,20171214,Moderate risk food holding temperature [ dat...
39041,94231,20171214,Wiping cloths not clean or properly stored or ...


Just looking at the first few records in viol we see that each inspection has
multiple entries. In other words, the granularity is at the level of a
violation. Reading the descriptions, we see that if corrected, a date is listed
in the description within square brackets.


In [48]:
viol.loc[39039, 'description']

'High risk vermin infestation  [ date violation corrected: 12/15/2017 ]'

We have found that these three tables have different granularities. If we are
interested in studying inspections, we can, say, aggregate the violations table
to find the number of violations that occurred in an inspection, and then add
this information to the inspection table. We can also reduce the inspection
table by selecting one inspection, say the most recent one, for each
restaurant. This reduced data table essentially has a granularity of
restaurant, which may be useful for a restaurant-based analysis. These kinds of
actions reshape the data table, transform columns, and create new columns.
We'll cover these operations later in this chapter, in
{numref}`Section %s <ch:wrangling_transformations>`.

## DAWN Survey Shape and Granularity

As a second example, we examine the DAWN survey data to determine its shape and
granularity. Let's begin by examining its encoding with the `file` CLI tool.


In [6]:
!file data/DAWN-Data.txt

data/DAWN-Data.txt: ASCII text, with very long lines


We find that the source file is ASCII plain text, and we are also informed that
the lines are very long! The `wc` tool confirms that indeed the lines must be
quite long because there are about 200 thousand lines and over 280 million
characters. On average, there are about 1200 characters per line.


In [7]:
!wc data/DAWN-Data.txt

  229211 22695570 280095842 data/DAWN-Data.txt


Given the line length, let's look at just one line in the file. We can use the
`-n 1` argument to do this. The display of the first line below has been
formatted to fit more easily in the code block.

```bash
!head -n 1 data/DAWN-Data.txt
# The output is one (long) line:
```

```
     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102005 1 2 1
2.00-7.00-7.0000-7.0000-7.00001255 105 1142032 4 1 1 2.50 5.00
5.0100-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7 
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7 
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.0000  -7  -7  -7  
-7-7-7-7-7.00-7.00-7.0000-7.0000-7.00008 611001
```

What do you notice about the line of data?
The values seem to run into each
other. It's hard to figure out where one field stops and another begins
for some of the numbers. There are many values of -7, -7.00, and -7.000. This is a
fixed-width formatted file, and we need to read the code book to find out
where the fields are. For example, the codebook in
{numref}`Figure %s <DAWN_Age>` tells us that age
appears in positions 34 and 35 in the row, and it is categorized into 11 age
groups, 1 stands for 5 and under, 2 for 6 to 11, ..., and 11 for 65 or older.
Also, -8 represents a missing value.


```{figure} figures/DAWN_Age.png
---
name: DAWN_Age
---

Screenshot of a portion of the DAWN coding for age.
```

 

We read just a few variables into a data frame to show how you can do this. 
`pandas` provides a `pd.read_fwf` function for reading in fixed-width format
files. We specify the positions of the fields in each line to extract, the 
associated names of the fields, and other information about the header and 
index. 

In [9]:
colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, 
                   header=None, index_col=0, names=varNames)
dawn.head()

Unnamed: 0_level_0,wt,age,sex,race,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.94,4,1,2,8
2,5.99,11,1,3,4
3,4.72,11,2,2,4
4,4.08,2,1,3,4
5,5.18,6,1,3,8


It appears that we have correctly loaded the dataset into a data frame. We can compare the shape to the number of lines from `wc`. The number of rows matches the number of lines in the file, but the number of columns does not because we read in only a handful of features. 

In [10]:
dawn.shape

(229211, 5)

The granularity is a bit more complicated due to the survey design.  Recall that these data are part of a large scientific study, with a complex sampling scheme. A row represents an emergency room visit, so the granularity is at the emergency room visit level. However, in order to reflect the sampling scheme and be representative of the population, weights are provided. We apply a weight to each record when computing summary statistics, building histograms, and fitting models. 

The `wt` field contains a weight that takes into account the probability of an ER visit like this one appearing in the sample. By "like this one" we mean a visit with similar visitor age and race and visit location, time of day, etc. We examine the different values in `wt`.

In [11]:
dawn['wt'].value_counts()

0.94     1719
84.26    1617
1.72     1435
         ... 
3.33        1
6.20        1
3.31        1
Name: wt, Length: 3500, dtype: int64

:::{note}

What do these weights mean? As a simplified example,
suppose you ran a survey and 75% of your respondents reported their sex
as female.
Since you know from the Census that roughly 50% of the
U.S. population is female, you could adjust your survey responses by using
a small weight (less than 1) for female responses and a
larger weight (greater than 1) for male responses.
The DAWN survey uses the same idea, except that they split the groups
much more finely.

:::

It is critical to include the survey weights in your analysis. For example, we can compare the calculation of the proportion of females among the ER visits with and without the weights.  

In [12]:
np.average((dawn["sex"] == 2))

0.48003804354939333

In [13]:
np.average((dawn["sex"] == 2), weights=dawn["wt"])

0.523468490709998

These figures differ by 4 percentage points. The weighted version is considered a more accurate estimate of the proportion of females among the entire population of ER visits.  

## Summary

After looking at the granularity of your datasets, you should have answers to the following questions.
Here, we provide answers for the three restaurant inspections datasets.
As an exercise, answer the questions for the DAWN dataset.

**What does a record represent?**

In the `bus` table, each record represents a restaurant; in the `insp` table, a record corresponds to an inspection; and in the `viol` table each record represents a violation found during an inspection.

**Do all records capture granularity at the same level? (Sometimes a table will contain summary rows.)**

Yes, the records within each of the `bus`, `insp`, and `viol` tables has the same granularity.

**If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.**

According to the data description, each record is the mean of daily readings. But, the data website also has hourly readings, so we suspect that both daily and monthly readings are aggregated from the hourly readings.

**What kinds of aggregations can we perform on the data?**

These tables allow for many useful aggregations. As mentioned earlier, we can aggregate information in the `viol` table, such as counting the number of violations per inspection. Another type of aggregation, is to choose one record to represent the aggregate. For example, we can select the most recent inspection of a restaurant to aggregate the inspections table to the restaurant level. Alternatively, we can aggregate inspections and report the average inspection score for a restaurant.