In [66]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)
       

(ch:files_granularity)=
# Table Shape and Granularity


As described earlier, we refer to a dataset's **structure** as a mental representation of the data, and in particular, we represent data that has a **table** structure by arranging data values in rows and columns. Now that we have investigated the restaurant inspection files, we load them into DataFrames and examine their shapes.

In [67]:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
insp = pd.read_csv("data/inspections.csv")
viol = pd.read_csv("data/violations.csv")

In [68]:
print(" Restaurants shape:", bus.shape, "\n Inspections shape:", insp.shape,
      "\n Violations shape:", viol.shape)

 Restaurants shape: (6406, 9) 
 Inspections shape: (14222, 4) 
 Violations shape: (39042, 3)


We find that the restaurants table has 6406 rows and 9 columns. But, what does a row in the table represent? The answer to this question is what we refer to as the **granularity**. 

In [69]:
bus.head(2)

Unnamed: 0,business_id,name,address,city,...,postal_code,latitude,longitude,phone_number
0,19,NRGIZE LIFESTYLE CAFE,"1200 VAN NESS AVE, 3RD FLOOR",San Francisco,...,94109,37.79,-122.42,14157763262
1,24,OMNI S.F. HOTEL - 2ND FLOOR PANTRY,"500 CALIFORNIA ST, 2ND FLOOR",San Francisco,...,94104,37.79,-122.4,14156779494


:::{note}

By default, `pandas` restricts its output to only show a few rows and columns
at once. To see more possible values, we can ask `pandas` to display more rows
and columns,
though the output can be verbose if there are many rows to display.

In this book, we've defined a function called `display_df` as a shorthand to
display more than the default number of rows and columns.

:::

Simply looking at the dataframe, we get the impression that each row/record represents a single restaurant/business. Is there one record per business in the table? The field `business_id` seems to imply that it is the unique identifier for the business. We can confirm this by checking whether the number of records in `bus` matches the number of unique values in `business_id`.

In [70]:
print("Number of records:", len(bus))
print("Number of unique business ids:", len(bus['business_id'].unique()))

Number of records: 6406
Number of unique business ids: 6406


Alternatively, we can count the number of occurrences of each `business_id` in
the table and check that they are all 1.


In [71]:
# Since value_counts() sorts the counts from largest to smallest, 
# we can see that all IDs only appear 1 time in the table.
bus['business_id'].value_counts()

2047     1
71088    1
5528     1
        ..
64154    1
2716     1
83969    1
Name: business_id, Length: 6406, dtype: int64

Indeed, both approaches confirm that `business_id` uniquely identifies each record in the DataFrame. We call `business_id` the **primary key** in the business table. 

And what does each column represent? The names of the columns are helpful in identifying the contents: the name, address, city, state, postal code, latitude, longitude, phone number, and zip code of the restaurant. It's a bit odd that one column is for a postal code and another is for a zip code, and the records that we have examined show the exact same values in these two fields. In practice, we will want to check that these column are indeed redundant and possibly remove one of them. 

Let's continue the examination of the inspections and violations data frames
and find their granularity.

## Granularity of Restaurant Inspections and Violations

Next, let's look at the data for restaurant inspections. There are many more rows in the inspections table compared to the business table. We take a closer look at the first few rows in the table.

In [72]:
insp.head(4)

Unnamed: 0,business_id,score,date,type
0,19,94,20160513,routine
1,19,94,20171211,routine
2,24,98,20171101,routine
3,24,98,20161005,routine


We notice that it also contains a field called `business_id`, but there are duplicate values of the ID. We see two records for business #19, but the dates are different for these two records. This implies that there is one record for each inspection of a restaurant in the table of inspections. In other words, the granularity of this table is a restaurant inspection. If this is indeed the case, that would mean that the unique identifier of a row is the combination of
`business_id` and `date`. In other words, the primary key consists of two fields.

To confirm that the two fields form the primary key, we can group `insp` by the combination of `business_id` and `date`, and then find the size of each group. If `business_id` and `date` uniquely define each row of the dataframe, then each group should have size 1.

In [73]:
(insp
 .groupby(['business_id', 'date'])
 .size()
 .sort_values(ascending=False)
 .head(5)
)

business_id  date    
77427        20170706    2
64859        20150924    2
87440        20160801    2
94231        20171214    1
7640         20161228    1
dtype: int64

The combination of ID and date, uniquely identifies each record in the inspections table, with the exception of three restaurants, which have two records for an ID-date combination. For example, it looks like restaurant `64859` got two different scores on the date `20150924` (Sept. 24, 2015). How could this happen? It may be that the restaurant had two inspections in one day, or it might be an error.

In [74]:
insp.query('business_id == 64859 and date == 20150924')

Unnamed: 0,business_id,score,date,type
7742,64859,96,20150924,routine
7744,64859,91,20150924,routine


We would address these sorts of questions when we consider the data quality in {numref}`Chapter %s <ch:wrangling>`. In any case, for all intents and purposes, the primary key for the inspections table is the combination of restaurant ID and inspection date.

Note that the `business_id` field in the inspections table acts as a reference to the primary key in the business table, for this reason `business_id` in `insp` is called a **foreign key**.

Briefly, we examine the granularity of the violations table.

In [75]:
viol

Unnamed: 0,business_id,date,description
0,19,20171211,Inadequate food safety knowledge or lack of ce...
1,19,20171211,Unapproved or unmaintained equipment or utensils
2,19,20160513,Unapproved or unmaintained equipment or utensi...
...,...,...,...
39039,94231,20171214,High risk vermin infestation [ date violation...
39040,94231,20171214,Moderate risk food holding temperature [ dat...
39041,94231,20171214,Wiping cloths not clean or properly stored or ...


Just looking at the first few records in viol we see that each inspection has
multiple entries. In other words, the granularity is at the level of a
violation. Reading the descriptions, we see that if corrected, a date is listed
in the description within square brackets.


In [76]:
viol.loc[39039, 'description']

'High risk vermin infestation  [ date violation corrected: 12/15/2017 ]'

We have found that these three tables have different granularities. If we are interested in studying inspections, we can, say, aggregate the violations table to find the number of violations that occurred in an inspection, and then add this information to the inspection table. We can also reduce the inspection table by selecting one inspection, say the most recent one, for each restaurant. This reduced data table essentially has a granularity of restaurant, which may be useful for a restaurant-based analysis. These kinds of actions reshape the data table, transform columns, and create new columns. We'll cover these operations later, in {numref}`Section %s <ch:wrangling_transformations>`.

## DAWN Survey Shape and Granularity

As noted in {numref}`Section %s <ch:reading_format>`, the DAWN file has fixed-width formatting, and we need to rely on a codebook to find out where the fields are. For example, the codebook in
{numref}`Figure %s <DAWN_Age>` tells us that age appears in positions 34 and 35 in the row, and it is categorized into 11 age groups, 1 stands for 5 and under, 2 for 6 to 11, ..., and 11 for 65 and older. Also, -8 represents a missing value.

```{figure} figures/DAWN_Age.png
---
name: DAWN_Age
---

Screenshot of a portion of the DAWN coding for age.
```

 

Given the tremendous amount of information on each line, we can read just a few variables into a data frame. (In {numref}`Section %s <ch:reading_command_line>` we determined the file contains 200 thousand lines and over 280 million characters so on average, there are about 1200 characters per line.) We can use the `pandas.read_fwf` method to do this. We must specify the positions of the fields to extract, the associated names of the fields, and other information about the header and index. 

In [77]:
colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, 
                   header=None, index_col=0, names=varNames)
dawn.head()

Unnamed: 0_level_0,wt,age,sex,race,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0.94,4,1,2,8
2,5.99,11,1,3,4
3,4.72,11,2,2,4
4,4.08,2,1,3,4
5,5.18,6,1,3,8


It appears that we have correctly loaded the dataset. We can compare the rows in the table to the number of lines in the file. 

In [78]:
dawn.shape

(229211, 5)

The number of rows in the data frame matches the number of lines in the file. That's good. The number of columns do not match because we read only a handful of features. 

The granularity of the data frame is a bit complicated due to the survey design.  Recall that these data are part of a large scientific study, with a complex sampling scheme. A row represents an emergency room visit, so the granularity is at the emergency room visit level. However, in order to reflect the sampling scheme and be representative of the population, weights are provided. We must apply a weight to each record when computing summary statistics, building histograms, and fitting models. 

The `wt` field contains a weight value that takes into account the probability of an ER visit like this one appearing in the sample. By "like this one" we mean a visit with similar visitor age and race and visit location, time of day, etc. We examine the different values in `wt`.

In [79]:
dawn['wt'].value_counts()

0.94     1719
84.26    1617
1.72     1435
         ... 
1.51        1
3.33        1
3.31        1
Name: wt, Length: 3500, dtype: int64

:::{note}

What do these weights mean? As a simplified example,
suppose you ran a survey and 75% of your respondents reported their sex
as female.
Since you know from the Census that roughly 50% of the
U.S. population is female, you could adjust your survey responses by using
a small weight (less than 1) for female responses and a
larger weight (greater than 1) for male responses.
The DAWN survey uses the same idea, except that they split the groups
much more finely.

:::

It is critical to include the survey weights in your analysis. For example, we can compare the calculation of the proportion of females among the ER visits both with and without using the weights.  

In [80]:
np.average((dawn["sex"] == 2))

0.48003804354939333

In [81]:
np.average((dawn["sex"] == 2), weights=dawn["wt"])

0.523468490709998

These figures differ by 4 percentage points. The weighted version is considered a more accurate estimate of the proportion of females among the entire population of drug-related ER visits.  

## Summary

After looking at the granularity of your dataset, you should have answers to the following questions. (We provide answers for the three restaurant food safety datasets.)
As an exercise, answer the questions for the DAWN dataset.

**What does a record represent?**

In the `bus` table, each record represents a restaurant; in the `insp` table, a record corresponds to an inspection of a restaurant; and in the `viol` table each record represents a violation found during an inspection.

**Do all records in a table capture granularity at the same level? (Sometimes a table will contain summary rows.)**

Yes, the records in each of the `bus`, `insp`, and `viol` tables have the same granularity within their resepctive tables.

**If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.**

These tables have not been aggregated, and in fact, they have different levels of granularity. 

**What kinds of aggregations can we perform on the data?**

These three tables allow for some useful aggregations. As mentioned earlier, we can aggregate information in the `viol` table, such as a count of the number of violations per inspection. Another type of aggregation, is to choose one inspection record to represent the restaurant. For example, we can select the most recent inspection as an "aggregation" of the inspections table to the restaurant level. Alternatively, we can aggregate inspections and report the average inspection score for a restaurant.