In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:files_datasets)=
# Data Source Examples

In this chapter, we use two datasets as examples: a government survey about
drug abuse and administrative data from the City of San Francisco about
restaurant inspections. In later sections, we demonstrate how taking stock of
the format, encoding, and size of the files that contain the "raw" data can
solve problems with loading the source file into a data frame. Before we get started,
we want to give an overview of the datasets and their scope ({numref}'Chapter %s <ch:data_scope>'.

## Drug Abuse Warning Network (DAWN) Survey

DAWN is a national healthcare survey that monitors trends in drug abuse and the
emergence of new substances of abuse. The survey also aims to estimate the
impact of drug abuse on the country's health care system and to improve how
emergency departments monitor substance-abuse crises. DAWN is administered by
the U.S. Department of Health and Human Services' Substance Abuse and Medical
Health Services Administration (SAMHSA). DAWN was administered annually from
1998 through 2011. Due in part to the opioid epidemic, DAWN began again in
2018. We examine the 2011 data that have been made available through the SAMHSA
Data Archive [^SAMHSA].

[^SAMHSA]: https://www.datafiles.samhsa.gov/study-series/drug-abuse-warning-network-dawn-nid13516

The target population of the survey is all drug-related, emergency-room visits
in the U.S. These visits are accessed through a frame of emergency rooms in
hospitals (and their records). Hospitals are selected for the survey through
probability sampling (see {numref}`Chapter %s <ch:theory_datadesign>`), and all
drug-related visits to the sampled hospital's emergency room are included in
the survey. All types of drug-related visits are included, such as drug misuse,
abuse, accidental ingestion, suicide attempts, malicious poisonings, and
adverse reactions.  For each visit, as many as 16 different drugs can be
recorded, including illegal drugs, prescription drugs, and over-the-counter
medications. 

The source file for this dataset is an example of fixed-width formatting and the need for a codebook to decipher. Also, it is a reasonably large file and so motivates the topic of how to find a file's size. And the granularity is unusual because an ER visit, not a person, is the subject of investigation and because of the complex survey design. 

The San Francisco restaurant files have different characteristics that also make them a good example for this chapter.

## San Francisco Restaurant Food Safety

The San Francisco Department of Public Health routinely makes unannounced
visits to restaurants and inspects them for food safety.  The inspector
calculates a score based on the violations observed and provides descriptions
of the violations that were found. The target population here is all
restaurants in the City of San Francisco. These restaurants are accessed
through a frame of restaurant inspections that were conducted between 2013 and
2016. Some restaurants have multiple inspections in a year, and not all of the
7000+ restaurants are inspected annually.

These food safety scores are available through the city's Open Data initiative,
called DataSF. DataSF is one example of city governments around the world
making their data publicly available; the DataSF mission is to "empower the use
of data in decision making and service delivery" with the goal of improving the
quality of life and work for residents, employers, employees and visitors.[^DATASF]

The City of San Francisco requires restaurants to publicly display their scores
(see {numref}`Figure %s <scoreCard>` below for an example placard)[^CARDS]. Data from
the inspections offer an example of multiple datasets with different structures, fields, and granularity. One dataset contains summary results of inspections, another
provides details about violations found during an inspection, and a third
contains information about the restaurants. The violations include serious
problems related to the transmission of food borne illnesses and contamination
of food-contact surfaces, and minor issues such as not properly displaying the
inspection placard.  

[^DATASF]: https://data.sfgov.org/Health-and-Social-Services/Restaurant-Scores-LIVES-Standard/pyih-qa8i/data

[^CARDS] In 2020, the city began giving restaurants color-coded placards indicating whether the restaurant passed (green), conditionally passed (yellow), or failed (red) the inspection, rather than a placard that displays the numeric score from the inspection. However, a restaurant's inspection scores and violations are still available through DataSF.


```{figure} figures/scoreCard.jpg
---
name: scoreCard
---

An example food safety scorecard given to a restaurant in San Francisco. 
Restaurants receive a score between 0-100.
```

Both the DAWN survey data and the San Francisco restaurant inspection data are available online as plain text files. However, their formatting is different, and in the next section, we demonstrate how to figure out the file format so that we can read the data into a data frame.