In [None]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# The Three Datasets Used in This Chapter

In this chapter, we'll perform data wrangling on three different datasets.
These include a scientific study (measurements from Mauna Loa Observatory),
government survey (drug abuse survey), and administrative data (San Francisco
restaurant scores). As we'll see from these datasets, data cleaning and
preparation can be quite different depending the data. Below, we give an
overview of why and how these data were collected and their scope.

## Carbon Dioxide Measured at Mauna Loa Observatory

Carbon dioxide (CO2) is an important signal of global warming. CO2 is a gas
that traps heat in the earth's atmosphere. Without it, earth would be
impossibly cold, but increases in CO2 also drives global warming and threatens
our planet's climate. CO2 concentrations have been monitored at Mauna Loa
Observatory since 1958, and these data offer a crucial benchmark for
understanding the threat of global warming.   


Located on the Big Island of Hawaii, Mauna Loa is the largest active volcano in
the world; it rises 4km above sea level and 17km above its base.  The
observatory (see Figure XX) sits at 3.4km in elevation, above the inversion
layer where local CO2 effects may be present.  Given its location, far from any
continent, and its elevation, air samples taken at the observatory are
uncontaminated and representative of the central pacific atmosphere. Daily
measurements of atmospheric carbon dioxide have been taken at Mauna Loa since
1958, making it the longest continuous record of CO2 in the world. These
measurements are collected by the Global Monitoring Laboratory (GML) of the US
National Oceanic and Atmospheric Administration (NOAA), and their recordings
are made available at the NOAA website [^noaa].

[^noaa]: https://www.esrl.noaa.gov/gmd/ccgg/trends/

```{figure} figures/MaunaLoaObservatory.png
---
name: MaunaLoaObservatory
---

The Mauna Loa observatory makes daily measurements of CO2; we'll analyze its
recordings in this chapter.
```

The GML routinely calibrates the monitoring equipment and estimates instrument
bias. For example, multiple times a day, gases with known and stable CO2
concentrations are measured and used to calibrate the instrument. In addition,
samples of air are taken independently each week and sent to NOAA headquarters
to be measured along with air samples from around the world to estimate the
bias of the instrument.  The bias has been consistently measured at about 0.2
ppm, which is small (0.05%) in comparison to the typical measurements of about
400 ppm. More details of the instrumentation are available at the 
NOAA website [^about_co2].

[^about_co2]: https://gml.noaa.gov/ccgg/about/co2_measurements.html

## Drug Abuse Warning Network (DAWN) Survey

DAWN is a national healthcare survey that monitors trends in drug abuse and the
emergence of new substances of abuse. The survey also aims to estimate the
impact of drug abuse on the country's health care system and improve how
emergency departments monitor substance-abuse crises. DAWN is administered by
the U.S. Department of Health and Human Services' Substance Abuse and Medical
Health Services Administration (SAMHSA). DAWN was administered annually from
1998 through 2011. Due in part to the opioid epidemic, DAWN began again in
2018. We examine the 2011 data that have been made available through the SAMHSA
Data Archive [^SAMHSA].


[^SAMHDA]: https://www.datafiles.samhsa.gov/study-series/drug-abuse-warning-network-dawn-nid13516


The target population of the survey is all emergency room visits in the U.S.
These visits are accessed through a frame of emergency rooms in hospitals (and
their records). Hospitals are selected for the survey through probability
sampling [^prob_sampling], and all drug-related visits to the hospital's
emergency room are included in the survey. All types of drug-related visits are
included, such as drug misuse, abuse, accidental ingestion, suicide attempts,
malicious poisonings, and adverse reactions.  For each visit, up to 16 drugs
can be recorded, including illegal drugs, prescription drugs, and
over-the-counter medications. 

[^prob_sampling]: We cover probability sampling in Section XX.

## San Francisco Restaurant Food Safety

The San Francisco Department of Public Health routinely makes unannounced
visits to restaurants and inspects them for food safety.  The inspector
calculates a score based on the violations observed and provides descriptions
of the violations that were found.  These food safety scores are available
through the city's Open Data initiative, called Data SF.  DataSF is one example
of city governments around the world making their data publicly available; the
DataSF mission is to "empower the use of data in decision making and service
delivery" with the goal of improving the quality of life and work for
residents, employers, employees and visitors.


The City of San Francisco requires restaurants to publicly display their scores
(see {ref}`scoreCard` below for an example placard). Data from the inspections
offer an example of multiple datasets with different structures and fields. One
dataset contains summary results of inspections, another provides details about
violations found during an inspection, and a third contains information about
the restaurants. These inspections occurred between 2013 and 2016, some
restaurants have multiple inspections in a year, and not all of the 7000+
restaurants are inspected annually. The violations include serious problems
related to the transmission of food borne illnesses and contamination of
food-contact surfaces, and minor issues such as not properly displaying the
inspection placard.     

```{figure} figures/scoreCard.png
---
name: scoreCard
---

An example food safety scorecard given to a restaurant in San Francisco. 
Restaurants receive a score between 0-100.
```

Note that in 2020, the City gives restaurants color-coded placards indicating
whether the restaurant passed (green), conditionally passed (yellow), or failed
(red) the inspection rather than a scorecard that displays a number. However,
the actual inspection scores and violations for restaurants are still available
through DataSF.