In [None]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# A Construct for Data Scope

An important initial step in the data life cycle is to express the question of interest in the context of the subject area and consider the connection between the question and the data collected to answer that question.  It’s good practice to do this before even thinking about the analysis or modelling steps because it may uncover a disconnect where the question of interest cannot be directly addressed with the data. As part of making the connection between the data collection process and the topic of investigation, we identify the population, the means of accessing the population, instruments of measurement, and additional protocols used in the collection process. We use these concepts to create a  construct for the scope of the data.

## The Topic of Investigation

We analyze data to gain knowledge about a target population, scientific quantity, physical model,social behavior, etc.  Prior to our analysis, we should clarify the main question of interest and how the data are associated with the question. Understanding this connection helps us identify any limitations the data might have in addressing the question. 

## Target Population, Access Frame, Sample

The *target population* is the collection of elements that the data scientist ultimately intends to describe and draw conclusions about.  

The *access frame* is the collection of elements that are accessible to us for measurement and observation. These are the units by which we access the target population. Ideally, the frame and population are perfectly aligned, i.e., they have the exact same units. However, the units in an access frame may be only a subset of the target population; additionally, the frame may include units that don’t belong to the population. 

The *sample* is the subset of units taken from the access frame to measure, observe, and analyze. 

The contents of the access frame, in comparison to the target population, and the method used to select units from the frame to be in the sample are important factors in determining whether or not the data can be considered representative of the target population.  If the access frame is not representative of the target population, then the data from the sample is most likely not representative either. And, if the units are sampled in a biased manner, problems with representativeness also arise.

For each of the following examples, we provide a diagram for the data-scope construct. The target population, access frame, and sample are represented by circles with shaded interior, and for each example, the configuration of their overlap represents the scope.  


**TODO: figure**

### Example: Informal Rewards and Peer Production 

The content on Wikipedia is written and edited by volunteers who belong to the Wikipedia community. This online community is crucial to the success and vitality of Wikipedia. In trying to understand how to incentivize members of online communities, researchers carried out an experiment with Wikipedia contributors as subjects {cite}`restivo2012`. The target population is the collection of active contributors—those who made at least one contribution to Wikipedia in the month before the start of the study. The access frame is restricted to the top 1% of contributors, eliminating anyone who already received an informal incentive. The sample is a randomly selected set of 200 contributors from the frame. The sample of contributors were observed for 90 days and digital traces of their activities on Wikipedia were collected.  Notice that the contributor population is not static; there is regular turnover. In the month prior to the start of the study more than 144,000 volunteers produced content for Wikipedia. Selecting top contributors from among this group limits the generalizability of the findings, but given the size of the group of top contributors, if they can be influenced by an informal reward to maintain or increase their contributions that is a valuable finding. 

**TODO: figure**

In many experiments and studies, we don’t have the ability to include all population units in the frame. It is often the case that the access frame consists of volunteers who are willing to join the study/experiment. 

### Example: Pollution and Health

The [CalEnviroScreen project](https://oehha.ca.gov/calenviroscreen) studies connections between population health and environmental pollution in California communities. The California Environmental Protection Agency (CalEPA) and the California Office of Health Hazard Assessment (OEHHA) and the public developed the CalEnviroScreen project. The project uses data collected from several sources, including demographic summaries from the U.S. census, health statistics from the California Office of Statewide Health Planning and Development, and pollution measurements from, e.g., air monitoring stations around the state maintained by the California Air Resources Board. That is, CalEnviroScreen has combined and repurposed administrative data to answer new questions of interest. One area they study is the relationship between the levels of particulate matter in the air and asthma. Ideally, we could examine these relationships for individuals. However, since this information is only available through federal administrative records for census tracts, we can only examine the prevalence of asthma at the level of a census tract. This is an example of how we need to align the data collected with the topic of investigation. This does not mean that our study is flawed, but we do need to be cautious with our conclusions. Relationships between aggregated quantities tend to appear stronger than relationships measured on individuals.  Ideally, the target population are the residents of California, but in this situation the access frame consists of groups of residents, i.e., census tracts. The units in the frame (census tracts) can not be disaggregated to examine individuals; this impacts how we analyze the data and the conclusions that we can draw. The sample is a census, i.e., it consists of all units in the frame.


**TODO: figure**

### Example: 2016 Presidential Election Upset 

The outcome of the US presidential election in 2016 took many people and many pollsters by surprise. Many pre-election polls predicted Clinton would beat Trump by a wide margin. Political polling is a type of public opinion survey held prior to an election that attempts to gauge who people will vote for. Since opinions change over time, the survey starts with a “horse-race'' question, where respondents are asked for whom would they vote in a head-to-head race if the election were tomorrow: Candidate A or Candidate B.  In these pre-election surveys, the target population consists of those who will vote in the election, which in this example was the 2016 US presidential election. However, pollsters can only guess at whether someone will vote in the election so the access frame consists of adults who have a landline or mobile phone (so they can be contacted by the pollster) and are determined to be likely voters (this is usually based on their past voting record, but other factors may be used to determine this). The sample is those people in the frame who are chosen according to a random sampling scheme. Later in this chapter, we discuss the impact on the election predictions of people’s unwillingness to answer their phone or participate in the poll. 


**TODO: figure**

## Instruments and Protocols

When we consider the scope of the data, we also consider the instrument being used to take the measurements and the protocol, i.e., the procedure for taking measurements. In a survey, the instrument might be a questionnaire that the sampled individuals answer. The protocol includes how the sample is chosen, how nonrespondents are followed up, interviewer training, protections for confidentiality, etc. Section XX, is dedicated to describing chance mechanisms for selecting samples, and Chapter XX introduces the theory behind why these chance mechanisms are preferable for sampling. Section XX, describes common ways in which the sample might not be representative of the population and draws connections to the protocol.      

Good instruments and protocols are important to all kinds of data collection. If we want to measure a natural phenomenon, such as the speed of light, we need to quantify the accuracy of the instrument. The protocol for calibrating the instrument and taking measurements is vital to obtaining accurate measurements. Instruments can get out of alignment and measurements can drift over time leading to poor, highly inaccurate measurements (see Section XX).  

Protocols are also critical in experiments. Ideally, any factor that can influence the outcome of the experiment is controlled, e.g., temperature, time of day, confidentiality, order of taking measurements need to be kept consistent. These ideas are discussed in more detail in Section XX.   

With digital traces, the algorithms used to support online activity are dynamic and continually re-engineered. For example, Google’s search algorithms are continually tweaked to improve user service and advertising revenue. Changes to the search algorithms can impact the data generated from the searches, which in turn impact systems built from these data, such as the Google Flu Trend tracking system (see Example XX). This changing environment can make it untenable to maintain data collection protocols and difficult to replicate findings. 
  
Many data science projects involve linking data together from multiple sources. Each source should be examined through this data-scope construct and any difference across sources considered. Additionally, matching algorithms used to combine data from multiple sources need to be clearly understood so that populations and frames from the sources can be compared.


**TODO: cross-refs**

## Target, Access, Sample: Measuring Natural Phenomenon

The construct introduced for observing populations can be extended to the situation where we want to measure a quantity such as the count of particles in the air, the age of a fossil, etc. In these cases we consider the quantity we want to measure as an unknown parameter value. In our diagram, we shrink the target to a bullseye that represents this parameter. The instrument’s accuracy acts as the frame, and the sample consists of the measurements taken by the instrument within the frame, like a scatter shot in the frame.  An example helps clarify the analogy.   

### Example: Purple Air 

In the US, sensors to measure air pollution are widely used by individuals, community groups, and state and local air monitoring agencies {cite}`hug2020`, {cite}`owyang2020`. For example, on two days in September, 2020, approximately 600,000 Californians and 500,000 Oregonians viewed PurpleAir’s map as fire spread through their states and evacuations were planned. PurpleAir creates air quality maps from crowdsourced data that streams in from their sensors. See the map of monitor readings in Berkeley on Aug 21, 2020 (screenshot taken by Josh Hug).  



**TODO: figure**

We can think of the data-scope as follows: at any location and point in time, there is a true particle composition in the air surrounding the sensor, this is our target, the bullseye. Our instrument, the sensor, takes many measurements, in some cases a reading every second. These form a sample contained in the access frame. If the instrument is working properly, the frame is centered on the bullseye, and the sample falls close to the bullseye.  Researchers have found that low humidity can distort the readings so that they are too high {cite}`hug2020`. In Section XX, we address how to use data science to calibrate these instruments to improve their accuracy.

**TODO: cite**

**TODO: figure**