In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(sec:scope_construct)=
# Target Population, Access Frame, Sample

An important initial step in the data life cycle is to express the question of interest in the context of the subject area and consider the connection between the question and the data collected to answer that question.  It’s good practice to do this before even thinking about the analysis or modelling steps because it may uncover a disconnect where the question of interest cannot be directly addressed with these data. As part of making the connection between the data collection process and the topic of investigation, we identify the population, the means of accessing the population, instruments of measurement, and additional protocols used in the collection process. These concepts help us understand the scope of the data, whether we aim to gain knowledge about a population, scientific quantity, physical model, social behavior, etc.   

The *target population* consists of the collection of elements that you ultimately intend to describe and draw conclusions about. By *element* we mean those individuals that make up our population. The element may be a person in a group of people, a voter in an election, a tweet from a collection of tweets, or a county in a state. We sometimes call an element a *unit* or an *atom*.   

The *access frame* is the collection of elements that are accessible to you for measurement and observation. These are the units by which you can access the target population. Ideally, the frame and population are perfectly aligned; meaning they consist of the exact same elements. However, the units in an access frame may be only a subset of the target population; additionally, the frame may include units that don’t belong to the population. For example, to find out how a voter intends to vote in an election, you might call people by phone. Someone you call, may not be a voter so they are in your frame but not in the poulation. On the other hand, a voter who never answers a call from an unkown number can't be reached so they are in the population but not in your frame.   

The *sample* is the subset of units taken from the access frame to measure, observe, and analyze. The sample gives you the data to analyze to make predictions or generalizations about the population of interest.   

The contents of the access frame, in comparison to the target population, and the method used to select units from the frame to be in the sample are important factors in determining whether or not the data can be considered representative of the target population.  If the access frame is not representative of the target population, then the data from the sample is most likely not representative either. And, if the units are sampled in a biased manner, problems with representativeness also arise.

For each of the following examples, we provide a diagram for the data scope. The target population, access frame, and sample are represented by circles with shaded interior, and for each example, the configuration of their overlap represents the scope.  


__EXAMPLE: Informal Rewards and Peer Production.__ 
Content on Wikipedia is written and edited by volunteers who belong to the Wikipedia community. This online community is crucial to the success and vitality of Wikipedia. In trying to understand how to incentivize members of online communities, researchers carried out an experiment with Wikipedia contributors as subjects {cite}`restivo2012`. The target population is the collection of active contributors—those who made at least one contribution to Wikipedia in the month before the start of the study. Additionally, the target population was further restricted to the top 1% of contributors.  The access frame eliminated anyone in the population who had received an informal incentive that month. The access frame purposely excluded some of the contributors in the population because the researchers want to measure the impact of an incentive and those who had already received one incentive might behave differently. (See {numref}`fig:WikipediaExptConstruct`).

```{figure} WikipediaExptConstruct.png
---
name: fig:WikipediaExptConstruct
---
The access frame does not include the entire population because the experiment included only those contributers who had not already received an incentive. The sample is a randomly selected subset from the frame. 
```

The sample is a randomly selected set of 200 contributors from the frame. The sample of contributors were observed for 90 days and digital traces of their activities on Wikipedia were collected.  Notice that the contributor population is not static; there is regular turnover. In the month prior to the start of the study more than 144,000 volunteers produced content for Wikipedia. Selecting top contributors from among this group limits the generalizability of the findings, but given the size of the group of top contributors, if they can be influenced by an informal reward to maintain or increase their contributions that is a valuable finding. 

In many experiments and studies, we don’t have the ability to include all population units in the frame. It is often the case that the access frame consists of volunteers who are willing to join the study/experiment. $\blacksquare$

__EXAMPLE: 2016 Presidential Election Upset.__ 
The outcome of the US presidential election in 2016 took many people and many pollsters by surprise. Most pre-election polls predicted Clinton would beat Trump by a wide margin. Political polling is a type of public opinion survey held prior to an election that attempts to gauge who people will vote for. Since opinions change over time, the survey starts with a ''horse-race'' question, where respondents are asked for whom would they vote in a head-to-head race if the election were tomorrow: Candidate A or Candidate B.  

In these pre-election surveys, the target population consists of those who will vote in the election, which in this example was the 2016 US presidential election. However, pollsters can only guess at whether someone will vote in the election so the access frame consists of adults who have a landline or mobile phone (so they can be contacted by the pollster) and are determined to be likely voters (this is usually based on their past voting record, but other factors may be used to determine this). The sample is those people in the frame who are chosen according to a random dialing scheme. (See {numref}`fig:ElectionPollConstruct`).

```{figure} ElectionPollConstruct.png
---
name: fig:ElectionPollConstruct
---
This representation is typical of many surveys. The access frame does not cover all of the population and includes some who are not in the population.
```

Later in {numref}`Chapter %s <ch:theory_datadesign>`, we discuss the impact on the election predictions of people’s unwillingness to answer their phone or participate in the poll. $\blacksquare$

__EXAMPLE: Pollution and Health.__
The [CalEnviroScreen project](https://oehha.ca.gov/calenviroscreen) studies connections between population health and environmental pollution in California communities. The California Environmental Protection Agency (CalEPA) and the California Office of Health Hazard Assessment (OEHHA) and the public developed the CalEnviroScreen project. The project uses data collected from several sources, including demographic summaries from the U.S. census, health statistics from the California Office of Statewide Health Planning and Development, and pollution measurements from air monitoring stations around the state maintained by the California Air.

Ideally, we want to study the people of California, and assess the impact of these environmental hazards on an indivudal's health. However, in this situation, the data can only be obtained at the level of a census tract. 
The access frame consists of groups of residents living in the same census tract. So, the units in the frame are census tracts and the sample is a census--all of the tracts--since data are provided for all of the tracts in the state. (See {numref}`fig:CalEnviroConstruct`).

```{figure} CalEnviroConstruct.png
---
name: fig:CalEnviroConstruct
---
The grid in the access frame represents the census tracts. The population, frame, and sample cover all Californians, but the grid limits measurments to the level of census tract. 
```

Unfortunately, we cannot disaggregate the information in a tract to examine what happens to an individual person. This aggregation impacts how we analyze the data and the conclusions that we can draw. $\blacksquare$

These examples have demonstrated some of the configurations a target, access frame, and sample might have, and the exercises provide a few more examples. When a frame doesn't reach everyone, we should consider how this missing information might impact our findings. Similarly we ask what might happen when a frame includes those not in the population. Additionally, the techniques for drawing the sample can impact how representative of the population the sample is. We begin to address this topic  in {numref}`sec:variationtypes`.  When you think about the generalizability of your data findings, you will also want to consider the quality of the instruments and procedures used to collect the data. If your sample is a census that matches your target, but the information is poorly collected, then your findings will be of little value. This is the topic of the next section.

(sec:scope_protocols)=
# Instruments and Protocols

When we consider the scope of the data, we also consider the instrument being used to take the measurements and the the procedure for taking measurement, which we call the *protocol*. For a survey, the instrument is typically the questionnaire that an individual in the sample answers. The protocol for a survey includes how the sample is chosen, how nonrespondents are followed up, interviewer training, protections for confidentiality, etc. 

Good instruments and protocols are important to all kinds of data collection. If we want to measure a natural phenomenon, such as the speed of light, we need to quantify the accuracy of the instrument. The protocol for calibrating the instrument and taking measurements is also vital to obtaining accurate measurements. Instruments can go out of alignment and measurements can drift over time leading to poor, highly inaccurate measurements (see {numref}`sec:variationtypes`).  

Protocols are also critical in experiments. Ideally, any factor that can influence the outcome of the experiment is controlled. For example, temperature, time of day, confidentiality of a medical record, and even the order of taking measurements need to be kept consistent to rule out potential effects from these factors from getting in the way.  

With digital traces, the algorithms used to support online activity are dynamic and continually re-engineered. For example, Google’s search algorithms are continually tweaked to improve user service and advertising revenue. Changes to the search algorithms can impact the data generated from the searches, which in turn impact systems built from these data, such as the Google Flu Trend tracking system (see Section sec:scope_bigdata). This changing environment can make it untenable to maintain data collection protocols and difficult to replicate findings. 
  
Many data science projects involve linking data together from multiple sources. Each source should be examined through this data-scope construct and any difference across sources considered. Additionally, matching algorithms used to combine data from multiple sources need to be clearly understood so that populations and frames from the sources can be compared.

Measurements from an instrument taken to study a natural phenomenon can also be cast in the Venn diagram of  a target, access frame, and sample (see Section sec:scope_construct). This approach is helpful in understanding their accuracy.  


(sec:scope_naturalphenomenon)=
# Measuring Natural Phenomenon

The Venn diagram introduced for observing a target population can be extended to the situation where we want to measure a quantity such as the count of particles in the air, the age of a fossil, the speed of light, etc. In these cases we consider the quantity we want to measure as an unknown value. (This unkown value referred is often referred to as a *parameter*.) In our diagram, we shrink the target to a point that represents this unknown. The instrument’s accuracy acts as the frame, and the sample consists of the measurements taken by the instrument within the frame. You might think of the frame as a dart board, where the instrument is the person throwing the darts. If they are reasonably good, the darts land within the circle, scattered around the bullseye. The scatter of darts correspond to the measurments taken by the instrument. The target point is not seen by the dart thrower, but ideally it coincides with the bullseye. 

To illustrate the concepts of measurement error and the connection to sampling error, we examine the problem of calibrating air quality sensors.  

__EXAMPLE: Purple Air.__ 
Across the US, sensors to measure air pollution are widely used by individuals, community groups, and state and local air monitoring agencies {cite}`hug2020,owyang2020`. For example, on two days in September, 2020, approximately 600,000 Californians and 500,000 Oregonians viewed PurpleAir’s map as fire spread through their states and evacuations were planned. PurpleAir creates air quality maps from crowdsourced data that streams in from their sensors. See the map of monitor readings in Berkeley on Aug 21, 2020 (screenshot taken by Josh Hug).  

```{figure} PurpleAirConstruct.png
---
name: fig:PurpleAirConstruct
---
This representation is typical of many measurement processes. The access frame represents the measurement process which reflects the accuracy of the instrument.
```

We can think of the data scope as follows: at any location and point in time, there is a true particle composition in the air surrounding the sensor, this is our target. Our instrument, the sensor, takes many measurements, in some cases a reading every second. These form a sample contained in the access frame, the dart board. If the instrument is working properly, the measurements are centered around the bullseye, and the target coincides with the bullseye.  Researchers have found that low humidity can distort the readings so that they are too high {cite}`hug2020`. In {numref}`Chapter %s <ch:pa>`, we address how to use data science to calibrate these instruments to improve their accuracy. $\blacksquare$

We continue the dart board analogy in the next section to introduce the concepts of bias and variation, describe common ways in which a sample might not be representative of the population, and draw connections between accuracy and the protocol. 