In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(ch:data_scope)=
# Data Scope

As data scientists we use data to answer questions, and the quality of the data collection process can significantly impact the validity and accuracy of the data, the strength of the conclusions we draw from an analysis, and the decisions we make. In this chapter, we describe a general approach for understanding data collection and evaluating the usefulness of the data in addressing the question of interest. Ideally, we aim for data to be representative of the phenomenon that we are studying, whether that phenomenon is a population characteristic, a physical model, or some type of social behavior. Typically, our data do not contain complete information (the scope is restricted in some way), yet we want to use the data to accurately describe a population, estimate a scientific quantity, infer the form of a relationship between features, or predict future outcomes. In all of these situations, if our data are not representative of the object of our study, then our conclusions can be limited, possibly misleading, or even wrong. 

To motivate the need to think about these issues, we begin with an example of the power of big data and what can go wrong ({numref}`Section %s <sec:scope_bigdata>`). We then provide a framework that can help you connect the goal of your study with the data collection process. We refer to this as the **data scope**. 
{numref}`Sections %s <sec:scope_construct>` and {numref}` %s <sec:scope_protocols>` provide terminology to help describe data scope and examples from surveys, government data, scientific instruments, and online resources.  Later, in {numref}`Section %s <sec:scope_accuracy>`, we consider what does it mean for data to be accurate. There, we introduce different forms of bias and variation, and describe conditions where they can arise. Throughout, the examples cover the spectrum of the sorts of data that you may be using as a data scientists; these examples are from science, political elections, public health, and online communities.  
