In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Summary

No matter the kind of data you are working with, before diving into cleaning, exploration, and analysis, take a moment to look into the data's source. If you didn't collect the data, ask yourself:

+ Who collected the data?
+ Why were the data collected?

Answers to these questions can help determine whether these found data can be used to address the question of interest to you. 

Consider the scope of the data. Questions about the temporal and location aspects of data collection can provide valuable insights: 
 
+ When were the data collected? 
+ Where were the data collected?

Answers to these questions help you determine whether your findings are relevant to the situation that interests you, or whether your situation that may not be comparable to this other place and time. 

Core to the notion of scope are answers to the following questions: 

+ What is the target population (or unknown parameter value)?
+ How was the target accessed?
+ What methods were used to select samples/take measurements?
+ What instruments were used and how were they calibrated? 

Answering as many of these questions as possible can give you valuable insights as to how much trust you can place in your findings and how far you can generalize your findings. 

This chapter has provided you with a terminology and framework for thinking about and answering these questions. The chapter has also outlined ways to identify possible sources of bias and variance that can impact the accuracy of your findings. To help you reason about bias and variance, we have introduced the following models:

+ Scope diagram to indicate the overlap between target population, access frame, and sample;
+ Dart board to describe an instrument's bias and variance; and 
+ Urn model for situations when a chance mechanism has been used to select a sample from an access frame, divide a group into experimental treatment groups, or take measurements from a well calibrated instrument. 

{numref}`Chapter %s <ch:theory_datadesign>` continues the development of the urn model to more formally quantify accuracy. 