In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, set_matplotlib_formats
import myst_nb

import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'plotly_mimetype+svg'
pio.templates['book'] = go.layout.Template(
    layout=dict(
        margin=dict(l=10, r=10, t=10, b=10),
        autosize=True,
        width=350, height=250,
    )
)
pio.templates.default = 'seaborn+book'

set_matplotlib_formats('svg')
sns.set()
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(sec:lifecycle_obtain)=
# Obtaining Data

In this step of the data science lifecycle, we obtain our data and
understand how the data were collected.
One of our goals in this stage is to understand what kinds of research
questions we can answer using the data that we have.
In our lifecycle, data analyses can begin with asking a question
(the previous stage) or with obtaining data (this stage).
When data are expensive and hard to gather, we
define a precise research question first and then collect the exact data
we need to answer the question.
Other times, data are cheap and easily accessed.
This is especially true for online data sources.
For example, the Twitter website lets people quickly download millions of data
points [^twitter].
When data are plentiful, we can also start an analysis by obtaining data,
exploring it, and then asking research questions. 

[^twitter]: https://developer.twitter.com/en/docs/twitter-api

When we obtain data, we write down how the data were collected
and what information the data contain.
This isn't just for bookkeeping---the type of research questions
we can answer depend greatly on the way the data were collected.
We explore this topic in 
{numref}`Chapter %s <ch:data_scope>`.