In [2]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display, set_matplotlib_formats
import myst_nb

import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'plotly_mimetype+svg'
pio.templates['book'] = go.layout.Template(
    layout=dict(
        margin=dict(l=10, r=10, t=10, b=10),
        autosize=True,
        width=350, height=250,
    )
)
pio.templates.default = 'seaborn+book'

set_matplotlib_formats('svg')
sns.set()
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(sec:lifecycle_data)=
# Understanding the Data


After obtaining data, we want to understand the data we have.
A key part of understanding the data is doing *exploratory data analysis*,
where we look at actual data values and create plots to summarize
the data visually.
We also look for problems in the data.
Most real-world datasets have missing values, weird values, or other anomalies
that we need to account for.

In our experience, this stage of the lifecycle is highly iterative.
Understanding the data can lead to any of the
other stages in the data science lifecycle.
As we understand the data more, we often revise our research
questions, or realize that we need to get data from a different source.

This stage incorporates both programming and statistical knowledge.
To manipulate data, we write programs that clean data, transform data, and
create plots.
To find patterns and trends in the data, we use summary statistics and 
statistical models.

When our research questions are purely exploratory, we are only concerned
about patterns in the data.
In these cases, our analysis can end at this stage of the lifecycle.
When our research questions are inferential or predictive, however, we
proceed to the next stage of the lifecycle: understanding the world.