In [147]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display
import myst_nb

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

(sec:eda_guidelines)=
# Guidelines for Exploration


So far in this chapter, we have:

- introduced the notion of feature types;
- seen how the feature types help you figure out what plots to make; and
- described how to read distributions and relationships in a visualization

Let's now describe generally the process of EDA. You have seen EDA in action
already in Chapter X when we developed checks for data quality and feature
transformations to improve their usefulness in data analysis.  Below are a set
of questions to guide you when making plots.

- How are the values of Feature X distributed?
- How do Feature X and Feature Y relate to each other?
- Is the distribution of Feature X the same for subgroups defined by Feature Z?
- Are there any unusual observations in X? in the combination of (X,Y)? in X
  for a subgroup of Z?


One approach that you may find helpful to develop your intuition about
distributions and relationships of different kinds of features is to make a
guess about what you will see before you make the plot. That is, try to sketch
or describe your best answer to the above questions first, and then make the
plot.  For example, distributions that have a natural lower/upper bound on
values tend to have a long tail on the other side. The distribution of income
(bounded below by 0) tends to have a long right tail, and exam scores (bounded
above by 100) tends to have a long left tail.


As you answer each of the above questions, it is important to tie your answer
back to the feature and the dataset. It is also important to adopt an active,
inquisitive approach to the investigation. Some questions to guide your
explorations is to ask "what next" and "so what" questions, such as the
following.

- Do you have reason to expect that one group/observation might be different?
- Why might your observation about the data shape matter?
- What comparison might bring added value to the investigation?
- Are there any potentially important features to create comparisons
  with/against?

We put these guidelines into practice, and provide an example of the EDA
process in the next section.
