(ch:eda)=
# Exploratory Data Analysis

John Tukey, author of the influential book, *Exploratory Data Analysis* {cite}`tukeyExploratory1977`, avidly promoted an alternative type of data analysis that broke from the formal world of confidence intervals, hypothesis tests, and modeling. Today, Exploratory Data Analysis (EDA) is a popular approach to data analysis and considered good practice.  Tukey describes Exploratory Data Analysis (EDA) as a philosophical approach to working with data:

> an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there. 

This is a deviation from the tradition of proposing a hypothesis before looking at the data, testing the hypothesis on the data, and making a decision based on the p-value of the test. Instead, EDA is a creative search for the unexpected using simple summary statistics and visualizations. According to Tukey, "EDA is actively incisive, rather than passively descriptive, with real emphasis on the discovery of the unexpected."

As a data scientist, you will want to use EDA in every stage of the data life cycle from checking the quality of your data to preparing the data for formal modeling to confirming your model is reasonable.  Indeed, the work described in {numref}`Chapter %s <ch:wrangling>` to clean and transform the data relied heavily on the EDA approach to guide our quality checks and transformations.

In an EDA-type investigation, we enter a process of discovery, constantly asking questions, and diving into uncharted territory to explore ideas. We use plots to uncover features of the data, examine distributions of values, and reveal relationships that cannot be detected from simple numerical summaries. This exploration involves transforming, visualizing, and summarizing data to build and confirm our understanding, identify and address potential issues with the data, and inform subsequent analysis. EDA is creative and fun! And, it takes practice. One of the best ways to learn how to carry out an exploratory data analysis is to learn from others as they describe their thought process while they carry out an EDA, and there are many online sources to help. 

But, while EDA can provide valuable insights, you need to be cautious about the conclusions that you draw. It is important to recognize that EDA can bias your view. EDA is a winnowing process and a decision-making process that can impact the replicability of your later, model-based findings. With enough data, if you look hard, you can dredge up something interesting that is entirely spurious. 

The role of EDA in the scientific reproducibility crisis has been noted, and data scientists have cautioned against overdoing it. For example, Gelman and Loken note {cite}`gelmanStatistical2017`:

> even in settings where a single analysis has been carried out on the given data, the issue of multiple comparisons [data dredging] emerges because different choices about combining variables, inclusion and exclusion of cases, transformations of variables, tests for interactions in the absence of main effects, and many other steps in the analysis could well have occurred with different data.

It's good practice to report and provide the code from your EDA so that others are aware of the choices that you made and the paths you took in analyzing your data.

The topic of visualization is split across three chapters. In {numref}`Chapter %s <ch:wrangling>`, we used plots to inform us in our data wrangling. The plots there were basic and the findings straightforward. We didn't dwell on interpretations and choices of plots. Now, in this chapter, we spend more time on learning how to choose the right plot and interpret it. We usually take the default parameter settings of the plotting functions since our goal is to make plots quickly as we carry out EDA. Then in {numref}`Chapter %s <ch:viz>`), we provide guidelines for making effective and informative plots and give advice on how to make your visual argument clear and compelling.  

The EDA process typically involves creating simple visualizations. To do this, we need to choose an appropriate visualization for a feature, and our choice depends on the kind of data that have been collected. This mapping of plot type to feature type is the topic of the next section.  From there, we go on to describe how to "read" a plot, what to look for, and how to interpret what you see. We first discuss what to look for in a one-variable plot, then focus on reading relationships between two variables, and finally describe plots for three or more variables. After we have introduced the visualization tools for EDA, we provide guidelines for carrying out an EDA, and then walk through an example as we follow these guidelines.  