(exploratory-data-analysis)=
# Exploratory Data Analysis

## Introduction

This chapter will show you how to use visualisation and transformation to explore your data in a systematic way, a task that data scientists call exploratory data analysis, or EDA for short. EDA is an iterative cycle; you:

1.  Generate questions about your data.

2.  Search for answers by visualising, transforming, and modelling your data.

3.  Use what you learn to refine your questions and/or generate new questions.

EDA is not a formal process with a strict set of rules and, during the initial phases of EDA, you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends. As your exploration continues, you will home in on a few particularly productive areas that you'll eventually write up and communicate to others. As you explore your data, you should remember that there are some pitfalls: you should always think about how the data were collected, what might be missing, whether there are quality problems, and be really strict about the differences between correlation and causation (this is a huge topic in itself!).

### Prerequisites

For doing EDA, we'll use the **pandas**, **skimpy**, and **pandas-profiling** packages. You are likely to already have **pandas** installed. We'll also need **seaborn** for data visualisation, which can you install with `pip install --pre seaborn`. To install the other two packages, open up a terminal in Visual Studio Code and run `pip install skimpy` and `pip install pandas-profiling`.

In [None]:
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

## Questions

> "There are no routine statistical questions, only questionable statistical routines." --- Sir David Cox
> "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise." --- John Tukey

Your goal during EDA is to develop an understanding of your data. The easiest way to do this is to use questions as tools to guide your investigation. When you ask a question, the question focuses your attention on a specific part of your dataset and helps you decide which graphs, models, or transformations to make.

EDA is fundamentally a creative process. And like most creative processes, the key to asking *quality* questions is to generate a large *quantity* of questions. It is difficult to ask revealing questions at the start of your analysis because you do not know what insights are contained in your dataset. On the other hand, each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery. You can quickly drill down into the most interesting parts of your data---and develop a set of thought-provoking questions---if you follow up each question with a new question based on what you find.

There is no rule about which questions you should ask to guide your research. However, two types of questions will always be useful for making discoveries within your data. You can loosely word these questions as:

1.  What type of variation occurs within my variables?

2.  What type of covariation occurs between my variables?

The rest of this chapter will look at these two questions. We'll explain what variation and covariation are, and We'll show you several ways to answer each question. To make the discussion easier, let's define some terms:

-   A **variable** is a quantity, quality, or property that you can measure.

-   A **value** is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.

-   An **observation** is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable. We'll sometimes refer to an observation as a data point.

-   **Tabular data** is a set of values, each associated with a variable and an observation. Tabular data is *tidy* if each value is placed in its own "cell", each variable in its own column, and each observation in its own row.

So far, all of the data that you've seen has been tidy. In real-life, most data isn't tidy, so we'll come back to how to clean untidy data later in the book.

## Variation

**Variation** is the tendency of the values of a variable to change from measurement to measurement. You can see variation easily in real life; if you measure any continuous variable twice, you will get two different results. This is true even if you measure quantities that are constant, like the speed of light. Each of your measurements include a small amount of error that varies from measurement to measurement. Variables can also vary if you measure across different subjects (e.g. the eye colours of different people) or different times (e.g. the energy levels of an electron at different moments).

Every variable has its own pattern of variation, which can reveal interesting information about how that variable varies between measurements on the same observation as well as across observations. The best way to understand that pattern is to visualise the distribution of the variable's values.

### Visualising distributions

How you visualize the distribution of a variable will depend on whether the variable is categorical or continuous. A variable is **categorical** if it can only take one of a small set of values. In data analysis in Python, categorical variables are usually saved as the 'category' type in **pandas** data frames. To examine the distribution of a categorical variable, you can use a bar chart. First let's load up **seaborn** for visusalisation and load up the diamonds dataset in **pandas**.

In [None]:
import seaborn.objects as so

In [None]:
import pandas as pd

diamonds = pd.read_csv(
    "https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv"
)
diamonds["cut"] = diamonds["cut"].astype("category")
diamonds.head()

Now we can visualise the data using a bar chart:

In [None]:
(so.Plot(diamonds, "cut").add(so.Bar(), so.Hist()))

The height of the bars displays how many observations occurred with each x value. You can compute these values directly with **pandas** too:

In [None]:
diamonds["cut"].value_counts()

A variable is **continuous** if it can take any of an infinite set of ordered values. Numbers and date-times are two examples of continuous variables. To examine the distribution of a continuous variable, you can use a histogram:

In [None]:
(so.Plot(diamonds, "carat").add(so.Bar(), so.Hist(binwidth=0.5)))

You can also compute this directly using **pandas** using `pd.cut` to assign a category (an interval) to each row and then `value_counts()` to count the number of rows in each category.

In [None]:
pd.cut(diamonds["carat"], bins=11).value_counts()

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. You can set the number of intervals in a histogram plot with the `binwidth=` keyword argument, which is measured in the units of the `x` variable. You should always explore a variety of binwidths when working with histograms, as different binwidths can reveal different patterns.

For example, here is how the graph above looks when we zoom into just the diamonds with a size of less than three carats and choose a smaller binwidth.


In [None]:
(so.Plot(diamonds.query("carat < 3"), "carat").add(so.Bar(), so.Hist(binwidth=0.1)))

Now that you can visualise variation, what should you look for in your plots? And what type of follow-up questions should you ask? Below is a list of the most useful types of information that you will find in your graphs, along with some follow-up questions for each type of information. The key to asking good follow-up questions will be to rely on your curiosity (What do you want to learn more about?) as well as your skepticism (How could this be misleading?).

### Typical Values

In both bar charts and histograms, tall bars show the common values of a variable, and shorter bars show less-common values.
Places that do not have bars reveal values that were not seen in your data.
To turn this information into useful questions, look for anything unexpected:

-   Which values are the most common?
    Why?

-   Which values are rare?
    Why?
    Does that match your expectations?

-   Can you see any unusual patterns?
    What might explain them?

As an example, the histogram below suggests several interesting questions:

-   Why are there more diamonds at whole carats and common fractions of carats?

-   Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?

In [None]:
(so.Plot(diamonds.query("carat < 3"), "carat").add(so.Bar(), so.Hist(binwidth=0.01)))

Clusters of similar values suggest that subgroups exist in your data.
To understand the subgroups, ask:

-   How are the observations within each cluster similar to each other?

-   How are the observations in separate clusters different from each other?

-   How can you explain or describe the clusters?

-   Why might the appearance of clusters be misleading?

Many of the questions above will prompt you to explore a relationship *between* variables, for example, to see if the values of one variable can explain the behavior of another variable. We'll get to that shortly.

### Unusual Values

Outliers are observations that are unusual; data points that don't seem to fit the pattern.
Sometimes outliers are data entry errors; other times outliers suggest important new science.
When you have a lot of data, outliers are sometimes difficult to see in a histogram.
For example, take the distribution of the `y` variable from the diamonds dataset.
The only evidence of outliers is the unusually wide limits on the x-axis.

In [None]:
(so.Plot(diamonds, "y").add(so.Bar(), so.Hist(binwidth=0.5)))

There are so many observations in the common bins that the rare bins are very short, making it very difficult to see them (although maybe if you stare intently at 0 you'll spot something). To make it easy to see the unusual values, we need to zoom to small values of the y-axis with a special value of the scale:

In [None]:
# Issue raised
# (
#     so.Plot(diamonds, x="y")
#     .add(so.Bar(), so.Hist(binwidth=0.5))
#     .scale(y=so.Continuous(trans=None).tick(at=[0, 50]))
# )

The `y` variable measures one of the three dimensions of these diamonds, in mm.
We know that diamonds can't have a width of 0mm, so these values must be incorrect.
We might also suspect that measurements of 32mm and 59mm are implausible: those diamonds are over an inch long, but don't cost hundreds of thousands of dollars!

It's good practice to repeat your analysis with and without the outliers. If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to omit them, and move on.

However, if they have a substantial effect on your results, you shouldn't drop them without justification.
You'll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

### Replacing Unusual Values

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis (not your only response—you can also consider filling in the missing data), you have two options:

1. Drop the entire row with the strange values. You can do this by just working with a subset of the data, eg`diamonds.query('3 <= y <= 20')`. This option isn't generally recommended though as just because one measurement is invalid, it doesn’t mean all the measurements are. Additionally, if you have low quality data, by time that you’ve applied this approach to every variable you might find that you don’t have any data left!
2. Replacing the unusual values with empty cells (ie remove those values entirely). The easiest way to do this is to use `assign()` to replace the variable with a modified copy. You can use the `np.where()` function to replace unusual values with `np.nan`, the **numpy** missing value operator:


In [None]:
import numpy as np

diamonds["y"] = diamonds["y"].apply(lambda y: np.where(20 > y > 3, y, np.nan))

`np.where()` typically has three arguments. The first argument condition should be a column of booleans. If `True`, then the next argument will be used; if `False`, the third. So we get the pattern `np.where(<CONDITION>, <VALUE IF CONDITION TRUE>, <VALUE IF CONDITION FALSE>)`.

## **pandas** built-in tools for EDA

**pandas** has some great options for built-in EDA; in fact we've already seen one of them, `df.info()` which, as well as reporting datatypes and memory usage, also tells us how many observations in each column are 'truthy' rather than 'falsy', ie how many have non-null values.

### Exploratory tables and descriptive statistics

A small step beyond `.info()` to get tables is to use `.describe()` which, if you have mixed datatypes that include floats, will report some basic summary statistics:

In [None]:
diamonds.describe()

Although helpful, that sure is hard to read! We can improve this by using the `round()` method too:


In [None]:
sum_table = diamonds.describe().round(1)
sum_table

Published summary statistics tables often list one variable per row, and if your dataframe has many variables, `describe()` can quickly get too wide to read easily. You can transpose it using the `T` property (or the `transpose()` method):

In [None]:
sum_table = sum_table.T
sum_table

Of course, the stats provided in this pre-built table are not very customised. So what do we do to get the table that we actually want? Well, the answer is to draw on the contents of the previous data chapters, particularly the introduction to data analysis. Groupbys, merges, aggregations: use all of them to produce the EDA table that you want.

If you're exploring data, you might also want to be able to read everything clearly and see any deviations from what you'd expect quickly. **pandas** has some built-in functionality that styles dataframes to help you. These styles persist when you export the dataframe to, say, Excel, too.

Here's an example that highlights some ways of styling dataframes, making use of several features such as: unstacking into a wider format (`unstack`), changing the units (`lambda` function; note that `1e3` is shorthand for `1000` on computers), fill NaNs with unobtrusive strings (`.fillna('-')`), removing numbers after the decimal place (`.style.format(precision=0)`), and adding a caption (`.style.set_caption`).

In [None]:
(
    diamonds.groupby(["cut", "color"])
    .mean()["price"]
    .unstack()
    .apply(lambda x: x / 1e3)
    .fillna("-")
    .style.format(precision=2)
    .set_caption("Sale price (thousands)")
)

Although a neater one than we've seen, this is still a drab table of numbers. The eye is not immediately drawn to it!

To remedy that, let's take a look at another styling technique: the use of colour. Let's say we wanted to make a table that showed a cross-tabulation between cut and color; that is the counts of objects appearing in both of these fields according to the categories.

To perform a cross-tabulation, we'll use the built-in `pd.crosstab` but we'll ask that the values that appear in the table (counts) be lit up with a heatmap using `style.background_gradient` too:

In [None]:
pd.crosstab(diamonds["color"], diamonds["cut"]).style.background_gradient(cmap="plasma")

By default, `background_gradient` highlights each number relative to the others in its column; you can highlight by row using `axis=1` or relative to all table values using `axis=0`. And of course `plasma` is just one of [many available colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html)!

```{admonition} Exercise
Do a new cross-tabulation using a different colourmap.
```

Here are a couple of other styling tips for dataframes.

First, use bars to show ordering:

In [None]:
(
    pd.crosstab(diamonds["color"], diamonds["cut"])
    .style.format(precision=0)
    .bar(color="#d65f5f")
)

Use `.hightlight_max`, and similar commands, to show important entries:

In [None]:
pd.crosstab(diamonds["color"], diamonds["cut"]).style.highlight_max().format("{:.0f}")

You can find a full set of styling commands [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Styling).

### Exploratory Plotting with **pandas**

**pandas** has some built-in plotting options to help you look at data quickly. These can be accessed via `.plot.*` or `.plot()`, depending on the context. Let's make a quick `.plot()` using a dataset on taxis.

In [None]:
taxis = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/taxis.csv")
# turn the pickup time column into a datetime
# taxis["pickup"] = pd.to_datetime(taxis["pickup"])
# set some other columns types
taxis = taxis.astype(
    {
        "dropoff": "datetime64",
        "pickup": "datetime64",
        "color": "category",
        "payment": "category",
        "pickup_zone": "string",
        "dropoff_zone": "string",
        "pickup_borough": "category",
        "dropoff_borough": "category",
    }
)
taxis.head()

In [None]:
taxis.info()

In [None]:
(
    taxis.set_index("pickup")
    .groupby(pd.Grouper(freq="D"))["total"]
    .mean()
    .plot(
        title="Mean taxi fares",
        xlabel="",
        ylabel="Fare (USD)",
    )
);

Again, if you can get the data in the right shape, you can plot it. The same function works with multiple lines


In [None]:
(
    taxis.set_index("pickup")
    .groupby(pd.Grouper(freq="D"))[["fare", "tip", "tolls"]]
    .mean()
    .plot(
        style=["-", ":", "-."],
        title="Components of taxi fares",
        xlabel="",
        ylabel="USD",
    )
);

Now let's see some of the other quick `.plot.*` options.

A bar chart (use `barh` for horizontal orientation; `rot` sets rotation of labels):

In [None]:
taxis.value_counts("payment").sort_index().plot.bar(title="Counts", rot=0);

This next one, uses `.plot.hist` to create a histogram.

In [None]:
taxis["tip"].plot.hist(bins=30, title="Tip");

Boxplot:

In [None]:
(taxis[["fare", "tolls", "tip"]].plot.box());

Scatter plot:

In [None]:
taxis.plot.scatter(x="fare", y="tip", alpha=0.7, ylim=(0, None));

## Other tools for EDA

Between **pandas** and visualisation packages, you have a lot of what you need for EDA. But there are some tools just dedicated to making EDA easier that it's worth knowing about.

### **skimpy** for summary statistics

The **skimpy** package is a light weight tool that provides summary statistics about variables in data frames in the console (rather than in a big HTML report, which is what the other EDA packages in the rest of this chapter too). Sometimes running `.summary()` on a data frame isn't enough, and **skimpy** fills this gap. It also comes with the `clean_columns` function for cleaning column names that we saw in an earlier chapter. To install **skimpy**, run `pip install skimpy` in the terminal.

Let's see **skimpy** in action.

In [None]:
from skimpy import skim

skim(taxis)

### The **pandas-profiling** package

The EDA we did using the built-in **pandas** functions was a bit limited and user-input heavy. The [**pandas-profiling**](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) library aims to automate the legwork of EDA for you. It generates 'profile' reports from a pandas DataFrame. For each column, many statistics are computed and then relayed in an interactive HTML report. To install it, run `pip install pandas-profiling` in the terminal.

Let's generate a report on our dataset. If you are using a large dataset, you may wish to employ the`minimal=True` setting that cuts out a lot of computationally expensive extras:

In [None]:
from pandas_profiling import ProfileReport


profile = ProfileReport(taxis, minimal=True, title="Profiling Report: Taxis Dataset")
profile.to_notebook_iframe()

This is a full on report about everything in our dataset! We can see, for instance, that we have 14 variables and what kind each of them are.

The alerts page shows where **pandas-profiling** really shines. It flags *potential* issues with the data that should be taken into account in any subsequent analysis. For example, although not relevant here, the report will say if there are very unbalanced classes in a low cardinality categorical variable.

Another good package for automated EDA is [dataprep](https://dataprep.ai/).