(data-visualise)=
# Data Visualisation

## Introduction

> "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey

This chapter will teach you how to visualise your data using the **seaborn** package.

There are a plethora of other options (and packages) for data visualisation using code. There are broadly two categories of approach to using code to create data visualisations: imperative, where you build what you want, and declarative, where you say what you want. Choosing which to use involves a trade-off: imperative libraries offer you flexibility but at the cost of some verbosity; declarative libraries offer you a quick way to plot your data, but only if it’s in the right format to begin with, and customisation may be more difficult.

**seaborn** is a declarative visualisation package, and these can be easier to get started with. But it's built on top of an imperative package, the incredibly powerful **matplotlib**, so you can always dig further and tweak details if you need to. However, in this chapter, we'll focus on using **seaborn** declaratively.

In [None]:
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

### Prerequisites

You will need to install the **seaborn** package for this chapter. This chapter uses the next generation version of **seaborn**, which can be installed by running the following on the command line (aka in the terminal): 

```bash
pip install --pre seaborn
```

Once you've done this, you'll need to import the **seaborn** library into your session using

In [None]:
import seaborn.objects as so

The second import brings in the plotting part of **seaborn**.

## First Steps

Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Non-linear?

### The `mpg` data frame

You can test your answer with the `mpg` data frame found in **seaborn** and obtained from the internet using the **pandas** package.

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). `mpg` contains observations collected by the US Environmental Protection Agency on 38 car models.

In [None]:
import pandas as pd

mpg = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv", index_col=0
)
mpg

Among the variables in `mpg` are:

1.  `displ`, a car's engine size, in litres.

2.  `hwy`, a car's fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

### Creating a Plot

To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Scatter())

The plot shows a negative relationship between engine size (`displacement`) and fuel efficiency (`mpg`). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?

With **seaborn**, you begin a plot with the function `so.Plot()`. **seaborn** creates a coordinate system that you can add layers to. The first argument of `so.Plot()` is the dataset to use in the graph. So `so.Plot(mpg)` creates an empty graph, but it's not very interesting so I'm not going to show it here.

You complete your graph by adding one or more layers to the plot. The function `.add(so.Scatter())` adds a layer of points to your plot, creating a scatterplot. You can choose between telling `so.Plot` what the x and y axis variables are or passing it directly to `.add`.

**seaborn** comes with many functions that each add a different type of layer to a plot. You'll learn a whole bunch of them throughout this chapter.

### A graphing template

Let's turn this code into a reusable template for making graphs with **seaborn**. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

```python
so.Plot(<data>, x=<X VARIABLE>, y=<Y VARIABLE>).add(so.<PLOT STYLE>)
```

The rest of this chapter will show you how to complete and extend this template to make different types of graphs.

### Exercises

1.  Run `so.Plot(mpg)`.
    What do you see?

2.  How many rows are in `mpg` (the data frame)?
    How many columns?

3.  Make a scatterplot of `mpg` vs `cylinders`.

4.  What happens if you make a scatterplot of `class` vs `drv`? Why is the plot not useful?

## Aesthetic mappings

> "The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?


In [None]:
# hide input
so.Plot(mpg, x="displ", y="hwy").add(so.Scatter()).add(
    so.Scatter(color="red", pointsize=5), data=mpg.query("displ > 5 and hwy > 20")
)

Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car.
The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to another dimension of the plot. These could be things like the size, the shape, or the colour of your points.

For example, you can map the colours of your points to the `class` variable to reveal the class of each car.

In [None]:
so.Plot(mpg, x="displ", y="hwy", color="class").add(so.Scatter())

To map another dimension in the plot to a variable, assign that dimension to the variable, for example `color="class"` within `so.Plot` or within `.add`. **seaborn** will automatically assign a unique level of the dimension (here a unique colour) to each unique value of the variable, a process known as scaling. **seaborn** will also add a legend that explains which levels correspond to which values.

The colours reveal that many of the unusual points (with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars don't seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.


In the above example, we mapped `class` to colour, but we could have mapped `class` to the size of points in the same way. In this case, the exact size of each point would reveal its class affiliation. Big warning here though: mapping an unordered variable (`class`) to an ordered variable (`size`) is generally not a good idea.

In [None]:
so.Plot(mpg, x="displ", y="hwy", pointsize="class").add(so.Scatter())

Similarly, we could have mapped `class` to *alpha* level, which controls the transparency of the points, or to the *marker* variable, which controls the shape of the points.

In [None]:
so.Plot(mpg, x="displ", y="hwy", alpha="class").add(so.Scatter())

In [None]:
so.Plot(mpg, x="displ", y="hwy", marker="class").add(so.Scatter())

Once you map variables to dimensions, **seaborn** takes care of the rest. It selects a reasonable scale to use with the dimension, and it constructs a legend that explains the mapping between levels and values.

You can also *set* a dimension property in your plot directly. For example, we can make all of the points in our plot purple:

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Scatter(color="purple"))

Here, the colour doesn't convey information about a variable, but only changes the appearance of the plot.
To set a dimension manually like this, put it within the specific layer it applies to (eg `.add(so.Scatter(color="purple"))`) rather than in the part that maps variables to dimensions (eg not in `so.Plot(mpg, x="displ", y="hwy")`).

When assigning values to dimensions, you'll need to pick values that makes sense, for example:

-   The name of a colour as a string, eg `color="purple"`
-   The size of a point in mm
-   The shape of a marker as a string, eg `marker="*"` for a star

In [None]:
# Hide input
from matplotlib.lines import Line2D


text_style = dict(
    horizontalalignment="right",
    verticalalignment="center",
    fontsize=12,
    fontfamily="monospace",
)
marker_style = dict(
    linestyle=":",
    color="0.8",
    markersize=10,
    markerfacecolor="tab:blue",
    markeredgecolor="tab:blue",
)


def format_axes(ax):
    ax.margins(0.2)
    ax.set_axis_off()
    ax.invert_yaxis()


def split_list(a_list):
    i_half = len(a_list) // 2
    return a_list[:i_half], a_list[i_half:]


fig, axs = plt.subplots(ncols=2)
fig.suptitle("Un-filled markers", fontsize=14)

# Filter out filled markers and marker settings that do nothing.
unfilled_markers = [
    m
    for m, func in Line2D.markers.items()
    if func != "nothing" and m not in Line2D.filled_markers
]

for ax, markers in zip(axs, split_list(unfilled_markers)):
    for y, marker in enumerate(markers):
        ax.text(-0.5, y, repr(marker), **text_style)
        ax.plot([y] * 3, marker=marker, **marker_style)
    format_axes(ax)

plt.show()

fig, axs = plt.subplots(ncols=2)
fig.suptitle("Filled markers", fontsize=14)
for ax, markers in zip(axs, split_list(Line2D.filled_markers)):
    for y, marker in enumerate(markers):
        ax.text(-0.5, y, repr(marker), **text_style)
        ax.plot([y] * 3, marker=marker, **marker_style)
    format_axes(ax)

plt.show()

You can find more information on markers in the [**matplotlib** documentation](https://matplotlib.org/stable/gallery/lines_bars_and_markers/marker_reference.html)

## Facets

