(data-visualise)=
# Data Visualisation

## Introduction

> "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey

This chapter will teach you how to visualise your data using the **seaborn** package.

There are a plethora of other options (and packages) for data visualisation using code. There are broadly two categories of approach to using code to create data visualisations: imperative, where you build what you want, and declarative, where you say what you want. Choosing which to use involves a trade-off: imperative libraries offer you flexibility but at the cost of some verbosity; declarative libraries offer you a quick way to plot your data, but only if it’s in the right format to begin with, and customisation may be more difficult.

**seaborn** is a declarative visualisation package, and these can be easier to get started with. But it's built on top of an imperative package, the incredibly powerful **matplotlib**, so you can always dig further and tweak details if you need to. However, in this chapter, we'll focus on using **seaborn** declaratively.

In [None]:
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

### Prerequisites

You will need to install the **seaborn** package for this chapter (`pip install seaborn`). Once you've done this, you'll need to import the **seaborn** library into your session using

In [None]:
import seaborn.objects as so

The second import brings in the plotting part of **seaborn**.

## First Steps

Let's use our first graph to answer a question: Do cars with big engines use more fuel than cars with small engines? You probably already have an answer, but try to make your answer precise. What does the relationship between engine size and fuel efficiency look like? Is it positive? Negative? Linear? Non-linear?

### The `mpg` data frame

You can test your answer with the `mpg` data frame found in **seaborn** and obtained from the internet using the **pandas** package.

A data frame is a rectangular collection of variables (in the columns) and observations (in the rows). `mpg` contains observations collected by the US Environmental Protection Agency on 38 car models.

In [None]:
import pandas as pd

mpg = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv", index_col=0
)
mpg

Among the variables in `mpg` are:

1.  `displ`, a car's engine size, in litres.

2.  `hwy`, a car's fuel efficiency on the highway, in miles per gallon (mpg). A car with a low fuel efficiency consumes more fuel than a car with a high fuel efficiency when they travel the same distance.

### Creating a Plot

To plot `mpg`, run this code to put `displ` on the x-axis and `hwy` on the y-axis:

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Dot())

The plot shows a negative relationship between engine size (`displacement`) and fuel efficiency (`mpg`). In other words, cars with smaller engine sizes have higher fuel efficiency and, in general, as engine size increases, fuel efficiency decreases. Does this confirm or refute your hypothesis about fuel efficiency and engine size?

With **seaborn**, you begin a plot with the function `so.Plot()`. **seaborn** creates a coordinate system that you can add layers to. The first argument of `so.Plot()` is the dataset to use in the graph. So `so.Plot(mpg)` creates an empty graph, but it's not very interesting so I'm not going to show it here.

You complete your graph by adding one or more layers to the plot. The function `.add(so.Dot())` adds a layer of points to your plot, creating a scatterplot. You can choose between telling `so.Plot` what the x and y axis variables are or passing it directly to `.add`.

**seaborn** comes with many functions that each add a different type of layer to a plot. You'll learn a whole bunch of them throughout this chapter.

### A graphing template

Let's turn this code into a reusable template for making graphs with **seaborn**. To make a graph, replace the bracketed sections in the code below with a dataset, a geom function, or a collection of mappings.

```python
so.Plot(<data>, x=<X VARIABLE>, y=<Y VARIABLE>).add(so.<PLOT STYLE>)
```

The rest of this chapter will show you how to complete and extend this template to make different types of graphs.

### Exercises

1.  Run `so.Plot(mpg)`.
    What do you see?

2.  How many rows are in `mpg` (the data frame)?
    How many columns?

3.  Make a scatterplot of `mpg` vs `cylinders`.

4.  What happens if you make a scatterplot of `class` vs `drv`? Why is the plot not useful?

## Aesthetic mappings

> "The greatest value of a picture is when it forces us to notice what we never expected to see." --- John Tukey

In the plot below, one group of points (highlighted in red) seems to fall outside of the linear trend. These cars have a higher mileage than you might expect. How can you explain these cars?


In [None]:
# remove input
so.Plot(mpg, x="displ", y="hwy").add(so.Dot()).add(
    so.Dot(color="red", pointsize=5), data=mpg.query("displ > 5 and hwy > 20")
)

Let's hypothesize that the cars are hybrids. One way to test this hypothesis is to look at the `class` value for each car.
The `class` variable of the `mpg` dataset classifies cars into groups such as compact, midsize, and SUV. If the outlying points are hybrids, they should be classified as compact cars or, perhaps, subcompact cars (keep in mind that this data was collected before hybrid trucks and SUVs became popular).

You can add a third variable, like `class`, to a two dimensional scatterplot by mapping it to another dimension of the plot. These could be things like the size, the shape, or the colour of your points.

For example, you can map the colours of your points to the `class` variable to reveal the class of each car.

In [None]:
so.Plot(mpg, x="displ", y="hwy", color="class").add(so.Dot())

To map another dimension in the plot to a variable, assign that dimension to the variable, for example `color="class"` within `so.Plot` or within `.add`. **seaborn** will automatically assign a unique level of the dimension (here a unique colour) to each unique value of the variable, a process known as scaling. **seaborn** will also add a legend that explains which levels correspond to which values.

The colours reveal that many of the unusual points (with engine size greater than 5 litres and highway fuel efficiency greater than 20 miles per gallon) are two-seater cars. These cars don't seem like hybrids, and are, in fact, sports cars! Sports cars have large engines like SUVs and pickup trucks, but small bodies like midsize and compact cars, which improves their gas mileage. In hindsight, these cars were unlikely to be hybrids since they have large engines.


In the above example, we mapped `class` to colour, but we could have mapped `class` to the size of points in the same way. In this case, the exact size of each point would reveal its class affiliation. Big warning here though: mapping an unordered variable (`class`) to an ordered variable (`size`) is generally not a good idea.

In [None]:
so.Plot(mpg, x="displ", y="hwy", pointsize="class").add(so.Dot())

Similarly, we could have mapped `class` to *alpha* level, which controls the transparency of the points, or to the *marker* variable, which controls the shape of the points.

In [None]:
so.Plot(mpg, x="displ", y="hwy", alpha="class").add(so.Dot())

In [None]:
so.Plot(mpg, x="displ", y="hwy", marker="class").add(so.Dot())

Once you map variables to dimensions, **seaborn** takes care of the rest. It selects a reasonable scale to use with the dimension, and it constructs a legend that explains the mapping between levels and values.

You can also *set* a dimension property in your plot directly. For example, we can make all of the points in our plot purple:

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Dot(color="purple"))

Here, the colour doesn't convey information about a variable, but only changes the appearance of the plot.
To set a dimension manually like this, put it within the specific layer it applies to (eg `.add(so.Scatter(color="purple"))`) rather than in the part that maps variables to dimensions (eg not in `so.Plot(mpg, x="displ", y="hwy")`).

When assigning values to dimensions, you'll need to pick values that makes sense, for example:

-   The name of a colour as a string, eg `color="purple"`
-   The size of a point in mm
-   The shape of a marker as a string, eg `marker="*"` for a star

In [None]:
# remove cell
from matplotlib.lines import Line2D


text_style = dict(
    horizontalalignment="right",
    verticalalignment="center",
    fontsize=12,
    fontfamily="monospace",
)
marker_style = dict(
    linestyle=":",
    color="0.8",
    markersize=10,
    markerfacecolor="tab:blue",
    markeredgecolor="tab:blue",
)


def format_axes(ax):
    ax.margins(0.2)
    ax.set_axis_off()
    ax.invert_yaxis()


def split_list(a_list):
    i_half = len(a_list) // 2
    return a_list[:i_half], a_list[i_half:]


fig, axs = plt.subplots(ncols=2)
fig.suptitle("Un-filled markers", fontsize=14)

# Filter out filled markers and marker settings that do nothing.
unfilled_markers = [
    m
    for m, func in Line2D.markers.items()
    if func != "nothing" and m not in Line2D.filled_markers
]

for ax, markers in zip(axs, split_list(unfilled_markers)):
    for y, marker in enumerate(markers):
        ax.text(-0.5, y, repr(marker), **text_style)
        ax.plot([y] * 3, marker=marker, **marker_style)
    format_axes(ax)

plt.show()

fig, axs = plt.subplots(ncols=2)
fig.suptitle("Filled markers", fontsize=14)
for ax, markers in zip(axs, split_list(Line2D.filled_markers)):
    for y, marker in enumerate(markers):
        ax.text(-0.5, y, repr(marker), **text_style)
        ax.plot([y] * 3, marker=marker, **marker_style)
    format_axes(ax)

plt.show()

You can find more information on markers in the [**matplotlib** documentation](https://matplotlib.org/stable/gallery/lines_bars_and_markers/marker_reference.html)

## Facets

One way to add additional variables to a plot is by mapping them to a dimension. Another way, which is particularly useful for categorical variables, is to split your plot into **facets**, subplots that each display one subset of the data.

To facet your plot by a single variable, use `.facet(<VARIABLE>)`; this should be a discrete variable.

In [None]:
(
    so.Plot(
        mpg,
        "displ",
        "hwy",
    )
    .facet("cyl")
    .add(so.Dot())
)

## Geometric objects

How are these two plots similar?

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Dot())

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Line(), so.Agg())

Both plots contain the same x variable, the same y variable, and both show the same data (to some extent). But the plots are not identical. Each plot uses a different visual object to represent the data. In **seaborn** language, these are represented by different *marks*: one is a scatter and the other a (mean) line (which introduces an aggregation).

A mark is a geometrical object that shows where data occur in x, y, and any other dimension-space you care to use. For example, the plot below is a line plot but we've added a discrete dimension of colour so that—instead of a single aggregate line—we get one for each distinct value of `"drv"`. One line describes all of the points that have a `4` value, one line describes all of the points that have an `f` value, and one line describes all of the points that have an `r` value. Here, `4` stands for four-wheel drive, `f` for front-wheel drive, and `r` for rear-wheel drive.

In [None]:
so.Plot(mpg, x="displ", y="hwy", color="drv").add(so.Line(), so.Agg())

You can achieve the same effect without distinguishing by colour too using `group` keyword. The `group` keyword uses a categorical variable to draw multiple objects; **seaborn** will draw a separate object for each unique value of the grouping variable.


In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Line(), so.Agg(), group="drv")

**seaborn** will allow you to add multiple layers to the base plot. In the below, we show both the points (using `.add(so.Dot())`) and an aggregate line per value of `"drv"`. Because we passed colour into `.Plot` both of these layers are distinguished by different colours.

In [None]:
so.Plot(mpg, x="displ", y="hwy", color="drv").add(so.Line(), so.Agg()).add(so.Dot())

If you map variables to the dimensions in `.Plot`, **seaborn** will use them for all subsequent layers. But if you specify a different approach for a specific layer, you will get info just for that:

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Line(), so.Agg()).add(so.Dot(), color="class")

Each layer can have its own cut of the data too. Here, our line displays just a subset of the `mpg` dataset, the subcompact cars. We get this by explicitly adding a `data=` keyword argument to the same `.add` command as a line. The scatter plot has all points, the line just those for subcompact cars as specified by the filter we applied to the **pandas** data frame (try running `mpg.loc[mpg["class"] == "subcompact"]` to see the data that make up the line).

In [None]:
(
    so.Plot(mpg, x="displ", y="hwy")
    .add(so.Dot())
    .add(so.Line(color="blue"), so.Agg(), data=mpg.loc[mpg["class"] == "subcompact"])
)

## Statistical Transformations

We've already seen `so.Agg()` for aggregating multiple points into a single, mean line. Now let's take a look at another statistical transform: the bar chart. We'll use the diamonds dataset:

In [None]:
import seaborn as sns

diamonds = sns.load_dataset("diamonds")
diamonds.head()

Let's now create a bar chart of counts, aka a histogram, of the numbers of diamonds of different cuts. This only requires one dimension, `"cut"`, and then an instruction to use `so.Hist()` alongside `so.Bar()` in the (single) layer on top of the plot.

In [None]:
(so.Plot(diamonds, "cut").add(so.Bar(), so.Hist()))

On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds! Where does count come from? Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:

- bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin

- aggregations fit a mean line to your data 

- boxplots compute a summary of the distribution and display it as a box

The algorithm used to calculate new values for a graph is called a Stat, short for statistical transformation.

## Adding More Information to Plots


There’s one more piece of magic associated with bar charts. You can colour a bar chart using the `color=` keyword argument within the `.add` layer

In [None]:
(so.Plot(diamonds, "cut").add(so.Bar(), so.Hist(), color="cut"))

But you can also choose another variable and thereby add extra info to your chart, for example here by adding information on clarity:

In [None]:
(so.Plot(diamonds, "cut").add(so.Bar(), so.Hist(), color="clarity"))

### Overplotting

**Seaborn** functions have parameters that allow adjustments for overplotting, ie putting multiple dimensions next to each other on the same chart. These include `dodge` in several categorical functions, `jitter` in several functions based on scatterplots, and the `multiple=` parameter in distribution functions. These adjustments are abstracted away from the particular visual representation into the concept of a 'move':

In [None]:
(so.Plot(diamonds, "cut", color="clarity").add(so.Bar(), so.Hist(), so.Dodge()))

This can also accept parameters to separate out the information in a particular way

In [None]:
(
    so.Plot(diamonds, "cut", color="clarity").add(
        so.Bar(), so.Hist(), so.Dodge(empty="fill", gap=0.5)
    )
)

There's another type of adjustment that's not useful for bar charts, but can be very useful for scatterplots. Recall our first scatterplot. Did you notice that the plot displays only 126 points, even though there are 234 observations in the dataset? And that some points appear darker than others?

In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Dot())


The underlying values of `hwy` and `displ` are rounded so the points appear on a grid and many points overlap each other. This problem is known as **overplotting**. This arrangement makes it difficult to see the distribution of the data. Because scatterplot points are, by default, plotted with some transparency you can get a sense of which parts of the grid have multiple points on them, but you may wish to use a different technique.

Another way to show the overlap is to use the "jitter" option. Passing the argument `so.Jitter()` adds a small amount of random noise to each point. Depending on the numerical option you use, this spreads the points out because no two points are likely to receive the same amount of random noise.


In [None]:
so.Plot(mpg, x="displ", y="hwy").add(so.Dot(), so.Jitter(1))

Adding randomness seems like a strange way to improve your plot, but while it makes your graph less accurate at small scales, it makes your graph *more* revealing at large scales.

## Co-ordinates and Scales

### Co-ordinates

The co-ordinates of a plot are the system that determines which data is attached to which axis of, typically, the horizontal, or x-axis, and the vertical, or y-axis. This is set by arguments to the call to `so.Plot`, so to reverse the plot from before we simply reverse the arguments:

In [None]:
so.Plot(mpg, "hwy", "displ").add(so.Dot())

You can also do this explicitly by setting `x="hwy"` and `y="displ"`, and there's a lot to be said for being explicit (when you read your code back later, it's very helpful indeed).

### Scales

Let's say you create a chart but the data vary on a scale that isn't shown well by the default axes. If you find yourself in this situation, you may wish to change the *scale* of one or both of the axes. This is controlled by the `Scale` property in **seaborn**.

The notion of scaling will probably not be unfamiliar; it means that a mathematical transformation, such as log, is made to the coordinate (or axes) variables.

We'll show this using the `planets` dataset, which has lots of variation in it!

In [None]:
planets = sns.load_dataset("planets").query("distance < 1000")
planets.head()

In [None]:
(so.Plot(planets, x="mass", y="distance").scale(x="log", y="log").add(so.Dot()))

Here we used a log scale for both the x- and y-axes because both mass and distance vary over many orders of magnitude.

But the scale property can apply to other dimensions that we are visualising in our plots too; here's an example where we're using colour (in the below, plasma is the name of a built-in continuous colourmap, a way of representing a continuous number line with colour gradients):

In [None]:
(
    so.Plot(planets, x="mass", y="distance", color="orbital_period")
    .scale(x="log", y="log", color=so.Continuous("plasma", trans="log"))
    .add(so.Dot())
)

Sometimes you *don't* want to apply the transform to everything, and that's okay too. Here's an example where the log scale *doesn't* apply to the mass variable (even though it's shown).

In [None]:
(
    so.Plot(planets, x="distance", y="orbital_period", pointsize="mass")
    .scale(x="log", y="log", pointsize=None)
    .add(so.Dot())
)

## Summing Up

In the above, you've got to grips with some of the basics of visualisation with **seaborn**. You can find much more information in the documentation for that project. But let's recap the grammar of a **seaborn** plot. The typical call will look something like this:

```python
(
    so.Plot(<DATA FRAME>, x=<X STRING>, y=<Y STRING>, <OTHER KEYWORD ARGUMENTS EG POINTSIZE>)
    .scale(x=<SCALE STRING>, <OTHER DIMENSION KEYWORD ARGUMENTS AND SCALE STRINGS>)
    .add(<TYPE OF CHART EG SCATTER>, <KEYWORD ARGUMENTS FOR THIS LAYER EG MOVE>)
    .add(<FURTHER LAYERS>)
)
```