# Data Visualisation with Seaborn

Seaborn is a data visualisation library for Python which builds on the
matplotlib package.

It is designed primarily with data exploration in mind. In particular:

- Seaborn integrates much more closely with pandas data structures
- It is capable of performing operations on entire datasets
- Its visualisation functions are designed to quickly produce detailed and
  informative statistical plots with few lines of code.

When importing seaborn, the convention is to use the alias `sns`:


In [None]:
import seaborn as sns

Running the following code will set all figures to seaborn's default plotting theme:


In [None]:
sns.set_theme()

## 1 Motivation

Let's begin by motivating why seaborn is a good choice for statistical visualisations!

To do this, we will create a basic regression plot with seaborn and attempt to
replicate it with matplotlib.

We will need some data. Let's use the popular "iris" dataset:


In [None]:
iris = sns.load_dataset("iris")
iris.head()

_NB. the complete list of seaborn datasets can be found
[here](https://github.com/mwaskom/seaborn-data)_

Regression plots are trivial with seaborn:


In [None]:
sns.regplot(data=iris, x="sepal_length", y="petal_length")

Let's attempt to make a similar figure with matplotlib:


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate the linear relationship
x, y = iris["sepal_length"], iris["petal_length"]
lin = np.polyfit(x, y, 1)
pred = np.poly1d(lin)

# Generate the plot
plt.scatter(x, y)
plt.plot(x, pred(x))
plt.xlabel("sepal_length")
plt.ylabel("petal_length")

This requires many more lines of code, and highlights a number of drawbacks with solely
using matplotlib:

- Matplotlib has no regression functionality
- The data points and the trend line have to be added separately
- We are unable to plot directly from a `DataFrame`
- We have to supply the axis labels manually

Seaborn actually *wraps around* matplotlib, giving us detailed statistical
visualisations with much shorter code.

As we'll show later on, seaborn figures can still be customised using matplotlib syntax.

## 2 Visualisation functions

We will now cover a range of visualisation functions provided by seaborn.

This is by no means intended as a complete guide. After completing this tutorial, we
recommend referring to
[seaborn's wonderful documentation](https://seaborn.pydata.org/tutorial.html).


### 2.1 Bivariate relationships

#### Scatter plots

We will begin with a basic scatter plot of petal length versus sepal length:


In [None]:
sns.scatterplot(
    data=iris,
    x="sepal_length",
    y="petal_length",
)

We are able to control the formatting of the markers (including the colour,
shapes and sizes) using the data:


In [None]:
sns.scatterplot(
    data=iris,
    x="sepal_length",
    y="petal_length",
    hue="species",
    style="species",
)

#### Line plots

To demonstrate line plots, let's load in some time series data:


In [None]:
flights = sns.load_dataset("flights")
flights.head()

This time we will call the `lineplot()` function:


In [None]:
sns.lineplot(data=flights, x="year", y="passengers")

In addition to a solid line, which represents the mean, we also get a shaded
error region which, by default, represents the 95% confidence interval.

_NB. We have an error region because the flights data is in long format (i.e.
multiple entries per year)._

We can also control the line style, colour, etc using the data. For example:


In [None]:
sns.lineplot(
    data=flights,
    x="year",
    y="passengers",
    hue="month",
    style="month",
)

### Exercise 1

Please complete *Q1* of [the exercise sheet](exercises.ipynb#Q1\))

### 2.2 Optimisation

With `regplot()`, it's possible to apply a polynomial fit via the `order`
parameter. Let's demonstrate this with seaborn's penguins dataset:


In [None]:
penguins = sns.load_dataset("penguins")
penguins.head()

In [None]:
sns.regplot(
    data=penguins,
    x="flipper_length_mm",
    y="body_mass_g",
    order=2,
)

One way to assess our goodness-of-fit is by inspecting residuals, which
can be visualised using the `residplot()` function:


In [None]:
sns.residplot(
    data=penguins,
    x="flipper_length_mm",
    y="body_mass_g",
    order=2,
)

### Exercise 2

Please complete *Q2* of [the exercise sheet](exercises.ipynb#Q2\))

### 2.3 Distributions

#### Histograms

Staying with the penguins dataset, let's plot a histogram of the body mass:

In [None]:
sns.histplot(
    data=penguins,
    x="body_mass_g",
    bins=18,
)

As with scatter plots, we can use `hue` to split the histogram by some
variable.


In [None]:
sns.histplot(
    data=penguins,
    x="body_mass_g",
    bins=18,
    hue="species",
)

#### Kernel density estimation

Kernel density estimation (KDE) is used to obtain a continuous estimate of a
distribution by smoothing histogram counts using a Gaussian kernel.

We can generate such a plot using the `kdeplot()` function:


In [None]:
sns.kdeplot(
    data=penguins,
    x="body_mass_g",
    hue="species",
    bw_adjust=1,
)

The granularity of the estimated density is controlled by the bandwidth of the
Gaussian kernel (set using `bw_adjust`).

#### Bivariate distributions

We now consider bivariate distributions, where we plot the 2-dimensional
distributions of pairs of variables.

To create such a plot in seaborn, take the code we've been using to create
histograms and include a `y` variable:


In [None]:
sns.histplot(
    data=penguins,
    x="body_mass_g",
    y="flipper_length_mm",
    cbar=True,
)

Darker shades represent areas of higher density.

Let's display a KDE instead, which for a 2D distribution appears as
contours:


In [None]:
sns.kdeplot(
    data=penguins,
    x="body_mass_g",
    y="flipper_length_mm",
    levels=[0.05, 0.32]
)

Each contour is drawn at an *iso-proportion* of the density, meaning that it
traces a boundary of constant density.

Here we show the contours which enclose 95% and 68% of the observations.

### Exercise 3

Please complete *Q3* of [the exercise sheet](exercises.ipynb#Q3\))

### 2.4 Categorical data

We can also generate plots that are specific to categorical data.

If we are primarily interested in the spread of the points, we could use a box
plot:


In [None]:
sns.boxplot(
    data=penguins,
    x="species",
    y="flipper_length_mm",
    hue="sex",
)

We can also create bar plots:


In [None]:
sns.barplot(
    data=flights,
    x="month",
    y="passengers",
)

This operates on the entire flights data set, providing:

- an estimate of the mean for each category

- an error bar displaying the variation about the mean.

## 3 Multi-panel plots

To finish, we will look at how to construct complex multi-panel figures with
seaborn.

## 3.1 Facet grids

Facet grids can be used to construct multiple plots using subsets of a
dataset, split on the values of variables.

Let's take our distribution of penguin body mass example from earlier, and
create multiple panels based on sex and species using `FacetGrid()`:

In [None]:
g = sns.FacetGrid(
    data=penguins,
    row="sex",
    col="species",
)
g.map(
    sns.histplot,
    "body_mass_g",
    element="step",
)

Here we made two rows (for Male and Female) and three columns (for Adelie,
Chinstrap and Gentoo).

We then mapped plots onto our panels with `FacetGrid`'s `.map()` method.

The first input of `.map()` should be the name of the plotting function to use,
followed by the inputs we would use if calling that function on its own.

### 3.2 Pair grids

Pair plots are used to show the relationships between every combination of two
variables in a data set.

In seaborn, they can be created by defining a `PairGrid` object:


In [None]:
g = sns.PairGrid(data=penguins, diag_sharey=False)

g.map_upper(sns.scatterplot) \
 .map_lower(sns.kdeplot) \
 .map_diag(sns.kdeplot)

The dataset is supplied to the `data` argument in `PairGrid()`.

`PairGrid` defines the methods:

- `.map_diag()`, to fill all _diagonal panels_ in a pair grid

- `.map_upper()` and `.map_lower()`, to fill the panels _above and below_ the
  diagonal

- A few other methods not shown here: eg. (`.map_offdiag()` and `.map()`).

We could also colour the data by one of our categorical variables using `hue`:


In [None]:
g = sns.PairGrid(
    data=penguins,
    diag_sharey=False,
    hue="species",
)
g.map_upper(sns.scatterplot) \
 .map_lower(sns.kdeplot) \
 .map_diag(sns.kdeplot) \
 .add_legend()

Here we have used `.add_legend()` to display the legend.

_NB. Setting_ `diag_sharey=False` _ensures KDEs on the diagonal use the full height of
the vertical axis, as the axes will not be shared with the other panels._

### Exercise 4

Please complete *Q4* of [the exercise sheet](exercises.ipynb#Q4\))

### 4 Customisation

As we have already touched on in Exercise 2, it is possible to pass a matplotlib `Axes`
object into many of seaborn's plotting functions.

This allows you to initialise and customise your figure using matplotlib, then add
your seaborn visualisations to the plot panels. For example:


In [None]:
fig, ax = plt.subplots(figsize=(5, 6))
sns.regplot(
    data=iris,
    x="sepal_length",
    y="petal_length",
    ax=ax,
)
ax.set_xlabel("Sepal length")
ax.set_ylabel("Petal length")

Customisation of `FacetGrid` and `PairGrid` figures is a bit more complicated. 

`FacetGrid()` and `PairGrid()` actually initialise a matplotlib figure internally. We
can access the `Figure` and `Axes` objects using the `.figure` and `.axes` attributes.
Then we can customise the figure using familiar matplotlib syntax:


In [None]:
g = sns.FacetGrid(
    data=penguins,
    row="sex",
    col="species",
).map(
    sns.histplot,
    "body_mass_g",
    element="step",
)

fig, ax = g.figure, g.axes
for i in range(3):
    ax[1, i].set_xlabel("Body mass [g]")

### Exercise 5

Please complete *Q5* of [the exercise sheet](exercises.ipynb#Q5\))

Thanks for participating in this tutorial. We hope you found it useful!

If you're interested in learning more about seaborn's vast range of visualisation
functions, we *strongly* recommend seaborn's
[excellent documentation](https://seaborn.pydata.org/tutorial.html).
