# Data visualisation

For data visualisation, you will be using the `matplotlib` package. Like `numpy`, this is an external package that can be used with Python. It allows you to produce graphics, and has *many* options for customisation. If you're viewing this page through MyBinder or Google Colab, Matplotlib has already been installed for you. If you're working along using the Anaconda distribution, Matplotlib is also included. Otherwise, you'll have to install it before continuing.

You can import matplotlib into Python with `import matplotlib`, but today you will mostly be using it's submodule `pyplot`, which you can import like this:

In [None]:
from matplotlib import pyplot

Some people like to abbreviate this. As with `numpy` and `np`, this can make your code harder to read (although the abbreviation is so commonplace that many people will recognise it), but it will make writing a bit quicker.

In [None]:
from matplotlib import pyplot as plt

#### Loading the dataset on Google Colab

Ignore this if you're copying code into your own Python installation, or if you're doing this on MyBinder. If you're on Google Colab, follow these steps:

1. Download the relevant files to your computer. They can be downloaded via [this link](https://github.com/esdalmaijer/PyBrain_Python_Intro/blob/main/files_for_google_colab/files_for_google_colab.zip)
2. Unzip the archive you downloaded at step 1.
3. Go to your Google Drive and create the folder `PyBrain`. (You can name it differently if you want to, but if you do, make sure to change the name in the following code too.)
4. Upload your files to Google Drive, into the `PyBrain` folder.
5. Now run the code below. This should provide you with a prompt. Click on the URL, allow access, and then copy the code you were given into the prompt.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# Run this if you are on Google Colab:

import numpy

# Set the path to the data on Google Colab.
file_name = "/content/drive/My Drive/PyBrain/example.csv"

# Load the data.
my_data = numpy.loadtxt(file_name, delimiter=",",
    skiprows=1)

# Split the data into x, y, and z values.
ppnr = x = my_data[:,0]
x = my_data[:,1]
y = my_data[:,2]
z = my_data[:,3]

#### Loading the dataset

If you're NOT on Google Colab, please use the following code. Before we start, load a simple dataset:

In [None]:
import numpy

my_data = numpy.loadtxt("example.csv", delimiter=",",
    skiprows=1)

# Split the data into x, y, and z values.
ppnr = x = my_data[:,0]
x = my_data[:,1]
y = my_data[:,2]
z = my_data[:,3]

## Scatter plots

There are two ways to create scatterplots. The first one is by using the `plot` function:

In [None]:
pyplot.plot(x, y, "o")

This is a bit simple, so let's get funky! You can customise the colours, the type of markers, their size, and their opacity (alpha):

In [None]:
pyplot.plot(x, y, "o", color="#FF69B4", marker=">", 
    markersize=20, alpha=0.2)

You can also create scatterplots by using the `scatter` function:

In [None]:
pyplot.scatter(x, y, marker="o", color="red")

Here, you can play around with the marker sizes by passing a NumPy array. For example, you can scale them according to their `z` values.

In [None]:
# Make sure all marker sizes are over 1.
marker_size = z + abs(numpy.min(z)) + 1
# Plot the points.
pyplot.scatter(x, y, s=marker_size*15, marker="o", color="red")

## Line plots

The same `plot` can be used to create line plots:

In [None]:
x_values = numpy.linspace(0, 8*numpy.pi, 1000)
y1 = numpy.sin(x_values)
y2 = numpy.cos(x_values)
pyplot.plot(x_values, y1, "-", lw=3, color="red")
pyplot.plot(x_values, y2, "-", lw=3, color="green")

You can add labels to your plot, and automatically turn them into a legend:

In [None]:
pyplot.plot(x_values, y1, "-", lw=3, color="red", label="sin")
pyplot.plot(x_values, y2, "-", lw=3, color="green", label="cos")
pyplot.legend()

In addition to the legend, you can add axis labels too:

In [None]:
pyplot.plot(x_values, y1, "-", lw=3, color="red", label="sin")
pyplot.plot(x_values, y2, "-", lw=3, color="green", label="cos")
pyplot.legend()
pyplot.xlabel("X values", fontsize=14)
pyplot.ylabel("Y values", fontsize=14)

#### Saving figures

While directly using `pyplot` functions for drawing operations, it is often neater to create a figure instance and use its methods. This is actually very similar:

In [None]:
# Create a new figure. The size is in inches.
# You can also set the DPI (dots per inch).
fig = pyplot.figure(figsize=(8.0, 6.0), dpi=100)
# Add an axis to draw in the figure. Give it a
# 15% border by making it start at 0.15 left and
# right, and giving it a width and height of 0.7.
ax = fig.add_axes([0.15,0.15,0.7,0.7])

# Now plot the data into the axes.
ax.plot(x_values, y1, "-", lw=3, color="red", label="sin")
ax.plot(x_values, y2, "-", lw=3, color="green", label="cos")
# Draw the legend.
ax.legend(loc="lower left", fontsize=14)
# Set the axis labels. Note that these functions
# are named subtly differently: set_xlabel
ax.set_xlabel("X values", fontsize=18)
ax.set_ylabel("Y values", fontsize=18)

If you were running this on your local computer, you could easily save the figure by calling the `savefig` method:

In [None]:
fig.savefig("my_figure.png")

## Histograms

Histograms count the numbers of observations within each `bin`: a section of the space that the data is in. To make them, you can combine NumPy's `histogram` function with Matplotlib's `bar` function. First, create a histogram:

In [None]:
# Create a histogram of the data by using NumPy's "histogram" function.
# Note that this does not produce a plot just yet! You can set the number
# of bins that you would like to divide the data over, and the function
# will then use that to create the bounds for each bin. The function will
# also return the number of observations in each bin.
hist, bin_bounds = numpy.histogram(x, bins=10)

We just allowed NumPy to choose the bin edges, which should generate equal-sized bins. You also have the option to choose the bin edges yourself, or to have them be determined by an algorithm.

Let's see what the bins look like by plotting them in a bar plot. Mind you that the bar plot only needs the left bound of every bin, whereas the `bin_bounds` contains the left bound of every bin AND the right bound of the last bin. You can select all the bounds but the last one by *indexing* or *slicing* the `bin_bounds` variable like this: `bin_bounds[0:-1]`. It means "*From bin_bounds, select all values from position 0 to position -1 (the last position from the end), not inclusive of the end point.*"

In [None]:
# Compute the bin width. You can set this as the distance between
# each of the bin edges.
bin_width = bin_bounds[1] - bin_bounds[0]

# Plot the bins using pyplot's bar function. We align the plot at the
# left edges, because those are what we obtained from the histogram
# function.
pyplot.bar(bin_bounds[0:-1], hist, align="edge", width=bin_width)
# Finish the plot by adding some information on the x and y axes.
pyplot.xlabel("Value", fontsize=16)
pyplot.ylabel("Number of observations", fontsize=16)

It's clear that the `x` values are spread in a somewhat normal range, centred roughly around 0, and with a standard deviation of roughly 1 maybe?

Let's do the same thing for the `y` values...

In [None]:
# Create a histogram of the y values.
hist, bin_bounds = numpy.histogram(y, bins=10)
bin_centres = bin_bounds[:-1] + numpy.diff(bin_bounds)/2.0
bin_width = bin_bounds[1]-bin_bounds[0]
# Plot the bins using pyplot's bar function. We align the plot at the
# left edges, because those are what we obtained from the histogram
# function. The bin width can be set to 0.1 so that the bins touch.
pyplot.bar(bin_centres, hist, align="center", width=bin_width)
# Finish the plot by adding some information on the x and y axes.
pyplot.xlabel("Value", fontsize=16)
pyplot.ylabel("Number of observations", fontsize=16)

## Bar plot

There's a very high likelihood that you've seen a bar plot before. They're everywhere, from scientific journals to newspapers. Traditionally, bar plots show the mean of each group. In our case, that would look like this:

In [None]:
# Compute the mean of the groups.
m_x = numpy.mean(x)
m_y = numpy.mean(y)
# Plot both means in a bar plot. The means will determine the groups
# position on the y-axis, but the position on the x-axis is something
# we need to set ourselves. Let's just go with 0.5 and 1.
# Plot the mean of the x group.
pyplot.bar(0.5, m_x, width=0.5, label="X group")
# Plot the mean of the y group.
pyplot.bar(1.0, m_y, width=0.5, label="Y group")

# Now finish the plot by adding a sensible y-axis label and a legend.
pyplot.ylabel("Score")
pyplot.legend()

From the current bar plot, it looks like the `x` group is doing very well compared to the `y` group (assuming positive values are better). It also hides that there were any negative values. (Note that the last statement might not entirely be true; sometimes it might randomly happen that the `y` group's average is below 0!)

In reality, there are values in the `y` group that extend beyond the range of the `x` group. Furthermore, the two groups were drawn from different types of distributions. We don't see that in the plot.

## Error bars

One way to improve bar plots is by adding *error bars*. These bars sit atop of the already plotted bars, and usually indicate the standard error of the mean. (You've already learned about this in the previous practicals. As a reminder, it was a measure of how well the sample reflects the population; i.e. how representative of our `x` group is from the group of all X in the world.)

We can compute the standard error by using the formula that you have already seen: Divide the sample standard deviation by the square root of the number of observations in that sample.

In [None]:
# Count the number of observations in each group by taking the length
# of the x vector.
n_x = len(x)
n_y = len(y)
# Calculate the standard deviations of each group. Note that we are 
# calculating the unbiased standard deviation, i.e. the sum of squares
# divided by (n-1). This is what the ddof value indicates.
sd_x = numpy.std(x, ddof=1)
sd_y = numpy.std(y, ddof=1)
# Calculate the standard error of the mean for both groups.
sem_x = sd_x / numpy.sqrt(n_x)
sem_y = sd_y / numpy.sqrt(n_y)

Now let's use the calculated standard errors to draw error bars into your bar plot. This is almost the same code as before, but mind the `yerr` keyword argument that specifies the error bar size on the y-axis.

In [None]:
# Plot both means in a bar plot. This is the same code as before, but 
# with the yerr keyword specified.
# Plot the mean of the x group.
pyplot.bar(0.5, m_x, yerr=sem_x, width=0.5, label="X group")
# Plot the mean of the y group.
pyplot.bar(1.0, m_y, yerr=sem_y, width=0.5, label="Y group")

# Now finish the plot by adding a sensible y-axis label and a legend.
pyplot.ylabel("Score")
pyplot.legend()

Now, as you are aware, the standard error of the mean indicates how well the sample mean represents the population mean. It's a measure that reflects something about your sampling process, not necessarily about what your data looks like. The plot above thus shows that the two groups might well differ in what their means are, but it doesn't teach you much else.

One thing you can do, is use error bars to plot the standard deviation:

In [None]:
# Plot both means in a bar plot. This is the same code as before, 
# but now the yerr represents the standard deviation.
# Plot the mean of the x group.
pyplot.bar(0.5, m_x, yerr=sd_x, width=0.5, label="X group")
# Plot the mean of the y group.
pyplot.bar(1.0, m_y, yerr=sd_y, width=0.5, label="Y group")

# Now finish the plot by adding a sensible y-axis label and a legend.
pyplot.ylabel("Score")
pyplot.legend()

In this plot, it's a bit clearer what the distributions of both groups look like. But it's not pretty... It's quite unclear why the bars are there in the first place, as the only thing they indicate is how far from 0 a group's mean is. In addition, although the standard deviation gives some indication of the spread of each group, we don't quite see what the exact distributions look like: it's just a black line.

## Are bar plots really that bad?

If you haven't seen a lot of datasets, you might not appreciate how summary statistics (mean, median, standard deviation, etcetera) can be misleading. To illustrate just how different datasets can be while having the exact same mean, median, standard deviation, and correlation, please have a look at Alberto Cairo's datasaurus, and a [dozen extremely different plots](https://www.autodeskresearch.com/publications/samestats) that all have the exact same summary statistics:

![](datasaurus_alberto_cairo.png)

This is why it's important to be aware of the underlying distribution of data, and to not simply rely on summary statistics. Bar plots only show summary statistics, and can thus hide potentially important differences between groups.

## Box plot

One type of plot that *does* reflect properties of distributions is the box plot, or box-and-whiskers plot. It quite literally is two stacked boxes with whiskers on either side. Each element of the plot represents a quartile of the data (that's 25% of the observations). This type of plot thus tells you about each group's median (the 50th percentile lies between the second and third quartile), and gives you a rough idea of what the distribution looks like. Some boxplots also include 'fliers': Values that lie outside the typical range, and could be outliers.

Let's draw box plots for our two groups:

In [None]:
# Draw a box for values from group x and group y. You can pass both 
# variables at the same time by combining them into a list, i.e. as
# [x,y]. The same is true for the labels you would like to associate
# with the groups.
pyplot.boxplot([x,y], labels=["Group X","Group Y"])

This is a pretty good visualisation of the two groups. We can see their central tendency, because the median is represented by the coloured horizontal line. In addition, we can see how observations are spread out. For group `x`, all quartiles are roughly equally big, which demonstrates that the data is uniformly distributed. For group `y`, we can see that the second and third quartile (the boxes) are smaller than the first and fourth quartile (the whiskers). This illustrates that the distribution is denser around the median.

What we still can't see from the current plot is what the shape of the distribution is. For example, it seems like group `y` is a normal distribution, but it could also be that all values within the second and third quartile are the same. For example, they could all be -0.5 and 0.5, and it would result in the same box plot.

## Violin plot

Where box plots do not typically reveal the exact shape of a distribution, violin plots are designed to do exactly that. They apply a *kernel density estimate* to characterise the shape of a distribution, and plot that instead of boxes and whiskers. Fliers are still denoted with a different marker (although what is considered a "flier" can differ between box plots and violin plots, or more accurately, per what standards are set within the function to draw the plots).

Let's see what a violin plot of our two groups would look like:

In [None]:
# Draw a violin plot for values from group x and group y. As 
# with the box plot, you can pass both variables at the same
# time by combining them into a list, i.e. as [x,y].
pyplot.violinplot([x,y], showmeans=True)

As you can see, the violin plot gives a much clearer picture of the actual distribution of your data.

## Choosing a visualisation type

As you have seen, different types of data visualisations exist, and each come with their own benefits and downsides. Bar plots can be easily understood, but also give you very little information about what the data underlying an average looks like. In addition, whether or not it makes sense to draw a bar highly depends on what kind of data you're visualising. Adding error bars to bar plots to indicate the standard error of the mean tells you something about the sampling process, whereas adding error bars to indicate the standard deviation tells you something about the sample.

If you're interested in visualising distributions in a more detailed way, you could turn to box or violin plots. These provide a clearer picture of what your data look like, and are still quite easy to interpret.

What the best type of plot is depends on the data, and on what message you would like your graph to illustrate. If you're simply saying "these groups have different means", a bar plot with error bars that indicate the standard error of the mean could work very well. However, if you're trying to illustrate that two groups are from distributions with different properties, you might need to turn to box or violin plots.

Finally, you are free to combine plots and types of visualisations. For example, you could simply throw everything into one combined plot:

In [None]:
# Determine the positions of the two groups' visualisations
# on the x-axis.
pos = [0.5, 1.5]

# Draw violin plots for each group, but don't draw the mean,
# median, or extrema.
vplot = pyplot.violinplot([x,y], positions=pos, \
    showmeans=False, showmedians=False, showextrema=False)
# Set the colour of the violin plot.
for violin in vplot["bodies"]:
    violin.set_color("#FF69B4")
# Draw box plots for each groups on the same positions.
bplot = pyplot.boxplot([x,y], positions=pos, \
    labels=["Group X", "Group Y"])
# Set the colour of horizontal lines that indicate the median
# in each box plot.
for line in bplot["medians"]:
    line.set_color("#FF69B4")

# Finally, draw the individual data points for each group. The
# points need to be plotted to the right of each of the 
# box/violin plots. To do so, we first need to create two 
# vectors that code for the position of each sample from each 
# group on the x-axis.
x_pos = numpy.ones(x.shape) * pos[0] + 0.3
y_pos = numpy.ones(y.shape) * pos[1] + 0.3
# Then, we simply need to plot the samples. The alpha keyword
# indicates the transparancy of each sample.
pyplot.plot(x_pos, x, '.', color="#FF69B4", alpha=0.1)
pyplot.plot(y_pos, y, '.', color="#FF69B4", alpha=0.1)

# Assignments

The following assignments combine the previous worksheet on NumPy and data handling with the current on plotting.

#### Assignment 1: Visual noise

Your assignment is to create some random noise, and to plot it using Maplotlib's `imshow` function. Make the noise 300 "pixels" wide and 200 high.

In [None]:
# Create a new figure.
fig = pyplot.figure()
# Add new axes without a border.
ax = fig.add_axes([0,0,1,1])

# Create random 2-dimensional noise.
noise = numpy.random.rand(200,300)

# Plot the noise using the imshow function.
ax.imshow(noise, cmap="gray")

# Save the figure.
fig.savefig("grey_noise.png")

#### Assignment 2: Colourful noise

Create 3-dimensional random noise, and use Matplotlib's `imshow` function to show it. TIP: Don't set the colour map this time.

In [None]:
# Create a new figure.
fig = pyplot.figure()
# Add new axes without a border.
ax = fig.add_axes([0,0,1,1])

# Create random 3-dimensional noise.
noise = numpy.random.rand(200,300, 3)

# Plot the noise using the imshow function.
ax.imshow(noise)

# Save the figure.
fig.savefig("colour_noise.png")

#### Assignment 3: Plot the data file

The data file you loaded at the start of this worksheet has three features in it. Plot all three features so that they can be compared. Choose the type of plot that you think is best, but make sure to give each feature a different colour.

In [None]:
# Load the data.
# NOTE: Change this if you're on Google Colab.
my_data = numpy.loadtxt("example.csv", delimiter=",",
    skiprows=1)

# Split the data into x, y, and z values.
ppnr = x = my_data[:,0]
x = my_data[:,1]
y = my_data[:,2]
z = my_data[:,3]

# Determine the positions of the three groups' visualisations
# on the x-axis.
pos = [0.5, 1.5, 2.5]
# Determine the colours of for the three groups.
cols = ["red", "green", "blue"]

# Draw violin plots for each group, but don't draw the mean,
# median, or extrema.
vplot = pyplot.violinplot([x,y,z], positions=pos, \
    showmeans=False, showmedians=False, showextrema=False)
# Set the colour of the violin plot.
for i, violin in enumerate(vplot["bodies"]):
    violin.set_color(cols[i])
# Draw box plots for each groups on the same positions.
bplot = pyplot.boxplot([x,y,z], positions=pos, \
    labels=["Group X", "Group Y", "Group Z"])
# Set the colour of horizontal lines that indicate the median
# in each box plot.
for i, line in enumerate(bplot["medians"]):
    line.set_color(cols[i])