# Overview of common plot types

`matplotlib` offers [a variety of plots](https://matplotlib.org/stable/plot_types/index.html) to choose from. Which plot is the right one in a given scenario depends on the nature and structure of your data.

The plot types are organized into three major groups:

1. Pairwise data: _There is a y for every x._
2. Statistical distributions: You want to visualize the _structure_ of your data.
3. Gridded data: You want to visualize your data on a (regular or irregular) grid.

In this session, we will take a closer look at the first two groups.

You can find more information on all the available plot types [in the documentation](https://matplotlib.org/stable/plot_types/index.html). There are also [many examples available](https://matplotlib.org/stable/gallery/index.html) for you to learn from, download, and modify.

The plotting functions share a lot of options that let you change the size and color of lines and markers and similar commonly used customizations. [The previous notebook](./01-basics.ipynb) already covered many of these. In this notebook, we will instead focus on some of the properties that are specific to the individual plot functions.


In [None]:
import matplotlib.pyplot as plt

## Pairwise data

Pairwise data means you want to visualize two variables with the same number of values against each other, i.e., one variable should be on the x axis and the other variable should be at the y axis.

In [the previous notebook](./01-basics.ipynb), all examples were based on pairwise data.

Some commonly used visualizations for pairwise data are the line, scatter, and bar plot. Let's repeat what we have already seen in the previous notebook:


### Line plot

The most basic version of a line plot draws a line through the specified pairs of x and y values:


In [None]:
x = [1, 2, 6, 7]
y = [2, 0, 3, 4]

fig, ax = plt.subplots()
ax.plot(x, y)

If we want to plot multiple lines, we can use the plot command multiple times:


In [None]:
x1 = [1, 2, 6, 7]
y1 = [2, 0, 3, 4]

x2 = [2, 3, 4, 5]
y2 = [1, 2, 3, 4]

fig, ax = plt.subplots()
ax.plot(x1, y1)
ax.plot(x2, y2)

Alternatively, we can organize our data structure in a way that makes it clear to `matplotlib`, how to pair up the values.

`matplotlib` goes through the data just like a `for` loop would: The first iteration would be used as the first set of values for each line, the next iteration would be the next set of values, and so on.

To keep multiple variables in a single data structure, we could therefore use a list of lists, where each inner list has one vale per line to be plotted:


In [None]:
x = [
    [1, 2],
    [2, 3],
    [6, 4],
    [7, 5],
]

y = [
    [2, 1],
    [0, 2],
    [3, 3],
    [4, 4],
]

fig, ax = plt.subplots()
ax.plot(x, y)

<div class="alert alert-block alert-info">

This may seem very cumbersome when using Python's built-in `list` type. However, `matplotlib` is usually used in conjunction with `numpy`, which offers an `array` type that offers much more elegant ways to organize your data this way.

If you are interested in a `numpy` workshop, <a href="mailto:researchdatahelp@groups.dartmouth.edu">let us know!</a>

</div>


### Scatter plot

A scatter plot can be helpful if your data is not sorted ascendingly along the x axis and/or if there is no meaningful interpretation of interpolating between the existing data points.

A simple scatter plot could look like this:


In [None]:
fig, ax = plt.subplots()

x = [1, 2, 6, 7]
y = [2, 0, 3, 4]

ax.scatter(x, y)

**Note:** The `scatter()` function does not automatically cycle through different colors when adding multiple sets of points:


In [None]:
fig, ax = plt.subplots()

x = [
    [1, 2],
    [2, 3],
    [6, 4],
    [7, 5],
]

y = [
    [2, 1],
    [0, 2],
    [3, 3],
    [4, 4],
]

ax.scatter(x, y)

### Bar plot

A bar plot is often used when we want to plot some kind of quantity across a categorical variable. For example, the average number of hours of sleep across days of the week:


In [None]:
day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
sleeping_hours = [7, 6, 7, 7, 8, 9, 8]

fig, ax = plt.subplots()

ax.bar(day, sleeping_hours)

We can also stack multiple bar charts on top of each other:


In [None]:
day = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
sleeping_hours = [7, 6, 7, 7, 8, 9, 8]
working_hours = [8, 9, 10, 8, 8, 0, 0]
fig, ax = plt.subplots()

ax.bar(day, sleeping_hours)
ax.bar(day, working_hours, bottom=sleeping_hours)

If you would like to change the width of the bars or the alignment to the x coordinates, check out [the documentation for `bar()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html).


## Statistical distributions

Whenever you want to visualize the distribution of a continuous variable for one or more categorical variables, you can use the plots for statistical distributions in `matplotlib`.

Before we demonstrate two of them, let us first generate some data!

Let's stick with the example of visualizing the number of hours slept, but now we increase the sample to an entire year and are interested in breaking this down by day of the week.


In [None]:
import random

# Generate sleeping hours for each day of a year
sleeping_hours = [
    [random.normalvariate(mu=8, sigma=1) for _ in range(53)],
    [random.normalvariate(mu=8, sigma=1) for _ in range(52)],
    [random.normalvariate(mu=8, sigma=1) for _ in range(52)],
    [random.normalvariate(mu=8, sigma=1) for _ in range(52)],
    [random.normalvariate(mu=8, sigma=2) for _ in range(52)],
    [random.normalvariate(mu=8, sigma=2) for _ in range(52)],
    [random.normalvariate(mu=8, sigma=1) for _ in range(52)],
]

# Generate a variable to hold the day of the week
weekday = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]

### Histogram

A histogram is a great tool to show the distribution of a random variable. It looks similar to a bar chart, but involves additional calculations:

For a bar chart, our data structure contains the height of each bar directly.

For a histogram, our data need to be binned into ranges and then counted. We could do these calculations by hand and then use a bar chart to visualize the results, or we conveniently use the function `hist()`:


In [None]:
fig, ax = plt.subplots()

# Histogram for the distribution of sleeping hours on Mondays
ax.hist(sleeping_hours[0])

Histograms work well to visualize the distribution of a single random variable. If you want to visualize the distributions of multiple variables, however, they can quickly get convoluted.

Visualizing the distribution of the sleeping hours on each day, for example, would look like this:


In [None]:
fig, ax = plt.subplots()

ax.hist(sleeping_hours)
ax.legend(weekday)

It is very difficult to make meaningful comparisons between the weekdays here.

To visualize the distributions in multiple groups, a boxplot is usually the better choice.


### Box plot

Box plots are a compact representation of several key statistics of a random variable.

From [the `matplotlib` documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot):

> The box extends from the first quartile (Q1) to the third quartile (Q3) of the data, with a line at the median. The whiskers extend from the box to the farthest data point lying within 1.5x the inter-quartile range (IQR) from the box. Flier points are those past the end of the whiskers. See https://en.wikipedia.org/wiki/Box_plot for reference.
>
> ```
>     Q1-1.5IQR   Q1   median  Q3   Q3+1.5IQR
>                   |-----:-----|
>   o      |--------|     :     |--------|    o  o
>                   |-----:-----|
> flier             <----------->            fliers
>                        IQR
> ```

We can use the `boxplot()` function to neatly compare the distributions across the levels of the categorical variable:


In [None]:
fig, ax = plt.subplots()

_ = ax.boxplot(sleeping_hours, labels=weekday)

There are number of customization options in [the documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot), both for how the statistics are calculated and for how the plot is displayed.


## Practice area

The best way to learn how to use `matplotlib` is to start using it. So here is are a couple of variables that you can use to try out different plots. Don't forget to label and style them!

The data here is taken from the [Palmer Penguins dataset](https://github.com/allisonhorst/palmerpenguins). Since the focus of this session is not on data structures, we are providing the data in separate variables instead of in a `numpy` array or the more commonly used data frame.


In [None]:
species = ["Adelie"] * 10 + ["Chinstrap"] * 10 + ["Gentoo"] * 10

bill_length_mm = [
    [39.0, 37.7, 33.5, 35.3, 37.6, 42.9, 41.4, 34.0, 33.1, 36.0],
    [50.6, 45.4, 50.5, 50.8, 46.9, 50.8, 51.3, 52.2, 49.2, 51.5],
    [47.5, 46.1, 50.0, 50.0, 43.3, 48.5, 49.4, 48.6, 47.5, 44.5],
]

bill_depth_mm = [
    [18.7, 16.0, 19.0, 18.9, 19.1, 17.6, 18.6, 17.1, 16.1, 17.8],
    [19.4, 18.7, 18.4, 19.0, 16.6, 18.5, 19.2, 18.8, 18.2, 18.7],
    [14.2, 15.1, 15.3, 15.9, 13.4, 15.0, 15.8, 16.0, 14.0, 14.7],
]

body_mass_g = [
    [3650.0, 3075.0, 3600.0, 3800.0, 3750.0, 4700.0, 3700.0, 3400.0, 2900.0, 3450.0],
    [3800.0, 3525.0, 3400.0, 4100.0, 2700.0, 4450.0, 3650.0, 3450.0, 4400.0, 3250.0],
    [4600.0, 5100.0, 5550.0, 5350.0, 4400.0, 4850.0, 4925.0, 5800.0, 4875.0, 4850.0],
]

In [None]:
# Your code here!

## Next steps

We have only scratched the surface of what you can do with `matplotlib`. There is much more to explore and there [a number of official tutorials](https://matplotlib.org/stable/tutorials/index.html) that can guide you on your journey.

In those tutorials, you will encounter `numpy` a lot. As mentioned above, we have so-far avoided using more elaborate data structures than the built-in `list`. To really get the most out of `matplotlib`, however, you should use `numpy` to organize your data. If you want to learn more about it, you can check out [the official quickstart for `numpy`](https://numpy.org/devdocs/user/quickstart.html)!

Finally, if you use tabular data in [`pandas`](https://pandas.pydata.org/), `matplotlib`'s interface might often be a bit clunky to use. Fortunately, there is another library built on top of `matplotlib` called [`seaborn`](https://seaborn.pydata.org/), that we will explore in a separate workshop!


<table >
<tbody>
  <tr>
    <td style="padding:0px;border-width:0px;vertical-align:center">    
    Created by Simon Stone for Dartmouth College Library under <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons CC BY-NC 4.0 License</a>.<br>For questions, comments, or improvements, email <a href="mailto:researchdatahelp@groups.dartmouth.edu">Research Data Services</a>.
    </td>
    <td style="padding:0 0 0 1em;border-width:0px;vertical-align:center"><img alt="Creative Commons License" src="https://i.creativecommons.org/l/by/4.0/88x31.png"/></td>
  </tr>
</tbody>
</table>
