# Plotting (continued)

## Questions

- How can I plot my data?
- How can I save my plot for publishing?

## Objectives

- Create a time series plot showing a single data set.
- Create a scatter plot showing relationship between two data sets.


## [matplotlib](https://matplotlib.org/) is the most widely used scientific plotting library in Python.

- Pandas uses matplotlib "under-the-hood" for plotting.
- Manipulation of these plots require directly interacting with matplotlib, however.
- Commonly use a sub-library called matplotlib.pyplot.
- The Jupyter Notebook will render plots inline by default.


In [None]:
import matplotlib.pyplot as plt

- Simple plots are then (fairly) simple to create.


In [None]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel("Time (hr)")
plt.ylabel("Position (km)")

## Annotate plots created through pandas using matplotlib


In [None]:
import pandas as pd

data = pd.read_csv("../../../data/gapminder_gdp_oceania.csv", index_col="country")

# Extract year from last 4 characters of each column name
# The current column names are structured as 'gdpPercap_(year)',
# so we want to keep the (year) part only for clarity when plotting GDP vs. years
# To do this we use replace(), which removes from the string the characters stated in the argument
# This method works on strings, so we use replace() from Pandas Series.str vectorized string functions

years = data.columns.str.replace("gdpPercap_", "")

# Convert year values to integers, saving results back to dataframe

data.columns = years.astype(int)

data.T.plot()
plt.ylabel("GDP per capita")

## Many styles of plot are available.

- For example, do a bar plot using a fancier style.


In [None]:
plt.style.use("ggplot")
data.T.plot(kind="bar")
plt.ylabel("GDP per capita")

## Data can also be plotted by calling the matplotlib plot function directly.

- The command is `plt.plot(x, y)`
- The color and format of markers can also be specified as an additional optional argument e.g., `b-` is a blue line, `g--` is a green dashed line.


## Get Australia data from dataframe


In [None]:
years = data.columns
gdp_australia = data.loc["Australia"]

plt.plot(years, gdp_australia, "g--")

## Can plot many sets of data together.

Often when plotting multiple datasets on the same figure it is desirable to have a legend describing the data.

This can be done in `matplotlib` in two stages:

1. Provide a label for each dataset in the figure
2. Instruct `matplotlib` to create the legend


In [None]:
# Select two countries' worth of data.
gdp_australia = data.loc["Australia"]
gdp_nz = data.loc["New Zealand"]

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, "b-", label="Australia")
plt.plot(years, gdp_nz, "g-", label="New Zealand")

# Create legend.
plt.legend(loc="upper left")
plt.xlabel("Year")
plt.ylabel("GDP per capita ($)")

- Plot a scatter plot correlating the GDP of Australia and New Zealand
- Use either `plt.scatter` or `DataFrame.plot.scatter`


In [None]:
plt.scatter(gdp_australia, gdp_nz)

In [None]:
data.T.plot.scatter(x="Australia", y="New Zealand")

## Seaborn is pandas best friend.

- Another important data visualization library is `seaborn`, which specializes in statistical graphs.
- `seaborn` is particularly useful when working `pandas` DataFrames
  - Adds labels based on variable names in dataframe
- `seaborn` uses `matplotlib` under the hood, so all of the above still applies
- To get the most out of it, the data should be in long form (a.k.a. _tidy_)

<figure>
<img src="https://seaborn.pydata.org/_images/data_structure_19_0.png" style="width:60%">
<figcaption align="left">Long-form versus wide-form data. Each color denotes a variable.</figcaption>
</figure>

- The gapminder data is in wide format, but we can make it long format using `melt`


In [None]:
data_long = data.melt(
    ignore_index=False, var_name="year", value_name="gdpPercap"
).reset_index()
data_long

- Now we can create a lineplot of the dataframe:


In [None]:
import seaborn as sns

sns.lineplot(
    data_long,
    x="year",
    y="gdpPercap",
)

plt.xlabel("Year")
plt.ylabel("GDP per capita")

- When there are multiple values for each `x`, `seaborn` by default shows the mean and the confidence interval of those values.
- To draw a separate line for each level of `country`, we can use the `hue` and/or `style` parameter:


In [None]:
sns.lineplot(data_long, x="year", y="gdpPercap", hue="country")

plt.xlabel("Year")
plt.ylabel("GDP per capita")

## Saving your plot to a file

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. calling this function, e.g. with

```{python}
plt.savefig('my_figure.png')
```

Will save the current figure to the file `my_figure.png`. the file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

Note that functions in `plt` refer to a global figure variable and after a figure has been displayed to the screen (e.g. with `plt.show`) `matplotlib` will make this variable refer to a new empty figure. therefore, make sure you call `plt.savefig` before the plot is displayed to the screen, otherwise you may find a file with an empty plot.

When using dataframes, data is often generated and plotted to screen in one line. in addition to using `plt.savefig`, we can save a reference to the current figure in a local variable (with `plt.gcf`) and call the `savefig` class method from that variable to save the figure to file.

```{python}
data.plot(kind='bar')
fig = plt.gcf() # get current figure
fig.savefig('my_figure.png')
```


## Making your plots accessible

Whenever you are generating plots to go into a paper or a presentation, there are a few things you can do to make sure that everyone can understand your plots.

- Always make sure your text is large enough to read. Use the `fontsize` parameter in `xlabel`, `ylabel`, `title`, and `legend`, and [`tick_params` with `labelsize`](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.tick_params.html) to increase the text size of the numbers on your axes.
- Similarly, you should make your graph elements easy to see. Use `s` to increase the size of your scatterplot markers and `linewidth` to increase the sizes of your plot lines.
- Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the `linestyle` parameter lets you use different types of lines. For scatterplots, `marker` lets you change the shape of your points. If you’re unsure about your colors, you can use [coblis](https://www.color-blindness.com/coblis-color-blindness-simulator/) or [color oracle](https://colororacle.org/) to simulate what your plots would look like to those with colorblindness.


## Key points

- `matplotlib` is the most widely used, general purpose scientific plotting library in Python.
- Plot data directly from a Pandas dataframe.
- Select and transform data, then plot it.
- `seaborn` is closely matched to Pandas dataframes but requires specific format
- Many styles of plot are available: see the Python Graph Gallery for more options.
- Can plot many sets of data together.


Licensed under [CC-BY 4.0](http://swcarpentry.github.io/python-novice-gapminder/18-style/index.html) 2018–2023 by [The Carpentries](https://carpentries.org/)

Licensed under [CC-BY 4.0](http://swcarpentry.github.io/python-novice-gapminder/18-style/index.html) 2016–2018 by [Software Carpentry Foundation](https://software-carpentry.org/)
