# **PYTHON FOR BIOLOGISTS 3: PLOTTING WITH MATPLOTLIB AND SEABORN**

Once again, make sure to run the cell below to improve the formatting of the notebook.

In [1]:
from IPython.display import display, HTML

display(HTML(data="""
<style>
    div.cell h1,div.cell h2,div.cell h3,div.cell h4,div.cell h5,div.cell p,div.cell li{font-family: "Times New Roman", sans-serif; color: #444}
    .rendered_html code{font-size: 14px;}
    div.cell p, div.cell li{line-height: 1.6; font-size: 18px; color: #444}
    div.code_cell{padding: 24px 16px;}
    div.prompt{display: none;}
    div.cell{width: 65%; max-width: 720px; margin: auto}
    .rendered_html h1{font-size: 32px;}
    .rendered_html h2{font-size: 28px;}
    .rendered_html h3{font-size: 24px;}
</style>
"""))

# MATPLOTLIB

## Introduction
`matplotlib` is Python's most used library for plotting data. It is both easy to use and very versatile, offering options for graphs going from scatter plots to pie charts. It is imported as follows:

In [None]:
import matplotlib.pyplot as plt

The simplest thing to plot is just a list of numbers:

In [None]:
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.show()

Here, the `.plot()` method assumes that the given list corresponds to **y values** and that the index of the elements are **x values**. For this discrete set of coordinates, it draws lines between points. Then, the `.show()` method displays the created plot.

Roughly, the general process of plotting with `matplotlib` is done in 3 steps.
- Plot the data.
- Customize the appearance of the plot.
- Show/save the plot.

For each of these steps, we use different `plt` methods that we call on the `plt` object. When doing so, we modify the current figure, which is shown and then reset whenever we call `.show()`.

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: BASIC PLOT</h2>

In the code cell below:
1. Plot the first 15 powers of 2 (2, 4, 8...). Generate the list of powers of 2 programmatically using a list comprehension (see below).
</div>

> ### ASIDE: List Comprehensions
> List comprehensions are a Python technique for quickly generating a lists based on other collections. The general syntax for them is:
>```python
> [<function to apply to element> for <element variable name> in <collection object>]
>```
> For example, if we wanted to create a list of the squares of 0 through 10:

In [None]:
# number * number is the function we want to apply
# number is the variable name given to an element
# range(10) is the collection

squares = [number * number for number in range(10)]
print(squares)

In [None]:
# EXERCISE: BASIC PLOT

## Basics

### Titles and Annotations

- The `.set_title(<title>)` method can be used to give the plot a title.
- The `.xlabel(<x axis name>)` method can be used to label the x-axis.
- The `.ylabel(<y axis name>)` method can be used to label the y-axis.



In [None]:
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.title("Fibonacci Numbers")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

### Text
Text can be added to any plot using the `.text()` method. The `.text()` method takes as arguments the `x`, `y` position where the text should be displayed as well as a string representing the text. We can also specify the fontsize and the color using the `fontsize` and `color` arguments respectively.

In [None]:
plt.plot(range(10))
plt.text(0, 8, "Wow, this line is very straight!", fontsize = 14, color = "navy")
plt.show()

### Axis and Figure Size
We can modify the limits of both axes using the `.xlim(<start>, <end>)` and `.ylim(<start>, <end>)` methods.

The overall figure size can be modified with `.figure(figsize = (<width>, <height>))`. The method must be used **before we plot the data**.

In [None]:
plt.figure(figsize = (15,15))
plt.plot(range(10))
plt.xlim(0, 5)
plt.ylim(0, 4)
plt.show()

### Superimposition

Multiple sets of data can be superimposed on the same plot, simply by calling `.plot()` multiple times, once for each set. For example, we could graph the fibonacci numbers and powers of 2 on the same graph.

In [None]:
fibonacci = [1, 1]
twos = []

# Populate lists of values
for i in range(15):
    fibonacci.append(fibonacci[i] + fibonacci[i+1])
    twos.append(2 ** i)

plt.plot(fibonacci)
plt.plot(twos)
plt.show()

To facilitate understanding which line is which, we can add a **legend** to our plot.

### Legend
To do so we use the `label` argument when plotting data (for example, when calling `.plot()` on the fibonacci sequence, we would specify `label="Fibonacci"`. 

To display the legend, we use the `.legend()` method, adding the `loc` argument if we want to specify the location of the legend on the plot.

`loc` can take on 9 different values (think of a 3x3 grid). For example, to put the legend in the top right, use `loc="upper right"`. The other locations are given by the table below:

<table>
    <tr>
        <td>upper left</td>
        <td>upper center</td>
        <td>upper right</td>
    </tr>
    <tr>
        <td>left</td>
        <td>center</td>
        <td>right</td>
    </tr>
    <tr>
        <td>lower left</td>
        <td>lower center</td>
        <td>lower right</td>
    </tr>
</table>

In [None]:
plt.plot(fibonacci, label = "Fibonacci")
plt.plot(twos, label = "Powers of 2")
plt.legend(loc = "upper left")
plt.show()

### Log Axis
We can change the scale of the x or y axis using the `.xscale()` and `.yscale()` methods. `matplotlib` defaults to using `linear` for both but we can also use `log` for a logarithmic scale.

In [None]:
plt.plot(twos)
plt.yscale('log')

plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: BASICS</h2>

In the code cell below, create a plot that:
1. Is of size 7x7.
2. Has as axis labels `x values` and `y values`.
3. Has axes that start at 0 and end at 100.
4. Plots 3 different, labelled mathematical functions of your choice (quadratic, linear, etc.). Just include the first 15 values (i.e. when x=1, x=2, ...).
5. Has a legend in the upper right.
6. Has a title.
</div>

In [None]:
# EXERCISE: BASICS

## Plot Types

With the basics out of the way, we can now test out the different plot options that `matplotlib` provides. We'll go over the most popular/useful types but others can be found by looking at [matplotlib's documentation](https://matplotlib.org/gallery.html).

To start, we'll need to import some data that we can then plot. We'll be using a dataset that's similar to the one we used in the `pandas` course.

In [None]:
import pandas as pd

variants = pd.read_csv("plotting.csv")

# Preliminary look at dataset
print(variants.head())
print(variants.shape)

# Create list of populations list that will be useful later on
populations = ["eas", "sas", "nfe", "asj", "amr", "fin", "afr"]

We can then use the different plots provided by `matplotlib` to illustrate different characteristics of the dataset and ultimately show interesting patterns.

### Bar
Bar charts are good when you have multiple categories and corresponding numerical values. For example, we can visualize the allele frequency of various variants using a bar graph.

In [None]:
# Sort by AF
variants.sort_values(by = ["AF"], inplace = True, ascending = False)

# Since there are over 90 variants, we'll just plot the 5 most frequent
plot_data = variants.head()

With the data imported and sorted, we now need to pass the data to `matplotlib` to create the graph. For a bar graph, we need to pass 2 lists to the `bar()` method: the list of categories (the variant IDs) and the list of associated numerical values (allele frequencies). For example:

In [None]:
heights = [150, 160, 170, 180]
names = ["John", "Bob", "Ron", "Lee"]

# The first argument is the categories and the second the associated numerical values
plt.bar(names, heights)
plt.show()

However, our data is in a `DataFrame` right now. While we could convert the associated columns to lists, `matplotlib` offers the `data` argument which allows us to pass labeled data and then just use the labels as arguments for the list of categories/numerical values. For example, in our case, we can set `data=plot_data` and then pass `"VAR_ID"` and `"AF"` as the data labels. 

In [None]:
# Pass the label of the categories first, then the label of the numerical values
plt.bar("VAR_ID", "AF", data = plot_data)

plt.show()

While it's working, there's some bothersome overlap between the labels on the x-axis. To remedy this, we can rotate the labels on the x-axis using the `.xticks()` method and the `rotation` argument (passing an integer corresponding to the angle of rotation).

In [None]:
plt.bar("VAR_ID", "AF", data = plot_data)
plt.xticks(rotation = 90)
plt.show()

It is also possible to include superimposed bars for each category. To do so, we simply call the `.bar()` method multiple times with different datasets.

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: BAR PLOT</h2>

In the code cell below, create a bar plot where each bar is a different population and the height corresponds to the allele frequency of the first variant of `plot_data`. To do so:
1. Start by accessing the first variant (remember `.iloc()`).
2. Create a list called `frequencies`.
3. Loop over the `populations` list and for each element add the corresponding allele frequency of that variant to the list.
4. Call `.bar()` using `populations` as the `x` argument and `frequencies` as the `y` argument.

</div>

In [None]:
# EXERCISE: BAR PLOT

### Histogram

Histograms require a set of bins (i.e. intervals). It then plots how many instances fall in each bin. Bins can be passed to the `.hist()` method either as a `list` or as a `range`. For example, if we want our bins to start at 1 and have a width of 1:

In [None]:
plt.hist("AC", range(1,100) , data = variants)
plt.show()

We could also specify a width of 5 (the third argument of `range()` specifies the step width):

In [None]:
plt.hist("AC", range(1, 100, 5) , data = variants)
plt.show()

However, the bars of these plots lack borders which makes it hard to differentiate between bars. This issue can be fixed by specifying the `edgecolor="black`.

Also, the x-axis starts at 0 even though it would make more sense for it to start 1 (if we detect a variant, then the allele count must be at least 1). 

In [None]:
plt.hist("AC", range(1, 50) , data = variants, edgecolor = "black")
plt.xlim(1, 50)
plt.show()

As the histogram demonstrates, the vast majority of variants are very rare, showing up in only a couple of genomes. However, the histogram misses the few variants that are much more frequent. In fact, if we check the 10 variants with the highest allele counts we get:

In [None]:
highest_ac = variants.sort_values(by = ["AC"], ascending = False)
print(highest_ac.head(25)["AC"])

Which clearly shows the presence of a few outliers which aren't included in the domain of the histogram (something to be aware of).

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: HISTOGRAM</h2>

In the code cell below:
1. Create a similar histogram using the allele count of the `nfe` population.
2. Label the axes and add a title.

</div>

In [None]:
# EXERCISE HISTOGRAM

### Pie
Pie charts are useful for showing proportions, e.g. what proportion of people with some variant are `nfe`, `eas`, etc. To do so, we use the `.pie()` method, passing it the sizes of the wedges and `populations` as `labels`.

We'll create a pie chart for the variant with the highest allele frequency. Since our data is in a `DataFrame`, we'll need to create a list of allele counts for the different populations by iterating over the columns.

In [None]:
# Get top variant
top_variant = plot_data.iloc[0]

wedges = []
for population in populations:
    wedges.append(top_variant["AC_" + population]) # Gets AC_nfe, then AC_fin ...

plt.pie(wedges, labels = populations)
plt.show()

We can also add the percentage occupied by each slice with the `autopct` argument as well as change the size of the pie chart with the `radius` argument.

In [None]:
# '%1.1f%%' specifies the format of the percentage
plt.pie(wedges, labels = populations, autopct = '%1.1f%%', radius = 2) 
plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: PIE CHART</h2>

In the code cell below, create a pie chart similar to the one above but that includes all variants (i.e. that uses the sum of the allele counts as wedge sizes):
1. Create a `wedges` list. 
2. Loop over the populations, for each adding the sum of the corresponding `AC` column as a wedge size (use `<dataframe variable>["<column name>"].sum()` to get the sum of all the values in that column).
3. Use the `wedges` list and the `populations` list to create the pie chart.

</div>

In [None]:
# EXERCISE: PIE CHART

### Scatter
Scatter plots are good if we have a set of datapoints with information on 2 numerical variables. We can plot them in 2D space and even draw a regression line to see how correlated they are. Let's examine how related allele frequency is between 2 populations with the `.scatter()` method.

In [None]:
plt.scatter("AF_nfe", "AF_fin", data = variants)
plt.show()

The size of the points can be modified using the `s=<size>` where size is an integer

In [None]:
plt.scatter("AF_nfe", "AF_sas", data = variants, s = 10)
plt.xlim(0, 0.023)
plt.ylim(0, 0.02)
plt.show()

Adding a line representing the result of a linear regression is more complicated. Here, we'll make use of `sklearn` a Python library for machine learning since it has easy to use functions for running linear regressions.

In [None]:
from sklearn.linear_model import LinearRegression

We need to create an instance of
`LinearRegression` and take advantage of the `.fit()` method to get the intercept/slope and the `.predict()` to calculate the values the line would take on the domain.

In [None]:
linear_model = LinearRegression() # Create instance

X = variants["AF_nfe"].values.reshape(-1, 1) # sklearn requires a 2D array
Y = variants["AF_fin"].values.reshape(-1, 1)

# Perform linear regression using nfe as X values and fin as Y values
linear_model.fit(X, Y) 

# Get values of line based on slope/intercept calculated by .fit()
y_values = linear_model.predict(X)

Now that we have the values of the line representing the linear regression, we just need to plot them with the scatter plot.

In [None]:
plt.scatter("AF_nfe", "AF_sas", data = variants, s=20) # Plots the points
plt.plot(variants["AF_nfe"], y_values, color = "green", linewidth = 0.5) # Plots the line
plt.xlim(0, 0.023)
plt.ylim(0, 0.02)

plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: SCATTER PLOT</h2>

In the code cell below:
1. Create a similar scatter plot using the `nfe` and `amr` populations. 
2. Limit the x-axis and y-axis to only values above 0 and less than 0.03, then label the axes and add a title.
3. Using `sklearn`, add a red regression line.

</div>

In [None]:
# EXERCISE SCATTER PLOT

We can get the slope and intercept of the line with the `.coef_` and `.intercept_` fields of the `LinearRegression` object.

In [None]:
print(linear_model.coef_)
print(linear_model.intercept_)

We can also compute the Pearson correlation coefficient using `<df object>.corr()`.

In [None]:
# Returns a dataframe of all correlations between all numerical values (i.e. tries all possible pairs)
print(variants[["AF_nfe", "AF_fin"]].corr())

# Just gets the correlation between the two variables of interest
variants[["AF_nfe", "AF_fin"]].corr().iloc[0,1]

## Subplots

With `matplotlib`, it is possible to include multiple plots in the same figure, like the output of the following code block.

In [None]:
x = range(10)
y = range(10)

fig, ax = plt.subplots(nrows=2, ncols=2)

for row in ax:
    for col in row:
        col.plot(x, y)

plt.show()

To do so, we make use of the `.subplots()` method which creates multiple `Axes`, in an arrangement specified by the number of rows (`nrows`) and the number of columns (`ncols`).

The method returns both a `Figure` object and a list of lists of `Axes` objects, which we store in the variables `fig` and `ax` respectively.

> ### ASIDE: Assigning values to multiple variables
> The following syntax allows us to capture the output of a function that returns 2 objects in 2 separate variables:
> ```python
> <var name 1>, <var name 2> = <function that returns 2 objects>
> ```

A `Figure` object is essentially a collection of `Axes` while an `Axes` object is an individual plot (and not the axis of a plot). Essentially all the plotting methods we use on `plt` (`.plot()`, `.bar()`, `.histogram()`, etc.) can be used on an `Axes` object (however other methods change, for example `.title()` becomes `.set_title()`.

For example, we could create a pie chart for multiple variants as follows.

In [None]:
# Create the axes objects
fig, ax = plt.subplots(nrows=3, ncols=3, figsize=(15,15))

# Counter variable to keep track of which variant we're at
i = 0

# Iterate over the lists to access each one individually
for row in ax:
    for col in row:
        
        # Get wedges for that variant
        top_variant = variants.sort_values(by=["AF"], ascending=False).iloc[i]

        wedges = []
        for population in populations:
            wedges.append(top_variant["AC_" + population])
        
        # Plot
        col.pie(wedges)
        
        # Use the variant ID as a plot title
        col.set_title(top_variant["VAR_ID"])
        
        # Add legend to upper right
        col.legend(populations, loc="upper right")
        
        # Increase the value of i so next iteration, we get the next variant.
        i += 1

plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: SUBPLOTS</h2>

In the code cell below:
1. Create a figure with 4 plots, each one a histogram of allele counts of a different population.

</div>

In [None]:
# EXERCISE: SUBPLOTS

# SEABORN

`seaborn` is another popular Python plotting library that builds upon the functionality of `matplotlib` to allow the creation of more complex and aesthetic plots. `seaborn` works in tandem with `matplotlib` and, as we'll see, a lot of `matplotlib`'s methods are still used when working with `seaborn`.

## Basics

As with all libraries, we first need to import `seaborn`.

In [None]:
import seaborn as sns

Then, to enable a `seaborn` style for our plots, we can use the `.set()` method.

In [None]:
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.show()

becomes:

In [None]:
sns.set()
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.show()

Seaborn has tons of customizations options, for example:

Removing the grid:

In [None]:
sns.set_style('dark')
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.show()

Using a white grid:

In [None]:
sns.set_style('whitegrid')
plt.plot([1, 1, 2, 3, 5, 8, 13, 21, 34, 55])
plt.show()

and much [more](https://seaborn.pydata.org/tutorial/aesthetics.html).

## Colors

`seaborn` allows the use of color palettes to ensure a more consistent and aesthetic theme throughout our plots. 

Even if we don't need plot types that are specific to `seaborn`, it's not a bad idea to use it just to improve the aesthetic of our plots (simply calling `sns.set()` will change the appearance of `matplotlib` plots as well). In fact, `seaborn`'s default color scheme is designed to be a more aesthetically pleasing version of `matplotlib`'s.

The current color palette is given by `.color_palette()`. Seaborn also has the `.palplot()` method to display color palettes.

In [None]:
# Display default seaborn color palette
sns.palplot(sns.color_palette())

The default color palette has a [few variations](https://seaborn.pydata.org/tutorial/color_palettes.html):

In [None]:
sns.palplot(sns.color_palette("pastel"))
sns.palplot(sns.color_palette("bright"))
sns.palplot(sns.color_palette("colorblind"))
sns.palplot(sns.color_palette("muted"))

`seaborn` also has sequential palettes that are good for displaying varying intensities and circular palettes that are good for displaying varying categories. 

In [None]:
# Circular color palette (second argument specifies the number of colors to include)
sns.palplot(sns.color_palette("hls", 8))

# Sequential color palette
sns.palplot(sns.color_palette("Blues", 8))

Finally, we can add our own color palette by passing a list of hex color codes (in string format) to `.color_palette()`.

> ### ASIDE: Hex Color Codes
> These color codes are hexadecimal strings preceded by a #. Each one represents a different color (for example a standard red is #FF000). 
>
> To pick colors and find the corresponding color code, use [this tool](https://www.google.com/search?q=color+picker). 

In [None]:
my_color_scheme = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
sns.palplot(sns.color_palette(my_color_scheme))

To actually select a color palette, use the `.set_palette()` method, using whatever arguments, you would pass to `.color_palette()`. For example, to use the `pastel` color palette:

In [None]:
sns.set_palette("pastel")

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: COLORS</h2>

In the code cell below:
1. Create and set your own color palette to be used for the rest of the course.

</div>

In [None]:
# EXERCISE: COLORS

## Plot Types

### Swarm Plot

Swarm plots are specific to `seaborn`. They allow for the display of data points that have 2 categorical dimensions and one numerical dimension. They give a good impression of how the data is spread out and at which points it accumulates.

From our dataset, an interesting swarm plot would be the allele frequency of different variants for different populations and LOF values. To do so, we use the `.swarmplot()` method similarly to the way we used plotting methods for `matplotlib`. 

We need to specify the following arguments:
- `x`: Categorical variable column
- `y`: Numerical variable column
- `hue`: Second categorical variable column (optional)
- `data`: Dataframe

In [None]:
sns.swarmplot(x = "CLIN_SIG", y = "AF", hue = "LOF", data = variants)

# Rotate labels to avoid overlap
plt.xticks(rotation = 90)

# Ensure the y-axis starts at 0
plt.ylim(0, 0.015)

plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: SWARM PLOT</h2>

In the code cell below:
1. Create a swarm plot that shows the `AF_nfe` where the points are separated based on the `INESSS` column. Set the color to depend on the `LOF`.
2. Give the plot a title.

</div>

In [None]:
# EXERCISE: SWARM PLOT

### Box Plot

Box plots can be made for a single category (by only specifying an `x` argument) or multiple (by also specifying a `y` argument, in which case `x` corresponds to the categories and `y` to the numerical values).

In [None]:
# Just plotting a single category
sns.boxplot(x="AF", data=variants)

plt.show()

The boxplot above looks strange since, by default, `seaborn` will display outliers which have a much higher allele frequency than most variants. To remove them, we can use the `showfliers` argument and set it to `False`.

In [None]:
sns.boxplot(x="AF", data=variants, showfliers = False)
plt.show()

Using the `y` argument instead gives a vertical boxplot.

In [None]:
sns.boxplot(y="AF", data=variants, showfliers = False)
plt.show()

What if we wanted to display boxplots for each population, side-by-side? Unfortunately, our `variants` dataframe isn't equipped for the task. Instead, we need to create a `DataFrame` where each row has a `population` column and a `AF` column, essentially splitting the multiple columns of each row of `variants` into multiple rows.

In [None]:
# We'll create our dataframe from a list of dictionaries
dict_list = []

# Iterate over rows of DataFrame
for key, row in variants.iterrows():
    
    # For each population, add a dictionary to the list with a "population" and "af" entry
    for population in populations:
        dict_list.append({
            "Population": population,
            "AF": row["AF_" + population]
        })
        
# Create DataFrame from list of dictionaries
box_df = pd.DataFrame(dict_list)

sns.boxplot(x="Population", y="AF", data=box_df, showfliers=False)
plt.show()

### Violin

A more complicated box plot that also contains information about the PDF (probability density function of the data). `seaborn` estimates the PDF using kernel density estimation. Other than that, creating a violin plot works similarly to creating a box plot.

In [None]:
sns.violinplot(x="Population", y="AF", data=box_df)
plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: BOX/VIOLIN PLOT</h2>

In the code cell below:
1. Create both a box plot and a violin plot that shows the distribution of `AF` for different `CLIN_SIG` values (no need to create a new `DataFrame` just use appropriate labels.

</div>

### Heatmap
Heatmaps allow us to display intensity information for a 2D grid. To create one, we need to either pass a 2D numpy array or a `DataFrame` to the `.heatmap()` function.

For example, we could create a heatmap of the correlations between allele frequencies of different populations. To do so, we need to create a `DataFrame` where the rows and columns are both population names and the value of a cell is the correlation between the row population and the column population.

To create the `DataFrame`, we'll first create a dictionary of dictionaries with a nested for loop.

In [None]:
correlations = {}

# Create rows
for row_population in populations:
    row = {}
    
    # Populate columns of row
    for column_population in populations:
        row[column_population] = variants[["AF_" + row_population, "AF_" + column_population]].corr().iloc[0,1]
    
    # Set entry in dictionary to be that row
    correlations[row_population] = row

# Transform to dataframe
df = pd.DataFrame(correlations)

# Ensure index order matches column order
df = df.reindex(populations)

print(df)
sns.heatmap(df)
plt.show()

<div style="background-color:#e8f1ff; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">EXERCISE: HEATMAP</h2>

In the code cell below, create a heatmap of allele frequency for different populations for the first 5 variants of `variants`:
1. Set the index of `variants` to be the `VAR_ID` column using the `.set_index()` method.
2. Create a new dataframe that consists of only the first 5 variants (use `.head()`).
3. Select only the columns with population frequency (e.g. `AF_nfe`, `AF_fin`, etc.) and pass the resulting dataframe as argument to `.heatmap()`.

</div>

<div style="background-color:#c4ffb3; padding: 20px;box-shadow: 0 4px 8px 0 rgba(0,0,0,0.2);
  transition: 0.3s;"> 
<h2 style="margin:0; text-align: center">FINAL EXERCISE</h2>

The goal of this final exercise is to plot the coverage of INESSS variants relative to all problematic variants for different populations. To do so, you'll create a bar plot with 2 superimposed bars, the first will represent the allele frequency of INESSS variants and the second the allele frequency for all variants for the given population.

### Preparing Data

1. Create a new dataframe called `inesss` that only includes variants with `True` in the `INESSS` column.
2. Create a new list called `inesss_freq`. Then, looping over `populations`, get the sum of the corresponding allele frequency column of `inesss`.
3. Repeat the process with a list called `non_iness_freq`, using the whole `variants` dataframe instead.

### Customizing Plot
1. Set `seaborn` styles with a `"darkgrid"` background.
2. Give the plot a title and label the axes.

### Plotting Data
1. Plot the `non_inesss_freq` list as a bar graph using `populations` as categories, in blue. 
2. Plot the `inesss_freq` list as a bar graph using `populations` as categories, in pink.
3. Include appropriate labels for both.
4. Add a legend.

### Adding Text
1. Create a list called `proportions` where each entry is equal to the corresponding `inesss_freq` entry divided by the corresponding `non_inesss_freq` entry.
2. Create a function called `to_percentage` that takes a number and:
    - Rounds the numbers to 2 digits.
    - Multiplies it by 100 and converts it to an integer.
    - Converts it to a string and adds `%` at the end.
    - Returns the resulting string.
3. Create a new list called `percentages` which consists of `to_percentage` applied to each element of `proportions`. 
4. Add them to the plot using the `.text()` method.
    - For the x coordinate, put the index of the bar (i.e. for the first bar put 0, the second 1, etc.)
    - For the y coordinate, put the height of the corresponding bar in terms frequency.
    - Add the argument `ha = "center"` to center the text.
5. Show the plot.

</div>

# Conclusion

You should now have a solid grasp of the fundamentals of plotting different types of data with Python. While the plot types we went over should cover the majority of your graphing needs, there are still more that can be found by looking at either [matplotlib's](https://matplotlib.org/3.1.1/index.html) or [seaborn's](https://seaborn.pydata.org/) documentation.