# Week 2 Solutions

We begin by importing the relevant packages and data sets.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
penguins = sns.load_dataset("penguins")

In [None]:
metabric = pd.read_csv("metabric_clinical_and_expression_data.csv")

In [None]:
%matplotlib inline

In [None]:
penguins.head()

In [None]:
metabric.head()

We also set a theme:

In [None]:
sns.set_theme(context="notebook", style="white")

## Exercises from the Text

### A

**<span style="color:blue">Exercise</span>**: From the penguins data set, create a plot of flipper length vs bill length where the points are also sized by the overall mass of the penguin, and see if you can control the range of sizes used. There is clearly a relationship between body mass and both flipper and bill length. Is this plot the most approporiate way to show this? 

The code for the original plot is as follows:

In [None]:
plt.figure(figsize=(10,10))

sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", s=50,
                hue="species", palette={"Adelie":"red", "Chinstrap":"blue", "Gentoo":"teal"})

plt.xlabel("Bill Length (mm)", fontsize=15, color="blue")
plt.ylabel("Flipper Length (mm)", fontsize=15)
plt.legend(title="Species")

plt.show()

To control the point size via a variable, we use the argument `size`. We can control the range of sizes used via the argument `sizes`, to which we provide a tuple with the minimum and maximum size. I experimented a bit and found that a minimum and maximum size of 10 and 100 respectively gave a good visual, but you may decide different parmeters look better.

In [None]:
plt.figure(figsize=(10,10))

scatter_plot = sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", 
                size="body_mass_g", sizes=(10,100),
                hue="species", palette={"Adelie":"red", "Chinstrap":"blue", "Gentoo":"teal"})

plt.xlabel("Bill Length (mm)", fontsize=15)
plt.ylabel("Flipper Length (mm)", fontsize=15)

plt.show()

**<span style="color:#830051">Corporate bonus exercise</span>**: Update this plot to use the official AstraZeneca colour palette.

We can achieve this via the use of hex codes.

In [None]:
plt.figure(figsize=(10,10))

sns.scatterplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", 
                size="body_mass_g", sizes=(10,100),
                hue="species", palette={"Adelie":"#D0006F", "Chinstrap":"#68D2DF", "Gentoo":"#3F4444"})

plt.xlabel("Bill Length (mm)", fontsize=15)
plt.ylabel("Flipper Length (mm)", fontsize=15)

plt.show()

**Note**: If you wanted to control the legend titles, now there are two of them this is a little more tricky. The best way I could find of doing this was to rename the columns themselves within the `data` argument of the `scatterplot()` function:

In [None]:
plt.figure(figsize=(10,10))

scatter_plot = sns.scatterplot(data=penguins.rename(columns={"species":"Species", "body_mass_g":"Body Mass (g)"}), x="bill_length_mm", y="flipper_length_mm", 
                size="Body Mass (g)", sizes=(10,100),
                hue="Species", palette={"Adelie":"red", "Chinstrap":"blue", "Gentoo":"teal"})

plt.xlabel("Bill Length (mm)", fontsize=15)
plt.ylabel("Flipper Length (mm)", fontsize=15)

plt.show()

### B

**<span style="color:blue">Exercise</span>**: To visualise the distribution of a discrete variable, a bar chart would be appropriate. Pick a discrete variable from the metabric data set and use the function `sns.countplot()` to plot a bar chart showing it's distribution. Ensure that all plot properties such as the axis labels are customised to your satisfaction.

Let's have a look at the distribution of tumour stages in the metabric dataset.

In [None]:
plt.figure()

sns.countplot(data=metabric, x="Tumour_stage", edgecolor="black") # The argument edgecolor adds a nice outline 
plt.xlabel("Tumour Stage", fontsize=15)
plt.ylabel("Count", fontsize=15)

plt.show()

### C

**<span style="color:blue">Exercise</span>**: Create plots to answer the following questions:
- Does the expression profile of ESR1 differ between patients who have and haven't undergone chemotherapy?
- What about radiotherapy?
- What about the four different combinations of chemotherapy and radiotherapy?

The first two parts can be easily answered using beeswarm-violin plots.

Let's put both plots in a grid of subplots to make the comparison easier.

In [None]:
plt.subplots(2, 1, sharex=True, figsize=(10, 10)) # Create a 2 x 1 grid of subplots

# Set the first subplot to be active, then draw onto this a beeswarm-violin
# plot of ESR1 expression stratified by chemotherapy
plt.subplot(2, 1, 1)
sns.violinplot(data=metabric, x="ESR1", y="Chemotherapy", palette=["w"])
sns.swarmplot(data=metabric, x="ESR1", y="Chemotherapy", s=3)

plt.ylabel("Chemotherapy", fontsize=20)
plt.xlabel("") # It's cleaner to keep only the label on the bottom plot
plt.yticks(fontsize=12.5)

# Set the first subplot to be active, then draw onto this a beeswarm-violin
# plot of ESR1 expression stratified by radiotherapy
plt.subplot(2, 1, 2)
sns.violinplot(data=metabric, x="ESR1", y="Radiotherapy", palette=["w"])
sns.swarmplot(data=metabric, x="ESR1", y="Radiotherapy", s=3)

plt.ylabel("Radiotherapy", fontsize=20)
plt.xlabel("ESR1", fontsize=20)
plt.yticks(fontsize=12.5)

plt.show()

To look at the different combinations, it is easiest to create a new variable representing the different combinations.

In [None]:
# First create a column of NAs
metabric["Therapy_combinations"] = np.nan

# Then use conditional subsetting to fill in this column
# Note: This gives a warning message about trying to set a value on a copy of a data frame 
metabric.Therapy_combinations[(metabric.Chemotherapy=="YES") & (metabric.Radiotherapy=="YES")] = "Both"
metabric.Therapy_combinations[(metabric.Chemotherapy=="YES") & (metabric.Radiotherapy=="NO")] = "Chemo Only"
metabric.Therapy_combinations[(metabric.Chemotherapy=="NO") & (metabric.Radiotherapy=="YES")] = "Radio Only"
metabric.Therapy_combinations[(metabric.Chemotherapy=="NO") & (metabric.Radiotherapy=="NO")] = "Neither"

We can then make the plot fairly simply.

In [None]:
plt.figure(figsize=(10, 10))

# Beeswarm-violin plot creation. The `order` argument is used to more sensibly order the different therapy combinations.  
sns.violinplot(data=metabric, x="ESR1", y="Therapy_combinations", palette=["w"], order=["Both", "Chemo Only", "Radio Only", "Neither"])
sns.swarmplot(data=metabric, x="ESR1", y="Therapy_combinations", s=3, order=["Both", "Chemo Only", "Radio Only", "Neither"])

plt.ylabel("Therapy Combination", fontsize=20)
plt.xlabel("ESR1 Expression", fontsize=20)

plt.yticks(fontsize=12.5)

plt.show()

## Exercise 1

In this exercise, we shall use the `sns.heatmap()` function to look at the correlation between the gene expression variables in the metabric data set.

- If you look at the documentation for the heatmap function, you will see that it requires a matrix as input. Use the `.corr()` method for data frames to create a correlation matrix for all the gene expression variables. *Note:* The `.loc` subsetting functionality may also be useful here.
- Create a heatmap visualising these correlations. Do you notice anything? Try manually ordering the different variables to best highlight the patterns in the heatmap.

We first create a correlation matrix for the different genes with expression data:
- Recall we can use the `.loc[]` functionality to use variable names to define a range of columns
- The `.corr()` method can be applied to data frames to produce a correlation matrix

In [None]:
gene_expression_correlation = metabric.loc[:,"ESR1":"MLPH"].corr(method="spearman")

In [None]:
gene_expression_correlation

In [None]:
plt.figure()

sns.heatmap(data=gene_expression_correlation)

plt.show()

Note the scale! In this plot it goes up to 1 but down only to -0.2, so there are actually no pairs of genes with a substantial inverse correlation. The arguments `vmin` and `vmax` can be used to manually set the lower and upper limits of the colour range.

Annoyingly `sns.heatmap()` does not contain an argument to manually control the order of the variables. Ordering can be done by re-selecting the columns in the input data.

In [None]:
plt.figure()

sns.heatmap(data=gene_expression_correlation.loc[["ESR1", "GATA3", "FOXA1", "MLPH", "PGR", "ERBB2", "TP53", "PIK3CA"],
                                                 ["ESR1", "GATA3", "FOXA1", "MLPH", "PGR", "ERBB2", "TP53", "PIK3CA"]])

plt.show()

**<span style="color:Seagreen">For Thought</span>**: Here we were able to manually able to look at the heatmap and pick out an order of genes that best highlighted their correlation structure. If we had many more genes this would not be possible. How might we automatically determine an order for them?

## Exercise 2

Earlier we used the `sns.histplot()` function to visualise the age distribution of the patients in our data set. In this exercise, we look at some options for visualising how this age distribution breaks down across the different cohorts.

- One option is to use the `sns.kdeplot()` function, which only plots the density curve, not the actual bars. Use this function to plot five different density curves on a single axis, one for the age distribution of patients within each cohort.
- Another option is to use subplots. Create a figure where the histogram for the age distribution within each cohort is on a different set of axes. Creating five different subplots is a bit laborious, so you might want to explore using a `for` loop to save time.

Which of these options do you prefer? 

To get multiple plots for different cohorts, one option is to set cohort as a variable to colour by.

In [None]:
plt.figure()

sns.kdeplot(data=metabric, x="Age_at_diagnosis", hue="Cohort")
plt.xlabel("Age at Diagnosis")
plt.ylabel("Count")

plt.show()

One option to create subplots is as follows:

In [None]:
plt.subplots(2, 3, sharex=True, figsize=(10,8))

plt.subplot(2, 3, 1)
sns.histplot(data=metabric[metabric.Cohort==1], x="Age_at_diagnosis")
plt.xlabel("")
plt.ylabel("Count")
plt.title("Cohort 1")

plt.subplot(2, 3, 2)
sns.histplot(data=metabric[metabric.Cohort==2], x="Age_at_diagnosis")
plt.xlabel("")
plt.ylabel("Count")
plt.title("Cohort 2")

plt.subplot(2, 3, 3)
sns.histplot(data=metabric[metabric.Cohort==3], x="Age_at_diagnosis")
plt.xlabel("")
plt.ylabel("Count")
plt.title("Cohort 3")

plt.subplot(2, 3, 4)
sns.histplot(data=metabric[metabric.Cohort==4], x="Age_at_diagnosis")
plt.xlabel("")
plt.ylabel("Count")
plt.title("Cohort 4")

plt.subplot(2, 3, 5)
sns.histplot(data=metabric[metabric.Cohort==5], x="Age_at_diagnosis")
plt.xlabel("Age at Diagnosis", fontsize=20)
plt.ylabel("Count")
plt.title("Cohort 5")

# Turn the axes off on the last subplot, as we are not using it 
plt.subplot(2, 3, 6)
plt.axis("off")

# The tight_layout() function ensures no overlap with axis labels and adjacent plots 
plt.tight_layout()

plt.show()

This can be done slightly quicker using a for loop:

In [None]:
plt.subplots(2, 3, sharex=True, figsize=(10,8))

for i in range(5):
    plt.subplot(2, 3, i+1)
    sns.histplot(data=metabric[metabric.Cohort==i+1], x="Age_at_diagnosis")
    plt.xlabel("")
    plt.ylabel("Count")
    plt.title("Cohort " + str(i+1)) # Recall that strings add like "a" + "b" = "ab"

# Add the x-label on the 5th subplot
plt.subplot(2, 3, 5)
plt.xlabel("Age at Diagnosis", fontsize=20)

# Turn axes off on the last subplot 
plt.subplot(2, 3, 6)
plt.axis("off")

plt.tight_layout()

plt.show()