session2.Rmd

---
title: "Session 2"
author: "Kevin Wang and Garth Tarr"
date: "2 Dec 2019"
output:
  html_document:
    fig_height: 8
    fig_width: 8
    toc: yes
    number_sections: true
    toc_depth: 3
    toc_float: yes
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(results = "hide", fig.show = "hide", warning = FALSE, message = FALSE)
```

# Learning outcomes of this session

+ Join two data together using the `left_join` function
+ Making basic plots using `ggplot2`: boxplots, histograms and scatter plots
+ Use `pivot_longer` and `pivot_wider` to reshape data for visualisation purposes
+ Saving cleaned data for next stage of analysis

# Loading packages

```{r}
library(tidyverse)
```

# Loading cleaned sample data

Let's begin today by reading in the csv file that we created in session 1. 

```{r}
session1_data = read_csv(file = "data/clean_sample_data.csv")
```

# Reading in gene expression data (top 6 rows) {#reading_ge}

Remember our aim for this workshop is to join the mice sample data with a mice gene expression data. `session1_data` is the sample data we have already cleaned yesterday, so we should now read in the gene expression data now. 

This gene expression data was downloaded straight from the Gene Expression Omnibus and it was saved as a `txt` file. This `txt` file is more or less like a rectangular Excel sheet, with each row separated by a new line and each column is separated by a a single press of the "tab" button on the eyboard. Thus, we call this tab as the "delimiter" of this file. `R` understands this "tab" as `\t`. 

The whole data has 54 rows, but for the sake of brevity, we will only read in the first 6 rows. We will also get rid of the first column of this raw data (the column of manufacturer's ID).

```{r}
raw_ge_data = read_delim(file = "data/GSE43251_NanoString_non-normalized.txt", delim = "\t", n_max = 6)
## Alternatively:
## raw_ge_data = read_tsv(file = "data/GSE43251_NanoString_non-normalized.txt", n_max = 6)

clean_ge_data = raw_ge_data %>% 
  dplyr::select(-1)

clean_ge_data
```


Let's try to understand this data before we proceed to the next section. The first column of this data contains the gene symbols that was measured by the gene expression machine. From the second column onwards, each column records the gene expression measurement for one mouse and the sample ID of the mouse is the column name (i.e. HW1, HW2, etc.). 


So how should we join these two data? If we just stack the two data together, there are no guarantees to say that we will have the right ordering of sample ID. In fact, if you try to just bind the two sample IDs together, `R` will complain about this. 

```{r, error = TRUE}
length(colnames(clean_ge_data)[-1]) ## Not counting the gene symbol column
length(session1_data$sample)

cbind(
  colnames(clean_ge_data)[-1],
  session1_data$sample)
```


This is where the `left_join` function from the `dplyr` package can help us. 

# `left_join` sample data

Let's begin by looking at the samples that have gene expression data available. We will make a `data.frame` documenting the sample IDs and another column that indicates the availability of the gene expression data (which always have a value of `"yes`, since we extracted the ID from the gene expression data itself). 

```{r}
ge_sample_data = data.frame(
  sample = colnames(clean_ge_data)[-1],
  ge_data_available = "yes"
)

head(ge_sample_data)
```

Now we will apply the `left_join` function between the `session1_data`, the full sample data, and the `ge_sample_data`, which only contains the samples with gene expression information available. The `sample` column in both data serves as an "identifier" to match each mice in the `session1_data` to each mice in `ge_sample_data`. See below.


```{r}
joined_sample_data = left_join(session1_data, ge_sample_data, by = "sample")

joined_sample_data
```

The resulted data has the same number of rows as `session1_data`, and this is no surprise because it is the first input into the function and hence the name. The `by` input specify which column should be used as the identifier. 

Looking at this data seems to suggest that nothing has gone wrong in the first 10 rows. But if we apply a small sampling on the number of rows, we can see that there are indeed sample mismatching between the sample and the gene expression data. This makes our task of combining data more difficult. Because we only know how to combine two datasets through a identifier column. This is not an issue with the `session1_data`, but the challenge here is the gene expression data, which has all identifiers as the column names. 

If only there is a way that we can grab the column names of the gene expression data into a column while preserving the gene expression values. That'd make things so much easier! Let's learn about the `pivot_longer` function first before returning to this problem.

![](https://www.fromthebottomoftheheap.net/assets/img/posts/tidyr-longer-wider.gif)

# `tidyr::pivot_longer` the gene expression data

The `pivot_longer` function takes a "wide" data into a "long" data. We simply tell the function which column we want to preserve, in this case, it will be the gene symbols, because we still want to distinguish each genes for each of the samples after the pivot operation. 

```{r}
clean_ge_data %>% 
  tidyr::pivot_longer(
    cols = -GENE_SYMBOL)
```

Of course, it is be better to give the new columns some informative names. 

```{r}
ge_data_long = clean_ge_data %>% 
  tidyr::pivot_longer(
    cols = -GENE_SYMBOL,
    names_to = "sample", 
    values_to = "gene_expression")

ge_data_long
```


Pay attention to how this data still maintains its tidy structure but it is now in a better format to be `left_join` with the `session1_data`. This may look simple, but you should appreciate the thinking behind this process as it will be ever more important as your data gets more complex. 

```{r}
long_sample_ge_data = session1_data %>% 
  left_join(ge_data_long, by = "sample") %>% 
  dplyr::filter(!is.na(GENE_SYMBOL)) %>% 
  janitor::clean_names()

long_sample_ge_data
```


Now that we have a very nice and cleaned data, we are now in a good position to learn about visualisation. 


# `ggplot2` visualisation

There are many plotting systems and packages out there, so why should we use this `ggplot2` system?

The `gg` in `ggplot2` stands for "grammar of graphics". That means in order to produce a ggplot, a very strict set of coding structure must be followed. These restrictions are designed in such a way that you can always look at a ggplot and know exactly which characteristic of the plot comes from which column of the data. This is extremely important for the purpose of reproducibility. 

The `ggplot2` plotting system is based on a layered structure, which means the final plot can always be decompose into several layers. We will illustrate what this means now step-by-step. 

Let's say we want to draw a scatter plot of the strains and the gene expression. We will first begin with a data and pipe that into the `ggplot()` function. This will give us a blank canvas for plotting. 

```{r}
long_sample_ge_data

long_sample_ge_data %>% 
  ggplot()
```


Now is a good time to think about what are the components of a scatter plot? 

+ x-coordinates of points
+ y-coordinates of points
+ Actually placement of points

The first two components are referred to as "mapping" in ggplot. Because the x-coordinates should be stored as a column in the `data.frame`, so we will "map" this column to be a characteristic on our ggplot. Similary, the y-coordinates should also be a mapping. This is why we write our code in the following manner. Note, we are using the `aes()` function to wrap everything. 


```{r}
long_sample_ge_data %>% 
  ggplot(mapping = aes(x = strain, y = gene_expression))
```


Even though we still have a blank canvas, but notice how the x and y axes are now labelled with the variables/columns in the `data.frame`. So what are we missing for a scatter plot? 
Actually putting the points on the canvas!

In `ggplot2`, a plot is always a combination of layers. We have now a layer of blank canvas with axes, we will want to draw on this canvas using `geom_point()` which places the actual points on the canvas based on the mappings we have already specified. In `ggplot2`, geometries are functions that actually draws on the canvas based on the aesthetic mappings given. These are referred to as "geometries". 

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point()
```


This is your first ggplot! Let's give it a title!

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  labs(title = "My first ggplot!")
```


We see that the scale of data points can be very different. This is due to some genes always tend to have a higher expression than others (due to biology). So it will be a good idea to make a plot for each gene instead. This means that we need to make this plot for each value in the `gene_symbol` column in our data. This can be done using the `facet_wrap` function, which will produce a plot that we have already made, but split by all unique values in the given variable, in this case `gene_symbol`. 

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  facet_wrap(~gene_symbol) +
  labs(title = "My first ggplot!")
```


This plot shows that "Aldh3a1" is the main culprit for having a very different scale compared all other five genes. 

One way to deal with this scaling is to use the logarithm scale. `ggplot2` has a very convenient way to doing this. The `scale_y_log10` function will scale the y-axis to log10 scale and this is a very convenient function because it retains the labelling of the original data but the points are on a log10-scale.

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  facet_wrap(~gene_symbol) +
  scale_y_log10() +
  labs(title = "My first ggplot!")
```


We now have the scatter plots on a good scale, and we can already see some interesting features from these plots. For example, the "Aldh3a1" gene seems to show off some differences between the two strains, and the "Bbs2" gene seems to have a different variance between the two strains. 

So, how can we improve on this plot further?

We can layer plots of other features onto this! For example, boxplots.

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression)) + 
  geom_point() +
  geom_boxplot() +
  facet_wrap(~gene_symbol) +
  scale_y_log10() +
  labs(title = "My first ggplot!")
```

Now it is a good time to stop and think about what we did. 

We started off with a `data.frame` and a blank canvas with the aim to make a scatter plot. We added aesthetic mappings onto our ggplot and evaluated the quality of our plot as we go along. Every time we identify deficiencies in our plot, we simply add more layers to the plot. This detective-work is what makes `ggplot2` is so popular, because it is very flexible in handling new ideas and plotting aesthetics. But at the same time, every aesthetic of this plot can be traced back to a particular column in the `data.frame`. 

There are so many functions and tricks that we can't really go through in details. Here is a very helpful [cheatsheet](https://rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf). And in general, the best way is to Google the plot that you want and copy the code that someone else already made. 

Have a play with the following code. Try deleting components of the code to understand which aspect of the plot it was controlling

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = strain, 
             y = gene_expression, 
             colour = strain)) + 
  geom_point() +
  geom_boxplot() +
  facet_wrap(~ gene_symbol, scales = "free_y") +
  scale_y_log10(labels = scales::comma) +
  scale_color_manual(values = c("red", "blue")) +
  labs(x = "Strain", 
       y = "Gene Expression", 
       title = "Boxplot of genes across strains") +
  theme_bw() +
  theme(legend.position = "bottom")
```


## Histogram

We will now try to make a histogram of each gene. 

Let's begin with the question: what are the aesthetic mappings of a histogram? 

+ x-axis should be numerical values of gene expression 

Let's try this out. For making a histogram, we need to use the `geom_histogram()` geometry. 


```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram()
```

You might be wondering, how is it possible that we didn't specify what should be on the y-axis but still got a plot out? This is because by the very definition of a histogram, only counts or density of the give x-variable should be plotted. Here, the default of `ggplot2` is to use counts. 

Again, we see the same scaling issue as the above boxplots. Hence, we will try to apply the same functions to manipulate the plots, namely: 

+ `facet_wrap` which creates a plot for each gene in small panels,
+ `scale_x_log10` which plots the x-axis on the log10 scale

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram() +
  facet_wrap(~gene_symbol) +
  scale_x_log10(labels = scales::comma)
```


This plot looks pretty good! We can see that some genes are unimodal while others are bimodal. This is an new characteristic of the data that we didn't observe from the boxplots above. 

We can make the modality more obvious by imposing the density of each gene onto the histogram. Let's try out the `geom_density` function then!

```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression)) + 
  geom_histogram() +
  geom_density() +
  facet_wrap(~gene_symbol) +
  scale_x_log10(labels = scales::comma)
```


This doesn't look quite right, so what happened?

If we pay attention to the y-axis of our ggplot so far, we see that it is `count`, which means that each of bin in the histograms is showing the number of observations in that interval. However, when we talk about density, it is on a very different scale to the number of counts. 

So how can we make a density-based histogram? 

Let's think this logically: right now, the count of observation is being mapped to the y-axis (default of `geom_histogram`), and we need to overwrite this by making density to be mapped to the y-axis. So the answer would be adding an extra input of `y = ..density..` into the `aes()` function. The `..` is just a technical way that `ggplot` handles such problem which you don't have to worry too much on the details about. (The reason that we can't use `y = density` is because in that case ggplot will interpret this line of code as mapping a column named `density` from the `data.frame` to the plot - but such column does not exist.) 


```{r}
long_sample_ge_data %>% 
  ggplot(aes(x = gene_expression, y = ..density..)) + 
  geom_histogram() +
  geom_density() +
  facet_wrap(~gene_symbol, scales = "free") +
  scale_x_log10(labels = scales::comma)
```


# Plotting two genes against each other

Now, let's consider how we could make a scatter plot of two genes, say the "Ahrrr" and the "Cxcr7" gene. So, let's answer the question, what are the components of a scatter plot in this case?

+ Ahrrr gene expression values on the x-axis
+ Cxcr7 gene expression values on the y-axis

But if we look at the `data.frame` we are using for plotting, `long_sample_ge_data`, it has all the gene expression values mixed in that `gene_symbol` column. So we can't effectively map the expression of only "Ahrrr" and "Cxcr7" to the x and y axes respectively. 


```{r}
long_sample_ge_data 
```


So we need to think about how to separate out the gene expression values into distinct columns. You may realise this as to be exactly the opposite of the operation of `pivot_longer`. In the `tidyr` package, the `tidyr::pivot_wider` allows for this operation. 

```{r}
wide_sample_ge_data = long_sample_ge_data %>% 
  tidyr::pivot_wider(names_from = "gene_symbol",
                     values_from = "gene_expression")

wide_sample_ge_data
```


Now that we have each gene as a column, we can perform our familiar ggplot operations. Have a play with each of the layers below. Notice that we also added the colour of the strain as well as a new geometry, `geom_smooth` that allows for adding linear regression lines. 

```{r}
wide_sample_ge_data %>% 
  ggplot(aes(x = Ahrr,
             y = Cxcr7,
             colour = strain)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10(labels = scales::comma) +
  scale_y_log10(labels = scales::comma)
```


# Saving data

```{r, eval = FALSE}
write_csv(long_sample_ge_data, path = "data/long_sample_ge_data.csv")
write_csv(wide_sample_ge_data, path = "data/wide_sample_ge_data.csv")
```

# Making gene expression boxplots with p-values

```{r, results="show", fig.show = "show"}
library(ggpubr)

ggpubr::ggboxplot(
  data = long_sample_ge_data,
  x = "strain", 
  y = "gene_expression",
  facet.by = "gene_symbol",
  scales = "free") +
  ggpubr::stat_compare_means()
```


# References

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

# Session Info

```{r, results = "show"}
sessionInfo()
```