RNA-seq/exercise_01-exploratory_data_analysis.Rmd

---
title: "Introduction to RNA-seq exploratory data analysis - Exercises"
author: "CCDL for ALSF"
date: "2020"
output:
  html_notebook:
    toc: true
    toc_float: true
---

In this notebook, we will use an RNA-seq experiment comprised of medulloblastoma orthotopic patient-derived xenograft (PDX) models from [Huang et al. _Sci Transl Med._ 2018.](https://doi.org/10.1126/scitranslmed.aat0150) to practice more exploratory data analysis.

The experiment, [`SRP150101`](https://www.refine.bio/experiments/SRP150101/transcriptome-analysis-of-group-3-and-4-medulloblastoma-orthotopic-xenograft-mice-with-digoxin-treatment), was processed by [refine.bio](https://www.refine.bio/).

Here's an excerpt of the experiment description:

> RNA-seq data of Group 3 and 4 medulloblastoma with digoxin treatment.

Group 3 and Group 4 are subgroups of medulloblastoma. 
You can read more about these subgroups of medulloblastoma in [Menyhárt et al. _J Hematol Oncol ._ 2019.](https://doi.org/10.1186/s13045-019-0712-y)

[Digoxin](https://en.wikipedia.org/wiki/Digoxin) is a treatment for various cardiovascular diseases.

## Libraries

```{r library, solution = TRUE}
# Load in DESeq2

# Load in ggplot2

# Load the SummarizedExperiment package

# Load the magrittr package for the %>% operator

```

## Read in and reorder data

We have prepared an RDS file that contains the output of `tximeta` steps in the form of a `SummarizedExperiment` object. 

```{r input-files}
txi_file <- file.path("data",
                      "medulloblastoma",
                      "txi",
                      "medulloblastoma_txi.RDS")
```

Use the chunk below to read in the gene-level summarized `tximeta` object and save it as `gene_summarized`.

```{r read-in-files, solution = TRUE}
# Read in the tximeta file and save it as an object called gene_summarized

```

## DESeq2 

Now we're ready to create a `DESeqDataSet` from the `tximeta` object.
Because we plan to use the `DESeqDataSet` to compare samples in a manner that is unbiased by the prior information about samples, we don't need to enter information about the experimental design to the `design` argument of `DESeqDataSet()`; we can use `~ 1` instead.

A few things to note:

1. The way we perform transformations with `vst()` in the notebook we use for exploratory analysis (`03-gastric_cancer_exploratory`) is blinded to experimental design by default even though we don't use `~ 1` there (the default argument to `vst()` is `blind = TRUE`).
2. If we wanted to go on to perform differential expression analysis or other downstream analyses, we'd want to create a new object where we specified information about the experimental design to `design`.

```{r ddset-creation}
ddset <- DESeqDataSet(gene_summarized,
                      # using design = ~ 1 blinds the DESeq2 object
                      # from the experimental design
                      design = ~ 1)
```

### Normalized counts

In the next chunk, we'll demonstrate how to get normalized counts from a `DESeqDataSet` object.
The first step is to calculate size or normalization factors with `estimateSizeFactors` using the median ratio method from [Anders and Huber. _Genome Biology._ 2010.](https://doi.org/10.1186/gb-2010-11-10-r106)
Dividing samples by their size factor will bring the counts to a common scale, making samples sequenced to different depths comparable.

```{r normalized-counts}
# Using estimateSizeFactors() will add the size factors to the ddset object
ddset <- estimateSizeFactors(ddset)

# The counts() function, when used with normalized = TRUE, will return the 
# scaled counts as a matrix
normalized_counts <- counts(ddset, normalized = TRUE)
```

Let's look at the relationship between mean and standard deviation in the normalized count data with a plot via the `meanSdPlot()` function from the `vsn` library.

```{r counts-mean-sd-plot}
vsn::meanSdPlot(normalized_counts, ranks = FALSE)
```

We can see that genes with higher count values tend to have higher variance (variance is the square of standard deviation).
The point of the transformation we do before visualizing RNA-seq data is to remove the dependence of the variance on the mean ([Data transformations and visualization, _Analyzing RNA-seq data with DESeq2_](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization)).
Without transformation, visualizations like clustering would be dominated by genes with high expression values ([Love et al. _Genome Biology._ 2014.](https://doi.org/10.1186/s13059-014-0550-8)).

### Transformation

The variance-stabilizing transformation makes it such that the variance is _stabilized_ across mean values.
The resulting values will also be on a log2-like scale for high counts.
Use the next chunk to run `vst()` on the dataset.

```{r vst, solution = TRUE}
# Save object as vst_data

```

`assay()` can be used to get the transformed data from the `DESeqTransform` object, `vst_data` as a matrix.

```{r vst-matrix}
vst_matrix <- assay(vst_data)
```

Now make the same plot as you did above with `vsn::meanSdPlot()` using the transformed matrix.

```{r vst-mean-sd-plot, solution = TRUE}

```

> What do you notice about the relationship between the mean and standard deviation? > What about the scale of the mean values?

### Principal Components Analysis

Make a PCA plot using `plotPCA()`, which takes the `vst()`-transformed object.
The `intgroup` argument of `plotPCA()` takes the column name(s) from the `colData()` and will use that to color the points in the PCA plot.

```{r plot-pca-1, solution = TRUE}
# Perform PCA and color the points by whether or not a sample is treated with
# digoxin

```

Run the chunk below to reveal a plot we'll replicate!

```{r include-graphics}
plot_to_replicate <- file.path("diagrams", 
                               "medulloblastoma_PCA_plot_treatment_group.png")
knitr::include_graphics(plot_to_replicate)
```

Use the following two chunks to replicate the plot. 
First, you'll need to do a bit of prep to get the data you need to make the plot.

```{r replicate-pca-plot-1, solution = TRUE}
# Use plotPCA again, but now we are interested in representing digoxin treatment 
# and subgroup on the same plot so we need `returnData = TRUE` and to save the 
# output to a new variable.

# We'll need the percent variance from the PCA data  

```

Now use the chunk below to replicate the plot with `ggplot2`.

```{r replicate-pca-plot-2, solution = TRUE}

```

> What is your interpretation of this plot?
`plotPCA()` uses the top 500 genes with highest variance as the input data for PCA by default.

## Session Info

```{r session_info}
sessionInfo()
```