/
exercise_01-exploratory_data_analysis.Rmd
166 lines (110 loc) · 6.54 KB
/
exercise_01-exploratory_data_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
title: "Introduction to RNA-seq exploratory data analysis - Exercises"
author: "CCDL for ALSF"
date: "2020"
output:
html_notebook:
toc: true
toc_float: true
---
In this notebook, we will use an RNA-seq experiment comprised of medulloblastoma orthotopic patient-derived xenograft (PDX) models from [Huang et al. _Sci Transl Med._ 2018.](https://doi.org/10.1126/scitranslmed.aat0150) to practice more exploratory data analysis.
The experiment, [`SRP150101`](https://www.refine.bio/experiments/SRP150101/transcriptome-analysis-of-group-3-and-4-medulloblastoma-orthotopic-xenograft-mice-with-digoxin-treatment), was processed by [refine.bio](https://www.refine.bio/).
Here's an excerpt of the experiment description:
> RNA-seq data of Group 3 and 4 medulloblastoma with digoxin treatment.
Group 3 and Group 4 are subgroups of medulloblastoma.
You can read more about these subgroups of medulloblastoma in [Menyhárt et al. _J Hematol Oncol ._ 2019.](https://doi.org/10.1186/s13045-019-0712-y)
[Digoxin](https://en.wikipedia.org/wiki/Digoxin) is a treatment for various cardiovascular diseases.
## Libraries
```{r library, solution = TRUE}
# Load in DESeq2
# Load in ggplot2
# Load the SummarizedExperiment package
# Load the magrittr package for the %>% operator
```
## Read in and reorder data
We have prepared an RDS file that contains the output of `tximeta` steps in the form of a `SummarizedExperiment` object.
```{r input-files}
txi_file <- file.path("data",
"medulloblastoma",
"txi",
"medulloblastoma_txi.RDS")
```
Use the chunk below to read in the gene-level summarized `tximeta` object and save it as `gene_summarized`.
```{r read-in-files, solution = TRUE}
# Read in the tximeta file and save it as an object called gene_summarized
```
## DESeq2
Now we're ready to create a `DESeqDataSet` from the `tximeta` object.
Because we plan to use the `DESeqDataSet` to compare samples in a manner that is unbiased by the prior information about samples, we don't need to enter information about the experimental design to the `design` argument of `DESeqDataSet()`; we can use `~ 1` instead.
A few things to note:
1. The way we perform transformations with `vst()` in the notebook we use for exploratory analysis (`03-gastric_cancer_exploratory`) is blinded to experimental design by default even though we don't use `~ 1` there (the default argument to `vst()` is `blind = TRUE`).
2. If we wanted to go on to perform differential expression analysis or other downstream analyses, we'd want to create a new object where we specified information about the experimental design to `design`.
```{r ddset-creation}
ddset <- DESeqDataSet(gene_summarized,
# using design = ~ 1 blinds the DESeq2 object
# from the experimental design
design = ~ 1)
```
### Normalized counts
In the next chunk, we'll demonstrate how to get normalized counts from a `DESeqDataSet` object.
The first step is to calculate size or normalization factors with `estimateSizeFactors` using the median ratio method from [Anders and Huber. _Genome Biology._ 2010.](https://doi.org/10.1186/gb-2010-11-10-r106)
Dividing samples by their size factor will bring the counts to a common scale, making samples sequenced to different depths comparable.
```{r normalized-counts}
# Using estimateSizeFactors() will add the size factors to the ddset object
ddset <- estimateSizeFactors(ddset)
# The counts() function, when used with normalized = TRUE, will return the
# scaled counts as a matrix
normalized_counts <- counts(ddset, normalized = TRUE)
```
Let's look at the relationship between mean and standard deviation in the normalized count data with a plot via the `meanSdPlot()` function from the `vsn` library.
```{r counts-mean-sd-plot}
vsn::meanSdPlot(normalized_counts, ranks = FALSE)
```
We can see that genes with higher count values tend to have higher variance (variance is the square of standard deviation).
The point of the transformation we do before visualizing RNA-seq data is to remove the dependence of the variance on the mean ([Data transformations and visualization, _Analyzing RNA-seq data with DESeq2_](https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization)).
Without transformation, visualizations like clustering would be dominated by genes with high expression values ([Love et al. _Genome Biology._ 2014.](https://doi.org/10.1186/s13059-014-0550-8)).
### Transformation
The variance-stabilizing transformation makes it such that the variance is _stabilized_ across mean values.
The resulting values will also be on a log2-like scale for high counts.
Use the next chunk to run `vst()` on the dataset.
```{r vst, solution = TRUE}
# Save object as vst_data
```
`assay()` can be used to get the transformed data from the `DESeqTransform` object, `vst_data` as a matrix.
```{r vst-matrix}
vst_matrix <- assay(vst_data)
```
Now make the same plot as you did above with `vsn::meanSdPlot()` using the transformed matrix.
```{r vst-mean-sd-plot, solution = TRUE}
```
> What do you notice about the relationship between the mean and standard deviation? > What about the scale of the mean values?
### Principal Components Analysis
Make a PCA plot using `plotPCA()`, which takes the `vst()`-transformed object.
The `intgroup` argument of `plotPCA()` takes the column name(s) from the `colData()` and will use that to color the points in the PCA plot.
```{r plot-pca-1, solution = TRUE}
# Perform PCA and color the points by whether or not a sample is treated with
# digoxin
```
Run the chunk below to reveal a plot we'll replicate!
```{r include-graphics}
plot_to_replicate <- file.path("diagrams",
"medulloblastoma_PCA_plot_treatment_group.png")
knitr::include_graphics(plot_to_replicate)
```
Use the following two chunks to replicate the plot.
First, you'll need to do a bit of prep to get the data you need to make the plot.
```{r replicate-pca-plot-1, solution = TRUE}
# Use plotPCA again, but now we are interested in representing digoxin treatment
# and subgroup on the same plot so we need `returnData = TRUE` and to save the
# output to a new variable.
# We'll need the percent variance from the PCA data
```
Now use the chunk below to replicate the plot with `ggplot2`.
```{r replicate-pca-plot-2, solution = TRUE}
```
> What is your interpretation of this plot?
`plotPCA()` uses the top 500 genes with highest variance as the input data for PCA by default.
## Session Info
```{r session_info}
sessionInfo()
```