/
exercise_02-intro_to_R.Rmd
188 lines (111 loc) · 4.68 KB
/
exercise_02-intro_to_R.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Introduction to R - Exercises"
author: "CCDL for ALSF"
date: "2020"
output:
html_notebook:
toc: true
toc_float: true
---
The goal of these exercises is to help you get comfortable with using R and R notebooks by continuing to play with the gene results dataset we used in the [01-intro_to_base_R](01-intro_to_base_R-live.Rmd) and [02-intro_to_ggplot2](02-intro_to_ggplot2-live.Rmd) notebooks.
It is a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls)
that we performed a set of [differential expression analyses on](scripts/00-setup-intro-to-R.R).
### Set Up
Use this chunk to load the `tidyverse` package.
```{r tidyverse, solution = TRUE}
```
Create a results directory if it doesn't exist.
```{r results_dir, solution = TRUE}
```
## Read in the gene results file
Use `readr::read_tsv()` to read in the file "gene_results_GSE44971.tsv" and
assign it the variable `stats_df`.
Recall that this notation means the `read_tsv()` function from the `readr` package.
If you have already loaded the `tidyverse` package above with `library()`,
you can use the function `read_tsv()` on its own without the preceding `readr::` as the `readr` package is loaded as part of `tidyverse`.
```{r read-data, solution = TRUE}
```
Use this chunk to explore what your data frame, `stats_df` looks like.
```{r explore-df, solution = TRUE}
```
## Read in the metadata
Use `readr::read_tsv()` to read in the file "cleaned_metadata_GSE44971.tsv" and assign it the name `metadata`.
```{r read-metadata, solution = TRUE}
```
Use this chunk to explore what your data frame, `metadata` looks like.
```{r explore-metadata, solution = TRUE}
```
### Selecting from data frames
Use `$` syntax to look at the `avg_expression` column of the `stats_df`
data frame.
```{r dollar, solution = TRUE}
```
Use the `min()` argument to find what the minimum average expression in this dataset is.
Remember you can use `?min` or the help panel to find out more about a function.
```{r minimum-expr, solution = TRUE}
# Find the minimum average expression value
```
Find the `log()`, using base 2, of the average expression values.
```{r log2-expr, solution = TRUE}
# Find the log of base 2 of the average expression
```
## Using logical arguments
Display the `adj_p_value` column of the `stats_df` data frame.
```{r show-p, solution = TRUE }
```
Find out which of these adjusted p-values are below a `0.05` cutoff using a logical statement.
```{r small-p, solution = TRUE}
```
Name the logical vector you created above as `significant_vector`.
```{r save-bool, solution = TRUE}
```
Use `sum()` with the object `significant_vector` to count how many p values in the total set are below this cutoff.
To solve this, you might think about `TRUE` and `FALSE` values as an alternative way to represent `1` and `0`.
```{r sum-sig, solution = TRUE}
```
## Filter the dataset
Select the column `contrast` from `stats_df`.
```{r select-contrast, solution = TRUE}
```
Construct a logical vector using `contrast` column you selected above that
indicates which rows of `stats_df` are from the `astrocytoma_normal`
contrast test.
```{r contrast-logical, solution = TRUE}
```
Use `dplyr::filter()` to keep only the data for the `astrocytoma_normal` contrast
in `stats_df`.
```{r filter-contrast, solution = TRUE}
```
Use the `nrow()` function on `astrocytoma_normal_df` to see if your filter worked.
You should have `2268` rows.
```{r contrast-rows, solution = TRUE}
```
Save your filtered data to a TSV file using `readr::write_tsv()`.
Call it `astrocytoma_normal_contrast_results.tsv` and save it to the `results`
directory.
```{r write-df, solution = TRUE}
```
### Create a density plot
Set up a ggplot object for `astrocytoma_normal_df` and set `x` as the average
expression variable.
Use the `+` to add on a layer called `geom_density()`
```{r density-plot, solution = TRUE}
```
Use the plot you started above and add a `ggplot2::theme` layer to play with its aesthetics (e.g. `theme_classic()`)
See the [ggplot2 themes vignette](https://ggplot2.tidyverse.org/reference/ggtheme.html)
to see a list of theme options.
```{r density-theme, solution = TRUE}
```
Feel free to make other customizations to this plot by adding more layers with `+`.
You can start by adding a `ylab()` and `xlab()` and then by getting inspiration
from this [handy cheatsheet for ggplot2](https://rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf).
```{r customize-plot, solution = TRUE}
# Customize your plot!
```
Save your plot as a `PNG`.
```{r save-plot, solution = TRUE}
```
### Session Info
```{r}
sessionInfo()
```