-
Notifications
You must be signed in to change notification settings - Fork 1
/
workshop.qmd
518 lines (347 loc) · 21.5 KB
/
workshop.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
---
title: "Workshop"
subtitle: "Summarising data in with several variables and the role of variables in analysis"
toc: true
toc-location: right
---
```{r}
#| include: false
library(tidyverse)
```
# Introduction
![Data data Artwork from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst](images/ggplot2_masterpiece.png){fig-alt="Stylized text providing an overview of Tidy Data. The top reads 'Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.' On the left reads 'In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.' There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure."}
## Session overview
In this workshop you will learn to summarise and plot datasets with more than one variable and how to write figures to files. You will also get more practice with working directories, importing data, formatting figures and the pipe.
## Philosophy
Workshops are not a test. It is expected that you often don't know how to start, make a lot of mistakes and need help. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. Tips
- don't worry about making mistakes
- don't let what you can not do interfere with what you can do
- discussing code with your neighbours will help
- look things up in the independent study material
- look things up in your own code from earlier workshops
- there are no stupid questions
::: callout-note
## Key
These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.
![](images/do_on_your_computer.png) Something you need to do on your computer. It may be opening programs or documents or locating a file.
![](images/do_in_R.png) Something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.
![](images/do_on_internet.png) Something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.
![](images/answer.png) A question for you to think about and answer. Record your answers in your script for future reference.
:::
# Getting started
![](images/do_on_your_computer.png) Start RStudio from the Start menu.
![](images/do_in_R.png) Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says `Project: (None)` and choosing New Project, then New Directory, then New Project. Navigate to the `data-analysis-in-r-1` folder and name the RStudio Project 'week-9'.
![](images/do_in_R.png) Make a new folder called `data-raw`. You can do this on the Files Pane by clicking New Folder and typing into the box that appears.
![](images/do_in_R.png) Make a new script then save it with a name like `analysis.R` to carry out the rest of the work.
![](images/do_in_R.png) Add a comment to the script: `# Summarising data in with several variables and the role of variables in analysis`
![](images/do_in_R.png) Add code to load the **`tidyverse`** package [@tidyverse]
# Exercises
You are going to practice how to summarise and plot data as follows:
- one continuous response and one nominal explanatory variable with three groups
- one continuous response and one nominal explanatory variable with two groups but the supplied data first need converting to tidy format
- two continuous variables
Note that the [Independent Study to prepare for workshop](study_before_workshop.qmd) covered one continuous response and one nominal explanatory variable with two groups.
## Myoglobin in seal muscle
The myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in [seal.csv](data-raw/seal.csv). Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.
### Import
![](images/do_on_your_computer.png) Save [seal.csv](data-raw/seal.csv) to your `data-raw` folder
![](images/do_in_R.png) Read the data into a dataframe called `seal`. . You might want to look up [data import from last week](../week-8/workshop.html#importing-data-from-files).
```{r}
#| include: false
#---CODING ANSWER---
seal <- read_csv("data-raw/seal.csv")
```
![](images/answer.png) What types of variables do you have in the `seal` dataframe? What role would you expect them to play in analysis?
<!-- #---THINKING ANSWER--- -->
<!-- there are two variables: `myoglobin` is the response and is continuous -->
<!-- and `species` is explanatory. `species` is categorical with three -->
<!-- levels (groups). -->
The key point here is that the fundamental structure of:
- one continuous response and one nominal explanatory variable with two groups ([adipocytes](data-raw/adipocytes.txt)), and
- one continuous response and one nominal explanatory variable with three groups ([seals](data-raw/seal.csv))
is the same! The only thing that differs is the number of groups (the number of values in the nominal variable). This means the code for summarising and plotting is *identical* except for the variable names!
::: callout-tip
## Tip
When two datasets have the same number of columns and the response variable and the explanatory variables have the same data types then the code you need is the same.
:::
### Summarise
Summarising the data for each species is the next sensible step. The most useful summary statistics for a continuous variable like `myoglobin` are the means, standard deviations, sample sizes and standard errors. You might remember from last week that we use the `group_by()` and `summarise()` functions along with the functions that do the calculations.
![](images/do_in_R.png) Create a data frame called `seal_summary` that contains the means, standard deviations, sample sizes and standard errors for the control and nicotinic acid treated samples.
```{r}
seal_summary <- seal |>
group_by(species) |>
summarise(mean = mean(myoglobin),
std = sd(myoglobin),
n = length(myoglobin),
se = std/sqrt(n))
```
You should get the following numbers:
```{r}
#| echo: false
knitr::kable(seal_summary) |> kableExtra::kable_styling()
```
### Visualise
Most commonly, we put the explanatory variable on the *x* axis and the response variable on the *y* axis. A continuous response, particularly one that follows the normal distribution, is best summarised with the mean and the standard error. In my opinion, you should also show all the raw data points if possible.
We are going to create a figure like this:
```{r}
#| echo: false
ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "gray50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Myoglobin (g/kg)",
limits = c(0, 80),
expand = c(0, 0)) +
scale_x_discrete(name = "Species") +
theme_classic()
```
In this figure, we have the data points themselves which are in `seal` dataframe and the means and standard errors which are in the `seal_summary` dataframe. That is, we have two dataframes we want to plot.
Here you will learn that dataframes and aesthetics can be specified *within a `geom_xxxx`* (rather than in the `ggplot()`). This is very useful if the geom only applies to some of the data you want to plot.
::: callout-tip
## Tip: `ggplot()`
You put the `data` argument and `aes()` inside `ggplot()` if you want all the `geoms` to use that dataframe and variables. If you want a different dataframe for a `geom`, put the `data` argument and `aes()` inside the `geom_xxxx()`
:::
I will build the plot up in small steps **but you should edit your *existing* `ggplot()` command as we go.**
![](images/do_in_R.png) Plot the data points first.
```{r}
ggplot() +
geom_point(data = seal,
aes(x = species, y = myoglobin))
```
Notice how we have given the data argument and the aesthetics inside the geom. The variables `species` and `myoglobin` are in the `seal` dataframe
![](images/do_in_R.png) So the data points don't overlap, we can add some random jitter in the *x* direction (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal,
aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0))
```
Note that `position = position_jitter(width = 0.1, height = 0)` is inside the `geom_point()` parentheses, after the `aes()` and a comma.
We've set the vertical jitter to 0 because, in contrast to the categorical *x*-axis, movement on the *y*-axis has meaning (the myoglobin levels).
![](images/do_in_R.png) Let's make the points a light grey (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal,
aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50")
```
Now to add the errorbars. These go from one standard error below the mean to one standard error above the mean.
![](images/do_in_R.png) Add a `geom_errorbar()` for errorbars (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3)
```
We have specified the `seal_summary` dataframe and the variables `species`, `mean` and `se` are in that.
There are several ways you could add the mean. You could use `geom_point()` but I like to use `geom_errorbar()` again with the `ymin` and `ymax` both set to the mean.
![](images/do_in_R.png) Add a `geom_errorbar()` for the mean (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean, ymax = mean),
width = 0.2)
```
![](images/do_in_R.png) Alter the axis labels and limits using `scale_y_continuous()` and `scale_x_discrete()` (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Myoglobin (g/kg)",
limits = c(0, 80),
expand = c(0, 0)) +
scale_x_discrete(name = "Species")
```
You only need to use `scale_y_continuous()` and `scale_x_discrete()` to use labels that are different from those in the dataset. Often this is to use proper terminology and captialisation.
![](images/do_in_R.png) Format the figure in a way that is more suitable for including in a report using `theme_classic()` (edit your existing code):
```{r}
ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Myoglobin (g/kg)",
limits = c(0, 80),
expand = c(0, 0)) +
scale_x_discrete(name = "Species") +
theme_classic()
```
### Writing figures to file
![](images/do_in_R.png) Make a new folder called `figures`.
![](images/do_in_R.png) Edit you ggplot code so that you assign the figure to a variable.
```{r}
sealfig <- ggplot() +
geom_point(data = seal, aes(x = species, y = myoglobin),
position = position_jitter(width = 0.1, height = 0),
colour = "grey50") +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean - se, ymax = mean + se),
width = 0.3) +
geom_errorbar(data = seal_summary,
aes(x = species, ymin = mean, ymax = mean),
width = 0.2) +
scale_y_continuous(name = "Myoglobin (g/kg)",
limits = c(0, 80),
expand = c(0, 0)) +
scale_x_discrete(name = "Species") +
theme_classic()
```
The figure won't be shown in the Plots tab - the output has gone into `sealfig` rather than to the Plots tab. To make it appear in the Plots tab type `sealfig`
![](images/do_in_R.png) The `ggsave()` command will write a ggplot figure to a file:
```{r}
ggsave("figures/seal-muscle.png",
plot = sealfig,
device = "png",
width = 4,
height = 3,
units = "in",
dpi = 300)
```
`figuresseal-muscle.png` is the name of the file, including the relative path.
![](images/do_in_R.png) Look up `ggsave()` in the manual to understand the arguments. You can do this by putting your cursor on the command and pressing F1
## Pigeons
The data in [pigeon.txt](data-raw/pigeon.txt) are 40 measurements of interorbital width (in mm) for two populations of domestic pigeons measured to the nearest 0.1mm
![Interorbital width is the distance between the eyes](images/interorbital.png){fig-alt="Photo of a domestic pigeon, head on, with an arrow showing the distance between the eyes."}
### Import
![](images/do_on_your_computer.png) Save [pigeon.txt](data-raw/pigeon.txt) to your `data-raw` folder
![](images/do_in_R.png) Read the data into a dataframe called `pigeons`.
```{r}
#| include: false
#---CODING ANSWER---
pigeons <- read_table("data-raw/pigeon.txt")
```
![](images/answer.png) What variables are there in the `pigeons` dataframe?
<!-- #---THINKING ANSWER--- -->
<!-- This is not obvious - there are two columns called A and B. -->
<!-- Both seem to contain contain continuous values, presumably the interorbital -->
<!-- distances. The A and B seem to be the names of the populations -->
Hummmm, these data are not organised like the other data sets we have used. The population is given as the column names and the interorbital distances for one population are given in a different column than those for the other population. The first row has data from two pigeons which have nothing in common, they just happen to be the first individual recorded in each population.
```{r}
#| echo: false
knitr::kable(pigeons) |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "200px")
```
This data is not in 'tidy' format [@Wickham2014-nl].
Tidy format has variables in column and observations in rows. All of the distance measurements should be in one column and a second column should give the population.
```{r}
#| echo: false
pigeons |>
pivot_longer(cols = everything(),
names_to = "population",
values_to = "distance") |>
knitr::kable() |>
kableExtra::kable_styling() |>
kableExtra::scroll_box(height = "200px")
```
Data which is in tidy format is easier to summarise, analyses and plot because the organisation matches the conceptual structure of the data:
- it is more obvious what the variables are because they columns are named with them - in the untidy format, that the measures are distances is not clear and what A and B are isn't clear
- it is more obvious that there is no relationship between any of the pigeons except for population
- functions are designed to work with variables in columns
### Tidying data
We can put this data in such a format with the `pivot_longer()` function from the `tidyverse`:
`pivot_longer()` collects the values from specified columns (`cols`) into a single column (`values_to`) and creates a column to indicate the group (`names_to`).
![](images/do_in_R.png) Put the data in tidy format:
```{r}
pigeons <- pivot_longer(data = pigeons,
cols = everything(),
names_to = "population",
values_to = "distance")
```
We have overwritten the original dataframe. If you wanted to keep the original you would need to give a new name on the left side of the assignment `<-` Note: the data in the file are unchanged.
## Ulna and height
The datasets we have used up to this point, have had a continuous variable and a categorical variable where it makes sense to summarise the response for each of the different groups in the categorical variable and plot the response on the *y*-axis. We will now summarise a dataset with two continuous variables. The data in [height.txt](data-raw/height.txt) are the ulna length (cm) and height (m) of 30 people. In this case, it is more appropriate to summarise both of thee variables and to plot them as a scatter plot.
We will use `summarise()` again but we do not need the `group_by()` function this time. We will also need to use each of the summary functions, such as `mean()`, twice, once for each variable.
### Import
![](images/do_on_your_computer.png) Save [height.txt](data-raw/height.txt) to your `data-raw` folder
![](images/do_in_R.png) Read the data into a dataframe called `ulna_heights`.
```{r}
#| include: false
#---CODING ANSWER---
ulna_heights <- read_table("data-raw/height.txt")
```
### Summarise
![](images/do_in_R.png) Create a data frame called `ulna_heights_summary` that contains the sample size and means, standard deviations and standard errors for both variables.
```{r}
ulna_heights_summary <- ulna_heights |>
summarise(n = length(ulna),
mean_ulna = mean(ulna),
std_ulna = sd(ulna),
se_ulna = std_ulna/sqrt(n),
mean_height = mean(height),
std_height = sd(height),
se_height = std_height/sqrt(n))
```
You should get the following numbers:
```{r}
#| echo: false
knitr::kable(ulna_heights_summary) |> kableExtra::kable_styling()
```
### Visualise
To plot make a scatter plot we need to use `geom_point()` again but without any scatter. In this case, it does not really matter which variable is on the *x*-axis and which is on the *y*-axis.
![](images/do_in_R.png) Make a simple scatter plot
```{r}
ggplot(data = ulna_heights, aes(x = ulna, y = height)) +
geom_point()
```
If you have time, you may want to format the figure more appropriately.
```{r}
#| echo: false
#---CODING ANSWER---
ulnafig <- ggplot(data = ulna_heights, aes(x = ulna, y = height)) +
geom_point() +
scale_x_continuous(name = "Ulna length (cm)",
limits = c(0, 40),
expand = c(0, 0)) +
scale_y_continuous(name = "Height (m)",
limits = c(0, 2.1),
expand = c(0, 0)) +
theme_classic()
ggsave("figures/ulna_heights.png",
plot = ulnafig,
device = "png",
width = 3,
height = 3,
units = "in",
dpi = 300)
```
<!-- ## Look after future you! -->
<!-- Future you is going to summarise and plot data from the "River practicals". You can make this much easier by documenting what you have done now. At the moment all of your code from this workshop is in a single file, probably called `analysis.R`. I recommend making a new script for each of data set and copying the code which imports, summarises and plots it. This will make it easier for future you to find the code you need. Here is an example: [seals-analysis.R](assets/seals-analysis.R). You may wish to comment your version much more. -->
You're finished!
# 🥳 Well Done! 🎉
![Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst](images/tidy-data.jpg){fig-alt="Stylized text providing an overview of Tidy Data. The top reads 'Tidy data is a standard way of mapping the meaning of a dataset to its structure. - Hadley Wickham.' On the left reads 'In tidy data: each variable forms a column; each observation forms a row; each cell is a single measurement.' There is an example table on the lower right with columns ‘id’, ‘name’ and ‘color’ with observations for different cats, illustrating tidy data structure"}
# Independent study following the workshop
[Consolidate](study_after_workshop.qmd)
# The Code file
This contains all the code needed in the workshop even where it is not visible on the webpage.
The `workshop.qmd` file is the file I use to compile the practical. Qmd stands for Quarto markdown. It allows code and ordinary text to be interweaved to produce well-formatted reports including webpages. [View the Qmd in Browser](https://github.com/3mmaRand/R4BABS/blob/main/r4babs1/week-9/workshop.qmd). Coding and thinking answers are marked with `#---CODING ANSWER---` and `#---THINKING ANSWER---`
Pages made with R [@R-core], Quarto [@allaire2022], `knitr` [@knitr], `kableExtra` [@kableExtra]
# References