-
Notifications
You must be signed in to change notification settings - Fork 1
/
02-dataskills2.Rmd
920 lines (640 loc) · 35.8 KB
/
02-dataskills2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
# Data skills - Part 2 {#dataskills2}
```{r include=FALSE, message=FALSE, warning=FALSE}
suppressPackageStartupMessages({
library(doBy)
library(Hmisc)
library(dplyr)
library(lubridate)
library(lgrdata)
library(ggplot2)
library(ggthemes)
library(padr)
library(gplots)
library(reshape2)
library(stringr)
library(glue)
library(tidyr)
library(zoo)
})
source("R/theme_datapelikaan.R")
library(showtext)
font_add_google(name = "Lato", family = "Lato", regular.wt = 400, bold.wt = 700)
library(ggplot2)
theme_set(theme_datapelikaan(base_family = "Lato"))
library(knitr)
current_output <- opts_knit$get("rmarkdown.pandoc.to")
opts_knit$set(kable.force.latex = TRUE)
knit_theme$set("earendel")
opts_chunk$set(background="grey94",
fig.showtext = TRUE,
dev = ifelse(current_output == "latex", "pdf", "svg"))
```
## Summarizing dataframes
There are a few useful functions to print general summaries of a dataframe, to see which variables are included, what types of data they contain, and so on. We already looked at `summary` and `str` in Section \@ref(dataframes).
Two more very useful functions are from the `Hmisc` package. The first, `describe`, is much like `summary`, but offers slightly more sophisticated statistics. The second, `contents`, is similar to `str`, but does a very nice job of summarizing the `factor` variables in your dataframe, prints the number of missing variables, the number of rows, and so on.
```{r echo=FALSE}
suppressPackageStartupMessages(library(Hmisc))
```
```{r}
# Load data
data(pupae)
# Make sure CO2_treatment is a factor (it will be read as a number)
pupae$CO2_treatment <- as.factor(pupae$CO2_treatment)
# Show contents:
library(Hmisc)
contents(pupae)
```
```{r echo=FALSE}
detach("package:Hmisc")
```
Here, `storage` refers to the internal storage type of the variable: note that the factor variables are stored as 'integer', and other numbers as 'double' (this refers to the precision of the number).
### Making summary tables {#tapplyaggregate}
#### Summarizing vectors with `tapply` {#tapply}
If we have the following dataset called `plantdat`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/exampledata.png")
```
and execute the command
```{r eval=FALSE}
with(plantdat, tapply(Plantbiomass, Treatment, mean))
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/tapplyresult.png")
```
Note that the result is a `vector` (elements of a vector can have names, like columns of a dataframe).
If we have the following dataset called `plantdat2`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/exampledatalarger.png")
```
and execute the command
```{r eval=FALSE}
with(plantdat2, tapply(Plantbiomass, list(Species, Treatment), mean))
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/tapplyresultlarger.png")
```
Note that the result here is a `matrix`, where `A` and `B`, the species codes, are the rownames of this matrix.
Often, you want to summarize a variable by the levels of another variable. For example, in the `rain` data, the `Rain` variable gives daily values, but we might want to calculate annual sums,
```{r}
# Read data
data(rain)
# Annual rain totals.
with(rain, tapply(Rain, Year, FUN=sum))
```
The `tapply` function applies a function (`sum`) to a vector (`Rain`), that is split into chunks depending on another variable (`Year`).
We can also use the `tapply` function on more than one variable at a time. Consider these examples on the `pupae` data.
```{r}
# Average pupal weight by CO2 and T treatment:
with(pupae, tapply(PupalWeight, list(CO2_treatment, T_treatment), FUN=mean))
# Further split the averages, by gender of the pupae.
with(pupae, tapply(PupalWeight, list(CO2_treatment, T_treatment, Gender), FUN=mean))
```
As the examples show, the `tapply` function produces summary tables by one or more factors. The resulting object is either a vector (when using one factor), or a matrix (as in the examples using the pupae data).
The limitations of `tapply` are that you can only summarize one variable at a time, and that the result is not a dataframe.
The main advantage of `tapply` is that we can use it as input to `barplot`, as the following example demonstrates (Fig. fig:pupgroupedbar})
```{r pupgroupedbar, fig.cap="A grouped barplot of average pupal weight by CO2 and Gender for the pupae dataset. This is easily achieved via the use of tapply.", opts.label="smallsquare"}
# Pupal weight by CO2 and Gender. Result is a matrix.
pupm <- with(pupae, tapply(PupalWeight, list(CO2_treatment,Gender),
mean, na.rm=TRUE))
# When barplot is provided a matrix, it makes a grouped barplot.
# We specify xlim to make some room for the legend.
barplot(pupm, beside=TRUE, legend.text=TRUE, xlim=c(0,8),
xlab="Gender", ylab="Pupal weight")
```
#### Quick summary tables with `summaryBy` {#summaryby}
If we have the following dataset called `plantdat`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/exampledata.png")
```
and execute the command
```{r eval=FALSE}
library(doBy)
summaryBy(Plantbiomass ~ treatment, FUN=mean, data=plantdat)
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/summarybyresult.png")
```
Note that the result here is a `dataframe`.
If we have the following dataset called `plantdat2`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/exampledatalarger.png")
```
and execute the command
```{r eval=FALSE}
summaryBy(Plantbiomass ~ Species + Treatment, FUN=mean, data=dfr)
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/summarybyresultlarger.png")
```
Note that the result here is a `dataframe`.
In practice, it is often useful to make summary tables of multiple variables at once, and to end up with a dataframe. In this book we first use `summaryBy`, from the `doBy` package, to achieve this.
With `summaryBy`, we can generate multiple summaries (mean, standard deviation, etc.) on more than one variable in a dataframe at once. We can use a convenient formula interface for this. It is of the form,
```{r eval=FALSE}
summaryBy(Yvar1 + Yvar2 ~ Groupvar1 + Groupvar2, FUN=c(mean,sd), data=mydata)
```
where we summarize the (numeric) variables `Yvar1` and `Yvar2` by all combinations of the (factor) variables `Groupvar1` and `Groupvar2`.
```{r echo=FALSE}
suppressPackageStartupMessages(library(doBy))
```
```{r}
# Load the doBy package
library(doBy)
# read pupae data if you have not already
data(pupae)
# Get mean and standard deviation of Frass by CO2 and T treatments
summaryBy(Frass ~ CO2_treatment + T_treatment,
data=pupae, FUN=c(mean,sd))
# Note that there is a missing value. We can specify na.rm=TRUE,
# which will be passed to both mean() and sd(). It works because those
# functions recognize that argument (i.e. na.rm is NOT an argument of
# summaryBy itself!)
summaryBy(Frass ~ CO2_treatment + T_treatment,
data=pupae, FUN=c(mean,sd), na.rm=TRUE)
# However, if we use a function that does not recognize it, we first have to
# exclude all missing values before making a summary table, like this:
pupae_nona <- pupae[complete.cases(pupae),]
# Get mean and standard deviation for
# the pupae data (Pupal weight and Frass), by CO2 and T treatment.
# Note that length() does not recognize na.rm (see ?length), which is
# why we have excluded any NA from pupae first.
summaryBy(PupalWeight+Frass ~ CO2_treatment + T_treatment,
data=pupae_nona,
FUN=c(mean,sd,length))
```
You can also use any function that returns a vector of results. In the following example we calculate the 5% and 95% quantiles of all numeric variables in the allometry dataset. To do this, use `.` for the left-hand side of the formula.
```{r}
# . ~ species means 'all numeric variables by species'.
# Extra arguments to the function used (in this case quantile) can be set here as well,
# they will be passed to that function (see ?quantile).
data(allometry)
summaryBy(. ~ species, data=allometry, FUN=quantile, probs=c(0.05, 0.95))
```
#### Summarizing dataframes with `dplyr` {#dplyr}
We started with the `summaryBy` package, since it is so easy to use. A more modern and popular approach is to use `dplyr` for all your dataframe summarizing needs. The main advantage is that `dplyr` is very, very fast. For datasets with > hundreds of thousands of rows, you will notice an incredible speed increase. For millions of rows, you really have to use `dplyr` (or `data.table`, but we don't cover that package in this book).
Making summary tables as we did in the above examples requires two steps:
1. Group your dataframe by one or more factor variables
2. Apply summarizing functions to each of the groups
As is the norm, we use the pipe operator (`%>%`) to keep the steps apart. Here is a simple example to get started.
```{r eval=TRUE, echo=TRUE}
library(dplyr)
group_by(pupae, CO2_treatment, T_treatment) %>%
summarize(Frass_mean = mean(Frass, na.rm=TRUE),
Frass_sd = sd(Frass, na.rm=TRUE))
```
I used `as.data.frame` at the end, so we arrive at an actual dataframe, not a tibble (only for the reason so you can compare the output to the previous examples). The `dplyr` package (and others) always produce tibbles, which are really just dataframes but with some adjusted printing methods.
In `summarize` you can specify each of the new variables that should be produced, in this case giving mean and standard deviation of Frass. If we want to apply a number of functions over many variables, we can use `summarize_at`, like so:
```{r}
group_by(pupae, CO2_treatment, T_treatment) %>%
summarize_at(.vars = c("Frass", "PupalWeight"),
.funs = c("mean", "sd"))
```
The result is identical to our last example with `summaryBy`.
Let's look at a more advanced example using weather data collected at the Hawkesbury Forest Experiment in 2008. The data given are in half-hourly time steps. It is a reasonable request to provide data as daily averages (for temperature) and daily sums (for precipitation).
The following code produces a daily weather dataset, and Fig. \@ref(fig:hfemetaggregate).
```{r hfemetaggregate, fig.cap='Daily rainfall at the HFE in 2008', opts.label="smallsquare", warning=FALSE}
# Read data, convert DateTime field to a proper Datetime class using
# lubridate's mdy_hm function, and add a Date column with as.Date.
# Instead of loading packages, we use :: in the example below to make sure
# the functions are used from the right package.
data(hfemet2008)
hfemet_agg <- hfemet2008 %>%
mutate(DateTime = lubridate::mdy_hm(DateTime),
Date = as.Date(DateTime)) %>%
group_by(Date) %>%
summarize(Rain = sum(Rain),
Tair = mean(Tair))
# A simple plot of daily rainfall.
library(ggplot2)
ggplot(hfemet_agg, aes(x = Date, y = Rain)) +
geom_bar(stat="identity")
```
#### Using `padr` to aggregate timeseries data
In the previous example we saw a quick and concise methods to aggregate and filter a dataframe that includes a single date-time column (here roughly referred to as timeseries data). In the example, we calculated daily totals and averages, but what if we want to aggregate by other timespans, like two-week intervals, 6 months, or 15 minutes? The `padr` package is very convenient for this sort of mutation.
The following example uses `padr` in combination with several functions from `dplyr` to make a table of total rainfall in 4 hour increments for one day, using the `hfemet2008` dataset.
```{r}
library(dplyr)
library(padr)
library(lubridate)
# From the lgrdata package
data(hfemet2008)
mutate(hfemet2008, DateTime = mdy_hm(DateTime)) %>% # convert to proper DateTime
filter(as.Date(DateTime) == "2008-6-3") %>% # select a single day
thicken("4 hours", round="down") %>% # add datetime in 4hour steps
group_by(DateTime_4_hour) %>% # set grouping variable
summarize(Rain = sum(Rain)) %>% # sum over the grouping variable
select(DateTime = DateTime_4_hour, Rain) # show two variables
```
### Tables of counts {#xtabs}
It is often useful to count the number of observations by one or more multiple factors. One option is to use `tapply` or `summaryBy` in combination with the `length` function. A much better alternative is to use the `xtabs` and `ftable` functions, in addition to the simple use of `table`. Alternatively, `dplyr` provides a `count` function. We will look at both options.
Consider these examples using the Titanic data.
```{r tidy = FALSE}
# Read titanic data
data(titanic)
# Count observations by passenger class
table(titanic$PClass)
# With more grouping variables, it is more convenient to use xtabs.
# Count observations by combinations of passenger class, sex, and whether they survived:
xtabs( ~ PClass + Sex + Survived, data=titanic)
# The previous output is hard to read, consider using ftable on the result:
ftable(xtabs( ~ PClass + Sex + Survived, data=titanic))
# Using dplyr, the result is a dataframe (actually, a tibble)
library(dplyr)
titanic %>% count(PClass, Sex, Survived)
```
### Adding summary variables to dataframes {#summaryvars}
We saw how `tapply` can make simple tables of averages (or totals, or other functions) of some variable by the levels of one or more factor variables. The result of `tapply` is typically a vector with a length equal to the number of levels of the factor you summarized by (see examples in Section \@ref(tapply)).
Consider the `allometry` dataset, which includes tree height for three species. Suppose you want to add a new variable 'MaxHeight', that is the maximum tree height observed per species. We can use `ave` to achieve this:
```{r}
# Read data
allom <- allometry %>%
mutate(MaxHeight = ave(height, species, FUN=max))
# Look at first few rows (or just type allom to see whole dataset)
head(allom)
```
Note that you can use any function in place of `max`, as long as that function can take a vector as an argument, and returns a single number.
```{block2, type="rmdtry"}
If you want results similar to `ave`, you can use `summaryBy` with the argument `full.dimension=TRUE`. Try `summaryBy` on the `pupae` dataset with that argument set, and compare the result to `full.dimension=FALSE`, which is the default.
```
### Reordering factor levels based on a summary variable {#reorder}
It is often useful to tabulate your data in a meaningful order. We saw that, when using `summaryBy`, `tapply` or similar functions, that the results are always in the order of your factor levels. Recall that the default order is alphabetical. This is rarely what you want.
You can reorder the factor levels by some summary variable. For example,
```{r warning=FALSE}
# Reorder factor levels for 'Manufacturer' in the cereal data
# by the mean amount of sodium.
# Read data, show default (alphabetical) levels:
data(cereals)
levels(cereals$Manufacturer)
# Now reorder:
cereals <- mutate(cereals,
Manufacturer = reorder(Manufacturer, sodium,
median, na.rm=TRUE))
# And inspect the new levels
levels(cereals$Manufacturer)
# Tables are now printed in order:
with(cereals, tapply(sodium, Manufacturer, median))
```
This trick comes in handy when making barplots; it is customary to plot them in ascending order if there is no specific order to the factor levels, as in this example.
The following code produces Fig. \@ref(fig:coweetabar).
```{r coweetabar, fig.cap='An ordered barplot for the coweeta tree data (error bars are 1 SD).', warning=FALSE, opts.label="largesquare"}
# Here we read the data, add a reordered factor variable,
# and continue with making the summary table used in the plot.
data(coweeta)
coweeta_table <- coweeta %>%
mutate(species = reorder(species, height, mean, na.rm=TRUE)) %>%
group_by(species) %>%
dplyr::summarize(height_mean = mean(height, na.rm=TRUE),
height_sd = sd(height, na.rm=TRUE))
ggplot(coweeta_table, aes(x = species, y = height_mean)) +
geom_bar(stat="identity", fill="red") +
geom_errorbar(aes(ymin = height_mean - height_sd,
ymax = height_mean + height_sd), width=0.2) +
labs(y = "Height (m)", x = "Species")
```
The above example uses the more modern approach with `ggplot2` and `dplyr`, but we can get practically the same result with `doBy` and `gplots` - below is the code for comparison (output not shown).
```{r results='hide', fig.keep='none'}
library(doBy)
coweeta$species <- with(coweeta, reorder(species, height, mean, na.rm=TRUE))
coweeta_agg <- summaryBy(height ~ species, data=coweeta, FUN=c(mean,sd))
# For barplot2, which adds options for error bars
library(gplots)
# This par setting makes the x-axis labels vertical, so they don't overlap.
par(las=2)
with(coweeta_agg, barplot2(height.mean, names.arg=species,
space=0.3, col="red",plot.grid=TRUE,
ylab="Height (m)",
plot.ci=TRUE,
ci.l=height.mean - height.sd,
ci.u=height.mean + height.sd))
```
```{block2 type="rmdtry"}
The above example orders the factor levels by increasing median sodium levels.
Try reversing the factor levels, using the following code after `reorder`.
`coweeta$species <- factor(coweeta$species, levels=rev(levels(coweeta$species)))`
Here we used `rev` to reverse the levels.
```
## Combining dataframes
### Merging dataframes {#merge}
If we have the following dataset called `plantdat`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/exampledata.png")
```
and we have another dataset, that includes the same `PlantID` variable (but is not necessarily ordered, nor does it have to include values for every plant):
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/leafnitrogendata.png")
```
and execute the command
```{r eval=FALSE}
merge(plantdat, leafnitrogendata, by="PlantID")
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/mergeresult.png")
```
Note the missing value (`NA`) for the plant for which no leaf nitrogen data was available.
In many problems, you do not have a single dataset that contains all the measurements you are interested in -- unlike most of the example datasets in this tutorial. Suppose you have two datasets that you would like to combine, or `merge`. This is straightforward in R, but there are some pitfalls.
Let's start with a common situation when you need to combine two datasets that have a different number of rows.
```{r}
# Two dataframes
data1 <- data.frame(unit=c("x","x","x","y","z","z"),Time=c(1,2,3,1,1,2))
data2 <- data.frame(unit=c("y","z","x"), height=c(3.4,5.6,1.2))
# Look at the dataframes
data1
data2
# Merge dataframes:
combdata <- merge(data1, data2, by="unit")
# Combined data
combdata
```
Sometimes, the variable you are merging with has a different name in either dataframe. In that case, you can either rename the variable before merging, or use the following option:
```{r eval = FALSE}
merge(data1, data2, by.x="unit", by.y="item")
```
Where `data1` has a variable called 'unit', and `data2` has a variable called 'item'.
Other times you need to merge two dataframes with multiple key variables. Consider this example, where two dataframes have measurements on the same units at some of the the same times, but on different variables:
```{r}
# Two dataframes
data1 <- data.frame(unit=c("x","x","x","y","y","y","z","z","z"),
Time=c(1,2,3,1,2,3,1,2,3),
Weight=c(3.1,5.2,6.9,2.2,5.1,7.5,3.5,6.1,8.0))
data2 <- data.frame(unit=c("x","x","y","y","z","z"),
Time=c(1,2,2,3,1,3),
Height=c(12.1,24.4,18.0,30.8,10.4,32.9))
# Look at the dataframes
data1
data2
# Merge dataframes:
combdata <- merge(data1, data2, by=c("unit","Time"))
# By default, only those times appear in the dataset that have measurements
# for both Weight (data1) and Height (data2)
combdata
# To include all data, use this command. This produces missing values for some times:
merge(data1, data2, by=c("unit","Time"), all=TRUE)
# Compare this result with 'combdata' above!
```
### Using join from `dplyr` {#dplyrjoin}
We showed how to use the `merge` function above, which is provided by base R. For larger datasets, it is advisable to use the `join*` functions from `dplyr`.
Instead of specifying which rows to keep with arguments `all.x`, `all`, etc., `dplyr` provides several functions that should make some intuitive sense. The table below compares `merge` and `join*`.
+------------------------------------+-------------------------------+
| `merge()` | `dplyr::join*` |
+====================================+===============================+
| `merge(dat1, dat2, all = FALSE)` | `inner_join(dat1, dat2)` |
+------------------------------------+-------------------------------+
| `merge(dat1, dat2, all.x = TRUE)` | `left_join(dat1, dat2)` |
+------------------------------------+-------------------------------+
| `merge(dat1, dat2, all.y = TRUE)` | `right_join(dat1, dat2)` |
+------------------------------------+-------------------------------+
| `merge(dat1, dat2, all = TRUE)` | `full_join(dat1, dat2)` |
+------------------------------------+-------------------------------+
One other function is provided that has no simple equivalent in base R, `anti_join`, which can be used to find all observations that have *no match* between the two datasets. This can be handy for error-checking.
Consider the cereal dataset, which gives measurements of all sorts of contents of cereals. Suppose the measurements for 'protein', 'vitamins' and 'sugars' were all produced by different laboratories, and each lab sent you a separate dataset. To make things worse, some measurements for sugars and vitamins are missing, because samples were lost in those labs.
```{r}
# Read the three datasets given to you from the three different labs:
data(cereal1)
data(cereal2)
data(cereal3)
# As always, look at the first few rows of each dataset.
head(cereal1, 3)
head(cereal2, 3)
head(cereal3, 3)
# The name of the variable that ties the datasets together,
# the 'cereal name' differs between the datasets, as do the number of rows.
# We can use merge() three times, but the data are easiest to merge with dplyr:
library(dplyr)
cereal_combined <- full_join(cereal1, cereal2, by=c("Cereal.name" = "cerealbrand")) %>%
full_join(cereal3, by=c("Cereal.name" = "cerealname"))
cereal_combined
# Note that missing values (NA) have been inserted where some data were not available.
```
### Row-binding dataframes {#rbind}
If we have the following dataset called `plantdat`,
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/rbindinput.png")
```
and we have another dataset (`plantdatmore`), *with exactly the same columns* (including the names and order of the columns),
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/moredataforrbind.png")
```
and execute the command
```{r eval=FALSE}
rbind(plantdat, plantdatmore)
```
we get the result
```{r echo=FALSE, out.width='33%'}
knitr::include_graphics("screenshots/rbindresult.png")
```
Using `merge`, we were able to glue dataframes together side-by-side based on one or more 'index' variables. Sometimes you have multiple datasets that can be glued together top-to-bottom, for example when you have multiple very similar dataframes. We can use the `rbind` function, like so:
```{r}
# Some fake data
mydata1 <- data.frame(var1=1:3, var2=5:7)
mydata2 <- data.frame(var1=4:6, var2=8:10)
# The dataframes have the same column names, in the same order:
mydata1
mydata2
# So we can use rbind to row-bind them together:
rbind(mydata1, mydata2)
```
Let's look at the above `rbind` example again but with a modification where some observations are duplicated between dataframes. This might happen, for example, when working with files containing time-series data and where there is some overlap between the two datasets. The `union` function from the `dplyr` package only returns unique observations:
```{r}
# Some fake data
mydata1 <- data.frame(var1=1:3, var2=5:7)
mydata2 <- data.frame(var1=2:4, var2=6:8)
# The dataframes have the same column names, in the same order:
mydata1
mydata2
# 'rbind' leads to duplicate observations, 'union' removes these:
dplyr::union(mydata1, mydata2)
rbind(mydata1, mydata2)
```
Sometimes, you want to `rbind` dataframes together but the column names do not exactly match. One option is to first process the dataframes so that they do match (using subscripting). Or, just use the `bind_rows` function from `dplyr`. Look at this example where we have two dataframes that have only one column in common, but we want to keep all the columns (and fill with `NA` where necessary),
```{r}
# Some fake data
mydata1 <- data.frame(index=c("A","B","C"), var1=5:7)
mydata2 <- data.frame(var1=8:10, species=c("one","two","three"))
# smartbind the dataframes together
dplyr::bind_rows(mydata1, mydata2)
```
*Note:* an equivalent function to bind dataframes side-by-side is `cbind`, which can be used instead of `merge` when no index variables are present. However, in this book, the use of `cbind` is discouraged for dataframes as it can lead to problems that are difficult to fix, and in all practical applications a merge is preferable.
## Reshaping data {#reshaping}
### From wide to long
In the majority of analyses in R, we like to have our data in 'long' format (nowadays sometimes called the 'tidy' format), where the data for some kind of measurement are in one column, and we have one or more factor variables distinguishing groups like individual, date, treatment, and so on. It is however common encounter data in 'wide format', which can be converted to long format by reshaping appropriately.
The first example uses the dutch election data.
```{r}
data(dutchelection)
head(dutchelection,3)
```
The data include percentage votes for 11 Dutch political parties in a 2012 election, and somewhat intuitively the parties have been ordered as columns (so that each row adds up to ca. 100%, ignoring some tiny parties). For purposes of analysis, we would like to reshape this dataset to long format, so that we have just columns 'Date', 'Party', and 'Poll'. Right now 'Party' is in the column names, and 'Poll' represents the polling numbers; the actual data in the cells.
As is often the case, there are many ways to solve this. For most basic reshaping tasks we recommend the `tidyr` package, however some other methods may be needed for more complex tasks, as we find in further examples below.
```{r}
library(tidyr)
# The new dataframe will have a column 'Party' with current columns,
# *except* Date (hence the -Date),
# and values will be stored in 'Poll'
elect_long <- gather(dutchelection, Party, Poll, -Date)
head(elect_long,3)
```
```{r eval=FALSE}
# Identical results can be had with melt() from reshape2,
# a function which otherwise includes more options.
# See `?melt.data.frame` for more options (not `?melt`,
# which is the generic function).
library(reshape2)
elect_long <- melt(dutchelection, variable.name="Party", value.name="Poll", id.vars="Date")
```
The advantage of `melt` is that we can include more than one ID variables, like in the following example. This example also shows the common need for some text processing after reshaping.
In this simple dataset, length of feet and hands was measured on two persons on three consecutive days. We want to reshape it to end up with columns 'Day', 'Person', 'Length' and 'BodyPart' (feet or hands).
```{r}
dat <- data.frame(Day=rep(1:3, each=2), Person=rep(letters[1:2], 3),
Length.feet = rep(c(2,3), 3),
Length.hands=rep(c(3,4),3))
dat
# Now reshape to long format, using melt.data.frame
dat_long <- melt(dat, id.vars=c("Day","Person"), value.name="Length",
variable.name="BodyPart") %>%
mutate(BodyPart = gsub("Length.", "", BodyPart))
dat_long
```
### From long to wide
Occasionally we like to use wide format, where groups of data are placed next to each other instead of on top of each other. For the first example consider this simple dataset with a set of questions asked to two persons,
```{r}
survey <- read.table(text="user question answer
a question_1 hi
a question_2 hey
a question_3 oh
a question_4 no
a question_5 yes
a question_6 obv
b question_1 cool
b question_2 good
b question_3 yes
b question_4 sweet
b question_5 wow
b question_6 no", header=TRUE)
survey
```
In this case we actually want to have all data for each 'user' in a single row, thus creating columns 'question_1', 'question_2' and so on. The first solution uses `spread` from `tidyr`.
```{r}
library(tidyr)
spread(survey, question, answer)
```
The second solution uses `dcast` from `reshape2`. I show this solution because unlike `spread`, `dcast` can be used for more complicated reshaping problems.
```{r}
library(reshape2)
dcast(survey, user ~ question, value.var="answer")
```
Here, the formula indicates 'user goes in rows, question in columns', and `value.var` is the name of the variable used to populate the cells.
For the second, more complex, example we use a small version of the `eucface_gasexchange` dataset, which includes measurements of photosynthesis (`Photo`) on two experimental plots (`Plot`) on three Dates (`Date`), under two experimental treatments (`treatment`). Clearly we have multiple levels of nesting of our data, and each of these nesting levels is represented by a different variable.
```{r}
gas <- read.table(text="Date Plot treatment Photo
A 1 Amb 14.4
A 1 Ele 16.3
A 2 Amb 17.8
A 2 Ele 13.3
B 1 Amb 16.4
B 1 Ele 19
B 2 Amb 15
B 2 Ele 10.8
C 1 Amb 17.5
C 1 Ele 19.8
C 2 Amb 13.5
C 2 Ele 12.1
", header=TRUE)
```
Now we have two variables that describe which measurements 'belong together' (experimental plot, and treatment). We can no longer use `spread` (which uses just one of those variables), but `dcast` does work nicely with multiple groupings:
```{r}
dcast(gas, Date + Plot ~ treatment, value.var="Photo")
```
We have another possibility here, to instead place the values for the individual plots and treatments next to each other. Note we now move `Plot` to the right-hand side of the formula in `dcast`.
```{r}
dcast(gas, Date ~ Plot + treatment, value.var="Photo")
```
## More complex `dplyr` examples
In this chapter, we have learned many new tools to work with data - filtering, summarizing, reshaping, and so on. One advantage of the `dplyr` package over functions in base R is that the various operations can be efficiently combined with the `%>%` operator, combining all your data converting and shaping steps into one.
This will become clear with some examples.
### Tree growth data: filter and multiple groupings
The first example uses the `hfeifbytree` data from the `lgrdata` package. This dataset contains measurements of height and stem diameter on nearly 1200 Eucalyptus trees in Sydney, repeated 17 times. The trees grow in plots subjected to different treatments (control, fertilization, irrigation, or both).
On some of the dates, all trees were measured, while on other dates a small subsample was taken (ca. 10% of all trees). We want to make a plot of average tree height by treatment over time, but use only the dates were all trees were measured. The following code makes Fig. \@ref(fig:hfeifmeanplot).
```{r hfeifmeanplot, fig.cap="Average tree height by treatment over time, for the hfeifbytree data"}
data(hfeifbytree)
library(dplyr)
library(ggplot2)
library(ggthemes)
hfeif_meanh <-
mutate(hfeifbytree, Date = as.Date(Date)) %>% # Convert to proper Date
group_by(Date) %>% # Set up the grouping variable
filter(n() > 500) %>% # Keep groups with more than 500 observations
group_by(Date, treat) %>% # New grouping; to get average by Date and treatment
summarize(height = mean(height, na.rm=TRUE)) %>% # Average height, discard NA.
rename(treatment = treat) # rename a variable
# It is possible to make the plot inside the data pipeline, but
# we recommend separating them!
ggplot(hfeif_meanh, aes(x = Date, y = height, col = treatment)) +
geom_point(size = 2) +
geom_line() +
scale_colour_tableau() +
ylim(c(0,20))
```
### Crude oil production: find top exporters
In the second example we will use the `oil` data, a dataset with annual crude oil production for the top 8 oil-producing countries since 1971. We want to find the top three countries for the period 1980-1985, sort them, and replace the country abbreviations with country names.
```{r}
data(oil)
# First we make a dataframe with the abbreviations.
# Very handy here is tribble from tibble; making it easier to write
# small dataframes directly into your script.
library(tibble)
abb_key <- tribble(~country, ~country_full,
"MEX", "Mexico",
"USA", "USA",
"CHN", "China",
"IRN", "Iran",
"SAU", "Saudi-Arabia",
"IRQ", "Iraq",
"KWT", "Kuwait",
"VEN", "Venezuela")
# To avoid a warning, we first convert country to character
# (not strictly necessary)
oil_top <-
mutate(oil,
country = as.character(country),
production_M = production / 1000) %>% # Convert to million tonnes.
full_join(abb_key, by="country") %>% # Add full country name
filter(year %in% 1980:1985) %>% # Subset for years between 1980-1985
group_by(country_full) %>%
summarize(production_M = mean(production_M)) %>%
arrange(desc(production_M)) %>% # Sort by production; in descending order
as.data.frame %>% # To be consistent, output a dataframe
head(., 3) # Only show the top 32
# Finally make a table that looks nice in an rmarkdown document
library(pander)
oil_top %>% pander(.,
caption = "Top three oil producing countries, 1980-1985.",
col.names = c("Country", "Annual production (MTOE)"))
```
```{block2 type="rmdtry"}
Run the example above, and modify the code so that we find the *maximum* oil production for each country, and the table should show the 2 countries with the lowest maximum.
```
### Weight loss data: make an irregular timeseries regular {#weightlossdplyr}
In this example we will use the `weightloss` data, again from the `lgrdata` package. A person recorded his weight on irregular intervals (1-3 days), to check if his diet was having the desired effect. Because the dataset is irregular, it is not easy to calculate daily weight loss for the entire dataset. We can use a combination of `dplyr`, `zoo` and `padr` to clean up this dataset.
The `pad` function is quite powerful: it takes a dataframe, figures the time-indexing variable (in our case, Date), and stretches the dataset to include all Dates along the dataset, filling NA where no data were available.
The `zoo` package contains many useful functions for timeseries data. Here we use `na.approx` to linearly interpolate missing values. We also make a plot of the weight loss data, clarifying which data were interpolated. The following code make Fig. \@ref(fig:weightlossinterp).
```{r weightlossinterp, fig.cap="The weightloss data, with linearly interpolated missing values (red)."}
library(lubridate)
library(dplyr)
library(zoo)
library(padr)
data(weightloss)
weightloss2 <- mutate(weightloss,
Date = dmy(Date), # proper Date class
Weight = Weight * 0.4536) %>% # convert to kg
pad(interval ="day") %>% # timeseries is now regular
mutate(measured = !is.na(Weight), # FALSE when the value was interpolated
Weight = na.approx(Weight)) # linear interpolation (zoo)
ggplot(weightloss2, aes(x = Date, y = Weight)) +
geom_point(colour = ifelse(weightloss2$measured, "darkgrey", "red2"))
```
## Exercises
```{r child="exercises/02_dataskills2_exercises.Rmd", echo=FALSE, results='hide', fig.show='hide'}
```