/
collapse_intro.Rmd
2147 lines (1621 loc) · 128 KB
/
collapse_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Introduction to collapse"
subtitle: "Advanced and Fast Data Transformation in R"
author: "Sebastian Krantz"
date: "2021-06-27"
output:
rmarkdown::html_vignette:
toc: true
vignette: >
%\VignetteIndexEntry{Introduction to collapse}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{css, echo=FALSE}
pre {
max-height: 500px;
overflow-y: auto;
}
pre[class] {
max-height: 500px;
}
```
```{r, echo=FALSE}
NCRAN <- identical(Sys.getenv("NCRAN"), "TRUE")
RUNBENCH <- NCRAN && identical(Sys.getenv("RUNBENCH"), "TRUE")
oldopts <- options(width = 100L)
```
```{r, echo = FALSE, message = FALSE, warning=FALSE, eval=NCRAN}
library(vars)
library(dplyr) # Needed because otherwise dplyr is loaded in benchmark chunk not run on CRAN !!
library(magrittr)
library(microbenchmark) # Same thing
library(collapse)
library(data.table)
B <- collapse::B # making sure it masks vars::B by loading into GE
knitr::opts_chunk$set(error = FALSE, message = FALSE, warning = FALSE,
comment = "#", tidy = FALSE, cache = FALSE, collapse = TRUE,
fig.width = 8, fig.height = 5,
out.width = '100%')
X = mtcars[1:2]
by = mtcars$cyl
F <- getNamespace("collapse")$F
set.seed(101)
```
*collapse* is a C/C++ based package for data transformation and statistical computing in R. It's aims are:
1. To facilitate complex data transformation, exploration and computing tasks in R.
2. To help make R code fast, flexible, parsimonious and programmer friendly.
This vignette demonstrates these two points and introduces all main features of the package in a structured way. The chapters are pretty self-contained, however the first chapters introduce the data and faster data manipulation functions which are used throughout the rest of this vignette.
<!-- structured with an overarching aim to instruct the reader how to write fast R code for (advanced) data manipulation with *collapse*. -->
***
**Notes:**
- Apart from this vignette, *collapse* comes with a built-in structured documentation available under `help("collapse-documentation")` after installing the package, and `help("collapse-package")` provides a compact set of examples for quick-start. A cheat sheet is available at [Rstudio](<https://raw.githubusercontent.com/rstudio/cheatsheets/master/collapse.pdf>).
<!-- To learn *collapse* as quickly as possible it is possibly better to consult those resources than going through this document. -->
- The two other vignettes focus on the integration of *collapse* with *dplyr* workflows (recommended for *dplyr* / *tidyverse* users), and on the integration of *collapse* with the *plm* package (+ some advanced programming with panel data).
- Documentation and vignettes can also be viewed [online](<https://sebkrantz.github.io/collapse/>).
<!-- - All benchmarks are run with a Windows 8.1 laptop, 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD. -->
***
## Why *collapse*?
*collapse* is a high-performance package that extends and enhances the data-manipulation capabilities of R and existing popular packages (such as *dplyr*, *data.table*, and matrix packages). It's main focus is on grouped and weighted statistical programming, complex aggregations and transformations, time series and panel data operations, and programming with lists of data objects. The lead author is an applied economist and created the package mainly to facilitate advanced computations on varied and complex data, in particular surveys, (multivariate) time series, multilevel / panel data, and lists / model objects.
A secondary aspect to applied work is that data is often imported into R from richer data structures (such as STATA, SPSS or SAS files imported with *haven*). This called for an intelligent suite of data manipulation functions that can both utilize aspects of the richer data structure (such as variable labels), and preserve the data structure / attributes in computations. Sometimes specialized classes like *xts*, *pdata.frame* and *grouped_df* can also become very useful to manipulate certain types of data. Thus *collapse* was built to explicitly supports these classes, while preserving most other classes / data structures in R.
<!-- , enabling flexible, efficient and non-destructive workflows with complex data. -->
Another objective was to radically improve the speed of R code by extensively relying on efficient algorithms in C/C++ and the faster components of base R. *collapse* ranks among the fastest R packages, and performs many grouped and/or weighted computations noticeably faster than *dplyr* or *data.table*.
A final development objective was to channel this performance through a stable and well conceived user API providing extensive and optimized programming capabilities (in standard evaluation) while also facilitating quick use and easy integration with existing data manipulation frameworks (in particular *dplyr* / *tidyverse* and *data.table*, both relying on non-standard evaluation).
<!--
*collapse* also provides exemplary documentation (built-in, vignettes, website) and testing. Testing currently covers all core features of the package, amounting to > 7700 unit tests, with extensive testing of statistical functions and computations.
The name of the package derives from the 'collapse' command for multi-type aggregation in the STATA statistical software. This was the first function in this package (later renamed to `collap` to avoid naming conflicts with *dplyr*). -->
## 1. Data and Summary Tools
We begin by introducing some powerful summary tools along with the 2 panel datasets *collapse* provides which are used throughout this vignette. If you are just interested in programming you can skip this section. Apart from the 2 datasets that come with *collapse* (`wlddev` and `GGDC10S`), this vignette uses a few well known datasets from base R: `mtcars`, `iris`, `airquality`, and the time series `Airpassengers` and `EuStockMarkets`.
### 1.1 `wlddev` - World Bank Development Data
This dataset contains 5 key World Bank Development Indicators covering 216 countries for up to 61 years (1960-2020). It is a balanced balanced panel with $216 \times 61 = 13176$ observations. -->
```{r, eval=NCRAN}
library(collapse)
head(wlddev)
# The variables have "label" attributes. Use vlabels() to get and set labels
namlab(wlddev, class = TRUE)
```
Of the categorical identifiers, the date variable was artificially generated to have an example dataset that contains all common data types frequently encountered in R. A detailed statistical description of this data is computed by `descr`:
```{r, eval=NCRAN}
# A fast and detailed statistical description
descr(wlddev)
```
The output of `descr` can be converted into a tidy data frame using:
```{r, eval=NCRAN}
head(as.data.frame(descr(wlddev)))
```
Note that `descr` does not require data to be labeled. Since `wlddev` is a panel data set tracking countries over time, we might be interested in checking which variables are time-varying, with the function `varying`:
```{r, eval=NCRAN}
varying(wlddev, wlddev$iso3c)
```
`varying` tells us that all 5 variables `PCGDP`, `LIFEEX`, `GINI`, `ODA` and `POP` vary over time. However the `OECD` variable does not, so this data does not track when countries entered the OECD. We can also have a more detailed look letting `varying` check the variation in each country:
```{r, eval=NCRAN}
head(varying(wlddev, wlddev$iso3c, any_group = FALSE))
```
`NA` indicates that there are no data for this country. In general data is varying if it has two or more distinct non-missing values. We could also take a closer look at observation counts and distinct values using:
```{r, eval=NCRAN}
head(fnobs(wlddev, wlddev$iso3c))
head(fndistinct(wlddev, wlddev$iso3c))
```
Note that `varying` is more efficient than `fndistinct`, although both functions are very fast.
Even more powerful summary methods for multilevel / panel data are provided by `qsu` (shorthand for *quick-summary*). It is modeled after *STATA*'s *summarize* and *xtsummarize* commands. Calling `qsu` on the data gives a concise summary. We can subset columns internally using the `cols` argument:
```{r, eval=NCRAN}
qsu(wlddev, cols = 9:12, higher = TRUE) # higher adds skewness and kurtosis
```
We could easily compute these statistics by region:
```{r, eval=NCRAN}
qsu(wlddev, by = ~region, cols = 9:12, vlabels = TRUE, higher = TRUE)
```
Computing summary statistics by country is of course also possible but would be too much information. Fortunately `qsu` lets us do something much more powerful:
```{r, eval=NCRAN}
qsu(wlddev, pid = ~ iso3c, cols = c(1,4,9:12), vlabels = TRUE, higher = TRUE)
```
The above output reports 3 sets of summary statistics for each variable: Statistics computed on the *Overall* (raw) data, and on the *Between*-country (i.e. country averaged) and *Within*-country (i.e. country-demeaned) data^[in the *Within* data, the overall mean was added back after subtracting out country means, to preserve the level of the data, see also section 6.5.]. This is a powerful way to summarize panel data because aggregating the data by country gives us a cross-section of countries with no variation over time, whereas subtracting country specific means from the data eliminates all cross-sectional variation. <!-- Thus we summarize the variation in our panel in 3 different ways: First we consider the raw data, then we create a cross-section of countries and summarize that, and then we sweep that cross-section out of the raw data and pretend we have a time series. -->
So what can these statistics tell us about our data? The `N/T` columns shows that for `PCGDP` we have 8995 total observations, that we observe GDP data for 203 countries and that we have on average 44.3 observations (time-periods) per country. In contrast the GINI Index is only available for 161 countries with 8.4 observations on average. The *Overall* and *Within* mean of the data are identical by definition, and the *Between* mean would also be the same in a balanced panel with no missing observations. In practice we have unequal amounts of observations for different countries, thus countries have different weights in the *Overall* mean and the difference between *Overall* and *Between*-country mean reflects this discrepancy. The most interesting statistic in this summary arguably is the standard deviation, and in particular the comparison of the *Between*-SD reflecting the variation between countries and the *Within*-SD reflecting average variation over time. This comparison shows that PCGDP, LIFEEX and GINI vary more between countries, but ODA received varies more within countries over time. The 0 *Between*-SD for the year variable and the fact that the *Overall* and *Within*-SD are equal shows that year is individual invariant. Thus `qsu` also provides the same information as `varying`, but with additional details on the relative magnitudes of cross-sectional and time series variation. It is also a common pattern that the *kurtosis* increases in within-transformed data, while the *skewness* decreases in most cases.
<!-- The output above is a 3D array of statistics which can also be subsetted (`[`) or permuted using `aperm()`. -->
We could also do all of that by regions to have a look at the between and within country variations inside and across different World regions:
```{r, eval=NCRAN}
qsu(wlddev, by = ~ region, pid = ~ iso3c, cols = 9:12, vlabels = TRUE, higher = TRUE)
```
Notice that the output here is a 4D array of summary statistics, which we could also subset (`[`) or permute (`aperm`) to view these statistics in any convenient way. If we don't like the array, we can also output as a nested list of statistics matrices:
```{r, eval=NCRAN}
l <- qsu(wlddev, by = ~ region, pid = ~ iso3c, cols = 9:12, vlabels = TRUE,
higher = TRUE, array = FALSE)
str(l, give.attr = FALSE)
```
Such a list of statistics matrices could, for example, be converted into a tidy data frame using `unlist2d` (more about this in the section on list-processing):
```{r, eval=NCRAN}
head(unlist2d(l, idcols = c("Variable", "Trans"), row.names = "Region"))
```
This is not yet end of `qsu`'s functionality, as we can also do all of the above on panel-surveys utilizing weights (`w` argument).
Finally, we can look at (weighted) pairwise correlations in this data:
```{r, eval=NCRAN}
pwcor(wlddev[9:12], N = TRUE, P = TRUE)
```
which can of course also be computed on averaged and within-transformed data:
```{r, eval=NCRAN}
print(pwcor(fmean(wlddev[9:12], wlddev$iso3c), N = TRUE, P = TRUE), show = "lower.tri")
# N is same as overall N shown above...
print(pwcor(fwithin(wlddev[9:12], wlddev$iso3c), P = TRUE), show = "lower.tri")
```
A useful function called by `pwcor` is `pwnobs`, which is very handy to explore the joint observation structure when selecting variables to include in a statistical model:
```{r, eval=NCRAN}
pwnobs(wlddev)
```
Note that both `pwcor/pwcov` and `pwnobs` are faster on matrices.
<!-- *Note:* Other distributional statistics like the *median* and *quantiles* are currently not implemented for reasons having to do with computation speed (>10x faster than `base::summary` and suitable for really large panels) and the algorithm^[`qsu` uses a numerically stable online algorithm generalized from Welford's Algorithm to compute variances.] behind `qsu`, but might come in a further update of `qsu`. -->
### 1.2 `GGDC10S` - GGDC 10-Sector Database
The Groningen Growth and Development Centre 10-Sector Database provides long-run data on sectoral productivity performance in Africa, Asia, and Latin America. Variables covered in the data set are annual series of value added (VA, in local currency), and persons employed (EMP) for 10 broad sectors.
```{r, eval=NCRAN}
head(GGDC10S)
namlab(GGDC10S, class = TRUE)
fnobs(GGDC10S)
fndistinct(GGDC10S)
# The countries included:
cat(funique(GGDC10S$Country, sort = TRUE))
```
The first problem in summarizing this data is that value added (VA) is in local currency, the second that it contains 2 different Variables (VA and EMP) stacked in the same column. One way of solving the first problem could be converting the data to percentages through dividing by the overall VA and EMP contained in the last column. A different solution involving grouped-scaling is introduced in section 6.4. The second problem is again nicely handled by `qsu`, which can also compute panel-statistics by groups.
```{r, eval=NCRAN}
# Converting data to percentages of overall VA / EMP, dapply keeps the attributes, see section 6.1
pGGDC10S <- ftransformv(GGDC10S, 6:15, `*`, 100 / SUM)
# Summarizing the sectoral data by variable, overall, between and within countries
su <- qsu(pGGDC10S, by = ~ Variable, pid = ~ Variable + Country,
cols = 6:16, higher = TRUE)
# This gives a 4D array of summary statistics
str(su)
# Permuting this array to a more readible format
aperm(su, c(4L, 2L, 3L, 1L))
```
The statistics show that the dataset is very consistent: Employment data cover 42 countries and 53 time-periods in almost all sectors. Agriculture is the largest sector in terms of employment, amounting to a 35% share of employment across countries and time, with a standard deviation (SD) of around 27%. The between-country SD in agricultural employment share is 24% and the within SD is 12%, indicating that processes of structural change are very gradual and most of the variation in structure is between countries. The next largest sectors after agriculture are manufacturing, wholesale and retail trade and government, each claiming an approx. 15% share of the economy. In these sectors the between-country SD is also about twice as large as the within-country SD.
In terms of value added, the data covers 43 countries in 50 time-periods. Agriculture, manufacturing, wholesale and retail trade and government are also the largest sectors in terms of VA, but with a diminished agricultural share (around 17%) and a greater share for manufacturing (around 20%). The variation between countries is again greater than the variation within countries, but it seems that at least in terms of agricultural VA share there is also a considerable within-country SD of 8%. This is also true for the finance and real estate sector with a within SD of 9%, suggesting (using a bit of common sense) that a diminishing VA share in agriculture and increased VA share in finance and real estate was a pattern characterizing most of the countries in this sample.
As a final step we consider a plot function which can be used to plot the structural transformation of any supported country. Below for Botswana:
```{r, eval=NCRAN}
library(data.table)
library(ggplot2)
library(magrittr)
plotGGDC <- function(ctry) {
# Select and subset
fsubset(GGDC10S, Country == ctry, Variable, Year, AGR:SUM) %>%
# Convert to shares and replace negative values with NA
ftransform(fselect(., AGR:OTH) %>%
lapply(`*`, 1 / SUM) %>%
replace_outliers(0, NA, "min")) %>%
# Remove totals column and make proper variable labels
ftransform(Variable = recode_char(Variable,
VA = "Value Added Share",
EMP = "Employment Share"),
SUM = NULL) %>%
# Fast conversion to data.table
qDT %>%
# data.table's melt function
melt(1:2, variable.name = "Sector", na.rm = TRUE) %>%
# ggplot with some scales provided by the 'scales' package
ggplot(aes(x = Year, y = value, fill = Sector)) +
geom_area(position = "fill", alpha = 0.9) + labs(x = NULL, y = NULL) +
theme_linedraw(base_size = 14L) + facet_wrap( ~ Variable) +
scale_fill_manual(values = sub("#00FF66", "#00CC66", rainbow(10L))) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 7L), expand = c(0, 0)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10L), expand = c(0, 0),
labels = scales::percent) +
theme(axis.text.x = element_text(angle = 315, hjust = 0, margin = ggplot2::margin(t = 0)),
strip.background = element_rect(colour = "grey20", fill = "grey20"),
strip.text = element_text(face = "bold"))
}
# Plotting the structural transformation of Botswana
plotGGDC("BWA")
```
## 2. Fast Data Manipulation
A lot of R code is not concerned with statistical computations but with preliminary data wrangling.
<!-- Very frequent operations include selecting, replacing, subsetting, ordering, adding/computing, and deleting data / columns.-->
For various reasons R development has focused on data frames as the main medium to contain data, although matrices / arrays provide significantly faster methods for common manipulations.
A first essential step towards optimizing R code is thus to speed up very frequent manipulations on data frames. *collapse* introduces a set of highly optimized functions to efficiently manipulate (mostly) data frames. Most manipulations can be conducted in non-standard evaluation or standard evaluation (utilizing different functions), and all functions preserve the data structure (i.e. they can be used with data.table, tbl_df, grouped_df, pdata.frame etc.).
<!-- Some of these functions (`fselect`, `roworder`, `franame`, `fsubset`, `ss` and `ftransform`) represent improved versions of existing ones (`dplyr::select`, `dplyr::arrange`, `dplyr::rename`, `base::subset`, `base::[.data.frame` and `base::transform`) while others are added in. Also some functions (`fselect`, `roworder`, `frename`, `colorder`, `fsubset`, `ftransform`, `settransform`, `fcompute`) use non-standard evaluation, whereas others (`get_vars`, `roworderv`, `colorderv`, `ss`, `add_vars`, `num_vars`, etc.) offer some of the same functionality with standard evaluation and are thus more programmer friendly. Here we run through all of them briefly: -->
### 2.1 Selecting and Replacing Columns
`fselect` is an analogue to `dplyr::select`, but executes about 100x faster. It can be used to select variables using expressions involving variable names:
```{r, eval=NCRAN}
library(magrittr) # Pipe operators
fselect(wlddev, country, year, PCGDP:ODA) %>% head(2)
fselect(wlddev, -country, -year, -(PCGDP:ODA)) %>% head(2)
library(microbenchmark)
microbenchmark(fselect = collapse::fselect(wlddev, country, year, PCGDP:ODA),
select = dplyr::select(wlddev, country, year, PCGDP:ODA))
```
in contrast to `dplyr::select`, `fselect` has a replacement method
```{r, eval=NCRAN}
# Computing the log of columns
fselect(wlddev, PCGDP:POP) <- lapply(fselect(wlddev, PCGDP:POP), log)
head(wlddev, 2)
# Efficient deleting
fselect(wlddev, country, year, PCGDP:POP) <- NULL
head(wlddev, 2)
rm(wlddev)
```
and it can also return information about the selected columns other than the data itself.
```{r, eval=NCRAN}
fselect(wlddev, PCGDP:POP, return = "names")
fselect(wlddev, PCGDP:POP, return = "indices")
fselect(wlddev, PCGDP:POP, return = "named_indices")
fselect(wlddev, PCGDP:POP, return = "logical")
fselect(wlddev, PCGDP:POP, return = "named_logical")
```
<!-- `fselect` is a lot faster than `dplyr::select` and maintains this performance on large data. -->
While `fselect` is faster than `dplyr::select`, it is also simpler and does not offer special methods for grouped tibbles (e.g. where grouping columns are always selected) and some other *dplyr*-specific features of `select`. We will see that this is not a problem at all when working with statistical functions in *collapse* that have a grouped_df method, but users should be careful replacing `dplyr::select` with `fselect` in *dplyr* scripts. From *collapse* 1.6.0, `fselect` has explicit support for *sf* data frames.
<!-- For some reason this does not work in build !! -->
<!-- ```{r, error=FALSE, warning=FALSE} -->
<!-- library(microbenchmark) -->
<!-- library(dplyr) -->
<!-- identical(select(wlddev, country, year, PCGDP:ODA), fselect(wlddev, country, year, PCGDP:ODA)) -->
<!-- microbenchmark(select(wlddev, country, year, PCGDP:ODA), fselect(wlddev, country, year, PCGDP:ODA)) -->
<!-- ``` -->
The standard-evaluation analogue to `fselect` is the function `get_vars`. `get_vars` can be used to select variables using names, indices, logical vectors, functions or regular expressions evaluated against column names:
```{r, eval=NCRAN}
get_vars(wlddev, 9:13) %>% head(1)
get_vars(wlddev, c("PCGDP","LIFEEX","GINI","ODA","POP")) %>% head(1)
get_vars(wlddev, "[[:upper:]]", regex = TRUE) %>% head(1)
get_vars(wlddev, "PC|LI|GI|OD|PO", regex = TRUE) %>% head(1)
# Same as above, vectors of regular expressions are sequentially passed to grep
get_vars(wlddev, c("PC","LI","GI","OD","PO"), regex = TRUE) %>% head(1)
get_vars(wlddev, is.numeric) %>% head(1)
# Returning other information
get_vars(wlddev, is.numeric, return = "names")
get_vars(wlddev, "[[:upper:]]", regex = TRUE, return = "named_indices")
```
Replacing operations are conducted analogous:
```{r, eval=NCRAN}
get_vars(wlddev, 9:13) <- lapply(get_vars(wlddev, 9:13), log)
get_vars(wlddev, 9:13) <- NULL
head(wlddev, 2)
rm(wlddev)
```
`get_vars` is about 2x faster than `[.data.frame`, and `get_vars<-` is about 6-8x faster than `[<-.data.frame`.
<!-- ```{r} -->
<!-- series <- wlddev[9:12] -->
<!-- microbenchmark(get_vars(wlddev, 9:12), wlddev[9:12]) -->
<!-- microbenchmark(get_vars(wlddev, 9:12) <- series, wlddev[9:12] <- series) -->
<!-- microbenchmark(get_vars(wlddev, 9:12) <- get_vars(wlddev, 9:12), wlddev[9:12] <- wlddev[9:12]) -->
<!-- ``` -->
In addition to `get_vars`, *collapse* offers a set of functions to efficiently select and replace data by data type: `num_vars`, `cat_vars` (for categorical = non-numeric columns), `char_vars`, `fact_vars`, `logi_vars` and `date_vars` (for date and date-time columns).
```{r, eval=NCRAN}
head(num_vars(wlddev), 2)
head(cat_vars(wlddev), 2)
head(fact_vars(wlddev), 2)
# Replacing
fact_vars(wlddev) <- fact_vars(wlddev)
```
### 2.2 Subsetting
`fsubset` is an enhanced version of `base::subset` using C functions from the *data.table* package for fast and subsetting operations. In contrast to `base::subset`, `fsubset` allows multiple comma-separated select arguments after the subset argument, and it also preserves all attributes of subsetted columns:
```{r, eval=NCRAN}
# Returning only value-added data after 1990
fsubset(GGDC10S, Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(2)
# Same thing
fsubset(GGDC10S, Variable == "VA" & Year > 1990, -(Regioncode:Variable), -(OTH:SUM)) %>% head(2)
```
It is also possible to use standard evaluation with `fsubset`, but for these purposes the function `ss` exists as a fast and more secure alternative to `[.data.frame`:
```{r, eval=NCRAN}
ss(GGDC10S, 1:2, 6:16) # or fsubset(GGDC10S, 1:2, 6:16), but not recommended.
ss(GGDC10S, -(1:2), c("AGR","MIN")) %>% head(2)
```
Thanks to the *data.table* C code and optimized R code, `fsubset` is very fast.
```{r, eval=NCRAN}
microbenchmark(base = subset(GGDC10S, Variable == "VA" & Year > 1990, AGR:SUM),
collapse = fsubset(GGDC10S, Variable == "VA" & Year > 1990, AGR:SUM))
microbenchmark(GGDC10S[1:10, 1:10], ss(GGDC10S, 1:10, 1:10))
```
like `base::subset`, `fsubset` is S3 generic with methods for vectors, matrices and data frames. For certain classes such as factors, `fsubset.default` also improves upon `[`, but the largest improvements are with the data frame method.
### 2.3 Reordering Rows and Columns
`roworder` is a fast analogue to `dplyr::arrange`. The syntax is inspired by `data.table::setorder`, so that negative variable names indicate descending sort.
```{r, eval=NCRAN}
roworder(GGDC10S, -Variable, Country) %>% ss(1:2, 1:8)
microbenchmark(collapse = collapse::roworder(GGDC10S, -Variable, Country),
dplyr = dplyr::arrange(GGDC10S, desc(Variable), Country))
```
In contrast to `data.table::setorder`, `roworder` creates a copy of the data frame (unless data are already sorted). If this copy is not required, `data.table::setorder` is faster. The function `roworderv` is a standard evaluation analogue to `roworder`:
```{r, eval=NCRAN}
# Same as above
roworderv(GGDC10S, c("Variable", "Country"), decreasing = c(TRUE, FALSE)) %>% ss(1:2, 1:8)
```
With `roworderv`, it is also possible to move or exchange rows in a data frame:
```{r, eval=NCRAN}
# If length(neworder) < fnrow(data), the default (pos = "front") brings rows to the front
roworderv(GGDC10S, neworder = which(GGDC10S$Country == "GHA")) %>% ss(1:2, 1:8)
# pos = "end" brings rows to the end
roworderv(GGDC10S, neworder = which(GGDC10S$Country == "BWA"), pos = "end") %>% ss(1:2, 1:8)
# pos = "exchange" arranges selected rows in the order they are passed, without affecting other rows
roworderv(GGDC10S, neworder = with(GGDC10S, c(which(Country == "GHA"),
which(Country == "BWA"))), pos = "exchange") %>% ss(1:2, 1:8)
```
Similarly, the pair `colorder` / `colorderv` facilitates efficient reordering of columns in a data frame. These functions not require a deep copy of the data and are very fast. To reorder columns by reference, see also `data.table::setcolorder`.
```{r, eval=NCRAN}
# The default is again pos = "front" which brings selected columns to the front / left
colorder(GGDC10S, Variable, Country, Year) %>% head(2)
```
### 2.4 Transforming and Computing New Columns
`ftransform` is an improved version of `base::transform` for data frames and lists. `ftransform` can be used to compute new columns or modify and delete existing columns, and always returns the entire data frame.
```{r, eval=NCRAN}
ftransform(GGDC10S, AGR_perc = AGR / SUM * 100, # Computing Agricultural percentage
Year = as.integer(Year), # Coercing Year to integer
AGR = NULL) %>% tail(2) # Deleting column AGR
# Computing scalar results replicates them
ftransform(GGDC10S, MIN_mean = fmean(MIN), Intercept = 1) %>% tail(2)
```
The modification `ftransformv` exists to transform specific columns using a function:
<!--
# Same thing using fselect to get the right indices
# GGDC10S %>% ftransformv(fselect(., AGR:SUM, return = "indices"), `*`, 100/SUM) %>% tail(2)
-->
```{r, eval=NCRAN}
# Apply the log to columns 6-16
GGDC10S %>% ftransformv(6:16, log) %>% tail(2)
# Convert data to percentage terms
GGDC10S %>% ftransformv(6:16, `*`, 100/SUM) %>% tail(2)
# Apply log to numeric columns
GGDC10S %>% ftransformv(is.numeric, log) %>% tail(2)
```
Instead of passing comma-separated `column = value` expressions, it is also possible to bulk-process data with `fransform` by passing a single list of expressions (such as a data frame). This is useful for more complex transformations involving multiple steps:
```{r, eval=NCRAN}
# Same as above, but also replacing any generated infinite values with NA
GGDC10S %>% ftransform(num_vars(.) %>% lapply(log) %>% replace_Inf) %>% tail(2)
```
This mode of usage toggles automatic column matching and replacement. Non-matching columns are added to the data frame. Apart from to `ftransform`, the function `settransform(v)` can be used to change the input data frame by reference:
<!-- and is a simple wrapper around `X <- ftransform(X, ...)`: -->
```{r, eval=NCRAN}
# Computing a new column and deleting some others by reference
settransform(GGDC10S, FIRE_MAN = FIRE / MAN,
Regioncode = NULL, Region = NULL)
tail(GGDC10S, 2)
rm(GGDC10S)
# Bulk-processing the data into percentage terms
settransformv(GGDC10S, 6:16, `*`, 100/SUM)
tail(GGDC10S, 2)
# Same thing via replacement
ftransform(GGDC10S) <- fselect(GGDC10S, AGR:SUM) %>% lapply(`*`, 100/.$SUM)
# Or using double pipes
GGDC10S %<>% ftransformv(6:16, `*`, 100/SUM)
rm(GGDC10S)
```
Another convenient addition is provided by the function `fcompute`, which can be used to compute new columns in a data frame environment and returns the computed columns in a new data frame:
```{r, eval=NCRAN}
fcompute(GGDC10S, AGR_perc = AGR / SUM * 100, FIRE_MAN = FIRE / MAN) %>% tail(2)
```
For more complex tasks see `?ftransform`.
### 2.5 Adding and Binding Columns
For cases where multiple columns are computed and need to be added to a data frame (regardless of whether names are duplicated or not), *collapse* introduces the predicate `add_vars`. Together with `add_vars`, the function `add_stub` is useful to add a prefix (default) or postfix to computed variables keeping the variable names unique:
```{r, eval=NCRAN}
# Efficient adding logged versions of some variables
add_vars(wlddev) <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)
```
By default `add_vars` appends a data frame towards the (right) end, but it can also replace columns in front or at other positions in the data frame:
```{r, eval=NCRAN}
add_vars(wlddev, "front") <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)
add_vars(wlddev, c(10L,12L,14L,16L,18L)) <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)
```
`add_vars` can also be used without replacement, where it serves as a more efficient version of `cbind.data.frame`, with the difference that the data structure and attributes of the first argument are preserved:
```{r, eval=NCRAN}
add_vars(wlddev, get_vars(wlddev, 9:13) %>% lapply(log) %>% add_stub("log."),
get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")) %>% head(2)
add_vars(wlddev, get_vars(wlddev, 9:13) %>% lapply(log) %>% add_stub("log."),
get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10."),
pos = c(10L,13L,16L,19L,22L,11L,14L,17L,20L,23L)) %>% head(2)
identical(cbind(wlddev, wlddev), add_vars(wlddev, wlddev))
microbenchmark(cbind(wlddev, wlddev), add_vars(wlddev, wlddev))
```
### 2.6 Renaming Columns
`frename` is a fast substitute for `dplyr::rename`:
```{r, eval=NCRAN}
frename(GGDC10S, AGR = Agriculture, MIN = Mining) %>% head(2)
frename(GGDC10S, tolower) %>% head(2)
frename(GGDC10S, tolower, cols = .c(AGR, MIN)) %>% head(2)
```
The function `setrename` does this by reference:
```{r, eval=NCRAN}
setrename(GGDC10S, AGR = Agriculture, MIN = Mining)
head(GGDC10S, 2)
setrename(GGDC10S, Agriculture = AGR, Mining = MIN)
rm(GGDC10S)
```
Both functions are not limited to data frames but can be applied to any R object with a 'names' attribute.
### 2.7 Using Shortcuts
The most frequently required among the functions introduced above can be abbreviated as follows: `fselect -> slt`, `fsubset -> sbt`, `ftransform(v) -> tfm(v)`, `settransform(v) -> settfm(v)`, `get_vars -> gv`, `num_vars -> nv`, `add_vars -> av`. This was done to make it possible to write faster and more parsimonious code, but is recommended only for personally kept scripts. A lazy person may also decide to code everything using shortcuts and then do ctrl+F replacement with the long names on the finished script.
<!--
and to facilitate the avoidance of piped `%>%` expressions. Needless to say pipes `%>%` have become a very convenient feature of the R language and do a great job avoiding complex nested calls. They do however require reconstructing the entire call before evaluating it, and thus take out a lot of speed:
```{r, eval=NCRAN}
microbenchmark(standard = tfm(gv(wlddev, 9:12), ODA_GDP = ODA/PCGDP),
piped = get_vars(wlddev, 9:12) %>% ftransform(ODA_GDP = ODA/PCGDP))
```
-->
### 2.8 Missing Values / Rows
The function `na_omit` is a much faster alternative to `stats::na.omit` for vectors, matrices and data frames. By default the 'na.action' attribute containing the removed cases is omitted, but it can be added with the option `na.attr = TRUE`. Like `fsubset`, `na_omit` preserves all column attributes as well as attributes of the data frame itself.
```{r, eval=NCRAN}
microbenchmark(na_omit(wlddev, na.attr = TRUE), na.omit(wlddev))
```
Another added feature is the removal of cases missing on certain columns only:
```{r, eval=NCRAN}
na_omit(wlddev, cols = .c(PCGDP, LIFEEX)) %>% head(2)
# only removing missing data from numeric columns -> same and slightly faster than na_omit(wlddev)
na_omit(wlddev, cols = is.numeric) %>% head(2)
```
For atomic vectors the function `na_rm` also exists which is 2x faster than `x[!is.na(x)]`. Both `na_omit` and `na_rm` return their argument if no missing cases were found.
The existence of missing cases can be checked using `missing_cases`, which is also considerably faster than `complete.cases` for data frames.
There is also a function `na_insert` to randomly insert missing values into vectors, matrices and data frames. The default is `na_insert(X, prop = 0.1)` so that 10% of values are randomly set to missing.
Finally, a function `allNA` provides the much needed opposite of `anyNA` for atomic vectors.
### 2.9 Unique Values / Rows
Similar to `na_omit`, the function `funique` is a much faster alternative to `base::unique` for atomic vectors and data frames. Like most *collapse* functions it also seeks to preserve attributes.
```{r, eval=NCRAN}
funique(GGDC10S$Variable) # Unique values in order of appearance
funique(GGDC10S$Variable, sort = TRUE) # Sorted unique values
# If all values/rows are unique, the original data is returned (no copy)
identical(funique(GGDC10S), GGDC10S)
# Can remove duplicate rows by a subset of columns
funique(GGDC10S, cols = .c(Country, Variable)) %>% ss(1:2, 1:8)
funique(GGDC10S, cols = .c(Country, Variable), sort = TRUE) %>% ss(1:2, 1:8)
```
### 2.10 Recoding and Replacing Values
With `recode_num`, `recode_char`, `replace_NA`, `replace_Inf` and `replace_outliers`, *collapse* also introduces a set of functions to efficiently recode and replace numeric and character values in matrix-like objects (vectors, matrices, arrays, data frames, lists of atomic objects). When called on a data frame, `recode_num`, `replace_Inf` and `replace_outliers` will skip non-numeric columns, and `recode_char` skips non-character columns, whereas `replace_NA` replaces missing values in all columns.
```{r, eval=NCRAN}
# Efficient replacing missing values with 0
microbenchmark(replace_NA(GGDC10S, 0))
# Adding log-transformed sectoral data: Some NaN and Inf values generated
add_vars(GGDC10S, 6:16*2-5) <- fselect(GGDC10S, AGR:SUM) %>%
lapply(log) %>% replace_Inf %>% add_stub("log.")
head(GGDC10S, 2)
rm(GGDC10S)
```
`recode_num` and `recode_char` follow the syntax of `dplyr::recode` and provide more or less the same functionality except that they can efficiently be applied to matrices and data frames, and that `recode_char` allows for regular expression matching implemented via `base::grepl`:
<!-- that `dplyr::recode` offers a method for factors not provided in *collapse*, -->
```{r, eval=NCRAN}
month.name
recode_char(month.name, ber = "C", "^J" = "A", default = "B", regex = TRUE)
```
The perhaps most interesting function in this ensemble is `replace_outliers`, which replaces values falling outside a 1- or 2-sided numeric threshold or outside a certain number of column- standard deviations with a value (default is `NA`).
```{r, eval=NCRAN}
# replace all values below 2 and above 100 with NA
replace_outliers(mtcars, c(2, 100)) %>% head(3)
# replace all value smaller than 2 with NA
replace_outliers(mtcars, 2, single.limit = "min") %>% head(3)
# replace all value larger than 100 with NA
replace_outliers(mtcars, 100, single.limit = "max") %>% head(3)
# replace all values above or below 3 column-standard-deviations from the column-mean with NA
replace_outliers(mtcars, 3) %>% tail(3)
```
## 3. Quick Data Object Conversions
Apart from code employed for manipulation of data and the actual statistical computations performed, frequently used data object conversions with base functions like `as.data.frame`, `as.matrix` or `as.factor` have a significant share in slowing down R code. Optimally code would be written without such conversions, but sometimes they are necessary and thus *collapse* provides a set of functions (`qDF`, `qDT`, `qTBL`, `qM`, `qF`, `mrtl` and `mctl`) to speed these conversions up quite a bit. These functions are fast because they are non-generic and dispatch different objects internally, perform critical steps in C++, and, when passed lists of objects, they only check the length of the first column.
`qDF`, `qDT` and `qTBL` efficiently convert vectors, matrices, higher-dimensional arrays and suitable lists to data.frame, *data.table* and *tibble* respectively.
```{r, eval=NCRAN}
str(EuStockMarkets)
# Efficient Conversion of data frames and matrices to data.table
microbenchmark(qDT(wlddev), qDT(EuStockMarkets), as.data.table(wlddev), as.data.frame(EuStockMarkets))
# Converting a time series to data.frame
head(qDF(AirPassengers))
```
By default these functions drop all unnecessary attributes from matrices or lists / data frames in the conversion, but this can be changed using the `keep.attr = TRUE` argument.
A useful additional feature of `qDF` and `qDT` is the `row.names.col` argument, enabling the saving of names / row-names in a column when converting from vector, matrix, array or data frame:
```{r, eval=NCRAN}
# This saves the row-names in a column named 'car'
head(qDT(mtcars, "car"))
N_distinct <- fndistinct(GGDC10S)
N_distinct
# Converting a vector to data.frame, saving names
head(qDF(N_distinct, "variable"))
```
For the conversion of matrices to list there are also the programmers functions `mrtl` and `mctl`, which row- or column- wise convert a matrix into a plain list, data.frame or *data.table*.
```{r, eval=NCRAN}
# This converts the matrix to a list of 1860 row-vectors of length 4.
microbenchmark(mrtl(EuStockMarkets))
```
For the reverse operation, `qM` converts vectors, higher-dimensional arrays, data frames and suitable lists to matrix.
```{r, eval=NCRAN}
# Note: kit::psum is the most efficient way to do this
microbenchmark(rowSums(qM(mtcars)), rowSums(mtcars), kit::psum(mtcars))
```
At last, `qF` converts vectors to factor and is quite a bit faster than `as.factor`:
```{r, eval=NCRAN}
# Converting from character
str(wlddev$country)
fndistinct(wlddev$country)
microbenchmark(qF(wlddev$country), as.factor(wlddev$country))
# Converting from numeric
str(wlddev$PCGDP)
fndistinct(wlddev$PCGDP)
microbenchmark(qF(wlddev$PCGDP), as.factor(wlddev$PCGDP))
```
<!-- by default `qF` converts vectors to ordered factor, regulated by the default argument `ordered = TRUE`. This behavior is the same as `as.factor`, but `as.factor` does not attach a class 'ordered'. `qF(x, ordered = FALSE)` will not sort the levels and is slightly faster. Another difference to `as.factor` is that `qF` always adds a `NA` level for missing values, however by default an integer missing value (`NA_integer_`) is provided as the value for that level (thus `qF` behaves identical to `as.factor` with exception of the added level). a slight modification of this behavior can be achieved with `qF(x, na.exclude = FALSE)`, which will still have the `NA` level but now also assigns a positive integer value to the level (implying that missing values can no longer be detected using `is.na`), and attaches an additional class 'na.included'. Factor generation using `qF(x, na.exclude = FALSE)` is advised when using factors to carry out grouped computations with *collapse*'s fast functions, as the class 'na.included' prevents *collapse* functions to execute a missing value check on the factor, and thus yields a performance improvement. -->
## 4. Advanced Statistical Programming
Having introduced some of the more basic *collapse* data manipulation infrastructure in the preceding chapters, this chapter introduces some of the packages core functionality for programming with data.
### 4.1 Fast (Grouped, Weighted) Statistical Functions
A key feature of *collapse* is it's broad set of *Fast Statistical Functions* (`fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct`), which are able to tangibly speed-up column-wise, grouped and weighted statistical computations on vectors, matrices or data frames. The basic syntax common to all of these functions is:
```{r eval=FALSE}
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
```
where `x` is a vector, matrix or data frame, `g` takes groups supplied as vector, factor, list of vectors or *GRP* object, and `w` takes a weight vector (supported by `fsum, fprod, fmean, fmedian, fmode, fnth, fvar` and `fsd`). `TRA` can be used to transform `x` using the computed statistics and one of 10 available transformations (`"replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%, "-%%"`, discussed in section 6.3). `na.rm` efficiently skips missing values during the computation and is `TRUE` by default. `use.g.names = TRUE` generates new row-names from the unique groups supplied to `g`, and `drop = TRUE` returns a vector when performing simple (non-grouped) computations on matrix or data frame columns.
With that in mind, let's start with some simple examples. To calculate simple column-wise means, it is sufficient to type:
```{r, eval=NCRAN}
fmean(mtcars$mpg) # Vector
fmean(mtcars)
fmean(mtcars, drop = FALSE) # This returns a 1-row data-frame
m <- qM(mtcars) # Generate matrix
fmean(m)
fmean(m, drop = FALSE) # This returns a 1-row matrix
```
Note that separate methods for vectors, matrices and data frames are written in C++, thus no conversions are needed and computations on matrices and data frames are equally efficient.
If we had a weight vector, weighted statistics are easily computed:
```{r, eval=NCRAN}
weights <- abs(rnorm(fnrow(mtcars))) # fnrow is a bit faster for data frames
fmean(mtcars, w = weights) # Weighted mean
fmedian(mtcars, w = weights) # Weighted median
fsd(mtcars, w = weights) # Frequency-weighted standard deviation
fmode(mtcars, w = weights) # Weighted statistical mode (i.e. the value with the largest sum of weights)
```
Fast grouped statistics can be calculated by simply passing grouping vectors or lists of grouping vectors to the fast functions:
```{r, eval=NCRAN}
fmean(mtcars, mtcars$cyl)
fmean(mtcars, fselect(mtcars, cyl, vs, am))
# Getting column indices
ind <- fselect(mtcars, cyl, vs, am, return = "indices")
fmean(get_vars(mtcars, -ind), get_vars(mtcars, ind))
```
<!-- `get_vars` also subsets data.table columns and other data.frame-like classes, and is about 2x the speed of `[.data.frame`. Replacements of the form `get_vars(data, ind) <- newcols` are about 4x as fast as `data[ind] <- newcols`. It is also possible to subset with functions i.e. `get_vars(mtcars, is.ordered)` and regular expressions i.e. `get_vars(mtcars, c("c","v","a"), regex = TRUE)` or `get_vars(mtcars, "c|v|a", regex = TRUE)`. Next to `get_vars` there are also the predicates `num_vars`, `cat_vars`, `char_vars`, `fact_vars`, `logi_vars` and `date_vars` to subset and replace data by type. -->
### 4.2 Factors, Grouping Objects and Grouped Data Frames
This programming can becomes more efficient when passing *factors* or *grouping objects* to the `g` argument, as otherwise vectors and lists of vectors are grouped internally.
```{r, eval=NCRAN}
# This creates a factor, na.exclude = FALSE attaches a class 'na.included'
f <- qF(mtcars$cyl, na.exclude = FALSE)
# The 'na.included' attribute skips a missing value check on this factor
attributes(f)
# Saving data without grouping columns
dat <- get_vars(mtcars, -ind)
# Grouped standard-deviation
fsd(dat, f)
# Without option na.exclude = FALSE, anyNA needs to be called on the factor (noticeable on larger data).
f2 <- qF(mtcars$cyl)
microbenchmark(fsd(dat, f), fsd(dat, f2))
```
For programming purposes *GRP* objects are preferable over factors because they never require further checks and they provide additional information about the grouping (such as group sizes and the original unique values in each group). The `GRP` function creates grouping objects (of class *GRP*) from vectors or lists of columns. Grouping is done very efficiently via radix ordering in C (using the `radixorder` function):
```{r, eval=NCRAN}
# This creates a 'GRP' object.
g <- GRP(mtcars, ~ cyl + vs + am) # Using the formula interface, could also use c("cyl","vs","am") or c(2,8:9)
str(g)
```
The first three elements of this object provide information about the number of groups, the group to which each row belongs, and the size of each group. A print and a plot method provide further information about the grouping:
```{r, eval=NCRAN}
print(g)
plot(g)
```
The important elements of the *GRP* object are directly handed down to the compiled C++ code of the statistical functions, making repeated computations over the same groups very efficient.
```{r, eval=NCRAN}
fsd(dat, g)
# Grouped computation with and without prior grouping
microbenchmark(fsd(dat, g), fsd(dat, get_vars(mtcars, ind)))
```
<!-- Note at this point that by default (`sort = TRUE`, `order = 1L`), groups are sorted in ascending order, corresponding to *data.table* grouping with `keyby`. The ordering can be reversed (`order = -1L`), or set individually for each of the 3 grouping columns (i.e. `order = c(1L, -1L, -1L)`). If `sort = FALSE`, the grouping is unordered corresponding to *data.table* grouping with `by`. -->
Yet another possibility is creating a grouped data frame (class *grouped_df*). This can either be done using `dplyr::group_by`, which creates a grouped tibble and requires a conversion of the grouping object using `GRP.grouped_df`, or using the more efficient `fgroup_by` provided in *collapse*:
```{r, eval=NCRAN}
gmtcars <- fgroup_by(mtcars, cyl, vs, am) # fgroup_by() can also be abbreviated as gby()
fmedian(gmtcars)
head(fgroup_vars(gmtcars))
fmedian(gmtcars, keep.group_vars = FALSE)
```
<!-- By default, both are ordered, but must not be. For multiple variables, `GRP` is always superior to creating multiple factors and interacting them, and it is also faster than `base::interaction` for lists of factors. -->
<!-- With factors or *GRP* objects, computations are faster since the fast functions would otherwise internally group the vectors every time they are executed. Compared to factors, grouped computations using `GRP` objects are a bit more efficient, primarily because they require no further checks, while factors are checked for missing values^[Because missing values are stored as the smallest integer in C++, and the values of the factor are used directly to index result vectors in grouped computations. Subsetting a vector with the smallest integer would break the C++ code of the *Fast Statistical Functions* and terminate the R session, which must be avoided.] unless a class '*na.included*' is attached. By default `qF` acts just like `as.factor` and preserves missing values when generating factors. Therefore the most effective way of programming with factors is to use `qF(x, na.exclude = FALSE)` to create the factor. This will create an underlying integer for `NA`'s and attach a class '*na.included*', so that no further checks are run on that factor in the *collapse* ecosystem. -->
Now suppose we wanted to create a new dataset which contains the *mean*, *sd*, *min* and *max* of the variables *mpg* and *disp* grouped by *cyl*, *vs* and *am*:
```{r, eval=NCRAN}
# Standard evaluation
dat <- get_vars(mtcars, c("mpg", "disp"))
add_vars(g[["groups"]],
add_stub(fmean(dat, g, use.g.names = FALSE), "mean_"),
add_stub(fsd(dat, g, use.g.names = FALSE), "sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))
# Non-Standard evaluation
fgroup_by(mtcars, cyl, vs, am) %>% fselect(mpg, disp) %>% {
add_vars(fgroup_vars(., "unique"),
fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"),
fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"),
fmin(., keep.group_vars = FALSE) %>% add_stub("min_"),
fmax(., keep.group_vars = FALSE) %>% add_stub("max_"))
}
```
### 4.3 Grouped and Weighted Computations
<!-- , and we could decide to include the original grouping columns and omit the generated row-names, as shown below -->
We could also calculate groupwise-frequency weighted means and standard-deviations using a weight vector^[You may wonder why with weights the standard-deviations in the group '4.0.1' are `0` while they were `NA` without weights. This stirs from the fact that group '4.0.1' only has one observation, and in the Bessel-corrected estimate of the variance there is a `n - 1` in the denominator which becomes `0` if `n = 1` and division by `0` becomes `NA` in this case (`fvar` was designed that way to match the behavior or `stats::var`). In the weighted version the denominator is `sum(w) - 1`, and if `sum(w)` is not 1, then the denominator is not `0`. The standard-deviation however is still `0` because the sum of squares in the numerator is `0`. In other words this means that in a weighted aggregation singleton-groups are not treated like singleton groups unless the corresponding weight is `1`.].
<!-- There is also a *collapse* predicate `add_vars` which serves as a much faster and more versatile alternative to `cbind.data.frame`. The intention behind `add_vars` is to be able to efficiently add multiple columns to an existing data.frame. Thus in a call `add_vars(data, newcols1, newcols2)`, `newcols1` and `newcols2` are added (by default) at the end of `data`, while preserving all attributes of `data`. -->
```{r, eval=NCRAN}
# Grouped and weighted mean and sd and grouped min and max
add_vars(g[["groups"]],
add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))
# Binding and reordering columns in a single step: Add columns in specific positions
add_vars(g[["groups"]],
add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_"),
pos = c(4,8,5,9,6,10,7,11))
```
The R overhead of this kind of programming in standard-evaluation is very low:
```{r, eval=NCRAN}
microbenchmark(call = add_vars(g[["groups"]],
add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
add_stub(fmax(dat, g, use.g.names = FALSE), "max_")))
```
### 4.4 Transformations Using the `TRA` Argument
As a final layer of added complexity, we could utilize the `TRA` argument to generate groupwise-weighted demeaned, and scaled data, with additional columns giving the group-minimum and maximum values:
```{r, eval=NCRAN}
head(add_vars(get_vars(mtcars, ind),
add_stub(fmean(dat, g, weights, "-"), "w_demean_"), # This calculates weighted group means and uses them to demean the data
add_stub(fsd(dat, g, weights, "/"), "w_scale_"), # This calculates weighted group sd's and uses them to scale the data
add_stub(fmin(dat, g, "replace"), "min_"), # This replaces all observations by their group-minimum
add_stub(fmax(dat, g, "replace"), "max_"))) # This replaces all observations by their group-maximum
```
It is also possible to `add_vars<-` to `mtcars` itself. The default option would add these columns at the end, but we could also specify positions:
```{r, eval=NCRAN}
# This defines the positions where we want to add these columns
pos <- as.integer(c(2,8,3,9,4,10,5,11))
add_vars(mtcars, pos) <- c(add_stub(fmean(dat, g, weights, "-"), "w_demean_"),
add_stub(fsd(dat, g, weights, "/"), "w_scale_"),
add_stub(fmin(dat, g, "replace"), "min_"),
add_stub(fmax(dat, g, "replace"), "max_"))
head(mtcars)
rm(mtcars)
```
Together with `ftransform`, things can become arbitrarily more complex:
```{r, eval=NCRAN}
# 2 different grouped and weighted computations (mutate operations) performed in one call
settransform(mtcars, carb_dwmed_cyl = fmedian(carb, cyl, weights, "-"),
carb_wsd_vs_am = fsd(carb, list(vs, am), weights, "replace"))
# Multivariate
settransform(mtcars, c(fmedian(list(carb_dwmed_cyl = carb, mpg_dwmed_cyl = mpg), cyl, weights, "-"),
fsd(list(carb_wsd_vs_am = carb, mpg_wsd_vs_am = mpg), list(vs, am), weights, "replace")))
# Nested (Computing the weighted 3rd quartile of mpg, grouped by cyl and carb being greater than it's weighted median, grouped by vs)
settransform(mtcars,
mpg_gwQ3_cyl = fnth(mpg, 0.75, list(cyl, carb > fmedian(carb, vs, weights, 1L)), weights, 1L))
head(mtcars)
rm(mtcars)
```
With the full set of 14 *Fast Statistical Functions*, and additional vector- valued functions and operators (`fscale/STD, fbetween/B, fwithin/W, fhdbetween/HDB, fhdwithin/HDW, flag/L/F, fdiff/D, fgrowth/G`) discussed later, *collapse* provides extraordinary new possibilities for highly complex and efficient statistical programming in R. Computation speeds generally exceed those of packages like *dplyr* or *data.table*, sometimes by orders of magnitude. Column-wise matrix computations are also highly efficient and comparable to packages like `matrixStats` and base R functions like `colSums`. In particular the ability to perform grouped and weighted computations on matrices is new to R and very useful for complex computations (such as aggregating input-output tables etc.).
Note that the above examples provide merely suggestions for use of these features and are focused on programming with data frames (as the predicates `get_vars`, `add_vars` etc. are made for data frames). Equivalently efficient code could be written using vectors or matrices.
<!-- comparable to `column-wise computations on matrices are also slightly faster than `base` functions like `colMeans`, `colSums` etc. especially in the presence of missing values. -->
## 5. Advanced Data Aggregation
The grouped statistical programming introduced in the previous section is the fastest and most customizable way of dealing with many data transformation problems. Some tasks such as multivariate aggregations on a single data frame are however so common that this demanded for a more compact solution which efficiently integrates multiple computational steps.
For such purposes `collap` was created as a fast multi-purpose aggregation command designed to solve complex aggregation problems efficiently and with a minimum of coding. `collap` performs optimally together with the *Fast Statistical Functions*, but will also work with other functions.
To perform the above aggregation with `collap`, one would simply need to type:
<!-- ^[One can also add a weight-argument `w = weights` here, but `fmin` and `fmax` don't support weights and all S3 methods in this package give errors when encountering unknown arguments. To do a weighted aggregation one would have to either only use `fmean` and `fsd`, or employ a named list of functions wrapping `fmin` and `fmax` in a way that additional arguments are silently swallowed.]: -->
```{r, eval=NCRAN}
collap(mtcars, mpg + disp ~ cyl + vs + am, list(fmean, fsd, fmin, fmax),
w = weights, keep.col.order = FALSE)
```
`collap` here also saves the sum of the weights in a column. The original idea behind `collap` is however better demonstrated with a different dataset. Consider the *World Development Dataset* `wlddev` introduced in section 1:
```{r, eval=NCRAN}
head(wlddev)
```
Suppose we would like to aggregate this data by country and decade, but keep all that categorical information. With `collap` this is extremely simple:
```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade) %>% head
```
Note that the columns of the data are in the original order and also retain all their attributes. To understand this result let us briefly examine the syntax of `collap`:
```{r eval=FALSE}
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
sort.row = TRUE, parallel = FALSE, mc.cores = 1L,
return = c("wide","list","long","long_dupl"), give.names = "auto") # , ...
```
It is clear that `X` is the data and `by` supplies the grouping information, which can be a one- or two-sided formula or alternatively grouping vectors, factors, lists and `GRP` objects (like the *Fast Statistical Functions*). Then `FUN` provides the function(s) applied only to numeric variables in `X` and defaults to `fmean`, while `catFUN` provides the function(s) applied only to categorical variables in `X` and defaults to `fmode`^[I.e. the most frequent value. By default a first-mode is computed.]. `keep.col.order = TRUE` specifies that the data is to be returned with the original column-order. Thus in the above example it was sufficient to supply `X` and `by` and `collap` did the rest for us.
Suppose we only want to aggregate 4 series in this dataset.
```{r, eval=NCRAN}
# Same as collap(wlddev, ~ iso3c + decade, cols = 9:12)
collap(wlddev, PCGDP + LIFEEX + GINI + ODA ~ iso3c + decade) %>% head
```
As before we could use multiple functions by putting them in a named or unnamed list^[If the list is unnamed, `collap` uses `all.vars(substitute(list(FUN1, FUN2, ...)))` to get the function names. Alternatively it is also possible to pass a character vector of function names.]:
```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12) %>% head
```
With multiple functions, we could also request `collap` to return a long-format of the data:
```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12, return = "long") %>% head
```
A very important feature of `collap` to highlight at this point is the `custom` argument, which allows the user to circumvent the broad distinction into numeric and categorical data (and the associated `FUN` and `catFUN` arguments) and specify exactly which columns to aggregate using which functions:
```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade,
custom = list(fmean = 9:10, fmedian = 11:12,
ffirst = c("country","region","income"),
flast = c("year","date"),
fmode = "OECD")) %>% head
```
Since *collapse* 1.5.0, it is also possible to perform weighted aggregations and append functions with `_uw` to yield an unweighted computation:
```{r, eval=NCRAN}
# This aggregates using weighted mean and mode, and unweighted median, first and last value
collap(wlddev, ~ region + year, w = ~ POP,
custom = list(fmean = 9:10, fmedian_uw = 11:12,
ffirst_uw = c("country","region","income"),
flast_uw = c("year","date"),
fmode = "OECD"), keep.w = FALSE) %>% head
```
Next to `collap`, the functions `collapv` provides a programmers alternative allowing grouping and weighting columns to be passed using column names or indices, and the function `collapg` operates on grouped data frames.