vignettes/collapse_intro.Rmd

---
title: "Introduction to collapse"
subtitle: "Advanced and Fast Data Transformation in R"
author: "Sebastian Krantz"
date: "2021-06-27"
output: 
  rmarkdown::html_vignette:
    toc: true
    
vignette: >
  %\VignetteIndexEntry{Introduction to collapse}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{css, echo=FALSE}
pre {
  max-height: 500px;
  overflow-y: auto;
}

pre[class] {
  max-height: 500px;
}
```

```{r, echo=FALSE}
NCRAN <- identical(Sys.getenv("NCRAN"), "TRUE")
RUNBENCH <- NCRAN && identical(Sys.getenv("RUNBENCH"), "TRUE")

oldopts <- options(width = 100L)
```


```{r, echo = FALSE, message = FALSE, warning=FALSE, eval=NCRAN}
library(vars)
library(dplyr)  # Needed because otherwise dplyr is loaded in benchmark chunk not run on CRAN !!
library(magrittr)
library(microbenchmark) # Same thing
library(collapse)
library(data.table)
B <- collapse::B # making sure it masks vars::B by loading into GE
knitr::opts_chunk$set(error = FALSE, message = FALSE, warning = FALSE, 
                      comment = "#", tidy = FALSE, cache = FALSE, collapse = TRUE,
                      fig.width = 8, fig.height = 5, 
                      out.width = '100%')

X = mtcars[1:2]
by = mtcars$cyl

F <- getNamespace("collapse")$F

set.seed(101)
```

*collapse* is a C/C++ based package for data transformation and statistical computing in R. It's aims are:

1. To facilitate complex data transformation, exploration and computing tasks in R.
2. To help make R code fast, flexible, parsimonious and programmer friendly. 

This vignette demonstrates these two points and introduces all main features of the package in a structured way. The chapters are pretty self-contained, however the first chapters introduce the data and faster data manipulation functions which are used throughout the rest of this vignette. 

<!-- structured with an overarching aim to instruct the reader how to write fast R code for (advanced) data manipulation with *collapse*. --> 

***
**Notes:**

- Apart from this vignette, *collapse* comes with a built-in structured documentation available under `help("collapse-documentation")` after installing the package, and `help("collapse-package")` provides a compact set of examples for quick-start. A cheat sheet is available at [Rstudio](<https://raw.githubusercontent.com/rstudio/cheatsheets/master/collapse.pdf>).
<!-- To learn *collapse* as quickly as possible it is possibly better to consult those resources than going through this document.  -->

- The two other vignettes focus on the integration of *collapse* with *dplyr* workflows (recommended for *dplyr* / *tidyverse* users), and on the integration of *collapse* with the *plm* package (+ some advanced programming with panel data). 

- Documentation and vignettes can also be viewed [online](<https://sebkrantz.github.io/collapse/>).

<!-- - All benchmarks are run with a Windows 8.1 laptop, 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD.  -->

***

## Why *collapse*?
*collapse* is a high-performance package that extends and enhances the data-manipulation capabilities of R and existing popular packages (such as *dplyr*, *data.table*, and matrix packages). It's main focus is on grouped and weighted statistical programming, complex aggregations and transformations, time series and panel data operations, and programming with lists of data objects. The lead author is an applied economist and created the package mainly to facilitate advanced computations on varied and complex data, in particular surveys, (multivariate) time series, multilevel / panel data, and lists / model objects. 

A secondary aspect to applied work is that data is often imported into R from richer data structures (such as STATA, SPSS or SAS files imported with *haven*). This called for an intelligent suite of data manipulation functions that can both utilize aspects of the richer data structure (such as variable labels), and preserve the data structure / attributes in computations. Sometimes specialized classes like *xts*, *pdata.frame* and *grouped_df* can also become very useful to manipulate certain types of data. Thus *collapse* was built to explicitly supports these classes, while preserving most other classes / data structures in R.
<!-- , enabling flexible, efficient and non-destructive workflows with complex data.  -->

Another objective was to radically improve the speed of R code by extensively relying on efficient algorithms in C/C++ and the faster components of base R. *collapse* ranks among the fastest R packages, and performs many grouped and/or weighted computations noticeably faster than *dplyr* or *data.table*. 

A final development objective was to channel this performance through a stable and well conceived user API providing extensive and optimized programming capabilities (in standard evaluation) while also facilitating quick use and easy integration with existing data manipulation frameworks (in particular *dplyr* / *tidyverse* and *data.table*, both relying on non-standard evaluation). 

<!--
*collapse* also provides exemplary documentation (built-in, vignettes, website) and testing. Testing currently covers all core features of the package, amounting to > 7700 unit tests, with extensive testing of statistical functions and computations.

The name of the package derives from the 'collapse' command for multi-type aggregation in the STATA statistical software. This was the first function in this package (later renamed to `collap` to avoid naming conflicts with *dplyr*). -->


## 1. Data and Summary Tools
We begin by introducing some powerful summary tools along with the 2 panel datasets *collapse* provides which are used throughout this vignette. If you are just interested in programming you can skip this section. Apart from the 2 datasets that come with *collapse* (`wlddev` and `GGDC10S`), this vignette uses a few well known datasets from base R: `mtcars`, `iris`, `airquality`, and the time series `Airpassengers` and `EuStockMarkets`. 

### 1.1 `wlddev` - World Bank Development Data
This dataset contains 5 key World Bank Development Indicators covering 216 countries for up to 61 years (1960-2020). It is a balanced balanced panel with $216 \times 61 = 13176$ observations. -->
```{r, eval=NCRAN}
library(collapse)
head(wlddev)

# The variables have "label" attributes. Use vlabels() to get and set labels
namlab(wlddev, class = TRUE)
```
Of the categorical identifiers, the date variable was artificially generated to have an example dataset that contains all common data types frequently encountered in R. A detailed statistical description of this data is computed by `descr`:

```{r, eval=NCRAN}
# A fast and detailed statistical description
descr(wlddev)
```

The output of `descr` can be converted into a tidy data frame using:

```{r, eval=NCRAN}
head(as.data.frame(descr(wlddev)))
```
Note that `descr` does not require data to be labeled. Since `wlddev` is a panel data set tracking countries over time, we might be interested in checking which variables are time-varying, with the function `varying`:

```{r, eval=NCRAN}
varying(wlddev, wlddev$iso3c)
```
`varying` tells us that all 5 variables `PCGDP`, `LIFEEX`, `GINI`, `ODA` and `POP` vary over time. However the `OECD` variable does not, so this data does not track when countries entered the OECD. We can also have a more detailed look letting `varying` check the variation in each country:

```{r, eval=NCRAN}
head(varying(wlddev, wlddev$iso3c, any_group = FALSE))
```

`NA` indicates that there are no data for this country. In general data is varying if it has two or more distinct non-missing values. We could also take a closer look at observation counts and distinct values using: 

```{r, eval=NCRAN}
head(fnobs(wlddev, wlddev$iso3c))

head(fndistinct(wlddev, wlddev$iso3c))
```

Note that `varying` is more efficient than `fndistinct`, although both functions are very fast. 
Even more powerful summary methods for multilevel / panel data are provided by `qsu` (shorthand for *quick-summary*). It is modeled after *STATA*'s *summarize* and *xtsummarize* commands. Calling `qsu` on the data gives a concise summary. We can subset columns internally using the `cols` argument:

```{r, eval=NCRAN}
qsu(wlddev, cols = 9:12, higher = TRUE) # higher adds skewness and kurtosis
```

We could easily compute these statistics by region: 

```{r, eval=NCRAN}
qsu(wlddev, by = ~region, cols = 9:12, vlabels = TRUE, higher = TRUE) 
```

Computing summary statistics by country is of course also possible but would be too much information. Fortunately `qsu` lets us do something much more powerful:  

```{r, eval=NCRAN}
qsu(wlddev, pid = ~ iso3c, cols = c(1,4,9:12), vlabels = TRUE, higher = TRUE)
```

The above output reports 3 sets of summary statistics for each variable: Statistics computed on the *Overall* (raw) data, and on the *Between*-country (i.e. country averaged) and *Within*-country (i.e. country-demeaned) data^[in the *Within* data, the overall mean was added back after subtracting out country means, to preserve the level of the data, see also section 6.5.]. This is a powerful way to summarize panel data because aggregating the data by country gives us a cross-section of countries with no variation over time, whereas subtracting country specific means from the data eliminates all cross-sectional variation. <!-- Thus we summarize the variation in our panel in 3 different ways: First we consider the raw data, then we create a cross-section of countries and summarize that, and then we sweep that cross-section out of the raw data and pretend we have a time series. -->

So what can these statistics tell us about our data? The `N/T` columns shows that for `PCGDP` we have 8995 total observations, that we observe GDP data for 203 countries and that we have on average 44.3 observations (time-periods) per country. In contrast the GINI Index is only available for 161 countries with 8.4 observations on average. The *Overall* and *Within* mean of the data are identical by definition, and the *Between* mean would also be the same in a balanced panel with no missing observations. In practice we have unequal amounts of observations for different countries, thus countries have different weights in the *Overall* mean and the difference between *Overall* and *Between*-country mean reflects this discrepancy. The most interesting statistic in this summary arguably is the standard deviation, and in particular the comparison of the *Between*-SD reflecting the variation between countries and the *Within*-SD reflecting average variation over time. This comparison shows that PCGDP, LIFEEX and GINI vary more between countries, but ODA received varies more within countries over time. The 0 *Between*-SD for the year variable and the fact that the *Overall* and *Within*-SD are equal shows that year is individual invariant. Thus `qsu` also provides the same information as `varying`, but with additional details on the relative magnitudes of cross-sectional and time series variation. It is also a common pattern that the *kurtosis* increases in within-transformed data, while the *skewness* decreases in most cases. 

<!-- The output above is a 3D array of statistics which can also be subsetted (`[`) or permuted using `aperm()`.  -->

We could also do all of that by regions to have a look at the between and within country variations inside and across different World regions:

```{r, eval=NCRAN}
qsu(wlddev, by = ~ region, pid = ~ iso3c, cols = 9:12, vlabels = TRUE, higher = TRUE)
```

Notice that the output here is a 4D array of summary statistics, which we could also subset (`[`) or permute (`aperm`) to view these statistics in any convenient way. If we don't like the array, we can also output as a nested list of statistics matrices:

```{r, eval=NCRAN}
l <- qsu(wlddev, by = ~ region, pid = ~ iso3c, cols = 9:12, vlabels = TRUE, 
         higher = TRUE, array = FALSE)

str(l, give.attr = FALSE)
```

Such a list of statistics matrices could, for example, be converted into a tidy data frame using `unlist2d` (more about this in the section on list-processing):

```{r, eval=NCRAN}
head(unlist2d(l, idcols = c("Variable", "Trans"), row.names = "Region"))
```


This is not yet end of `qsu`'s functionality, as we can also do all of the above on panel-surveys utilizing weights (`w` argument). 

Finally, we can look at (weighted) pairwise correlations in this data:
```{r, eval=NCRAN}
pwcor(wlddev[9:12], N = TRUE, P = TRUE)
```

which can of course also be computed on averaged and within-transformed data:

```{r, eval=NCRAN}
print(pwcor(fmean(wlddev[9:12], wlddev$iso3c), N = TRUE, P = TRUE), show = "lower.tri")

# N is same as overall N shown above...
print(pwcor(fwithin(wlddev[9:12], wlddev$iso3c), P = TRUE), show = "lower.tri")

```

A useful function called by `pwcor` is `pwnobs`, which is very handy to explore the joint observation structure when selecting variables to include in a statistical model:

```{r, eval=NCRAN}
pwnobs(wlddev)
```

Note that both `pwcor/pwcov` and `pwnobs` are faster on matrices.  

<!-- *Note:* Other distributional statistics like the *median* and *quantiles* are currently not implemented for reasons having to do with computation speed (>10x faster than `base::summary` and suitable for really large panels) and the algorithm^[`qsu` uses a numerically stable online algorithm generalized from Welford's Algorithm to compute variances.] behind `qsu`, but might come in a further update of `qsu`.  -->


### 1.2 `GGDC10S` - GGDC 10-Sector Database
The Groningen Growth and Development Centre 10-Sector Database provides long-run data on sectoral productivity performance in Africa, Asia, and Latin America. Variables covered in the data set are annual series of value added (VA, in local currency), and persons employed (EMP) for 10 broad sectors.

```{r, eval=NCRAN}
head(GGDC10S)

namlab(GGDC10S, class = TRUE)

fnobs(GGDC10S)

fndistinct(GGDC10S)

# The countries included:
cat(funique(GGDC10S$Country, sort = TRUE))

```
The first problem in summarizing this data is that value added (VA) is in local currency, the second that it contains 2 different Variables (VA and EMP) stacked in the same column. One way of solving the first problem could be converting the data to percentages through dividing by the overall VA and EMP contained in the last column. A different solution involving grouped-scaling is introduced in section 6.4. The second problem is again nicely handled by `qsu`, which can also compute panel-statistics by groups. 
```{r, eval=NCRAN}
# Converting data to percentages of overall VA / EMP, dapply keeps the attributes, see section 6.1
pGGDC10S <- ftransformv(GGDC10S, 6:15, `*`, 100 / SUM) 

# Summarizing the sectoral data by variable, overall, between and within countries
su <- qsu(pGGDC10S, by = ~ Variable, pid = ~ Variable + Country, 
          cols = 6:16, higher = TRUE) 

# This gives a 4D array of summary statistics
str(su)

# Permuting this array to a more readible format
aperm(su, c(4L, 2L, 3L, 1L))
```
The statistics show that the dataset is very consistent: Employment data cover 42 countries and 53 time-periods in almost all sectors. Agriculture is the largest sector in terms of employment, amounting to a 35% share of employment across countries and time, with a standard deviation (SD) of around 27%. The between-country SD in agricultural employment share is 24% and the within SD is 12%, indicating that processes of structural change are very gradual and most of the variation in structure is between countries. The next largest sectors after agriculture are manufacturing, wholesale and retail trade and government, each claiming an approx. 15% share of the economy. In these sectors the between-country SD is also about twice as large as the within-country SD. 

In terms of value added, the data covers 43 countries in 50 time-periods. Agriculture, manufacturing, wholesale and retail trade and government are also the largest sectors in terms of VA, but with a diminished agricultural share (around 17%) and a greater share for manufacturing (around 20%). The variation between countries is again greater than the variation within countries, but it seems that at least in terms of agricultural VA share there is also a considerable within-country SD of 8%. This is also true for the finance and real estate sector with a within SD of 9%, suggesting (using a bit of common sense) that a diminishing VA share in agriculture and increased VA share in finance and real estate was a pattern characterizing most of the countries in this sample. 

As a final step we consider a plot function which can be used to plot the structural transformation of any supported country. Below for Botswana:
```{r, eval=NCRAN}
library(data.table)
library(ggplot2)
library(magrittr)

plotGGDC <- function(ctry) {
  # Select and subset
  fsubset(GGDC10S, Country == ctry, Variable, Year, AGR:SUM) %>%
  # Convert to shares and replace negative values with NA
  ftransform(fselect(., AGR:OTH) %>% 
             lapply(`*`, 1 / SUM) %>% 
             replace_outliers(0, NA, "min")) %>%
  # Remove totals column and make proper variable labels
  ftransform(Variable = recode_char(Variable, 
                                    VA = "Value Added Share",
                                    EMP = "Employment Share"),
             SUM = NULL) %>% 
  # Fast conversion to data.table
  qDT %>% 
  # data.table's melt function
  melt(1:2, variable.name = "Sector", na.rm = TRUE) %>%
  # ggplot with some scales provided by the 'scales' package
  ggplot(aes(x = Year, y = value, fill = Sector)) +
    geom_area(position = "fill", alpha = 0.9) + labs(x = NULL, y = NULL) +
    theme_linedraw(base_size = 14L) + facet_wrap( ~ Variable) +
    scale_fill_manual(values = sub("#00FF66", "#00CC66", rainbow(10L))) +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 7L), expand = c(0, 0)) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10L), expand = c(0, 0),
                       labels = scales::percent) +
    theme(axis.text.x = element_text(angle = 315, hjust = 0, margin = ggplot2::margin(t = 0)),
          strip.background = element_rect(colour = "grey20", fill = "grey20"),
          strip.text = element_text(face = "bold"))
}

# Plotting the structural transformation of Botswana
plotGGDC("BWA")

```

## 2. Fast Data Manipulation
A lot of R code is not concerned with statistical computations but with preliminary data wrangling.
<!-- Very frequent operations include selecting, replacing, subsetting, ordering, adding/computing, and deleting data / columns.--> 
For various reasons R development has focused on data frames as the main medium to contain data, although matrices / arrays provide significantly faster methods for common manipulations. 

A first essential step towards optimizing R code is thus to speed up very frequent manipulations on data frames. *collapse* introduces a set of highly optimized functions to efficiently manipulate (mostly) data frames. Most manipulations can be conducted in non-standard evaluation or standard evaluation (utilizing different functions), and all functions preserve the data structure (i.e. they can be used with data.table, tbl_df, grouped_df, pdata.frame etc.). 

<!-- Some of these functions (`fselect`, `roworder`, `franame`, `fsubset`, `ss` and `ftransform`) represent improved versions of existing ones (`dplyr::select`, `dplyr::arrange`, `dplyr::rename`, `base::subset`, `base::[.data.frame` and `base::transform`) while others are added in. Also some functions (`fselect`, `roworder`, `frename`, `colorder`, `fsubset`, `ftransform`, `settransform`, `fcompute`) use non-standard evaluation, whereas others (`get_vars`, `roworderv`, `colorderv`, `ss`, `add_vars`, `num_vars`, etc.) offer some of the same functionality with standard evaluation and are thus more programmer friendly. Here we run through all of them briefly: -->

### 2.1 Selecting and Replacing Columns

`fselect` is an analogue to `dplyr::select`, but executes about 100x faster. It can be used to select variables using expressions involving variable names:

```{r, eval=NCRAN}
library(magrittr) # Pipe operators
fselect(wlddev, country, year, PCGDP:ODA) %>% head(2)

fselect(wlddev, -country, -year, -(PCGDP:ODA)) %>% head(2)

library(microbenchmark)
microbenchmark(fselect = collapse::fselect(wlddev, country, year, PCGDP:ODA),
               select = dplyr::select(wlddev, country, year, PCGDP:ODA))
```

in contrast to `dplyr::select`, `fselect` has a replacement method
```{r, eval=NCRAN}
# Computing the log of columns
fselect(wlddev, PCGDP:POP) <- lapply(fselect(wlddev, PCGDP:POP), log)
head(wlddev, 2)
# Efficient deleting
fselect(wlddev, country, year, PCGDP:POP) <- NULL
head(wlddev, 2)
rm(wlddev)
```
and it can also return information about the selected columns other than the data itself.
```{r, eval=NCRAN}
fselect(wlddev, PCGDP:POP, return = "names")
fselect(wlddev, PCGDP:POP, return = "indices")
fselect(wlddev, PCGDP:POP, return = "named_indices")
fselect(wlddev, PCGDP:POP, return = "logical")
fselect(wlddev, PCGDP:POP, return = "named_logical")
```

<!-- `fselect` is a lot faster than `dplyr::select` and maintains this performance on large data.  -->

While `fselect` is faster than `dplyr::select`, it is also simpler and does not offer special methods for grouped tibbles (e.g. where grouping columns are always selected) and some other *dplyr*-specific features of `select`. We will see that this is not a problem at all when working with statistical functions in *collapse* that have a grouped_df method, but users should be careful replacing `dplyr::select` with `fselect` in *dplyr* scripts. From *collapse* 1.6.0, `fselect` has explicit support for *sf* data frames. 

<!-- For some reason this does not work in build !! -->
<!-- ```{r, error=FALSE, warning=FALSE} -->
<!-- library(microbenchmark) -->
<!-- library(dplyr) -->
<!-- identical(select(wlddev, country, year, PCGDP:ODA), fselect(wlddev, country, year, PCGDP:ODA)) -->
<!-- microbenchmark(select(wlddev, country, year, PCGDP:ODA), fselect(wlddev, country, year, PCGDP:ODA)) -->
<!-- ``` -->

The standard-evaluation analogue to `fselect` is the function `get_vars`. `get_vars` can be used to select variables using names, indices, logical vectors, functions or regular expressions evaluated against column names:

```{r, eval=NCRAN}
get_vars(wlddev, 9:13) %>% head(1)
get_vars(wlddev, c("PCGDP","LIFEEX","GINI","ODA","POP")) %>% head(1)
get_vars(wlddev, "[[:upper:]]", regex = TRUE) %>% head(1)
get_vars(wlddev, "PC|LI|GI|OD|PO", regex = TRUE) %>% head(1)
# Same as above, vectors of regular expressions are sequentially passed to grep
get_vars(wlddev, c("PC","LI","GI","OD","PO"), regex = TRUE) %>% head(1)
get_vars(wlddev, is.numeric) %>% head(1)

# Returning other information
get_vars(wlddev, is.numeric, return = "names")
get_vars(wlddev, "[[:upper:]]", regex = TRUE, return = "named_indices")

```
Replacing operations are conducted analogous:

```{r, eval=NCRAN}
get_vars(wlddev, 9:13) <- lapply(get_vars(wlddev, 9:13), log)
get_vars(wlddev, 9:13) <- NULL
head(wlddev, 2)
rm(wlddev)
```

`get_vars` is about 2x faster than `[.data.frame`, and `get_vars<-` is about 6-8x faster than `[<-.data.frame`.

<!-- ```{r} -->
<!-- series <- wlddev[9:12] -->
<!-- microbenchmark(get_vars(wlddev, 9:12), wlddev[9:12]) -->
<!-- microbenchmark(get_vars(wlddev, 9:12) <- series, wlddev[9:12] <- series) -->
<!-- microbenchmark(get_vars(wlddev, 9:12) <- get_vars(wlddev, 9:12), wlddev[9:12] <- wlddev[9:12]) -->
<!-- ``` -->

In addition to `get_vars`, *collapse* offers a set of functions to efficiently select and replace data by data type: `num_vars`, `cat_vars` (for categorical = non-numeric columns), `char_vars`, `fact_vars`, `logi_vars` and `date_vars` (for date and date-time columns). 

```{r, eval=NCRAN}
head(num_vars(wlddev), 2)
head(cat_vars(wlddev), 2)
head(fact_vars(wlddev), 2)

# Replacing
fact_vars(wlddev) <- fact_vars(wlddev)
```

### 2.2 Subsetting
`fsubset` is an enhanced version of `base::subset` using C functions from the *data.table* package for fast and subsetting operations. In contrast to `base::subset`, `fsubset` allows multiple comma-separated select arguments after the subset argument, and it also preserves all attributes of subsetted columns: 

```{r, eval=NCRAN}
# Returning only value-added data after 1990
fsubset(GGDC10S, Variable == "VA" & Year > 1990, Country, Year, AGR:GOV) %>% head(2)
# Same thing
fsubset(GGDC10S, Variable == "VA" & Year > 1990, -(Regioncode:Variable), -(OTH:SUM)) %>% head(2)
```

It is also possible to use standard evaluation with `fsubset`, but for these purposes the function `ss` exists as a fast and more secure alternative to `[.data.frame`:

```{r, eval=NCRAN}
ss(GGDC10S, 1:2, 6:16)  # or fsubset(GGDC10S, 1:2, 6:16), but not recommended.
ss(GGDC10S, -(1:2), c("AGR","MIN")) %>% head(2)
```

Thanks to the *data.table* C code and optimized R code, `fsubset` is very fast.

```{r, eval=NCRAN}
microbenchmark(base = subset(GGDC10S, Variable == "VA" & Year > 1990, AGR:SUM), 
               collapse = fsubset(GGDC10S, Variable == "VA" & Year > 1990, AGR:SUM))

microbenchmark(GGDC10S[1:10, 1:10], ss(GGDC10S, 1:10, 1:10))
```

like `base::subset`, `fsubset` is S3 generic with methods for vectors, matrices and data frames. For certain classes such as factors, `fsubset.default` also improves upon `[`, but the largest improvements are with the data frame method.

### 2.3 Reordering Rows and Columns

`roworder` is a fast analogue to `dplyr::arrange`. The syntax is inspired by `data.table::setorder`, so that negative variable names indicate descending sort. 

```{r, eval=NCRAN}
roworder(GGDC10S, -Variable, Country) %>% ss(1:2, 1:8)

microbenchmark(collapse = collapse::roworder(GGDC10S, -Variable, Country), 
               dplyr = dplyr::arrange(GGDC10S, desc(Variable), Country))
```
In contrast to `data.table::setorder`, `roworder` creates a copy of the data frame (unless data are already sorted). If this copy is not required, `data.table::setorder` is faster. The function `roworderv` is a standard evaluation analogue to `roworder`:
```{r, eval=NCRAN}
# Same as above
roworderv(GGDC10S, c("Variable", "Country"), decreasing = c(TRUE, FALSE)) %>% ss(1:2, 1:8)
```
With `roworderv`, it is also possible to move or exchange rows in a data frame:
```{r, eval=NCRAN}
# If length(neworder) < fnrow(data), the default (pos = "front") brings rows to the front
roworderv(GGDC10S, neworder = which(GGDC10S$Country == "GHA")) %>% ss(1:2, 1:8)

# pos = "end" brings rows to the end
roworderv(GGDC10S, neworder = which(GGDC10S$Country == "BWA"), pos = "end") %>% ss(1:2, 1:8)

# pos = "exchange" arranges selected rows in the order they are passed, without affecting other rows
roworderv(GGDC10S, neworder = with(GGDC10S, c(which(Country == "GHA"), 
                                              which(Country == "BWA"))), pos = "exchange") %>% ss(1:2, 1:8)
```
Similarly, the pair `colorder` / `colorderv` facilitates efficient reordering of columns in a data frame. These functions not require a deep copy of the data and are very fast. To reorder columns by reference, see also `data.table::setcolorder`. 
```{r, eval=NCRAN}
# The default is again pos = "front" which brings selected columns to the front / left
colorder(GGDC10S, Variable, Country, Year) %>% head(2)
```


### 2.4 Transforming and Computing New Columns
`ftransform` is an improved version of `base::transform` for data frames and lists. `ftransform` can be used to compute new columns or modify and delete existing columns, and always returns the entire data frame.

```{r, eval=NCRAN}
ftransform(GGDC10S, AGR_perc = AGR / SUM * 100, # Computing Agricultural percentage
                    Year = as.integer(Year),    # Coercing Year to integer
                    AGR = NULL) %>% tail(2)     # Deleting column AGR

# Computing scalar results replicates them
ftransform(GGDC10S, MIN_mean = fmean(MIN), Intercept = 1) %>% tail(2)
```

The modification `ftransformv` exists to transform specific columns using a function:

<!--
# Same thing using fselect to get the right indices
# GGDC10S %>% ftransformv(fselect(., AGR:SUM, return = "indices"), `*`, 100/SUM) %>% tail(2)
-->

```{r, eval=NCRAN}
# Apply the log to columns 6-16
GGDC10S %>% ftransformv(6:16, log) %>% tail(2)

# Convert data to percentage terms 
GGDC10S %>% ftransformv(6:16, `*`, 100/SUM) %>% tail(2)

# Apply log to numeric columns
GGDC10S %>% ftransformv(is.numeric, log) %>% tail(2)
```
Instead of passing comma-separated `column = value` expressions, it is also possible to bulk-process data with `fransform` by passing a single list of expressions (such as a data frame). This is useful for more complex transformations involving multiple steps:
```{r, eval=NCRAN}
# Same as above, but also replacing any generated infinite values with NA
GGDC10S %>% ftransform(num_vars(.) %>% lapply(log) %>% replace_Inf) %>% tail(2)
```
This mode of usage toggles automatic column matching and replacement. Non-matching columns are added to the data frame. Apart from to `ftransform`, the function `settransform(v)` can be used to change the input data frame by reference: 
<!-- and is a simple wrapper around `X <- ftransform(X, ...)`: -->

```{r, eval=NCRAN}
# Computing a new column and deleting some others by reference
settransform(GGDC10S, FIRE_MAN = FIRE / MAN,
                      Regioncode = NULL, Region = NULL)
tail(GGDC10S, 2)
rm(GGDC10S)

# Bulk-processing the data into percentage terms
settransformv(GGDC10S, 6:16, `*`, 100/SUM)
tail(GGDC10S, 2)

# Same thing via replacement 
ftransform(GGDC10S) <- fselect(GGDC10S, AGR:SUM) %>% lapply(`*`, 100/.$SUM)
# Or using double pipes
GGDC10S %<>% ftransformv(6:16, `*`, 100/SUM)
rm(GGDC10S)
```

Another convenient addition is provided by the function `fcompute`, which can be used to compute new columns in a data frame environment and returns the computed columns in a new data frame:
```{r, eval=NCRAN}
fcompute(GGDC10S, AGR_perc = AGR / SUM * 100, FIRE_MAN = FIRE / MAN) %>% tail(2)
```

For more complex tasks see `?ftransform`. 

### 2.5 Adding and Binding Columns
For cases where multiple columns are computed and need to be added to a data frame (regardless of whether names are duplicated or not), *collapse* introduces the predicate `add_vars`. Together with `add_vars`, the function `add_stub` is useful to add a prefix (default) or postfix to computed variables keeping the variable names unique:

```{r, eval=NCRAN}
# Efficient adding logged versions of some variables
add_vars(wlddev) <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)
```

By default `add_vars` appends a data frame towards the (right) end, but it can also replace columns in front or at other positions in the data frame:

```{r, eval=NCRAN}
add_vars(wlddev, "front") <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)

add_vars(wlddev, c(10L,12L,14L,16L,18L)) <- get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")
head(wlddev, 2)
rm(wlddev)
```

`add_vars` can also be used without replacement, where it serves as a more efficient version of `cbind.data.frame`, with the difference that the data structure and attributes of the first argument are preserved:

```{r, eval=NCRAN}
add_vars(wlddev, get_vars(wlddev, 9:13) %>% lapply(log) %>% add_stub("log."),
                 get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10.")) %>% head(2)

add_vars(wlddev,  get_vars(wlddev, 9:13) %>% lapply(log) %>% add_stub("log."), 
                  get_vars(wlddev, 9:13) %>% lapply(log10) %>% add_stub("log10."),
         pos = c(10L,13L,16L,19L,22L,11L,14L,17L,20L,23L)) %>% head(2)

identical(cbind(wlddev, wlddev), add_vars(wlddev, wlddev))
microbenchmark(cbind(wlddev, wlddev), add_vars(wlddev, wlddev))
```

### 2.6 Renaming Columns
`frename` is a fast substitute for `dplyr::rename`:

```{r, eval=NCRAN}
frename(GGDC10S, AGR = Agriculture, MIN = Mining) %>% head(2)
frename(GGDC10S, tolower) %>% head(2)
frename(GGDC10S, tolower, cols = .c(AGR, MIN)) %>% head(2)
```
The function `setrename` does this by reference:

```{r, eval=NCRAN}
setrename(GGDC10S, AGR = Agriculture, MIN = Mining)
head(GGDC10S, 2)
setrename(GGDC10S, Agriculture = AGR, Mining = MIN)
rm(GGDC10S)
```
Both functions are not limited to data frames but can be applied to any R object with a 'names' attribute. 

### 2.7 Using Shortcuts
The most frequently required among the functions introduced above can be abbreviated as follows: `fselect -> slt`, `fsubset -> sbt`, `ftransform(v) -> tfm(v)`, `settransform(v) -> settfm(v)`, `get_vars -> gv`, `num_vars -> nv`, `add_vars -> av`. This was done to make it possible to write faster and more parsimonious code, but is recommended only for personally kept scripts. A lazy person may also decide to code everything using shortcuts and then do ctrl+F replacement with the long names on the finished script.

<!--
and to facilitate the avoidance of piped `%>%` expressions. Needless to say pipes `%>%` have become a very convenient feature of the R language and do a great job avoiding complex nested calls. They do however require reconstructing the entire call before evaluating it, and thus take out a lot of speed: 

```{r, eval=NCRAN}
microbenchmark(standard = tfm(gv(wlddev, 9:12), ODA_GDP = ODA/PCGDP),
               piped = get_vars(wlddev, 9:12) %>% ftransform(ODA_GDP = ODA/PCGDP))
```
-->

### 2.8 Missing Values / Rows
The function `na_omit` is a much faster alternative to `stats::na.omit` for vectors, matrices and data frames. By default the 'na.action' attribute containing the removed cases is omitted, but it can be added with the option `na.attr = TRUE`. Like `fsubset`, `na_omit` preserves all column attributes as well as attributes of the data frame itself. 

```{r, eval=NCRAN}
microbenchmark(na_omit(wlddev, na.attr = TRUE), na.omit(wlddev))
```

Another added feature is the removal of cases missing on certain columns only:

```{r, eval=NCRAN}
na_omit(wlddev, cols = .c(PCGDP, LIFEEX)) %>% head(2)
# only removing missing data from numeric columns -> same and slightly faster than na_omit(wlddev) 
na_omit(wlddev, cols = is.numeric) %>% head(2)
```
For atomic vectors the function `na_rm` also exists which is 2x faster than `x[!is.na(x)]`. Both `na_omit` and `na_rm` return their argument if no missing cases were found. 

The existence of missing cases can be checked using `missing_cases`, which is also considerably faster than `complete.cases` for data frames. 
 
 There is also a function `na_insert` to randomly insert missing values into vectors, matrices and data frames. The default is `na_insert(X, prop = 0.1)` so that 10% of values are randomly set to missing.  
 
Finally, a function `allNA` provides the much needed opposite of `anyNA` for atomic vectors. 

### 2.9 Unique Values / Rows
Similar to `na_omit`, the function `funique` is a much faster alternative to `base::unique` for atomic vectors and data frames. Like most *collapse* functions it also seeks to preserve attributes.
```{r, eval=NCRAN}
funique(GGDC10S$Variable)              # Unique values in order of appearance
funique(GGDC10S$Variable, sort = TRUE) # Sorted unique values

# If all values/rows are unique, the original data is returned (no copy)
identical(funique(GGDC10S), GGDC10S)

# Can remove duplicate rows by a subset of columns
funique(GGDC10S, cols = .c(Country, Variable)) %>% ss(1:2, 1:8)
funique(GGDC10S, cols = .c(Country, Variable), sort = TRUE) %>% ss(1:2, 1:8)
```

### 2.10 Recoding and Replacing Values
With `recode_num`, `recode_char`, `replace_NA`, `replace_Inf` and `replace_outliers`, *collapse* also introduces a set of functions to efficiently recode and replace numeric and character values in matrix-like objects (vectors, matrices, arrays, data frames, lists of atomic objects). When called on a data frame, `recode_num`, `replace_Inf` and `replace_outliers` will skip non-numeric columns, and `recode_char` skips non-character columns, whereas `replace_NA` replaces missing values in all columns. 

```{r, eval=NCRAN}
# Efficient replacing missing values with 0
microbenchmark(replace_NA(GGDC10S, 0))

# Adding log-transformed sectoral data: Some NaN and Inf values generated
add_vars(GGDC10S, 6:16*2-5) <- fselect(GGDC10S, AGR:SUM) %>% 
  lapply(log) %>% replace_Inf %>% add_stub("log.") 
head(GGDC10S, 2)
rm(GGDC10S)
```

`recode_num` and `recode_char` follow the syntax of `dplyr::recode` and provide more or less the same functionality except that they can efficiently be applied to matrices and data frames, and that `recode_char` allows for regular expression matching implemented via `base::grepl`:

<!-- that `dplyr::recode` offers a method for factors not provided in *collapse*,  -->

```{r, eval=NCRAN}
month.name
recode_char(month.name, ber = "C", "^J" = "A", default = "B", regex = TRUE)
```

The perhaps most interesting function in this ensemble is `replace_outliers`, which replaces values falling outside a 1- or 2-sided numeric threshold or outside a certain number of column- standard deviations with a value (default is `NA`). 

```{r, eval=NCRAN}
# replace all values below 2 and above 100 with NA
replace_outliers(mtcars, c(2, 100)) %>% head(3)        

# replace all value smaller than 2 with NA
replace_outliers(mtcars, 2, single.limit = "min") %>% head(3)

# replace all value larger than 100 with NA
replace_outliers(mtcars, 100, single.limit = "max") %>% head(3)

# replace all values above or below 3 column-standard-deviations from the column-mean with NA
replace_outliers(mtcars, 3) %>% tail(3)                        
                                                    
```

## 3. Quick Data Object Conversions

Apart from code employed for manipulation of data and the actual statistical computations performed, frequently used data object conversions with base functions like `as.data.frame`, `as.matrix` or `as.factor` have a significant share in slowing down R code. Optimally code would be written without such conversions, but sometimes they are necessary and thus *collapse* provides a set of functions (`qDF`, `qDT`, `qTBL`, `qM`, `qF`, `mrtl` and `mctl`) to speed these conversions up quite a bit. These functions are fast because they are non-generic and dispatch different objects internally, perform critical steps in C++, and, when passed lists of objects, they only check the length of the first column.

`qDF`, `qDT` and `qTBL` efficiently convert vectors, matrices, higher-dimensional arrays and suitable lists to data.frame, *data.table* and *tibble* respectively. 

```{r, eval=NCRAN}
str(EuStockMarkets)
# Efficient Conversion of data frames and matrices to data.table
microbenchmark(qDT(wlddev), qDT(EuStockMarkets), as.data.table(wlddev), as.data.frame(EuStockMarkets))

# Converting a time series to data.frame
head(qDF(AirPassengers))
```
By default these functions drop all unnecessary attributes from matrices or lists / data frames in the conversion, but this can be changed using the `keep.attr = TRUE` argument. 

A useful additional feature of `qDF` and `qDT` is the `row.names.col` argument, enabling the saving of names / row-names in a column when converting from vector, matrix, array or data frame:

```{r, eval=NCRAN}
# This saves the row-names in a column named 'car'
head(qDT(mtcars, "car"))

N_distinct <- fndistinct(GGDC10S)
N_distinct
# Converting a vector to data.frame, saving names
head(qDF(N_distinct, "variable"))
```

For the conversion of matrices to list there are also the programmers functions `mrtl` and `mctl`, which row- or column- wise convert a matrix into a plain list, data.frame or *data.table*. 

```{r, eval=NCRAN}
# This converts the matrix to a list of 1860 row-vectors of length 4.
microbenchmark(mrtl(EuStockMarkets))
```

For the reverse operation, `qM` converts vectors, higher-dimensional arrays, data frames and suitable lists to matrix. 

```{r, eval=NCRAN}
# Note: kit::psum is the most efficient way to do this
microbenchmark(rowSums(qM(mtcars)), rowSums(mtcars), kit::psum(mtcars))
```

At last, `qF` converts vectors to factor and is quite a bit faster than `as.factor`:

```{r, eval=NCRAN}
# Converting from character
str(wlddev$country)
fndistinct(wlddev$country)
microbenchmark(qF(wlddev$country), as.factor(wlddev$country))

# Converting from numeric
str(wlddev$PCGDP)
fndistinct(wlddev$PCGDP)
microbenchmark(qF(wlddev$PCGDP), as.factor(wlddev$PCGDP))

```

<!-- by default `qF` converts vectors to ordered factor, regulated by the default argument `ordered = TRUE`. This behavior is the same as `as.factor`, but `as.factor` does not attach a class 'ordered'. `qF(x, ordered = FALSE)` will not sort the levels and is slightly faster. Another difference to `as.factor` is that `qF` always adds a `NA` level for missing values, however by default an integer missing value (`NA_integer_`) is provided as the value for that level (thus `qF` behaves identical to `as.factor` with exception of the added level). a slight modification of this behavior can be achieved with `qF(x, na.exclude = FALSE)`, which will still have the `NA` level but now also assigns a positive integer value to the level (implying that missing values can no longer be detected using `is.na`), and attaches an additional class 'na.included'. Factor generation using `qF(x, na.exclude = FALSE)` is advised when using factors to carry out grouped computations with *collapse*'s fast functions, as the class 'na.included' prevents *collapse* functions to execute a missing value check on the factor, and thus yields a performance improvement. -->

## 4. Advanced Statistical Programming
Having introduced some of the more basic *collapse* data manipulation infrastructure in the preceding chapters, this chapter introduces some of the packages core functionality for programming with data. 

### 4.1 Fast (Grouped, Weighted) Statistical Functions
A key feature of *collapse* is it's broad set of *Fast Statistical Functions* (`fsum, fprod, fmean, fmedian, fmode, fvar, fsd, fmin, fmax, fnth, ffirst, flast, fnobs, fndistinct`), which are able to tangibly speed-up column-wise, grouped and weighted statistical computations on vectors, matrices or data frames. The basic syntax common to all of these functions is:
```{r eval=FALSE}
FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

```

where `x` is a vector, matrix or data frame, `g` takes groups supplied as vector, factor, list of vectors or *GRP* object, and `w` takes a weight vector (supported by `fsum, fprod, fmean, fmedian, fmode, fnth, fvar` and `fsd`). `TRA` can be used to transform `x` using the computed statistics and one of 10 available transformations (`"replace_fill", "replace", "-", "-+", "/", "%", "+", "*", "%%, "-%%"`, discussed in section 6.3). `na.rm` efficiently skips missing values during the computation and is `TRUE` by default. `use.g.names = TRUE` generates new row-names from the unique groups supplied to `g`, and `drop = TRUE` returns a vector when performing simple (non-grouped) computations on matrix or data frame columns.

With that in mind, let's start with some simple examples. To calculate simple column-wise means, it is sufficient to type:

```{r, eval=NCRAN}
fmean(mtcars$mpg) # Vector

fmean(mtcars)

fmean(mtcars, drop = FALSE)  # This returns a 1-row data-frame

m <- qM(mtcars) # Generate matrix
fmean(m)

fmean(m, drop = FALSE)  # This returns a 1-row matrix
```

Note that separate methods for vectors, matrices and data frames are written in C++, thus no conversions are needed and computations on matrices and data frames are equally efficient.
If we had a weight vector, weighted statistics are easily computed:
```{r, eval=NCRAN}
weights <- abs(rnorm(fnrow(mtcars))) # fnrow is a bit faster for data frames

fmean(mtcars, w = weights) # Weighted mean
fmedian(mtcars, w = weights) # Weighted median
fsd(mtcars, w = weights) # Frequency-weighted standard deviation
fmode(mtcars, w = weights) # Weighted statistical mode (i.e. the value with the largest sum of weights)
```
Fast grouped statistics can be calculated by simply passing grouping vectors or lists of grouping vectors to the fast functions:

```{r, eval=NCRAN}
fmean(mtcars, mtcars$cyl)

fmean(mtcars, fselect(mtcars, cyl, vs, am))

# Getting column indices 
ind <- fselect(mtcars, cyl, vs, am, return = "indices")
fmean(get_vars(mtcars, -ind), get_vars(mtcars, ind))  
```

<!-- `get_vars` also subsets data.table columns and other data.frame-like classes, and is about 2x the speed of `[.data.frame`. Replacements of the form `get_vars(data, ind) <- newcols` are about 4x as fast as `data[ind] <- newcols`. It is also possible to subset with functions i.e. `get_vars(mtcars, is.ordered)` and regular expressions i.e. `get_vars(mtcars, c("c","v","a"), regex = TRUE)` or  `get_vars(mtcars, "c|v|a", regex = TRUE)`. Next to `get_vars` there are also the predicates `num_vars`, `cat_vars`, `char_vars`, `fact_vars`, `logi_vars` and `date_vars` to subset and replace data by type. -->

### 4.2 Factors, Grouping Objects and Grouped Data Frames

This programming can becomes more efficient when passing *factors* or *grouping objects* to the `g` argument, as otherwise vectors and lists of vectors are grouped internally. 

```{r, eval=NCRAN}
# This creates a factor, na.exclude = FALSE attaches a class 'na.included'
f <- qF(mtcars$cyl, na.exclude = FALSE)
# The 'na.included' attribute skips a missing value check on this factor
attributes(f)
# Saving data without grouping columns
dat <- get_vars(mtcars, -ind)
# Grouped standard-deviation
fsd(dat, f)

# Without option na.exclude = FALSE, anyNA needs to be called on the factor (noticeable on larger data).
f2 <- qF(mtcars$cyl)
microbenchmark(fsd(dat, f), fsd(dat, f2))
```

For programming purposes *GRP* objects are preferable over factors because they never require further checks and they provide additional information about the grouping (such as group sizes and the original unique values in each group). The `GRP` function creates grouping objects (of class *GRP*) from vectors or lists of columns. Grouping is done very efficiently via radix ordering in C (using the `radixorder` function):
```{r, eval=NCRAN}
# This creates a 'GRP' object. 
g <- GRP(mtcars, ~ cyl + vs + am) # Using the formula interface, could also use c("cyl","vs","am") or c(2,8:9)
str(g)
```

The first three elements of this object provide information about the number of groups, the group to which each row belongs, and the size of each group. A print and a plot method provide further information about the grouping:

```{r, eval=NCRAN}
print(g)
plot(g)
```

The important elements of the *GRP* object are directly handed down to the compiled C++ code of the statistical functions, making repeated computations over the same groups very efficient.

```{r, eval=NCRAN}
fsd(dat, g)

# Grouped computation with and without prior grouping
microbenchmark(fsd(dat, g), fsd(dat, get_vars(mtcars, ind)))
```

<!-- Note at this point that by default (`sort = TRUE`, `order = 1L`), groups are sorted in ascending order, corresponding to *data.table* grouping with `keyby`. The ordering can be reversed (`order = -1L`), or set individually for each of the 3 grouping columns (i.e. `order = c(1L, -1L, -1L)`). If `sort = FALSE`, the grouping is unordered corresponding to *data.table* grouping with `by`.  -->

Yet another possibility is creating a grouped data frame (class *grouped_df*). This can either be done using `dplyr::group_by`, which creates a grouped tibble and requires a conversion of the grouping object using `GRP.grouped_df`, or using the more efficient `fgroup_by` provided in *collapse*:

```{r, eval=NCRAN}
gmtcars <- fgroup_by(mtcars, cyl, vs, am) # fgroup_by() can also be abbreviated as gby()
fmedian(gmtcars)

head(fgroup_vars(gmtcars))

fmedian(gmtcars, keep.group_vars = FALSE)
```

<!-- By default, both are ordered, but must not be. For multiple variables, `GRP` is always superior to creating multiple factors and interacting them, and it is also faster than `base::interaction` for lists of factors. -->

<!-- With factors or *GRP* objects, computations are faster since the fast functions would otherwise internally group the vectors every time they are executed. Compared to factors, grouped computations using `GRP` objects are a bit more efficient, primarily because they require no further checks, while factors are checked for missing values^[Because missing values are stored as the smallest integer in C++, and the values of the factor are used directly to index result vectors in grouped computations. Subsetting a vector with the smallest integer would break the C++ code of the *Fast Statistical Functions* and terminate the R session, which must be avoided.] unless a class '*na.included*' is attached. By default `qF` acts just like `as.factor` and preserves missing values when generating factors. Therefore the most effective way of programming with factors is to use `qF(x, na.exclude = FALSE)` to create the factor. This will create an underlying integer for `NA`'s and attach a class '*na.included*', so that no further checks are run on that factor in the *collapse* ecosystem. -->


Now suppose we wanted to create a new dataset which contains the *mean*, *sd*, *min* and *max* of the variables *mpg* and *disp* grouped by *cyl*, *vs* and *am*:

```{r, eval=NCRAN}
# Standard evaluation
dat <- get_vars(mtcars, c("mpg", "disp"))
add_vars(g[["groups"]],
         add_stub(fmean(dat, g, use.g.names = FALSE), "mean_"),
         add_stub(fsd(dat, g, use.g.names = FALSE), "sd_"),
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))

# Non-Standard evaluation
fgroup_by(mtcars, cyl, vs, am) %>% fselect(mpg, disp) %>% {
  add_vars(fgroup_vars(., "unique"),
           fmean(., keep.group_vars = FALSE) %>% add_stub("mean_"),
           fsd(., keep.group_vars = FALSE) %>% add_stub("sd_"),
           fmin(., keep.group_vars = FALSE) %>% add_stub("min_"),
           fmax(., keep.group_vars = FALSE) %>% add_stub("max_"))
}
```

### 4.3 Grouped and Weighted Computations

<!-- , and we could decide to include the original grouping columns and omit the generated row-names, as shown below -->

We could also calculate groupwise-frequency weighted means and standard-deviations using a weight vector^[You may wonder why with weights the standard-deviations in the group '4.0.1' are `0` while they were `NA` without weights. This stirs from the fact that group '4.0.1' only has one observation, and in the Bessel-corrected estimate of the variance there is a `n - 1` in the denominator which becomes `0` if `n = 1` and division by `0` becomes `NA` in this case (`fvar` was designed that way to match the behavior or `stats::var`). In the weighted version the denominator is `sum(w) - 1`, and if `sum(w)` is not 1, then the denominator is not `0`. The standard-deviation however is still `0` because the sum of squares in the numerator is `0`. In other words this means that in a weighted aggregation singleton-groups are not treated like singleton groups unless the corresponding weight is `1`.].

<!-- There is also a *collapse* predicate `add_vars` which serves as a much faster and more versatile alternative to `cbind.data.frame`. The intention behind `add_vars` is to be able to efficiently add multiple columns to an existing data.frame. Thus in a call `add_vars(data, newcols1, newcols2)`, `newcols1` and `newcols2` are added (by default) at the end of `data`, while preserving all attributes of `data`. -->

```{r, eval=NCRAN}
# Grouped and weighted mean and sd and grouped min and max
add_vars(g[["groups"]],
         add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
         add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_"))

# Binding and reordering columns in a single step: Add columns in specific positions
add_vars(g[["groups"]],
         add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
         add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_"),
         pos = c(4,8,5,9,6,10,7,11))

```

The R overhead of this kind of programming in standard-evaluation is very low:
```{r, eval=NCRAN}
microbenchmark(call = add_vars(g[["groups"]],
         add_stub(fmean(dat, g, weights, use.g.names = FALSE), "w_mean_"),
         add_stub(fsd(dat, g, weights, use.g.names = FALSE), "w_sd_"),
         add_stub(fmin(dat, g, use.g.names = FALSE), "min_"),
         add_stub(fmax(dat, g, use.g.names = FALSE), "max_")))
```

### 4.4 Transformations Using the `TRA` Argument

As a final layer of added complexity, we could utilize the `TRA` argument to generate groupwise-weighted demeaned, and scaled data, with additional columns giving the group-minimum and maximum values:

```{r, eval=NCRAN}
head(add_vars(get_vars(mtcars, ind),
              add_stub(fmean(dat, g, weights, "-"), "w_demean_"), # This calculates weighted group means and uses them to demean the data
              add_stub(fsd(dat, g, weights, "/"), "w_scale_"),    # This calculates weighted group sd's and uses them to scale the data
              add_stub(fmin(dat, g, "replace"), "min_"),          # This replaces all observations by their group-minimum
              add_stub(fmax(dat, g, "replace"), "max_")))         # This replaces all observations by their group-maximum
```
It is also possible to `add_vars<-` to `mtcars` itself. The default option would add these columns at the end, but we could also specify positions:
```{r, eval=NCRAN}
# This defines the positions where we want to add these columns
pos <- as.integer(c(2,8,3,9,4,10,5,11))

add_vars(mtcars, pos) <- c(add_stub(fmean(dat, g, weights, "-"), "w_demean_"),
                           add_stub(fsd(dat, g, weights, "/"), "w_scale_"),
                           add_stub(fmin(dat, g, "replace"), "min_"),
                           add_stub(fmax(dat, g, "replace"), "max_"))
head(mtcars)
rm(mtcars)
```
Together with `ftransform`, things can become arbitrarily more complex:

```{r, eval=NCRAN}
# 2 different grouped and weighted computations (mutate operations) performed in one call
settransform(mtcars, carb_dwmed_cyl = fmedian(carb, cyl, weights, "-"),
                     carb_wsd_vs_am = fsd(carb, list(vs, am), weights, "replace"))

# Multivariate
settransform(mtcars, c(fmedian(list(carb_dwmed_cyl = carb, mpg_dwmed_cyl = mpg), cyl, weights, "-"),
                      fsd(list(carb_wsd_vs_am = carb, mpg_wsd_vs_am = mpg), list(vs, am), weights, "replace")))

# Nested (Computing the weighted 3rd quartile of mpg, grouped by cyl and carb being greater than it's weighted median, grouped by vs)
settransform(mtcars, 
 mpg_gwQ3_cyl = fnth(mpg, 0.75, list(cyl, carb > fmedian(carb, vs, weights, 1L)), weights, 1L))

head(mtcars)
rm(mtcars)
```

With the full set of 14 *Fast Statistical Functions*, and additional vector- valued functions and operators (`fscale/STD, fbetween/B, fwithin/W, fhdbetween/HDB, fhdwithin/HDW, flag/L/F, fdiff/D, fgrowth/G`) discussed later, *collapse* provides extraordinary new possibilities for highly complex and efficient statistical programming in R. Computation speeds generally exceed those of packages like *dplyr* or *data.table*, sometimes by orders of magnitude. Column-wise matrix computations are also highly efficient and comparable to packages like `matrixStats` and base R functions like `colSums`. In particular the ability to perform grouped and weighted computations on matrices is new to R and very useful for complex computations (such as aggregating input-output tables etc.).

Note that the above examples provide merely suggestions for use of these features and are focused on programming with data frames (as the predicates `get_vars`, `add_vars` etc. are made for data frames). Equivalently efficient code could be written using vectors or matrices.

<!-- comparable to `column-wise computations on matrices are also slightly faster than `base` functions like `colMeans`, `colSums` etc. especially in the presence of missing values.  -->


## 5. Advanced Data Aggregation
The grouped statistical programming introduced in the previous section is the fastest and most customizable way of dealing with many data transformation problems. Some tasks such as multivariate aggregations on a single data frame are however so common that this demanded for a more compact solution which efficiently integrates multiple computational steps.

For such purposes `collap` was created as a fast multi-purpose aggregation command designed to solve complex aggregation problems efficiently and with a minimum of coding. `collap` performs optimally together with the *Fast Statistical Functions*, but will also work with other functions.

To perform the above aggregation with `collap`, one would simply need to type:

<!-- ^[One can also add a weight-argument `w = weights` here, but `fmin` and `fmax` don't support weights and all S3 methods in this package give errors when encountering unknown arguments. To do a weighted aggregation one would have to either only use `fmean` and `fsd`, or employ a named list of functions wrapping `fmin` and `fmax` in a way that additional arguments are silently swallowed.]: -->

```{r, eval=NCRAN}
collap(mtcars, mpg + disp ~ cyl + vs + am, list(fmean, fsd, fmin, fmax), 
       w = weights, keep.col.order = FALSE)
```

`collap` here also saves the sum of the weights in a column. The original idea behind `collap` is however better demonstrated with a different dataset. Consider the *World Development Dataset* `wlddev` introduced in section 1:
```{r, eval=NCRAN}
head(wlddev)
```
Suppose we would like to aggregate this data by country and decade, but keep all that categorical information. With `collap` this is extremely simple:

```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade) %>% head
```
Note that the columns of the data are in the original order and also retain all their attributes. To understand this result let us briefly examine the syntax of `collap`:
```{r eval=FALSE}
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
       custom = NULL, keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE, 
       sort.row = TRUE, parallel = FALSE, mc.cores = 1L,
       return = c("wide","list","long","long_dupl"), give.names = "auto") # , ...
```

It is clear that `X` is the data and `by` supplies the grouping information, which can be a one- or two-sided formula or alternatively grouping vectors, factors, lists and `GRP` objects (like the *Fast Statistical Functions*). Then `FUN` provides the function(s) applied only to numeric variables in `X` and defaults to `fmean`, while `catFUN` provides the function(s) applied only to categorical variables in `X` and defaults to `fmode`^[I.e. the most frequent value. By default a first-mode is computed.]. `keep.col.order = TRUE` specifies that the data is to be returned with the original column-order. Thus in the above example it was sufficient to supply `X` and `by` and `collap` did the rest for us.

Suppose we only want to aggregate 4 series in this dataset. 
```{r, eval=NCRAN}
# Same as collap(wlddev, ~ iso3c + decade, cols = 9:12)
collap(wlddev, PCGDP + LIFEEX + GINI + ODA ~ iso3c + decade) %>% head
```
As before we could use multiple functions by putting them in a named or unnamed list^[If the list is unnamed, `collap` uses `all.vars(substitute(list(FUN1, FUN2, ...)))` to get the function names. Alternatively it is also possible to pass a character vector of function names.]:

```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12) %>% head
```

With multiple functions, we could also request `collap` to return a long-format of the data:
```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade, list(fmean, fmedian, fsd), cols = 9:12, return = "long") %>% head
```

A very important feature of `collap` to highlight at this point is the `custom` argument, which allows the user to circumvent the broad distinction into numeric and categorical data (and the associated `FUN` and `catFUN` arguments) and specify exactly which columns to aggregate using which functions:

```{r, eval=NCRAN}
collap(wlddev, ~ iso3c + decade,
        custom = list(fmean = 9:10, fmedian = 11:12,
                      ffirst = c("country","region","income"),
                      flast = c("year","date"),
                      fmode = "OECD")) %>% head
```

Since *collapse* 1.5.0, it is also possible to perform weighted aggregations and append functions with `_uw` to yield an unweighted computation:


```{r, eval=NCRAN}
# This aggregates using weighted mean and mode, and unweighted median, first and last value
collap(wlddev, ~ region + year, w = ~ POP,
        custom = list(fmean = 9:10, fmedian_uw = 11:12,
                      ffirst_uw = c("country","region","income"),
                      flast_uw = c("year","date"),
                      fmode = "OECD"), keep.w = FALSE) %>% head
```

Next to `collap`, the functions `collapv` provides a programmers alternative allowing grouping and weighting columns to be passed using column names or indices, and the function `collapg` operates on grouped data frames.


## 6. Data Transformations
While `ftransform` and the `TRA` argument to the *Fast Statistical Functions* introduced earlier already provide a significant scope for transforming data, this section introduces some further specialized functions covering some advanced and common use cases, sometimes with greater efficiency.

<!-- *collapse* also provides an ensemble of function to perform common data transformations extremely efficiently and user friendly.  -->

<!-- I start off this section by briefly introducing two apply functions I thought were missing in the base R ensemble, and then quickly move to the more involved functions to carry out extremely fast grouped transformations. -->

### 6.1 Row and Column Arithmetic
When dealing with matrices or matrix-like datasets, we often have to perform operations applying a vector to the rows or columns of the data object in question. The mathematical operations of base R (`+`, `-`, `*`, `/`, `%%`, ...) operate column-wise and are quite inefficient when used with data frames. Even in matrix code it is challenging to efficiently apply a vector `v` to the rows of a matrix `X`. 

For this reason *collapse* introduces a set of efficient row- and column-wise arithmetic operators for matrix-like objects: `%rr%`, `%r+%`, `%r-%`, `%r*%`, `%r/%`, `%cr%`, `%c+%`, `%c-%`, `%c*%`, `%c/%`. 

```{r, eval=NCRAN}
X <- qM(fselect(GGDC10S, AGR:SUM))
v <- fsum(X)
v

# This divides the rows of X by v
all_obj_equal(t(t(X) / v), X / outer(rep(1, nrow(X)), v), X %r/% v)

# Base R vs. efficient base R vs. collapse
microbenchmark(t(t(X) / v), X / outer(rep(1, nrow(X)), v), X %r/% v) 

# Data frame row operations
dat <- fselect(GGDC10S, AGR:SUM)
microbenchmark(dat %r/% v, # Same thing using mapply and collapse::copyAttrib 
               copyAttrib(mapply(`/`, dat, v, SIMPLIFY = FALSE), dat))

# Data frame column arithmetic is very slow
microbenchmark(dat / dat$SUM, dat / 5, dat / dat, 
               dat %c/% dat$SUM, dat %c/% 5, dat %c/% dat) 

```


### 6.1 Row and Column Data Apply

`dapply` is an efficient apply command for matrices and data frames. It can be used to apply functions to rows or (by default) columns of matrices or data frames and by default returns objects of the same type and with the same attributes unless the result of each computation is a scalar.
```{r, eval=NCRAN}
dapply(mtcars, median)

dapply(mtcars, median, MARGIN = 1)

dapply(mtcars, quantile)

dapply(mtcars, quantile, MARGIN = 1) %>% head

# This is considerably more efficient than log(mtcars):
dapply(mtcars, log) %>% head 
```
`dapply` preserves the data structure:
```{r, eval=NCRAN}
is.data.frame(dapply(mtcars, log))
is.matrix(dapply(m, log))
```

It also delivers seamless conversions, i.e. you can apply functions to data frame rows or columns and return a matrix and vice-versa:

```{r, eval=NCRAN}
identical(log(m), dapply(mtcars, log, return = "matrix"))
identical(dapply(mtcars, log), dapply(m, log, return = "data.frame"))
```

On data frames, the performance is comparable to `lapply`, and `dapply` is about 2x faster than `apply` for row- or column-wise operations on matrices. The most important feature is that it does not change the structure of the data at all: all attributes are preserved unless the result is a scalar and `drop = TRUE` (the default). 

<!-- so you can use `dapply` on a data table, grouped tibble, or on a time series matrix and get a transformed object of the same class back  -->

### 6.2 Split-Apply-Combine Computing
`BY` is a generalization of `dapply` for grouped computations using functions that are not part of the *Fast Statistical Functions* introduced above. It fundamentally is a re-implementation of the `lapply(split(x, g), FUN, ...)` computing paradigm in base R, but substantially faster and more versatile than functions like `tapply`, `by` or `aggregate`. It is however not faster than *dplyr* or *data.table* for larger grouped computations on data frames requiring split-apply-combine computing.

`BY` is S3 generic with methods for vector, matrix, data.frame and grouped_df^[`BY.grouped_df` is probably only useful together with the `expand.wide = TRUE` argument which *dplyr* does not have, because otherwise *dplyr*'s `summarise` and `mutate` are substantially faster on larger data.]. It also supports the same grouping (`g`) inputs as the *Fast Statistical Functions* (grouping vectors, factors, lists or *GRP* objects). Below the use of `BY` is demonstrated on vectors matrices and data frames.

<!-- On bigger data.frame's however split-apply combine computing with *dplyr* is faster.  -->

```{r, eval=NCRAN}
v <- iris$Sepal.Length   # A numeric vector
f <- iris$Species        # A factor

## default vector method
BY(v, f, sum)                          # Sum by species, about 2x faster than tapply(v, f, sum)

BY(v, f, quantile)                     # Species quantiles: by default stacked

BY(v, f, quantile, expand.wide = TRUE) # Wide format

## matrix method
miris <- qM(num_vars(iris))
BY(miris, f, sum)                          # Also returns as matrix

BY(miris, f, quantile) %>% head

BY(miris, f, quantile, expand.wide = TRUE)[, 1:5]

BY(miris, f, quantile, expand.wide = TRUE, return = "list")[1:2] # list of matrices

## data.frame method
BY(num_vars(iris), f, sum)             # Also returns a data.frame etc...

## Conversions
identical(BY(num_vars(iris), f, sum), BY(miris, f, sum, return = "data.frame"))
identical(BY(miris, f, sum), BY(num_vars(iris), f, sum, return = "matrix"))
```

### 6.3 Fast (Grouped) Replacing and Sweeping-out Statistics
`TRA` is an S3 generic that efficiently transforms data by either replacing data values with supplied statistics or sweeping the statistics out of the data. It is the workhorse function behind the row-wise arithmetic operators introduced above (`%rr%`, `%r+%`, `%r-%`, `%r*%`, `%r/%`), and generalizes those to grouped operations. The 10 operations supported by `TRA` are:

* 1 - "replace_fill" : replace and overwrite missing values (same as dplyr::mutate)

* 2 - "replace" : replace but preserve missing values

* 3 - "-" : subtract (center)

* 4 - "-+" : subtract group-statistics but add average of group statistics

* 5 - "/" : divide (scale)

* 6 - "%" : compute percentages (divide and multiply by 100)

* 7 - "+" : add

* 8 - "*" : multiply

* 9 - "%%" : modulus

* 10 - "-%%" : subtract modulus

`TRA` is also incorporated as an argument to all *Fast Statistical Functions*. Therefore it is only really necessary and advisable to use the `TRA` function if both aggregate statistics and transformed data are required, or to sweep out statistics otherwise obtained (e.g. regression or correlation coefficients etc.). The code below computes the column means of the iris-matrix obtained above, and uses them to demean that matrix.
<!--
The code below shows 4 identical ways to center data in the *collapse* package. 

For the very common centering and averaging tasks, *collapse* supplies 2 special functions `fwithin` and `fbetween` (discussed in section 6.5) which are slightly faster and more memory efficient than `fmean(..., TRA = "-")` and `fmean(..., TRA = "replace")`. -->
```{r, eval=NCRAN}
# Note: All examples below generalize to vectors or data frames
stats <- fmean(miris)               # Saving stats

# 6 identical ways of centering a matrix
microbenchmark(sweep(miris, 2, stats, "-"),  # base R
               miris - outer(rep(1, nrow(iris)), stats),
               TRA(miris, fmean(miris), "-"),
               miris %r-% fmean(miris),      # The operator is actually a wrapper around TRA
               fmean(miris, TRA = "-"),      # better for any operation if the stats are not needed
               fwithin(miris))               # fastest, fwithin is discussed in section 6.5

# Simple replacing [same as fmean(miris, TRA = "replace") or fbetween(miris)]
TRA(miris, fmean(miris), "replace") %>% head(3)

# Simple scaling [same as fsd(miris, TRA = "/")]
TRA(miris, fsd(miris), "/") %>% head(3)
```
All of the above is functionality also offered by `base::sweep`, but `TRA` is significantly faster. The big advantage of `TRA` is that it also supports grouped operations:
```{r, eval=NCRAN}
# Grouped centering [same as fmean(miris, f, TRA = "-") or fwithin(m, f)]
TRA(miris, fmean(miris, f), "-", f) %>% head(3)

# Grouped replacing [same as fmean(m, f, TRA = "replace") or fbetween(m, f)]
TRA(miris, fmean(miris, f), "replace", f) %>% head(3)

# Groupwise percentages [same as fsum(m, f, TRA = "%")]
TRA(miris, fsum(miris, f), "%", f) %>% head(3)
```
<!--
A somewhat special operation performed by `TRA` is the grouped centering on the overall statistic (which for the mean is also performed more efficiently by `fwithin`):
```{r, eval=NCRAN}
# Grouped centering on the overall mean [same as fmean(m, f, TRA = "-+") or fwithin(m, f, mean = "overall.mean")]
head(TRA(miris, fmean(miris, f), "-+", f), 3)
head(TRA(TRA(miris, fmean(miris, f), "-", f), fmean(miris), "+"), 3) # Same thing done manually!

# This group-centers data on the overall median!
head(fmedian(miris, f, TRA = "-+"), 3)
```
This is the within transformation also computed by `qsu` discussed in section 1. It's utility in the case of grouped centering is demonstrated visually in section 6.5.
-->

As mentioned, calling the `TRA()` function does not make much sense if the same task can be performed using the *Fast Statistical Functions* or the arithmetic operators. It is however a very useful function to call for complex transformations involving grouped sweeping operations with precomputed quantities. 

### 6.4 Fast Standardizing

The function `fscale` can be used to efficiently standardize (i.e. scale and center) data using a numerically stable online algorithm. It's structure is the same as the *Fast Statistical Functions*. The standardization-operator `STD` also exists as a wrapper around `fscale`. The difference is that by default `STD` adds a prefix to standardized variables and also provides an enhanced method for data frames (more about operators in the next section).

```{r, eval=NCRAN}
# fscale doesn't rename columns
fscale(mtcars) %>% head(2)

# By default adds a prefix
STD(mtcars) %>% head(2)

# See that is works
STD(mtcars) %>% qsu

# We can also scale and center to a different mean and standard deviation:
qsu(fscale(mtcars, mean = 5, sd = 3))[, .c(Mean, SD)] %>% t

# Or not center at all. In that case scaling is mean-preserving, in contrast to fsd(mtcars, TRA = "/")
qsu(fscale(mtcars, mean = FALSE, sd = 3))[, .c(Mean, SD)] %>% t
```
Scaling with `fscale / STD` can also be done groupwise and / or weighted. For example the Groningen Growth and Development Center 10-Sector Database provides annual series of value added in local currency and persons employed for 10 broad sectors in several African, Asian, and Latin American countries.
```{r, eval=NCRAN}
head(GGDC10S)
```
If we wanted to correlate this data across countries and sectors, it needs to be standardized:
```{r, eval=NCRAN}
# Standardizing Sectors by Variable and Country
STD_GGDC10S <- STD(GGDC10S, ~ Variable + Country, cols = 6:16)
head(STD_GGDC10S)

# Correlating Standardized Value-Added across countries
fsubset(STD_GGDC10S, Variable == "VA", STD.AGR:STD.SUM) %>% pwcor
```


### 6.5 Fast Centering and Averaging
As a slightly faster alternative to `fmean(x, g, w, TRA = "-"/"-+")` or `fmean(x, g, w, TRA = "replace"/"replace_fill")`, `fwithin` and `fbetween` can be used to perform common (grouped, weighted) centering and averaging tasks (also known as *between*- and *within*- transformations in the language of panel data econometrics). `fbetween` / `fwithin` are faster than `fmean(..., TRA = ...)` because they don't materialize the full set of computed averages. The operators `W` and `B` also exist.
```{r, eval=NCRAN}
## Simple centering and averaging
fbetween(mtcars$mpg) %>% head

fwithin(mtcars$mpg) %>% head

all.equal(fbetween(mtcars) + fwithin(mtcars), mtcars)

## Groupwise centering and averaging
fbetween(mtcars$mpg, mtcars$cyl) %>% head

fwithin(mtcars$mpg, mtcars$cyl) %>% head

all.equal(fbetween(mtcars, mtcars$cyl) + fwithin(mtcars, mtcars$cyl), mtcars)
```
To demonstrate more clearly the utility of the operators which exists for all fast transformation and time series functions, the code below implements the task of demeaning 4 series by country and saving the country-id using the within-operator `W` as opposed to `fwithin` which requires all input to be passed externally like the *Fast Statistical Functions*.
```{r, eval=NCRAN}
# Center 4 series in this dataset by country
W(wlddev, ~ iso3c, cols = 9:12) %>% head  

# Same thing done manually using fwithin...
add_vars(get_vars(wlddev, "iso3c"),       
         get_vars(wlddev, 9:12) %>% 
         fwithin(wlddev$iso3c) %>% 
         add_stub("W.")) %>% head
```
It is also possible to drop the id's in `W` using the argument `keep.by = FALSE`. `fbetween / B` and `fwithin / W` each have one additional computational option:

```{r, fig.height=4, eval=NCRAN}
# This replaces missing values with the group-mean: Same as fmean(x, g, TRA = "replace_fill")
B(wlddev, ~ iso3c, cols = 9:12, fill = TRUE) %>% head

# This adds back the overall mean after subtracting out group means: Same as fmean(x, g, TRA = "-+")
W(wlddev, ~ iso3c, cols = 9:12, mean = "overall.mean")  %>% head

# Visual demonstration of centering on the overall mean vs. simple centering
oldpar <- par(mfrow = c(1, 3))
plot(iris[1:2], col = iris$Species, main = "Raw Data")                       # Raw data
plot(W(iris, ~ Species)[2:3], col = iris$Species, main = "Simple Centering") # Simple centering
plot(W(iris, ~ Species, mean = "overall.mean")[2:3], col = iris$Species,     # Centering on overall mean: Preserves level of data
     main = "Added Overall Mean")
par(oldpar)
```
<!-- 
# Note: This is not just slightly faster than fmean(x, g, TRA = "-+"), but if weights are used, fmean(x, g, w, "-+")
# gives a wrong result: It subtracts weighted group means but then centers on the frequency-weighted average of those group means,
# whereas fwithin(x, g, w, mean = "overall.mean") will also center on the properly weighted overall mean.
-->

Another great utility of operators is that they can be employed in regression formulas in a manor that is both very efficient and pleasing to the eyes. The code below demonstrates the use of `W` and `B` to efficiently run fixed-effects regressions with `lm`.
```{r, eval=NCRAN}
# When using operators in formulas, we need to remove missing values beforehand to obtain the same results as a Fixed-Effects package
data <- wlddev %>% fselect(iso3c, year, PCGDP, LIFEEX) %>% na_omit

# classical lm() -> iso3c is a factor, creates a matrix of 200+ country dummies.
coef(lm(PCGDP ~ LIFEEX + iso3c, data))[1:2]

# Centering each variable individually
coef(lm(W(PCGDP, iso3c) ~ W(LIFEEX, iso3c), data))

# Centering the data
coef(lm(W.PCGDP ~ W.LIFEEX, W(data, PCGDP + LIFEEX ~ iso3c)))

# Adding the overall mean back to the data only changes the intercept
coef(lm(W.PCGDP ~ W.LIFEEX, W(data, PCGDP + LIFEEX  ~ iso3c, mean = "overall.mean")))

# Procedure suggested by Mundlak (1978) - controlling for group averages instead of demeaning
coef(lm(PCGDP ~ LIFEEX + B(LIFEEX, iso3c), data))
```

In general it is recommended calling the long names (i.e. `fwithin` or `fscale` etc.) for programming since they are a bit more efficient on the R-side of things and require all input in terms of data. For all other purposes the operators are more convenient. It is important to note that the operators can do everything the functions can do (i.e. you can also pass grouping vectors or *GRP* objects to them). They are just simple wrappers that in the data frame method add 4 additional features:

* The possibility of formula input to `by` i.e. `W(mtcars, ~ cyl)` or `W(mtcars, mpg ~ cyl)`
* They preserve grouping columns (`cyl` in the above example) when passed in a formula (default `keep.by = TRUE`)
* The ability to subset many columns using the `cols` argument (i.e. `W(mtcars, ~ cyl, cols = 4:7)` is the same as `W(mtcars, hp + drat + wt + qsec ~ cyl)`)
* They rename transformed columns by adding a prefix (default `stub = "W."`)

<!-- That's it about operators! If you like this kind of parsimony use them, otherwise leave it. -->

<!-- # Now with cyl, vs and am fixed effects -->
<!-- lm(W(mpg,list(cyl,vs,am)) ~ W(carb,list(cyl,vs,am)), data = mtcars) -->
<!-- lm(mpg ~ carb, data = W(mtcars, ~ cyl + vs + am, stub = FALSE)) -->
<!-- lm(mpg ~ carb + collapse::B(carb,list(cyl,vs,am)), data = mtcars) -->

<!-- # Now with cyl, vs and am fixed effects weighted by hp: -->
<!-- lm(W(mpg,list(cyl,vs,am),hp) ~ W(carb,list(cyl,vs,am),hp), data = mtcars) -->
<!-- lm(mpg ~ carb, data = W(mtcars, ~ cyl + vs + am, ~ hp, stub = FALSE)) -->
<!-- lm(mpg ~ carb + collapse::B(carb,list(cyl,vs,am),hp), data = mtcars)       # This gives a slightly different coefficient!! -->

### 6.6 HD Centering and Linear Prediction
Sometimes simple centering is not enough, for example if a linear model with multiple levels of fixed-effects needs to be estimated, potentially involving interactions with continuous covariates. For these purposes `fhdwithin / HDW` and `fhdbetween / HDB` were created as efficient multi-purpose functions for linear prediction and partialling out. They operate by splitting complex regression problems in 2 parts: Factors and factor-interactions are projected out using `fixest::demean`, an efficient `C++` routine for centering vectors on multiple factors, whereas continuous variables are dealt with using a standard `chol` or `qr` decomposition in base R. The examples below show the use of the `HDW` operator in manually solving a regression problem with country and time fixed effects.

```{r, eval=NCRAN}
data$year <- qF(data$year, na.exclude = FALSE) # the country code (iso3c) is already a factor

# classical lm() -> creates a matrix of 196 country dummies and 56 year dummies
coef(lm(PCGDP ~ LIFEEX + iso3c + year, data))[1:2]

# Centering each variable individually
coef(lm(HDW(PCGDP, list(iso3c, year)) ~ HDW(LIFEEX, list(iso3c, year)), data))

# Centering the entire data
coef(lm(HDW.PCGDP ~ HDW.LIFEEX, HDW(data, PCGDP + LIFEEX ~ iso3c + year)))

# Procedure suggested by Mundlak (1978) - controlling for averages instead of demeaning
coef(lm(PCGDP ~ LIFEEX + HDB(LIFEEX, list(iso3c, year)), data))
```
We may wish to test whether including time fixed-effects in the above regression actually impacts the fit. This can be done with the fast F-test:
```{r, eval=NCRAN}
# The syntax is fFtest(y, exc, X, ...). 'exc' are exclusion restrictions.
data %$% fFtest(PCGDP, year, list(LIFEEX, iso3c))
```
The test shows that the time fixed-effects (accounted for like year dummies) are jointly significant.

One can also use `fhdbetween / HDB` and `fhdwithin / HDW` to project out interactions and continuous covariates. 

```{r, eval=NCRAN}
wlddev$year <- as.numeric(wlddev$year)

# classical lm() -> full country-year interaction, -> 200+ country dummies, 200+ trends, year and ODA
coef(lm(PCGDP ~ LIFEEX + iso3c * year + ODA, wlddev))[1:2]

# Same using HDW 
coef(lm(HDW.PCGDP ~ HDW.LIFEEX, HDW(wlddev, PCGDP + LIFEEX ~ iso3c * year + ODA)))

# example of a simple continuous problem
HDW(iris[1:2], iris[3:4]) %>% head

# May include factors..
HDW(iris[1:2], iris[3:5]) %>% head
```


## 7. Time Series and Panel Series
*collapse* also presents some essential contributions in the time series domain, particularly in the area of (irregular) time series, panel data and efficient and secure computations on (potentially unordered) time-dependent vectors and (unbalanced) panels.

### 7.1 Panel Series to Array Conversions
To facilitate the exploration and access of panel data, `psmat` was created as an S3 generic to efficiently obtain matrices or 3D-arrays from panel data.
```{r, eval=NCRAN}
mts <- psmat(wlddev, PCGDP ~ iso3c, ~ year)
str(mts)
plot(log10(mts), main = paste("Log10", vlabels(wlddev$PCGDP)), xlab = "Year")
```

Passing a data frame of panel series to `psmat` generates a 3D array:
```{r, eval=NCRAN}
# Get panel series array
psar <- psmat(wlddev, ~ iso3c, ~ year, cols = 9:12)
str(psar)
plot(psar)
```
```{r, fig.height=7, eval=NCRAN}
# Plot array of Panel Series aggregated by region:
collap(wlddev, ~ region + year, cols = 9:12) %>% 
  psmat( ~ region, ~ year) %>%
  plot(legend = TRUE, labs = vlabels(wlddev)[9:12])
```
`psmat` can also output a list of panel series matrices, which can be used among other things to reshape the data with `unlist2d` (discussed in more detail in List-Processing section).
```{r, eval=NCRAN}
# This gives list of ps-matrices
psml <- psmat(wlddev, ~ iso3c, ~ year, 9:12, array = FALSE)
str(psml, give.attr = FALSE)

# Using unlist2d, can generate a data.frame
unlist2d(psml, idcols = "Variable", row.names = "Country") %>% gv(1:10) %>% head
```

### 7.2 Panel Series ACF, PACF and CCF
The correlation structure of panel data can also be explored with `psacf`, `pspacf` and `psccf`. These functions are exact analogues to `stats::acf`, `stats::pacf` and `stats::ccf`. They use `fscale` to group-scale panel data by the panel-id provided, and then compute the covariance of a sequence of panel-lags (generated with `flag` discussed below) with the group-scaled level-series, dividing by the variance of the group-scaled level series. The Partial-ACF is generated from the ACF using a Yule-Walker decomposition (as in `stats::pacf`).
```{r, eval=NCRAN}
# Panel-ACF of GDP per Capita
psacf(wlddev, PCGDP ~ iso3c, ~year)
# Panel-Partial-ACF of GDP per Capia
pspacf(wlddev, PCGDP ~ iso3c, ~year)
# Panel- Cross-Correlation function of GDP per Capia and Life-Expectancy
wlddev %$% psccf(PCGDP, LIFEEX, iso3c, year)
# Multivariate Panel-auto and cross-correlation function of 3 variables:
psacf(wlddev, PCGDP + LIFEEX + ODA ~ iso3c, ~year)
```

### 7.3 Fast Lags and Leads
`flag` and the corresponding lag- and lead- operators `L` and `F` are S3 generics to efficiently compute lags and leads on time series and panel data. The code below shows how to compute simple lags and leads on the classic Box & Jenkins airline data that comes with R.
```{r, eval=NCRAN}
# 1 lag
L(AirPassengers)

# 3 identical ways of computing 1 lag
all_identical(flag(AirPassengers), L(AirPassengers), F(AirPassengers,-1))

# 1 lead and 3 lags - output as matrix
L(AirPassengers, -1:3) %>% head

# ... this is still a time series object:
attributes(L(AirPassengers, -1:3))
```
`flag / L / F` also work well on (time series) matrices. Below a regression with daily closing prices of major European stock indices is run: Germany DAX (Ibis), Switzerland SMI, France CAC, and UK FTSE. The data are sampled in business time, i.e. weekends and holidays are omitted.

```{r, eval=NCRAN}
str(EuStockMarkets)

# Data is recorded on 260 days per year, 1991-1999
tsp(EuStockMarkets)
freq <- frequency(EuStockMarkets)

# There is some obvious seasonality
stl(EuStockMarkets[, "DAX"], freq) %>% plot

# 1 annual lead and 1 annual lag
L(EuStockMarkets, -1:1*freq) %>% head

# DAX regressed on it's own 2 annual lags and the lags of the other indicators
lm(DAX ~., data = L(EuStockMarkets, 0:2*freq)) %>% summary
```
Since v1.5.0, irregular time series are supported:
```{r, eval=NCRAN}
t <- seq_row(EuStockMarkets)[-4L]

flag(EuStockMarkets[-4L, ], -1:1, t = t) %>% head
```

The main innovation of `flag / L / F` is the ability to very efficiently compute sequences of lags and leads on panel data, and that this panel data need not be ordered or balanced:

```{r, message=TRUE, eval=NCRAN}
# This lags all 4 series
L(wlddev, 1L, ~ iso3c, ~ year, cols = 9:12) %>% head

# Without t: Works here because data is ordered, but gives a message
L(wlddev, 1L, ~ iso3c, cols = 9:12) %>% head

# 1 lead and 2 lags of Life Expectancy
# after removing the 4th row, thus creating an unbalanced panel
wlddev %>% ss(-4L) %>% 
  L(-1:2, LIFEEX ~ iso3c, ~year) %>% head
```

Optimal performance is obtained if the panel-id is a factor, and the time variable also a factor or an integer variable. In that case an ordering vector of the data is computed directly without any prior sorting or grouping, and the data is accessed through this vector. Thus the data need not be sorted to compute a fully-identified panel-lag, which is a key advantage to, say, the `shift` function in `data.table`. 

<!--
One caveat of the direct computation of the ordering is that `flag / L / F` requires regularly spaced panel data, and provides errors for repeated values or gaps in time within any group:

```{r, eval=NCRAN}
g <- c(1,1,1,2,2,2)
tryCatch(flag(1:6, 1, g, t = c(1,2,3,1,2,2)),
         error = function(e) e)
tryCatch(flag(1:6, 1, g, t = c(1,2,3,1,2,4)),
         error = function(e) e)
```

Note that all of this does not require the panel to be balanced. `flag / L /F` works for unbalanced panel data as long as there are no gaps or repeated values in the time-variable for an individual. Since this sacrifices some functionality for speed and has been a requested feature, *collapse* 1.2.0 introduced the function `seqid`, which can be used to generate an new panel-id variable which identifies consecutive time-sequences at the sub-individual level, an thus enables the use of `flag / L /F` on irregular panels.  
-->

One intended area of use, especially for the operators `L` and `F`, is to substantially facilitate the implementation of dynamic models in various contexts (independent of the estimation package). Below different ways `L` can be used to estimate a dynamic panel-model using `lm` are shown:
```{r, eval=NCRAN}
# Different ways of regressing GDP on it's lags and life-Expectancy and it's lags

# 1 - Precomputing lags
lm(PCGDP ~ ., L(wlddev, 0:2, PCGDP + LIFEEX ~ iso3c, ~ year, keep.ids = FALSE)) %>% summary

# 2 - Ad-hoc computation in lm formula
lm(PCGDP ~ L(PCGDP, 1:2, iso3c, year) + L(LIFEEX, 0:2, iso3c, year), wlddev) %>% summary

# 3 - Precomputing panel-identifiers
g = qF(wlddev$iso3c, na.exclude = FALSE)
t = qF(wlddev$year, na.exclude = FALSE)
lm(PCGDP ~ L(PCGDP, 1:2, g, t) + L(LIFEEX, 0:2, g, t), wlddev) %>% summary
```

### 7.4 Fast Differences and Growth Rates
Similarly to `flag / L / F`, `fdiff / D / Dlog` computes sequences of suitably lagged / leaded and iterated differences, quasi-differences or (quasi-)log-differences on time series and panel data, and `fgrowth / G` computes growth rates. Using again the `Airpassengers` data, the seasonal decomposition shows significant seasonality:
```{r, eval=NCRAN}
stl(AirPassengers, "periodic") %>% plot
```
We can test the statistical significance of this seasonality by jointly testing a set of monthly dummies regressed on the differenced series. Given that the seasonal fluctuations are increasing in magnitude, using growth rates for the test seems more appropriate:
```{r, eval=NCRAN}
f <- qF(cycle(AirPassengers))
fFtest(fgrowth(AirPassengers), f)
```
The test shows significant seasonality, accounting for 87% of the variation in the growth rate of the series. We can plot the series together with the ordinary, seasonal (12-month) and deseasonalized monthly growth rate using:
```{r, eval=NCRAN}
G(AirPassengers, c(0, 1, 12)) %>% cbind(W.G1 = W(G(AirPassengers), f)) %>% 
  plot(main = "Growth Rate of Airpassengers")
```
It is evident that taking the annualized growth rate also removes the periodic behavior. We can also compute second differences or growth rates of growth rates. Below a plot of the ordinary and annual first and second differences of the data:
```{r, eval=NCRAN}
D(AirPassengers, c(1,12), 1:2) %>% plot
```
In general, both `fdiff / D` and `fgrowth / G` can compute sequences of lagged / leaded and iterated differences / growth rates.
```{r, eval=NCRAN}
# sequence of leaded/lagged and iterated differences
y = 1:10
D(y, -2:2, 1:3)
```
All of this also works for panel data. The code below gives an example:
```{r, eval=NCRAN}
g = rep(1:2, each = 5)
t = rep(1:5, 2)

D(y, -2:2, 1:2, g, t)
```
Calls to `flag / L / F`, `fdiff / D` and `fgrowth / G` can be nested. In the example below, `L.matrix` is called on the right-half ob the above sequence:
```{r, eval=NCRAN}
L(D(y, 0:2, 1:2, g, t), 0:1, g, t)
```
<!-- THIS GAVE ERROR ON CRAN Checks !! -->
<!-- If `n * diff` (or `n` in `flag / L / F`) exceeds the length of the data or the average group size in panel-computations, all of these functions will throw appropriate errors: -->
<!-- ```{r} -->
<!-- D(y, 3, 2, g, t) -->
<!-- ``` -->

`fdiff / D` and `fgrowth / G` also come with a data frame method, making the computation of growth-variables on datasets very easy:
```{r, eval=NCRAN}
G(GGDC10S, 1L, 1L, ~ Variable + Country, ~ Year, cols = 6:10) %>% head
```
<!-- One could also add variables by reference using *data.table*: -->
<!-- ```{r, warning=FALSE, message=FALSE, error=FALSE} -->
<!-- head(qDT(wlddev)[, paste0("G.", names(wlddev)[9:12]) := fgrowth(.SD,1,1,iso3c,year), .SDcols = 9:12]) -->

<!-- ``` -->
<!-- When working with *data.table* it is important to realize that while collapse functions will work with *data.table* grouping using `by` or `keyby`, this is very slow because it will run a method-dispatch for every group. It is much better and more secure to utilize the functions fast internal grouping facilities, as I have done in the above example. -->

The code below estimates a dynamic panel model regressing the 10-year growth rate of GDP per capita on it's 10-year lagged level and the 10-year growth rate of life-expectancy:

```{r, eval=NCRAN}
summary(lm(G(PCGDP,10,1,iso3c,year) ~
             L(PCGDP,10,iso3c,year) +
             G(LIFEEX,10,1,iso3c,year), data = wlddev))
```
To go a step further, the code below regresses the 10-year growth rate of GDP on the 10-year lagged levels and 10-year growth rates of GDP and life expectancy, with country and time-fixed effects projected out using `HDW`. The standard errors are unreliable without bootstrapping, but this example nicely demonstrates the potential for complex estimations brought by *collapse*.
```{r, eval=NCRAN}
moddat <- HDW(L(G(wlddev, c(0, 10), 1, ~iso3c, ~year, 9:10), c(0, 10), ~iso3c, ~year), ~iso3c + qF(year))[-c(1,5)]
summary(lm(HDW.L10G1.PCGDP ~. , moddat))
```

<!-- How long did it take to run this computation? About 4 milliseconds on my laptop (2x 2.2 GHZ, 8 GB RAM), so there is plenty of room to do this with much larger data. -->
<!-- ```{r} -->
<!-- microbenchmark(HDW(L(G(wlddev, c(0, 10), 1, ~iso3c, ~year, 9:10), c(0, 10), ~iso3c, ~year), ~iso3c + qF(year))) -->
<!-- ``` -->

One of the inconveniences of the above computations is that it requires declaring the panel-identifiers `iso3c` and `year` again and again for each function. A great remedy here are the *plm* classes *pseries* and *pdata.frame* which *collapse* was built to support. This shows how one could run the same regression with plm:
```{r, eval=NCRAN}
pwlddev <- plm::pdata.frame(wlddev, index = c("iso3c", "year"))
moddat <- HDW(L(G(pwlddev, c(0, 10), 1, 9:10), c(0, 10)))[-c(1,5)]
summary(lm(HDW.L10G1.PCGDP ~. , moddat))
```
To learn more about the integration of *collapse* and *plm*, consult the corresponding vignette.

## 8. List Processing and a Panel-VAR Example
*collapse* also provides an ensemble of list-processing functions that grew out of a necessity of working with complex nested lists of data objects. The example provided in this section is also somewhat complex, but it demonstrates the utility of these functions while also providing a nice data-transformation task. 

When summarizing the `GGDC10S` data in section 1, it was evident that certain sectors have a high share of economic activity in almost all countries in the sample. This prompts the question of whether there exist common patterns in the interaction of these important sectors across countries. One way to empirically study this could be through a (Structural) Panel-Vector-Autoregression (PSVAR) in value added with the 6 most important sectors (excluding government): Agriculture, manufacturing, wholesale and retail trade, construction, transport and storage and finance and real estate.

For this we will use the *vars* package^[I noticed there is a *panelvar* package, but I am more familiar with *vars* and *panelvar* can be pretty slow in my experience. We also have about 50 years of data here, so dynamic panel bias is not a big issue.]. Since *vars* natively does not support panel-VAR, we need to create the central *varest* object manually and then run the `SVAR` function to impose identification restrictions. We start with exploring and harmonizing the data:
<!-- and then the `irf` and `fevd` commands which create the impulse response functions and the forecast error variance decompositions, respectively.  -->


<!-- # We will estimate a panel-VAR with 1 lag -->
<!-- p <- 1 -->
<!-- # This creates a data.table containing the value added of the 6 most important non-government sectors -->
<!-- data <- qDT(GGDC10S)[Variable == "VA", c("Country","Year","AGR","MAN","WRT","CON","TRA","FIRE")] -->
<!-- # Standardizing by country takes country fixed-effects and gets rid of local-currencies -->
<!-- get_vars(data, 3:8) <- STD(data, ~ Country, cols = 3:8, keep.by = FALSE) -->
<!-- # This also subtracts time fixed-effects accounting for global shocks -->
<!-- data <- na_omit(cbind(get_vars(data, 1), W(data, ~ Year, cols = 3:8))) -->
<!-- # Here we add p panel-lags to the country-scaled and time-demeaned data -->
<!-- data <- cbind(get_vars(data, -(1:2)), L(data, 1:p, ~Country, ~Year, keep.ids = FALSE)) -->
<!-- # This removes missing values generated by L from all but the first row  -->
<!-- data <- rbind(data[1:p], na_omit(data[-(1:p)])) # (vars will treat this as a single time series) -->
<!-- # adding a contant term -->
<!-- data[["const"]] <- rep(1, nrow(data)) -->
<!-- # saving the names of the 6 sectors -->
<!-- nam <- names(data)[1:6] -->

<!-- AGRmat <- sweep(AGRmat, 1, rowMeans(AGRmat, na.rm = TRUE), "-") -->


```{r, warning=FALSE, message=FALSE, eval=NCRAN}
library(vars)
# The 6 most important non-government sectors (see section 1)
sec <- c("AGR", "MAN", "WRT", "CON", "TRA", "FIRE")
# This creates a data.frame containing the value added of the 6 most important non-government sectors
data <- fsubset(GGDC10S, Variable == "VA", c("Country", "Year", sec)) %>% 
  na_omit(cols = sec)
# Let's look at the log VA in agriculture across countries:
AGRmat <- psmat(data, AGR ~ Country, ~ Year, transpose = TRUE) %>% log   # Converting to panel series matrix
plot(AGRmat)
```
The plot shows quite some heterogeneity both in the levels (VA is in local currency) and in trend growth rates. In the panel-VAR estimation we are only really interested in the sectoral relationships within countries. Thus we need to harmonize this sectoral data further. One way would be taking growth rates or log-differences of the data, but VAR's are usually estimated in levels unless the data are cointegrated (and value added series do not, in general, exhibit unit-root behavior). Thus to harmonize the data further we opt for subtracting a country-sector specific cubic trend from the data in logs:

```{r, eval=NCRAN}
# Subtracting a country specific cubic growth trend
AGRmat <- dapply(AGRmat, fhdwithin, poly(seq_row(AGRmat), 3), fill = TRUE)

plot(AGRmat)
```
This seems to have done a decent job in curbing most of the heterogeneity. Some series however have a high variance around that cubic trend. Therefore a final step is to standardize the data to bring the variances in line:

```{r, eval=NCRAN}
# Standardizing the cubic log-detrended data
AGRmat <- fscale(AGRmat)
plot(AGRmat)
```

Now this looks pretty good, and is about the most we can do in terms of harmonization without differencing the data. The code below applies these transformations to all sectors:


```{r, fig.height=7, eval=NCRAN}
# Taking logs
settransformv(data, 3:8, log)
# Projecting out country FE and cubic trends from complete cases
gv(data, 3:8) <- HDW(data, ~ qF(Country)*poly(Year, 3), fill = TRUE)
# Scaling
gv(data, 3:8) <- STD(data, ~ Country, cols = 3:8, keep.by = FALSE)

# Check the plot
psmat(data, ~ Country, ~ Year) %>% plot
```

Since the data is annual, let us estimate the Panel-VAR with one lag:

```{r, eval=NCRAN}
# This adds one lag of all series to the data
add_vars(data) <- L(data, 1, ~ Country, ~ Year, keep.ids = FALSE)
# This removes missing values from all but the first row and drops identifier columns (vars is made for time series without gaps)
data <- rbind(ss(data, 1, -(1:2)), na_omit(ss(data, -1, -(1:2))))
head(data)
```


Having prepared the data, the code below estimates the panel-VAR using `lm` and creates the *varest* object:

<!-- pVAR <- list(varresult = setNames(lapply(seq_len(6), function(i)    # list of 6 lm's each regressing -->
<!--                lm(as.formula(paste0(nam[i],"~ -1 + . ")),           # the sector on all lags of  -->
<!--                get_vars(data, c(i,7:length(data)))[-(1:p)])), nam), # itself and other sectors -->
<!--              datamat = data[-(1:p)], # The full data containing levels and lags of the sectors -->
<!--              y = do.call(cbind, get_vars(data, 1:6)), # Only the levels data as matrix -->
<!--              type = "const", # Specifying that a constant term was added -->
<!--              p = p, # The lag-order -->
<!--              K = 6, # The number of variables -->
<!--              obs = nrow(data)-p, # The number of non-missing obs -->
<!--              totobs = nrow(data), # The total number of obs -->
<!--              restrictions = NULL,  -->
<!--              call = quote(VAR(y = data))) -->


```{r, eval=NCRAN}
# saving the names of the 6 sectors
nam <- names(data)[1:6]

pVAR <- list(varresult = setNames(lapply(seq_len(6), function(i)    # list of 6 lm's each regressing
               lm(as.formula(paste0(nam[i], "~ -1 + . ")),          # the sector on all lags of
               get_vars(data, c(i, 7:fncol(data))))), nam),         # itself and other sectors, removing the missing first row
             datamat = ss(data, -1),                                # The full data containing levels and lags of the sectors, removing the missing first row
             y = do.call(cbind, get_vars(data, 1:6)),               # Only the levels data as matrix
             type = "none",                                         # No constant or tend term: We harmonized the data already
             p = 1,                                                 # The lag-order
             K = 6,                                                 # The number of variables
             obs = fnrow(data)-1,                                   # The number of non-missing obs
             totobs = fnrow(data),                                  # The total number of obs
             restrictions = NULL,
             call = quote(VAR(y = data)))

class(pVAR) <- "varest"
```
The significant serial-correlation test below suggests that the panel-VAR with one lag is ill-identified, but the sample size is also quite large so the test is prone to reject, and the test is likely also still picking up remaining cross-sectional heterogeneity. For the purposes of this vignette this shall not bother us.
```{r, eval=NCRAN}
serial.test(pVAR)
```
 By default the VAR is identified using a Choleski ordering of the direct impact matrix in which the first variable (here Agriculture) is assumed to not be directly impacted by any other sector in the current period, and this descends down to the last variable (Finance and Real Estate), which is assumed to be impacted by all other sectors in the current period. For structural identification it is usually necessary to impose restrictions on the direct impact matrix in line with economic theory. It is difficult to conceive theories on the average worldwide interaction of broad economic sectors, but to aid identification we will compute the correlation matrix in growth rates and restrict the lowest coefficients to be 0, which should be better than just imposing a random Choleski ordering. 
 <!-- This will also provide room for a demonstration of the grouped tibble methods for *collapse* functions, discussed in more detail in the '*collapse* and *dplyr*' vignette: -->

<!-- # This computes the correlation matrix in panel growth rates of VA of the 6 sectors, removing very large positive or negative growth rates:  -->
<!-- corr <- GGDC10S %>% fsubset(Variable == "VA") %>%                           # Taking VA series -->
<!--           G(by = ~Country, t = ~Year, cols = sec) %>% {                    # Exact panel growth rates -->
<!--            STD(replace_outliers(get_vars(.,3:8), c(-100, 100)), .$Country) # Standardizing (i.e. harmonizing trend growth rates and variance, and removing some gross outliers) -->
<!--           } %>% na_omit %>% pwcor                                          # Computing correlations -->
<!-- corr -->


<!-- ```{r} -->
<!-- # This computes the average correlation in growth rates of value added of the 6 sectors in each country -->
<!-- corr <- GGDC10S %>% fsubset(Variable == "VA") %>%         # Only VA data -->
<!--               psmat(~ Country, ~ Year, cols = sec) %>%   # Convert to 3D Array -->
<!--                 apply(1, function(x) pwcor(G(x))) %>%    # For each country (dimension 1) compute pairwise correlations of the sectoral growth rates -->
<!--                    rowMeans %>%                          # Compute the mean correlation coefficient across countries -->
<!--                      structure(dim = c(6,6),             # Output as a correlation matrix, class 'pwcor' for pretty printing -->
<!--                                dimnames = list(sec, sec),  -->
<!--                                class = "pwcov")  -->

<!-- corr -->

<!-- # Another solution using data.table ... a bit more compact -->
<!-- qDT(GGDC10S)[Variable == "VA", pwcor(fgrowth(.SD)), by = Country, .SDcols = sec][, -->
<!--              coef := rowid(Country)][, mean(V1), by = coef][[2]] %>% -->
<!--              structure(dim = c(6,6), dimnames = list(sec, sec), class = "pwcor") -->

<!-- ``` -->

```{r, eval=NCRAN}
# This computes the pairwise correlations between standardized sectoral growth rates across countries
corr <- fsubset(GGDC10S, Variable == "VA") %>%   # Subset rows: Only VA
           fgroup_by(Country) %>%                # Group by country
                get_vars(sec) %>%                # Select the 6 sectors
                   fgrowth %>%                   # Compute Sectoral growth rates (a time-variable can be passed, but not necessary here as the data is ordered)
                      fscale %>%                 # Scale and center (i.e. standardize)
                         pwcor                   # Compute Pairwise correlations

corr

# We need to impose K*(K-1)/2 = 15 (with K = 6 variables) restrictions for identification
corr[corr <= sort(corr)[15]] <- 0
corr

# The rest is unknown (i.e. will be estimated)
corr[corr > 0 & corr < 1] <- NA

# Using a diagonal shock vcov matrix (standard assumption for SVAR)
Bmat <- diag(6)
diag(Bmat) <- NA


# This estimates the Panel-SVAR using Maximum Likelihood:
pSVAR <- SVAR(pVAR, Amat = unclass(corr), Bmat = Bmat, estmethod = "direct")
pSVAR
```

Now this object is quite involved, which brings us to the actual subject of this section:
```{r, eval=NCRAN}
# psVAR$var$varresult is a list containing the 6 linear models fitted above, it is not displayed in full here.
str(pSVAR, give.attr = FALSE, max.level = 3)
```

### 8.1 List Search and Identification

When dealing with such a list-like object, we might be interested in its complexity by measuring the level of nesting. This can be done with `ldepth`:
```{r, eval=NCRAN}
# The list-tree of this object has 5 levels of nesting
ldepth(pSVAR)

# This data has a depth of 1, thus this dataset does not contain list-columns
ldepth(data)
```
Further we might be interested in knowing whether this list-object contains non-atomic elements like call, terms or formulas. The function `is.regular` in the *collapse* package checks if an object is atomic or list-like, and the recursive version `is_unlistable` checks whether all objects in a nested structure are atomic or list-like:
```{r, eval=NCRAN}
# Is this object composed only of atomic elements e.g. can it be unlisted?
is_unlistable(pSVAR)
```
Evidently this object is not unlistable, from viewing its structure we know that it contains several call and terms objects. We might also want to know if this object saves some kind of residuals or fitted values. This can be done using `has_elem`, which also supports regular expression search of element names:
```{r, eval=NCRAN}
# Does this object contain an element with "fitted" in its name?
has_elem(pSVAR, "fitted", regex = TRUE)

# Does this object contain an element with "residuals" in its name?
has_elem(pSVAR, "residuals", regex = TRUE)
```
We might also want to know whether the object contains some kind of data-matrix. This can be checked by calling:
```{r, eval=NCRAN}
# Is there a matrix stored in this object?
has_elem(pSVAR, is.matrix)
```

These functions can sometimes be helpful in exploring objects. A much greater advantage of having functions to search and check lists is the ability to write more complex programs with them (which will not be demonstrated here).

### 8.2 List Subsetting
Having gathered some information about the `pSVAR` object, this section introduces several extractor functions to pull out elements from such lists: `get_elem` can be used to pull out elements from lists in a simplified format^[The *vars* package also provides convenient extractor functions for some quantities, but `get_elem` of course works in a much broader range of contexts.].
```{r, eval=NCRAN}
# This is the path to the residuals from a single equation
str(pSVAR$var$varresult$STD.HDW.AGR$residuals)

# get_elem gets the residuals from all 6 equations and puts them in a top-level list
resid <- get_elem(pSVAR, "residuals")
str(resid, give.attr = FALSE)

# Quick conversion to matrix and plotting
qM(resid) %>% plot.ts(main = "Panel-VAR Residuals")
```
Similarly, we could pull out and plot the fitted values:
```{r, eval=NCRAN}
# Regular expression search and retrieval of fitted values
get_elem(pSVAR, "^fi", regex = TRUE) %>% qM %>% 
  plot.ts(main = "Panel-VAR Fitted Values")
```
Below the main quantities of interest in SVAR analysis are computed: The impulse response functions (IRF's) and forecast error variance decompositions (FEVD's):
```{r, eval=NCRAN}
# This computes orthogonalized impulse response functions
pIRF <- irf(pSVAR)
# This computes the forecast error variance decompositions
pFEVD <- fevd(pSVAR)
```
The `pIRF` object contains the IRF's with lower and upper confidence bounds and some atomic elements providing information about the object:
```{r, eval=NCRAN}
# See the structure of a vars IRF object:
str(pIRF, give.attr = FALSE)
```
We could separately access the top-level atomic or list elements using `atomic_elem` or `list_elem`:
```{r, eval=NCRAN}
# Pool-out top-level atomic elements in the list
str(atomic_elem(pIRF))
```
There are also recursive versions of `atomic_elem` and `list_elem` named `reg_elem` and `irreg_elem` which can be used to split nested lists into the atomic and non-atomic parts. These are not covered in this vignette.

### 8.3 Recursive Apply and Unlisting in 2D
*vars* supplies simple `plot` methods for IRF and FEVD objects using base graphics. 
<!-- , for example: -->
<!-- ```{r} -->
<!-- # Plot the forecast-error variance decmpositions -->
<!-- plot(pFEVD) -->
<!-- ``` -->
<!-- `plot(pIRF)` would give us 6 charts of all sectoral responses to each sectoral shock.  -->
In this section we however want to generate nicer and more compact plots using `ggplot2`, and also compute some statistics on the IRF data. Starting with the latter, the code below sums the 10-period impulse response coefficients of each sector in response to each sectoral impulse and stores them in a data frame:
```{r, eval=NCRAN}
# Computing the cumulative impact after 10 periods
list_elem(pIRF) %>%                            # Pull out the sublist elements containing the IRF coefficients + CI's
  rapply2d(function(x) round(fsum(x), 2)) %>%  # Recursively apply the column-sums to coefficient matrices (could also use colSums)
  unlist2d(c("Type", "Impulse"))               # Recursively row-bind the result to a data.frame and add identifier columns
```
The function `rapply2d` used here is very similar to `base::rapply`, with the difference that the result is not simplified / unlisted by default and that `rapply2d` will treat data frames like atomic objects and apply functions to them. `unlist2d` is an efficient generalization of `base::unlist` to 2-dimensions, or one could also think of it as a recursive generalization of `do.call(rbind, ...)`. It efficiently unlists nested lists of data objects and creates a data frame with identifier columns for each level of nesting on the left, and the content of the list in columns on the right.

The above cumulative coefficients suggest that Agriculture responds mostly to it's own shock, and a bit to shocks in Manufacturing and Wholesale and Retail Trade. Similar patters can be observed for Manufacturing and Wholesale and Retail Trade. Thus these three sectors seem to be interlinked in most countries. The remaining three sectors are mostly affected by their own dynamics, but also by Agriculture and Manufacturing. 

<!-- Finance and Real Estate sector seems even more independent and really only responds to it's own dynamics. Manufacturing and Transport and Storage seem to be pretty interlinked with the other broad sectors. Wholesale and Retail Trade and Construction exhibit some strange dynamics (i.e. WRT responds more to the CON shock that to it's own shock, and CON responds strongly negatively to the WRT shock). -->

Let us use `ggplot2` to create nice compact plots of the IRF's and FEVD's. For this task `unlist2d` will again be extremely helpful in creating the data frame representation required. Starting with the IRF's, we will discard the upper and lower bounds and just use the impulses:
```{r, eval=NCRAN}
# This binds the matrices after adding integer row-names to them to a data.table

data <- pIRF$irf %>%                      # Get only the coefficient matrices, discard the confidence bounds
           unlist2d(idcols = "Impulse",   # Recursive unlisting to data.table creating a factor id-column
                    row.names = "Time",   # and saving generated rownames in a variable called 'Time'
                    id.factor = TRUE,     # -> Create Id column ('Impulse') as factor
                    DT = TRUE)            # -> Output as data.table (default is data.frame)

head(data, 3)

data <- melt(data, 1:2)                   # Using data.table's melt
head(data, 3)

# Here comes the plot:
  ggplot(data, aes(x = Time, y = value, color = Impulse)) +
    geom_line(size = I(1)) + geom_hline(yintercept = 0) +
    labs(y = NULL, title = "Orthogonal Impulse Response Functions") +
    scale_color_manual(values = rainbow(6)) +
    facet_wrap(~ variable) +
    theme_light(base_size = 14) +
    scale_x_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    scale_y_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    theme(axis.text = element_text(colour = "black"),
      plot.title = element_text(hjust = 0.5),
      strip.background = element_rect(fill = "white", colour = NA),
      strip.text = element_text(face = "bold", colour = "grey30"),
      axis.ticks = element_line(colour = "black"),
      panel.border = element_rect(colour = "black"))

```
To round things off, below we do the same thing for the FEVD's:
```{r, eval=NCRAN}
data <- unlist2d(pFEVD, idcols = "variable", row.names = "Time", id.factor = TRUE, DT = TRUE) %>% 
            melt(c("variable", "Time"), variable.name = "Sector") 
head(data, 3)

# Here comes the plot:
  ggplot(data, aes(x = Time, y = value, fill = Sector)) +
    geom_area(position = "fill", alpha = 0.8) +
    labs(y = NULL, title = "Forecast Error Variance Decompositions") +
    scale_fill_manual(values = rainbow(6)) +
    facet_wrap(~ set_class(variable, "factor")) +
    theme_linedraw(base_size = 14) +
    scale_x_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    scale_y_continuous(breaks = scales::pretty_breaks(n=7), expand = c(0, 0))+
    theme(plot.title = element_text(hjust = 0.5),
      strip.background = element_rect(fill = "white", colour = NA),
      strip.text = element_text(face = "bold", colour = "grey30"))

```
Both the IRF's and the FEVD's show that Agriculture, Manufacturing and Wholesale and Retail Trade are broadly interlinked, even in the short-run, and that Agriculture and Manufacturing explain some of the variation in Construction, Transport and Finance at longer horizons. Of course the identification strategy used for this example was not really structural or theory based. A better strategy could be to aggregate the World Input-Output Database and use those shares for identification (which would be another very nice *collapse* exercise, but not for this vignette). 


<!-- . There are also not much dynamics in the FEVD, suggesting that longer lag-lengths might be appropriate. The most important point of critique for this analysis is the structural identification strategy which is highly dubious (as correlation does not imply causation and we are also restricting sectoral relationships with a lower correlation to be 0 in the current period). A better method could be to aggregate the World Input-Output Database and use those shares for identification (which would be another very nice *collapse* exercise, but not for this vignette). -->


## Going Further
To learn more about *collapse*, just examine the documentation `help("collapse-documentation")` which is organized, extensive and contains lots of examples.

```{r, echo=FALSE}
options(oldopts)
```

## References

Timmer, M. P., de Vries, G. J., & de Vries, K. (2015). "Patterns of Structural Change in Developing Countries." . In J. Weiss, & M. Tribe (Eds.), *Routledge Handbook of Industry and Development.* (pp. 65-83). Routledge.

Mundlak, Yair. 1978. “On the Pooling of Time Series and Cross Section Data.” *Econometrica* 46 (1): 69–85.


<!-- ## Benchmarks vs. *dplyr* and *data.table* -->
<!-- **Notes:**  -->
<!-- * Benchmarks are run on real and generated data, always with missing values. The largest data size is 10 columns each 1 Million Obs. and 100000 groups. -->

<!-- * Benchmarks are run on a conventional Windows 8.1 laptop with 2x 2.2 GHZ Intel i5 processor, 8GB DDR3 RAM and a Samsung 850 EVO SSD.  -->

<!-- * *data.table* multi-threading is enabled and 2 thread are used. *collapse* and *dplyr* are not parallelized.  -->


<!-- ### Aggregations -->

<!-- #### Large Data -->

<!-- ```{r, eval=RUNBENCH} -->
<!-- # Creating a data.table with 10 columns and 1 mio. obs, including 10% missing values -->
<!-- testdat <- na_insert(qDT(replicate(10, rnorm(1e6), simplify = FALSE)), prop = 0.1)  -->
<!-- testdat[["g1"]] <- sample.int(1000, 1e6, replace = TRUE) # 1000 groups -->
<!-- testdat[["g2"]] <- sample.int(100, 1e6, replace = TRUE) # 100 groups -->

<!-- # The average group size is 10, there are about 100000 groups -->
<!-- gtestdat <- group_by(testdat, g1, g2) -->
<!-- g <- GRP(testdat, ~ g1 + g2) -->
<!-- g -->

<!-- # Function performing the comparison -->
<!-- benchFUN <- function(FUN, fFUN, neval = 10, narm = TRUE, ...) { -->
<!--   eval(substitute(microbenchmark( -->
<!--    dplyr = summarize_all(group_by(testdat, g1, g2), FUN, ...), -->
<!--    dplyr_grouped = summarize_all(gtestdat, FUN, ...), -->
<!--    data.table_2_threads = testdat[, lapply(.SD, FUN, ...), keyby = c("g1","g2")], -->
<!--    collapse = collap(testdat, ~ g1 + g2, fFUN, keep.col.order = FALSE, na.rm = narm), -->
<!--    collapse_grouped = collap(testdat, g, fFUN, cols = 1:10, keep.col.order = FALSE, na.rm = narm),  -->
<!--    times = neval -->
<!--   ))) -->
<!-- } -->

<!-- # Sum -->
<!-- benchFUN(sum, fsum, na.rm = TRUE) -->

<!-- # Product -->
<!-- benchFUN(prod, fprod, na.rm = TRUE) -->

<!-- # Mean -->
<!-- benchFUN(mean, fmean, na.rm = TRUE) -->

<!-- # Maximum -->
<!-- benchFUN(max, fmax, na.rm = TRUE) -->

<!-- # Median -->
<!-- benchFUN(median, fmedian, 1, na.rm = TRUE) -->

<!-- # Variance -->
<!-- benchFUN(var, fvar, 1, na.rm = TRUE) -->
<!-- # Note: fvar implements a numerically stable online variance using Welfords Algorithm. -->

<!-- # Last Value -->
<!-- benchFUN(last, flast, 1, narm = FALSE) -->
<!-- # Note: collapse functions ffirst and flast by default also remove missing values i.e. take the first and last non-missing data point -->

<!-- # Number of Observations -->
<!-- options(collapse_unused_arg_action = "none") -->
<!-- benchFUN(function(x) sum(!is.na(x)), fnobs, 1) -->

<!-- # Number of Distinct Values -->
<!-- benchFUN(n_distinct, fndistinct, 1, na.rm = TRUE) -->
<!-- ``` -->

<!-- Below some weighted and mode computations that are not straightforward in *dplyr* or *data.table*: -->
<!-- ```{r, eval=RUNBENCH} -->
<!-- # Weighted Mean -->
<!-- w <- abs(100*rnorm(1e6)) + 1 -->
<!-- testdat[["w"]] <- w -->
<!-- # Seems not possible with dplyr ... -->
<!-- system.time(testdat[, lapply(.SD, weighted.mean, w = w, na.rm = TRUE), keyby = c("g1","g2")]) -->
<!-- system.time(collap(testdat, ~ g1 + g2, w = w)) -->

<!-- # Weighted Variance -->
<!-- system.time(collap(testdat, ~ g1 + g2, fvar, w = w)) -->

<!-- # Mode -->
<!-- system.time(collap(testdat, ~ g1 + g2, fmode)) -->
<!-- # Note: This mode function uses index hashing in C++ -->

<!-- # Weighted Mode -->
<!-- system.time(collap(testdat, ~ g1 + g2, fmode, w = w)) -->

<!-- ``` -->


<!-- #### Typical Data -->

<!-- #### Small Data -->

<!-- When it comes to larger aggregation problems, the performance if *collapse* is broadly in line with *data.table* (at least on this machine), and offers the additional advantage of high-performance weighted and categorical aggregations: -->


<!-- I believe on really huge datasets aggregated on a multi-core machine, *data.table*'s memory efficiency and thread-parallelization will let it run faster with some GeForce optimized functions, but that does not apply to most users (I have tested up to 10 million obs. on my laptop where *collapse* is still very much in line). In comparison to *collapse* and *data.table* the performance of *dplyr* on this data is rather poor, especially for base functions that are not highly optimized like `sum`. I do however very much appreciate the tidyverse ecosystem for highly organized data exploration and transformation. Therefore I have created methods for all of the *Fast Statistical Functions* as well as `collap`, enabling them to be used effectively in the *dplyr* ecosystem where they produce amazing speed gains. This is the subject of the '*collapse* and *dplyr*' vignette. -->

<!-- Apart from its non-reliance on non-standard evaluation, a central advantage of *collapse* for programming is the speed it maintains on smaller problems where it's more efficient R code compared to *dplyr* and *data.table* really plays out: -->
<!-- ```{r, eval=RUNBENCH} -->
<!-- # 12000 obs in 1500 groups: A more typical case -->
<!-- GRP(wlddev, ~ iso3c + decade) -->

<!-- library(microbenchmark) -->
<!-- dtwlddev <- qDT(wlddev) -->
<!-- microbenchmark(dplyr = dtwlddev %>% group_by(iso3c,decade) %>% select_at(9:12) %>% summarise_all(sum, na.rm = TRUE), -->
<!--                data.table = dtwlddev[, lapply(.SD, sum, na.rm = TRUE), by = c("iso3c","decade"), .SDcols = 9:12], -->
<!--                collap = collap(dtwlddev, ~ iso3c + decade, fsum, cols = 9:12), -->
<!--                fast_fun = fsum(get_vars(dtwlddev, 9:12), GRP(dtwlddev, ~ iso3c + decade), use.g.names = FALSE)) # We can gain a bit coding it manually -->

<!-- # Now going really small: -->
<!-- dtmtcars <- qDT(mtcars) -->
<!-- microbenchmark(dplyr = dtmtcars %>% group_by(cyl,vs,am) %>% summarise_all(sum, na.rm = TRUE),      # Large R overhead -->
<!--                data.table = dtmtcars[, lapply(.SD, sum, na.rm = TRUE), by = c("cyl","vs","am")],   # Large R overhead -->
<!--                collap = collap(dtmtcars, ~ cyl + vs + am, fsum),                                   # Now this is still quite efficient -->
<!--                fast_fun = fsum(dtmtcars, GRP(dtmtcars, ~ cyl + vs + am), use.g.names = FALSE))     # And this is nearly the speed of a full C++ implementation -->

<!-- ``` -->

<!-- In general, the smaller the problem, the greater advantage *collapse* has over other packages because it's R overhead (i.e. the R code executed before the actual C-function doing the hard work is called) is carefully minimized. Most users working on typical datasets (< 1 Mio obs.) will find that their code runs significantly faster when implemented in *collapse* compared to other solutions. -->


<!-- ### Transformation Benchmarks -->

<!-- Below I provide benchmarks for some very common data transformation tasks, again comparing *collapse* to *dplyr* and *data.table*: -->

<!-- ```{r, eval=RUNBENCH} -->
<!-- # The average group size is 10, there are about 100000 groups -->
<!-- GRP(testdat, ~ g1 + g2) -->

<!-- # get indices of grouping columns -->
<!-- ind <- get_vars(testdat, c("g1","g2"), "indices") -->

<!-- # Centering -->
<!-- system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x - mean.default(x, na.rm = TRUE))) -->
<!-- system.time(testdat[, lapply(.SD, function(x) x - mean(x, na.rm = TRUE)), keyby = c("g1","g2")]) -->
<!-- system.time(W(testdat, ~ g1 + g2)) -->

<!-- # Weighted Centering -->
<!-- # Can't easily be done in dplyr.. -->
<!-- system.time(testdat[, lapply(.SD, function(x) x - weighted.mean(x, w, na.rm = TRUE)), keyby = c("g1","g2")]) -->
<!-- system.time(W(testdat, ~ g1 + g2, ~ w)) -->

<!-- # Centering on the overall mean -->
<!-- # Can't easily be done in dplyr or data.table. -->
<!-- system.time(W(testdat, ~ g1 + g2, mean = "overall.mean"))      # Ordinary -->
<!-- system.time(W(testdat, ~ g1 + g2, ~ w, mean = "overall.mean")) # Weighted -->

<!-- # Centering on both grouping variables simultaneously -->
<!-- # Can't be done in dplyr or data.table at all! -->
<!-- system.time(HDW(testdat, ~ qF(g1) + qF(g2), variable.wise = TRUE))        # Ordinary -->
<!-- system.time(HDW(testdat, ~ qF(g1) + qF(g2), w = w, variable.wise = TRUE)) # Weighted -->

<!-- # Proportions -->
<!-- system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sum(x, na.rm = TRUE))) -->
<!-- system.time(testdat[, lapply(.SD, function(x) x/sum(x, na.rm = TRUE)), keyby = c("g1","g2")]) -->
<!-- system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "/")) -->

<!-- # Scaling -->
<!-- system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) x/sd(x, na.rm = TRUE))) -->
<!-- system.time(testdat[, lapply(.SD, function(x) x/sd(x, na.rm = TRUE)), keyby = c("g1","g2")]) -->
<!-- system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "/")) -->
<!-- system.time(fsd(get_vars(testdat, -ind), get_vars(testdat, ind), w, "/")) # Weighted Scaling. Need a weighted sd to do in dplyr or data.table -->

<!-- # Scaling and centering (i.e. standardizing) -->
<!-- system.time(testdat %>% group_by(g1,g2) %>% mutate_all(function(x) (x - mean.default(x, na.rm = TRUE))/sd(x, na.rm = TRUE))) -->
<!-- system.time(testdat[, lapply(.SD, function(x) (x - mean(x, na.rm = TRUE))/sd(x, na.rm = TRUE)), keyby = c("g1","g2")]) -->
<!-- system.time(STD(testdat, ~ g1 + g2)) -->
<!-- system.time(STD(testdat, ~ g1 + g2, ~ w))  # Weighted standardizing: Also difficult to do in dplyr or data.table -->

<!-- # Replacing data with any ststistic, here the sum: -->
<!-- system.time(testdat %>% group_by(g1,g2) %>% mutate_all(sum, na.rm = TRUE)) -->
<!-- system.time(testdat[, setdiff(names(testdat), c("g1","g2")) := lapply(.SD, sum, na.rm = TRUE), keyby = c("g1","g2")]) -->
<!-- system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "replace_fill")) # dplyr and data.table also fill missing values. -->
<!-- system.time(fsum(get_vars(testdat, -ind), get_vars(testdat, ind), TRA = "replace")) # This preserves missing values, and is not easily implemented in dplyr or data.table -->
<!-- ``` -->

<!-- The message is clear: *collapse* outperforms *dplyr* and *data.table* both in scope and speed when it comes to grouped and / or weighted transformations of data. This capacity of *collapse* should make it attractive to econometricians and people programming with complex panel data. In the '*collapse* and *plm*' vignette I provide a programming example by implementing a more general case of the Hausman and Taylor (1981) estimator with two levels of fixed effects, as well as further benchmarks. -->

<!-- <!-- This capacity with `B` and `W` can be extremely useful to implement complex fixed-effects and instrumental-variables procedures (Like Hausman-Taylor 1985 etc.) not implemented in standard packages. Bootstrapping can be used to obtain proper standard errors. `fbetween / B` and `fwithin / W` are also orders of magnitudes faster than implementations in standard packages - for large problems. The code below shows the time required to average / center 1 Million observations in 100,000 groups:  -->
<!-- <!-- ````{r} -->
<!-- <!-- x <- abs(1000*rnorm(1e6))+100 -->
<!-- <!-- f <- qF(sample.int(1e5, 1e6, replace = TRUE), na.exclude = FALSE) -->

<!-- <!-- system.time(ave(x, f)) # Base R equivalent of fbetween / B -->
<!-- <!-- system.time(B(x, f)) -->
<!-- <!-- system.time(W(x, f)) -->
<!-- <!-- ``` -->

<!-- ### Time-Computation Benchmarks -->

<!-- Below I provide some benchmarks for lags, differences and growth rates on panel data. I will run microbenchmarks on the `wlddev` dataset. benchmarks on larger panels are already provided in the other vignettes. Again I compare *collapse* to *dplyr* and *data.table*: -->

<!-- ```{r, eval=RUNBENCH} -->
<!-- # We have a balanced panel of 216 countries, each observed for 59 years -->
<!-- descr(wlddev, cols = c("iso3c", "year")) -->

<!-- # 1 Panel-Lag -->
<!-- suppressMessages( -->
<!-- microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag), -->
<!--                dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(lag), -->
<!--                data.table_not_ordered = dtwlddev[, shift(.SD), keyby = iso3c, .SDcols = 9:12], -->
<!--                data.table_ordered = dtwlddev[order(year), shift(.SD), keyby = iso3c, .SDcols = 9:12], -->
<!--                collapse_not_ordered = L(wlddev, 1, ~iso3c, cols = 9:12), -->
<!--                collapse_ordered = L(wlddev, 1, ~iso3c, ~year, cols = 9:12), -->
<!--                subtract_from_CNO = message("Panel-lag computed without timevar: Assuming ordered data"))) -->

<!-- # Sequence of 1 lead and 3 lags: Not possible in dplyr -->
<!-- microbenchmark(data.table_not_ordered = dtwlddev[, shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12], -->
<!--                data.table_ordered = dtwlddev[order(year), shift(.SD, -1:3), keyby = iso3c, .SDcols = 9:12], -->
<!--                collapse_ordered = L(wlddev, -1:3, ~iso3c, ~year, cols = 9:12)) -->

<!-- # 1 Panel-difference -->
<!-- microbenchmark(dplyr_not_ordered = wlddev %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)), -->
<!--                dplyr_ordered = wlddev %>% arrange(iso3c,year) %>% group_by(iso3c) %>% select_at(9:12) %>% mutate_all(function(x) x - lag(x)), -->
<!--                data.table_not_ordered = dtwlddev[, lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12], -->
<!--                data.table_ordered = dtwlddev[order(year), lapply(.SD, function(x) x - shift(x)), keyby = iso3c, .SDcols = 9:12], -->
<!--                collapse_ordered = D(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12)) -->

<!-- # Iterated Panel-Difference: Not straightforward in dplyr or data.table -->
<!-- microbenchmark(collapse_ordered = D(wlddev, 1, 2, ~iso3c, ~year, cols = 9:12)) -->

<!-- # Sequence of Lagged/Leaded Differences: Not straightforward in dplyr or data.table -->
<!-- microbenchmark(collapse_ordered = D(wlddev, -1:3, 1, ~iso3c, ~year, cols = 9:12)) -->

<!-- # Sequence of Lagged/Leaded and Iterated Differences: Not straightforward in dplyr or data.table -->
<!-- microbenchmark(collapse_ordered = D(wlddev, -1:3, 1:2, ~iso3c, ~year, cols = 9:12)) -->

<!-- # The same applies to growth rates or log-differences. -->
<!-- microbenchmark(collapse_ordered_growth = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12), -->
<!--                collapse_ordered_logdiff = G(wlddev, 1, 1, ~iso3c, ~year, cols = 9:12, logdiff = TRUE)) -->
<!-- ``` -->

<!-- The results are similar to the grouped transformations: *collapse* substantially facilitates and speeds up these complex operations in R. Again *plm* classes are very useful to avoid having to specify panel-identifiers all the time. See the '*collapse* and *plm*' vignette for more details. -->