20 Jan 23:20

SebKrantz

d483fe4

collapse version 1.9.1

Fixed minor C/C++ issues flagged by CRAN's detailed checks.
Added functions set_collapse() and get_collapse(), allowing you to globally set defaults for the nthreads and na.rm arguments to all functions in the package. E.g. set_collapse(nthreads = 4, na.rm = FALSE) could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of .op, such that these defaults are received using e.g. .op[["nthreads"]], at the computational cost of a few nanoseconds (8-10x faster than getOption("nthreads") which would take about 1 microsecond). .op is not accessible by the user, so function get_collapse() can be used to retrieve settings. Exempt from this are functions .quantile, and a new function .range (alias of frange), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Function descr(), which internally calls a bunch of statistical functions, is also not affected by these settings.
Further improvements in thread safety for fsum() and fmean() in grouped computations across data frame columns. All OpenMP enabled functions in collapse can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.

Assets 2

14 Jan 23:13

SebKrantz

v1.9.0

86e7db6

collapse version 1.9.0

collapse 1.9.0 released mid of January 2023, provides improvements in performance and versatility in many areas, as well as greater statistical capabilities, most notably efficient (grouped, weighted) estimation of sample quantiles.

Changes to functionality

All functions renamed in collapse 1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See help("collapse-renamed").
The lead operator F() is not exported anymore from the package namespace, to avoid clashes with base::F flagged by multiple people. The operator is still part of the package and can be accessed using collapse:::F. I have also added an option "collapse_export_F", such that setting options(collapse_export_F = TRUE) before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347).
Function fnth() has a new default ties = "q7", which gives the same result as quantile(..., type = 7) (R's default). More details below.

Bug Fixes

fmode() gave wrong results for singleton groups (groups of size 1) on unsorted data. I had optimized fmode() for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred for fndistinct if singleton groups were NA, which were counted as 1 instead of 0 under the na.rm = TRUE default (provided the first element of the vector/data was not NA). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using collapse on a daily basis, and it escaped 700 unit tests as well).
Function groupid(x, na.skip = TRUE) returned uninitialized first elements if the first values in x where NA. Thanks for reporting @Henrik-P (#335).
Fixed a bug in the .names argument to across(). Passing a naming function such as .names = function(c, f) paste0(c, "-", f) now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) using outer(). Previously this was just internally evaluated as .names(cols, funs), which did not work if there were multiple cols and multiple funs. There is also now a possibility to set .names = "flip", which names columns f_c instead of c_f.
fnrow() was rewritten in C and also supports data frames with 0 columns. Similarly for seq_row(). Thanks @NicChr (#344).

Additions

Added functions fcount() and fcountv(): a versatile and blazing fast alternative to dplyr::count. It also works with vectors, matrices, as well as grouped and indexed data.
Added function fquantile(): Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster than stats::quantile on larger vectors, but also especially fast on smaller data, where the R overhead of stats::quantile becomes burdensome. For maximum performance during repeated executions, a programmers version .quantile() with different defaults is also provided.
Added function fdist(): A fast and versatile replacement for stats::dist. It computes a full euclidian distance matrix around 4x faster than stats::dist in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g. fdist(mat, mat[1, ]) is the same as sqrt(colSums((t(mat) - mat[1, ])^2))), but about 20x faster in serial mode, and fdist(x, y) is the same as sqrt(sum((x-y)^2)), about 3x faster in serial mode. In both cases (sub-column level) multithreading is available. Note that fdist does not skip missing values i.e. NA's will result in NA distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices.
Added function GRPid() to easily fetch the group id from a grouping object, especially inside grouped fmutate() calls. This addition was warranted especially by the new improved fnth.default() method which allows orderings to be supplied for performance improvements. See commends on fnth() and the example provided below.
fsummarize() was added as a synonym to fsummarise. Thanks @arthurgailes for the PR.
C API: collapse exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of src/ExportSymbols.c. The API does not include the Fast Statistical Functions, which I thought are too closely related to how collapse works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.

Improvements

fnth() and fmedian() were rewritten in C, with significant gains in performance and versatility. Notably, fnth() now supports (grouped, weighted) continuous quantile estimation like fquantile() (fmedian(), which is a wrapper around fnth(), can also estimate various quantile based weighted medians). The new default for fnth() is ties = "q7", which gives the same result as (f)quantile(..., type = 7) (R's default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally, fnth.default gained an additional argument o to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:

# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed. 
mtcars %>% fgroup_by(cyl, vs, am) %>%
    fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid()
    fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt),
               mpg_median = fmedian(mpg, o = o, w = wt),
               mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt))
# Note that without weights this is not always faster. Quickselect can be very efficient, so it depends 
# on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...

BY now supports data-length arguments to be passed e.g. BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt), making it effectively a generic grouped mapply function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1.
collap(), which internally uses BY with non-Fast Statistical Functions, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions with collap(). Furthermore, the parsing of the FUN, catFUN and wFUN arguments was improved and brought in-line with the parsing of .fns in across(). The main benefit of this is that Fast Statistical Functions are now also detected and optimizations carried out when passed in a list providing a new name e.g. collap(data, ~ id, list(mean = fmean)) is now optimized! Thanks @ttrodrigz (#358) for requesting this.
descr(), by virtue of fquantile and the improvements to BY, supports full-blown grouped and weighted descriptions of data. This is implemented through additional by and w arguments. The function has also been turned into an S3 generic, with a default and a 'grouped_df' method. The 'descr' methods as.data.frame and print also feature various improvements, and a new compact argument to print.descr, allowing a more compact printout. Users will also notice improved performance, mainly due to fquantile: on the M1 descr(wlddev) is now 2x faster than summary(wlddev), and 41x faster than Hmisc::describe(wlddev). Thanks @statzhero for the request (#355).
radixorder is about 25% faster on characters and doubles. This also benefits grouping performance. Note that group() may still be substantially faster on unsorted data, so if performance is critical try the sort = FALSE argument to functions like fgroup_by and compare.
Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to collapse's version of rbindlist() (used in unlist2d(), and various other places).
fsummarise and fmutate gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to use across(). For example: mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE)) or mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE)). There is also the possibility to compute expressions using .data e.g. mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE)) yields the same thing, but is less efficient because the whole dataset (including 'cyl') is split by groups. For greater efficiency and convenience, you can pre-select columns using a global .cols argument, e.g. mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb)) gives the same as above. Three Notes about this:
- No grouped vectorizations for fast statistical functions i.e. the entire expression is evaluated for each group. (Let m...

Contributors

statzhero, Henrik-P, and 5 other contributors

Assets 2

07 Oct 13:55

SebKrantz

v1.8.9

1004893

collapse version 1.8.9

Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).
.pseries / .indexed_series methods also change the implicit class of the vector (attached after "pseries"), if the data type changed. e.g. calling a function like fgrowth on an integer pseries changed the data type to double, but the "integer" class was still attached after "pseries".
Fixed bad testing for SE inputs in fgroup_by() and findex_by(). See #320.
Added rsplit.matrix method.
descr() now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA's detailed summary statistics), and can also be applied to 'pseries' / 'indexed_series'. Furthermore, descr() itself now has an argument stepwise such that descr(big_data, stepwise = TRUE) yields computation of summary statistics on a variable-by-variable basis (and the finished 'descr' object is returned invisibly). The printed result is thus identical to print(descr(big_data), stepwise = TRUE), with the difference that the latter first does the entire computation whereas the former computes statistics on demand.

Function ss() has a new argument check = TRUE. Setting check = FALSE allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers.
Function get_vars() has a new argument rename allowing select-renaming of columns in standard evaluation programming, e.g. get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE). The default is rename = FALSE, to warrant full backwards compatibility. See #327.
Added helper function setattrib(), to set a new attribute list for an object by reference + invisible return. This is different from the existing function setAttrib() (note the capital A), which takes a shallow copy of list-like objects and returns the result.

Assets 2

15 Aug 14:34

SebKrantz

v1.8.8

e824a46

collapse version 1.8.8

flm and fFtest are now internal generic with an added formula method e.g. flm(mpg ~ hp + carb, mtcars, weights = wt) or fFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt) in addition to the programming interface. Thanks to Grant McDermott for suggesting.
Added method as.data.frame.qsu, to efficiently turn the default array outputs from qsu() into tidy data frames.
Major improvements to setv and copyv, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such as X[v] <- R can now be done significantly faster using setv(X, v, R).
n and qtab can now be added to options("collapse_mask") e.g. options(collapse_mask = c("manip", "helper", "n", "qtab")). This will export a function n() to get the (group) count in fsummarise and fmutate (which can also always be done using GRPN() but n() is more familiar to dplyr users), and will mask table() with qtab(), which is principally a fast drop-in replacement, but with some different further arguments.
Added C-level helper function all_funs, which fetches all the functions called in an expression, similar to setdiff(all.names(x), all.vars(x)) but better because it takes account of the syntax. For example let x = quote(sum(sum)) i.e. we are summing a column named sum. Then all.names(x) = c("sum", "sum") and all.vars(x) = "sum" so that the difference is character(0), whereas all_funs(x) returns "sum". This function makes collapse smarter when parsing expressions in fsummarise and fmutate and deciding which ones to vectorize.

Assets 2

23 Jul 18:01

SebKrantz

v1.8.7

d959fa6

collapse version 1.8.7

Fixed a bug in fscale.pdata.frame where the default C++ method was being called instead of the list method (i.e. the method didn't work at all).
Fixed 2 minor rchk issues (the remaining ones are spurious).
fsum has an additional argument fill = TRUE (default FALSE) that initializes the result vector with 0 instead of NA when na.rm = TRUE, so that fsum(NA, fill = TRUE) gives 0 like base::sum(NA, na.rm = TRUE).
Slight performance increase in fmean with groups if na.rm = TRUE (the default).
Significant performance improvement when using base R expressions involving multiple functions and one column e.g. mid_col = (min(col) + max(col)) / 2 or lorentz_col = cumsum(sort(col)) / sum(col) etc. inside fsummarise and fmutate. Instead of evaluating such expressions on a data subset of one column for each group, they are now turned into a function e.g. function(x) cumsum(sort(x)) / sum(x) which is applied to a single vector split by groups.
Argument return.groups from GRP.default is now also available in fgroup_by, allowing grouped data frames without materializing the unique grouping columns. This allows more efficient mutate-only operations e.g. mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmutate(across(hp:carb, fscale)). Similarly for aggregation with dropping of grouping columns mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmean() is equivalent and faster than mtcars |> fgroup_by(cyl) |> fmean(keep.group_vars = FALSE).

Assets 2

14 Jun 09:06

SebKrantz

v1.8.6

21dee0b

collapse version 1.8.6

collapse 1.8.6

Fixed further minor issues:
- some inline functions in TRA.c needed to be declared 'static' to be local in scope (#275)
- timeid.Rd now uses zoo package conditionally and limits size of printout

collapse 1.8.5

Fixed some issues flagged by CRAN:
- Installation on some linux distributions failed because omp.h was included after Rinternals.h
- Some signed integer overflows while running tests caused UBSAN warnings. (This happened inside a hash function where overflows are not a problem. I changed to unsigned int to avoid the UBSAN warning.)
- Ensured that package passes R CMD Check without suggested packages

Assets 2

08 Jun 15:27

SebKrantz

v1.8.4

65ef679

collapse version 1.8.4

A few improvements and fixes to make collapse 1.8 acceptable to CRAN. The changes may be summarised as follows:

collapse 1.8.4

Makevars text substitution hack to have CRAN accept a package that combines C, C++ and OpenMP. Thanks also to @MichaelChirico for pointing me in the right direction.

collapse 1.8.3

Significant speed improvement in qF/qG (factor-generation) for character vectors with more than 100,000 obs and many levels if sort = TRUE (the default). For details see the method argument of ?qF.
Optimizations in fmode and fndistinct for singleton groups.

collapse 1.8.2

Fixed some rchk issues found by Thomas Kalibera from CRAN.
faster funique.default method.
group now also internally optimizes on 'qG' objects.

collapse 1.8.1

Added function fnunique (yet another alternative to data.table::uniqueN, kit::uniqLen or dplyr::n_distinct, and principally a simple wrapper for attr(group(x), "N.groups")). At present fnunique generally outperforms the others on data frames.
finteraction has an additional argument factor = TRUE. Setting factor = FALSE returns a 'qG' object, which is more efficient if just an integer id but no factor object itself is required.
Operators (see .OPERATOR_FUN) have been improved a bit such that id-variables selected in the .data.frame (by, w or t arguments) or .pdata.frame methods (variables in the index) are not computed upon even if they are numeric (since the default is cols = is.numeric). In general, if cols is a function used to select columns of a certain data type, id variables are excluded from computation even if they are of that data type. It is still possible to compute on id variables by explicitly selecting them using names or indices passed to cols, or including them in the lhs of a formula passed to by.
Further efforts to facilitate adding the group-count in fsummarise and fmutate:
- if options(collapse_mask = "all") before loading the package, an additional function n() is exported that works just like dplyr:::n(). (Note that internal optimization flags for n are always on, so if you really want the function to be called n() without setting options(collapse_mask = "all"), you could also do n <- GRPN or n <- collapse:::n)
- otherwise the same can now always be done using GRPN(). The previous uses of GRPN are unaltered i.e. GRPN can also:
  - fetch group sizes directly grouping object or grouped data frame i.e. data |> gby(id) |> GRPN() or data %>% gby(id) %>% ftransform(N = GRPN(.)) (note the dot).
  - compute group sizes on the fly, for example fsubset(data, GRPN(id) > 10L) or fsubset(data, GRPN(list(id1, id2)) > 10L) or GRPN(data, by = ~ id1 + id2).

Contributors

MichaelChirico

Assets 2

11 May 00:26

SebKrantz

v1.8.0

e802e2b

collapse version 1.8.0

collapse 1.8.0, released mid of May 2022, brings enhanced support for indexed computations on time series and panel data by introducing flexible 'indexed_frame' and 'indexed_series' classes and surrounding infrastructure, sets a modest start to OpenMP multithreading as well as data transformation by reference in statistical functions, and enhances the packages descriptive statistics toolset.