Releases: SebKrantz/collapse
collapse version 1.9.1
-
Fixed minor C/C++ issues flagged by CRAN's detailed checks.
-
Added functions
set_collapse()
andget_collapse()
, allowing you to globally set defaults for thenthreads
andna.rm
arguments to all functions in the package. E.g.set_collapse(nthreads = 4, na.rm = FALSE)
could be a suitable setting for larger data without missing values. This is implemented using an internal environment by the name of.op
, such that these defaults are received using e.g..op[["nthreads"]]
, at the computational cost of a few nanoseconds (8-10x faster thangetOption("nthreads")
which would take about 1 microsecond)..op
is not accessible by the user, so functionget_collapse()
can be used to retrieve settings. Exempt from this are functions.quantile
, and a new function.range
(alias offrange
), which go directly to C for maximum performance in repeated executions, and are not affected by these global settings. Functiondescr()
, which internally calls a bunch of statistical functions, is also not affected by these settings. -
Further improvements in thread safety for
fsum()
andfmean()
in grouped computations across data frame columns. All OpenMP enabled functions in collapse can now be considered thread safe i.e. they pass the full battery of tests in multithreaded mode.
collapse version 1.9.0
collapse 1.9.0 released mid of January 2023, provides improvements in performance and versatility in many areas, as well as greater statistical capabilities, most notably efficient (grouped, weighted) estimation of sample quantiles.
Changes to functionality
-
All functions renamed in collapse 1.6.0 are now depreciated, to be removed end of 2023. These functions had already been giving messages since v1.6.0. See
help("collapse-renamed")
. -
The lead operator
F()
is not exported anymore from the package namespace, to avoid clashes withbase::F
flagged by multiple people. The operator is still part of the package and can be accessed usingcollapse:::F
. I have also added an option"collapse_export_F"
, such that settingoptions(collapse_export_F = TRUE)
before loading the package exports the operator as before. Thanks @matthewross07 (#100), @edrubin (#194), and @arthurgailes (#347). -
Function
fnth()
has a new defaultties = "q7"
, which gives the same result asquantile(..., type = 7)
(R's default). More details below.
Bug Fixes
-
fmode()
gave wrong results for singleton groups (groups of size 1) on unsorted data. I had optimizedfmode()
for singleton groups to directly return the corresponding element, but it did not access the element through the (internal) ordering vector, so the first element/row of the entire vector/data was taken. The same mistake occurred forfndistinct
if singleton groups wereNA
, which were counted as1
instead of0
under thena.rm = TRUE
default (provided the first element of the vector/data was notNA
). The mistake did not occur with data sorted by the groups, because here the data pointer already pointed to the first element of the group. (My apologies for this bug, it took me more than half a year to discover it, using collapse on a daily basis, and it escaped 700 unit tests as well). -
Function
groupid(x, na.skip = TRUE)
returned uninitialized first elements if the first values inx
whereNA
. Thanks for reporting @Henrik-P (#335). -
Fixed a bug in the
.names
argument toacross()
. Passing a naming function such as.names = function(c, f) paste0(c, "-", f)
now works as intended i.e. the function is applied to all combinations of columns (c) and functions (f) usingouter()
. Previously this was just internally evaluated as.names(cols, funs)
, which did not work if there were multiple cols and multiple funs. There is also now a possibility to set.names = "flip"
, which names columnsf_c
instead ofc_f
. -
fnrow()
was rewritten in C and also supports data frames with 0 columns. Similarly forseq_row()
. Thanks @NicChr (#344).
Additions
-
Added functions
fcount()
andfcountv()
: a versatile and blazing fast alternative todplyr::count
. It also works with vectors, matrices, as well as grouped and indexed data. -
Added function
fquantile()
: Fast (weighted) continuous quantile estimation (methods 5-9 following Hyndman and Fan (1996)), implemented fully in C based on quickselect and radixsort algorithms, and also supports an ordering vector as optional input to speed up the process. It is up to 2x faster thanstats::quantile
on larger vectors, but also especially fast on smaller data, where the R overhead ofstats::quantile
becomes burdensome. For maximum performance during repeated executions, a programmers version.quantile()
with different defaults is also provided. -
Added function
fdist()
: A fast and versatile replacement forstats::dist
. It computes a full euclidian distance matrix around 4x faster thanstats::dist
in serial mode, with additional gains possible through multithreading along the distance matrix columns (decreasing thread loads as the matrix is lower triangular). It also supports computing the distance of a matrix with a single row-vector, or simply between two vectors. E.g.fdist(mat, mat[1, ])
is the same assqrt(colSums((t(mat) - mat[1, ])^2)))
, but about 20x faster in serial mode, andfdist(x, y)
is the same assqrt(sum((x-y)^2))
, about 3x faster in serial mode. In both cases (sub-column level) multithreading is available. Note thatfdist
does not skip missing values i.e.NA
's will result inNA
distances. There is also no internal implementation for integers or data frames. Such inputs will be coerced to numeric matrices. -
Added function
GRPid()
to easily fetch the group id from a grouping object, especially inside groupedfmutate()
calls. This addition was warranted especially by the new improvedfnth.default()
method which allows orderings to be supplied for performance improvements. See commends onfnth()
and the example provided below. -
fsummarize()
was added as a synonym tofsummarise
. Thanks @arthurgailes for the PR. -
C API: collapse exports around 40 C functions that provide functionality that is either convenient or rather complicated to implement from scratch. The exported functions can be found at the bottom of
src/ExportSymbols.c
. The API does not include the Fast Statistical Functions, which I thought are too closely related to how collapse works internally to be of much use to a C programmer (e.g. they expect grouping objects or certain kinds of integer vectors). But you are free to request the export of additional functions, including C++ functions.
Improvements
-
fnth()
andfmedian()
were rewritten in C, with significant gains in performance and versatility. Notably,fnth()
now supports (grouped, weighted) continuous quantile estimation likefquantile()
(fmedian()
, which is a wrapper aroundfnth()
, can also estimate various quantile based weighted medians). The new default forfnth()
isties = "q7"
, which gives the same result as(f)quantile(..., type = 7)
(R's default). OpenMP multithreading across groups is also much more effective in both the weighted and unweighted case. Finally,fnth.default
gained an additional argumento
to pass an ordering vector, which can dramatically speed up repeated invocations of the function on the dame data:# Estimating multiple weighted-grouped quantiles on mpg: pre-computing an ordering provides extra speed. mtcars %>% fgroup_by(cyl, vs, am) %>% fmutate(o = radixorder(GRPid(), mpg)) %>% # On grouped data, need to account for GRPid() fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o, w = wt), mpg_median = fmedian(mpg, o = o, w = wt), mpg_Q3 = fnth(mpg, 0.75, o = o, w = wt)) # Note that without weights this is not always faster. Quickselect can be very efficient, so it depends # on the data, the number of groups, whether they are sorted (which speeds up radixorder), etc...
-
BY
now supports data-length arguments to be passed e.g.BY(mtcars, mtcars$cyl, fquantile, w = mtcars$wt)
, making it effectively a generic groupedmapply
function as well. Furthermore, the grouped_df method now also expands grouping columns for output length > 1. -
collap()
, which internally usesBY
with non-Fast Statistical Functions, now also supports arbitrary further arguments passed down to functions to be split by groups. Thus users can also apply custom weighted functions withcollap()
. Furthermore, the parsing of theFUN
,catFUN
andwFUN
arguments was improved and brought in-line with the parsing of.fns
inacross()
. The main benefit of this is that Fast Statistical Functions are now also detected and optimizations carried out when passed in a list providing a new name e.g.collap(data, ~ id, list(mean = fmean))
is now optimized! Thanks @ttrodrigz (#358) for requesting this. -
descr()
, by virtue offquantile
and the improvements toBY
, supports full-blown grouped and weighted descriptions of data. This is implemented through additionalby
andw
arguments. The function has also been turned into an S3 generic, with a default and a 'grouped_df' method. The 'descr' methodsas.data.frame
andprint
also feature various improvements, and a newcompact
argument toprint.descr
, allowing a more compact printout. Users will also notice improved performance, mainly due tofquantile
: on the M1descr(wlddev)
is now 2x faster thansummary(wlddev)
, and 41x faster thanHmisc::describe(wlddev)
. Thanks @statzhero for the request (#355). -
radixorder
is about 25% faster on characters and doubles. This also benefits grouping performance. Note thatgroup()
may still be substantially faster on unsorted data, so if performance is critical try thesort = FALSE
argument to functions likefgroup_by
and compare. -
Most list processing functions are noticeably faster, as checking the data types of elements in a list is now also done in C, and I have made some improvements to collapse's version of
rbindlist()
(used inunlist2d()
, and various other places). -
fsummarise
andfmutate
gained an ability to evaluate arbitrary expressions that result in lists / data frames without the need to useacross()
. For example:mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(cbind(mpg, wt, carb)), names = TRUE))
ormtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb)), names = TRUE))
. There is also the possibility to compute expressions using.data
e.g.mtcars |> fgroup_by(cyl) |> fsummarise(mctl(lmtest::coeftest(lm(mpg ~ wt + carb, .data)), names = TRUE))
yields the same thing, but is less efficient because the whole dataset (including 'cyl') is split by groups. For greater efficiency and convenience, you can pre-select columns using a global.cols
argument, e.g.mtcars |> fgroup_by(cyl, vs, am) |> fsummarise(mctl(cor(.data), names = TRUE), .cols = .c(mpg, wt, carb))
gives the same as above. Three Notes about this:- No grouped vectorizations for fast statistical functions i.e. the entire expression is evaluated for each group. (Let m...
collapse version 1.8.9
-
Fixed some warnings on rchk and newer C compilers (LLVM clang 10+).
-
.pseries
/.indexed_series
methods also change the implicit class of the vector (attached after"pseries"
), if the data type changed. e.g. calling a function likefgrowth
on an integer pseries changed the data type to double, but the "integer" class was still attached after "pseries". -
Fixed bad testing for SE inputs in
fgroup_by()
andfindex_by()
. See #320. -
Added
rsplit.matrix
method. -
descr()
now by default also reports 10% and 90% quantiles for numeric variables (in line with STATA's detailed summary statistics), and can also be applied to 'pseries' / 'indexed_series'. Furthermore,descr()
itself now has an argumentstepwise
such thatdescr(big_data, stepwise = TRUE)
yields computation of summary statistics on a variable-by-variable basis (and the finished 'descr' object is returned invisibly). The printed result is thus identical toprint(descr(big_data), stepwise = TRUE)
, with the difference that the latter first does the entire computation whereas the former computes statistics on demand.
-
Function
ss()
has a new argumentcheck = TRUE
. Settingcheck = FALSE
allows subsetting data frames / lists with positive integers without checking whether integers are positive or in-range. For programmers. -
Function
get_vars()
has a new argumentrename
allowing select-renaming of columns in standard evaluation programming, e.g.get_vars(mtcars, c(newname = "cyl", "vs", "am"), rename = TRUE)
. The default isrename = FALSE
, to warrant full backwards compatibility. See #327. -
Added helper function
setattrib()
, to set a new attribute list for an object by reference + invisible return. This is different from the existing functionsetAttrib()
(note the capital A), which takes a shallow copy of list-like objects and returns the result.
collapse version 1.8.8
-
flm
andfFtest
are now internal generic with an added formula method e.g.flm(mpg ~ hp + carb, mtcars, weights = wt)
orfFtest(mpg ~ hp + carb | vs + am, mtcars, weights = wt)
in addition to the programming interface. Thanks to Grant McDermott for suggesting. -
Added method
as.data.frame.qsu
, to efficiently turn the default array outputs fromqsu()
into tidy data frames. -
Major improvements to
setv
andcopyv
, generalizing the scope of operations that can be performed to all common cases. This means that even simple base R operations such asX[v] <- R
can now be done significantly faster usingsetv(X, v, R)
. -
n
andqtab
can now be added tooptions("collapse_mask")
e.g.options(collapse_mask = c("manip", "helper", "n", "qtab"))
. This will export a functionn()
to get the (group) count infsummarise
andfmutate
(which can also always be done usingGRPN()
butn()
is more familiar to dplyr users), and will masktable()
withqtab()
, which is principally a fast drop-in replacement, but with some different further arguments. -
Added C-level helper function
all_funs
, which fetches all the functions called in an expression, similar tosetdiff(all.names(x), all.vars(x))
but better because it takes account of the syntax. For example letx = quote(sum(sum))
i.e. we are summing a column namedsum
. Thenall.names(x) = c("sum", "sum")
andall.vars(x) = "sum"
so that the difference ischaracter(0)
, whereasall_funs(x)
returns"sum"
. This function makes collapse smarter when parsing expressions infsummarise
andfmutate
and deciding which ones to vectorize.
collapse version 1.8.7
-
Fixed a bug in
fscale.pdata.frame
where the default C++ method was being called instead of the list method (i.e. the method didn't work at all). -
Fixed 2 minor rchk issues (the remaining ones are spurious).
-
fsum
has an additional argumentfill = TRUE
(defaultFALSE
) that initializes the result vector with0
instead ofNA
whenna.rm = TRUE
, so thatfsum(NA, fill = TRUE)
gives0
likebase::sum(NA, na.rm = TRUE)
. -
Slight performance increase in
fmean
with groups ifna.rm = TRUE
(the default). -
Significant performance improvement when using base R expressions involving multiple functions and one column e.g.
mid_col = (min(col) + max(col)) / 2
orlorentz_col = cumsum(sort(col)) / sum(col)
etc. insidefsummarise
andfmutate
. Instead of evaluating such expressions on a data subset of one column for each group, they are now turned into a function e.g.function(x) cumsum(sort(x)) / sum(x)
which is applied to a single vector split by groups. -
Argument
return.groups
fromGRP.default
is now also available infgroup_by
, allowing grouped data frames without materializing the unique grouping columns. This allows more efficient mutate-only operations e.g.mtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmutate(across(hp:carb, fscale))
. Similarly for aggregation with dropping of grouping columnsmtcars |> fgroup_by(cyl, return.groups = FALSE) |> fmean()
is equivalent and faster thanmtcars |> fgroup_by(cyl) |> fmean(keep.group_vars = FALSE)
.
collapse version 1.8.6
collapse 1.8.6
- Fixed further minor issues:
- some inline functions in TRA.c needed to be declared 'static' to be local in scope (#275)
- timeid.Rd now uses zoo package conditionally and limits size of printout
collapse 1.8.5
- Fixed some issues flagged by CRAN:
- Installation on some linux distributions failed because omp.h was included after Rinternals.h
- Some signed integer overflows while running tests caused UBSAN warnings. (This happened inside a hash function where overflows are not a problem. I changed to unsigned int to avoid the UBSAN warning.)
- Ensured that package passes R CMD Check without suggested packages
collapse version 1.8.4
A few improvements and fixes to make collapse 1.8 acceptable to CRAN. The changes may be summarised as follows:
collapse 1.8.4
- Makevars text substitution hack to have CRAN accept a package that combines C, C++ and OpenMP. Thanks also to @MichaelChirico for pointing me in the right direction.
collapse 1.8.3
-
Significant speed improvement in
qF/qG
(factor-generation) for character vectors with more than 100,000 obs and many levels ifsort = TRUE
(the default). For details see themethod
argument of?qF
. -
Optimizations in
fmode
andfndistinct
for singleton groups.
collapse 1.8.2
-
Fixed some rchk issues found by Thomas Kalibera from CRAN.
-
faster
funique.default
method. -
group
now also internally optimizes on 'qG' objects.
collapse 1.8.1
-
Added function
fnunique
(yet another alternative todata.table::uniqueN
,kit::uniqLen
ordplyr::n_distinct
, and principally a simple wrapper forattr(group(x), "N.groups")
). At presentfnunique
generally outperforms the others on data frames. -
finteraction
has an additional argumentfactor = TRUE
. Settingfactor = FALSE
returns a 'qG' object, which is more efficient if just an integer id but no factor object itself is required. -
Operators (see
.OPERATOR_FUN
) have been improved a bit such that id-variables selected in the.data.frame
(by
,w
ort
arguments) or.pdata.frame
methods (variables in the index) are not computed upon even if they are numeric (since the default iscols = is.numeric
). In general, ifcols
is a function used to select columns of a certain data type, id variables are excluded from computation even if they are of that data type. It is still possible to compute on id variables by explicitly selecting them using names or indices passed tocols
, or including them in the lhs of a formula passed toby
. -
Further efforts to facilitate adding the group-count in
fsummarise
andfmutate
:- if
options(collapse_mask = "all")
before loading the package, an additional functionn()
is exported that works just likedplyr:::n()
. (Note that internal optimization flags forn
are always on, so if you really want the function to be calledn()
without settingoptions(collapse_mask = "all")
, you could also don <- GRPN
orn <- collapse:::n
) - otherwise the same can now always be done using
GRPN()
. The previous uses ofGRPN
are unaltered i.e.GRPN
can also:- fetch group sizes directly grouping object or grouped data frame i.e.
data |> gby(id) |> GRPN()
ordata %>% gby(id) %>% ftransform(N = GRPN(.))
(note the dot). - compute group sizes on the fly, for example
fsubset(data, GRPN(id) > 10L)
orfsubset(data, GRPN(list(id1, id2)) > 10L)
orGRPN(data, by = ~ id1 + id2)
.
- fetch group sizes directly grouping object or grouped data frame i.e.
- if
collapse version 1.8.0
collapse 1.8.0, released mid of May 2022, brings enhanced support for indexed computations on time series and panel data by introducing flexible 'indexed_frame' and 'indexed_series' classes and surrounding infrastructure, sets a modest start to OpenMP multithreading as well as data transformation by reference in statistical functions, and enhances the packages descriptive statistics toolset.
Changes to functionality
-
Functions
Recode
,replace_non_finite
, depreciated since collapse v1.1.0 andis.regular
, depreciated since collapse v1.5.1 and clashing with a more important function in the zoo package, are now removed. -
Fast Statistical Functions operating on numeric data (such as
fmean
,fmedian
,fsum
,fmin
,fmax
, ...) now preserve attributes in more cases. Previously these functions did not preserve attributes for simple computations using the default method, and only preserved attributes in grouped computations if!is.object(x)
(see NEWS section for collapse 1.4.0). This meant thatfmin
andfmax
did not preserve the attributes of Date or POSIXct objects, and none of these functions preserved 'units' objects (used a lot by the sf package). Now, attributes are preserved if!inherits(x, "ts")
, that is the new default of these functions is to generally keep attributes, except for 'ts' objects where doing so obviously causes an unwanted error (note that 'xts' and others are handled by the matrix or data.frame method where other principles apply, see NEWS for 1.4.0). An exception are the functionsfnobs
andfndistinct
where the previous default is kept. -
Time Series Functions
flag
,fdiff
,fgrowth
andpsacf/pspacf/psccf
(and the operatorsL/F/D/Dlog/G
) now internally process time objects passed to thet
argument (whereis.object(t) && is.numeric(unclass(t))
) via a new function calledtimeid
which turns them into integer vectors based on the greatest common divisor (GCD) (see below). Previously such objects were converted to factor. This can change behavior of code e.g. a 'Date' variable representing monthly data may be regular when converted to factor, but is now irregular and regarded as daily data (with a GCD of 1) because of the different day counts of the months. Users should fix such code by either by callingqG
on the time variable (for grouping / factor-conversion) or using appropriate classes e.g.zoo::yearmon
. Note that plain numeric vectors where!is.object(t)
are still used directly for indexation without passing them throughtimeid
(which can still be applied manually if desired). -
BY
now has an argumentreorder = TRUE
, which casts elements in the original order ifNROW(result) == NROW(x)
(likefmutate
). Previously the result was just in order of the groups, regardless of the length of the output. To obtain the former outcome users need to setreorder = FALSE
. -
options("collapse_DT_alloccol")
was removed, the default is now fixed at 100. The reason is that data.table automatically expands the range of overallocated columns if required (so the option is not really necessary), and calling R options from C slows down C code and can cause problems in parallel code.
Bug Fixes
-
Fixed a bug in
fcumsum
that caused a segfault during grouped operations on larger data, due to flawed internal memory allocation. Thanks @Gulde91 for reporting #237. -
Fixed a bug in
across
caused by twofunction(x)
statements being passed in a list e.g.mtcars |> fsummarise(acr(mpg, list(ssdd = function(x) sd(x), mu = function(x) mean(x))))
. Thanks @trang1618 for reporting #233. -
Fixed an issue in
across()
when logical vectors were used to select column on grouped data e.g.mtcars %>% gby(vs, am) %>% smr(acr(startsWith(names(.), "c"), fmean))
now works without error. -
qsu
gives proper output for length 1 vectors e.g.qsu(1)
. -
collapse depends on R > 3.3.0, due to the use of newer C-level macros introduced then. The earlier indication of R > 2.1.0 was only based on R-level code and misleading. Thanks @ben-schwen for reporting #236. I will try to maintain this dependency for as long as possible, without being too restrained by development in R's C API and the ALTREP system in particular, which collapse might utilize in the future.
Additions
-
Introduction of 'indexed_frame','indexed_series' and 'index_df' classes: fast and flexible indexed time series and panel data classes that inherit from plm's 'pdata.frame', 'pseries' and 'pindex' classes. These classes take full advantage of collapse's computational infrastructure, are class-agnostic i.e. they can be superimposed upon any data frame or vector/matrix like object while maintaining most of the functionality of that object, support both time series and panel data, natively handle irregularity, and supports ad-hoc computations inside arbitrary data masking functions and model formulas. This infrastructure comprises of additional functions and methods, and modification of some existing functions and 'pdata.frame' / 'pseries' methods.
-
New functions:
findex_by/iby
,findex/ix
,unindex
,reindex
,is_irregular
,to_plm
. -
New methods:
[.indexed_series
,[.indexed_frame
,[<-.indexed_frame
,$.indexed_frame
,
$<-.indexed_frame
,[[.indexed_frame
,[[<-.indexed_frame
,[.index_df
,fsubset.pseries
,fsubset.pdata.frame
,funique.pseries
,funique.pdata.frame
,roworder(v)
(internal)na_omit
(internal),print.indexed_series
,print.indexed_frame
,print.index_df
,Math.indexed_series
,Ops.indexed_series
. -
Modification of 'pseries' and 'pdata.frame' methods for functions
flag/L/F
,fdiff/D/Dlog
,fgrowth/G
,fcumsum
,psmat
,psacf/pspacf/psccf
,fscale/STD
,fbetween/B
,fwithin/W
,fhdbetween/HDB
,fhdwithin/HDW
,qsu
andvarying
to take advantage of 'indexed_frame' and 'indexed_series' while continuing to work as before with 'pdata.frame' and 'pseries'.
For more information and details see
help("indexing")
. -
-
Added function
timeid
: Generation of an integer-id/time-factor from time or date sequences represented by integer of double vectors (such as 'Date', 'POSIXct', 'ts', 'yearmon', 'yearquarter' or plain integers / doubles) by a numerically quite robust greatest common divisor method (see below). This function is used internally infindex_by
,reindex
and also in evaluation of thet
argument to functions likeflag
/fdiff
/fgrowth
wheneveris.object(t) && is.numeric(unclass(t))
(see also note above). -
Programming helper function
vgcd
to efficiently compute the greatest common divisor from a vector or positive integer or double values (which should ideally be unique and sorted as well,timeid
usesvgcd(sort(unique(diff(sort(unique(na_rm(x)))))))
). Precision for doubles is up to 6 digits. -
Programming helper function
frange
: A significantly faster alternative tobase::range
, which calls bothmin
andmax
. Note thatfrange
inherits collapse's globalna.rm = TRUE
default. -
Added function
qtab/qtable
: A versatile and computationally more efficient alternative tobase::table
. Notably, it also supports tabulations with frequency weights, and computation of a statistic over combinations of variables. Objects are of class 'qtab' that inherits from 'table'. Thus all 'table' methods apply to it. -
TRA
was rewritten in C, and now has an additional argumentset = TRUE
which toggles data transformation by reference. The functionsetTRA
was added as a shortcut which additionally returns the result invisibly. SinceTRA
is usually accessed internally through the like-named argument to Fast Statistical Functions, passingset = TRUE
to those functions yields an internal call tosetTRA
. For examplefmedian(num_vars(iris), g = iris$Species, TRA = "-", set = TRUE)
subtracts the species-wise median from the numeric variables in the iris dataset, modifying the data in place and returning the result invisibly. Similarly the argument can be added in other workflows such asiris |> fgroup_by(Species) |> fmutate(across(1:2, fmedian, set = TRUE))
ormtcars |> ftransform(mpg = mpg %+=% hp, wt = fsd(wt, cyl, TRA = "replace_fill", set = TRUE))
. Note that such chains must be ended byinvisible()
if no printout is wanted. -
Exported helper function
greorder
, the companion togsplit
to reorder output infmutate
(and now also inBY
): letg
be a 'GRP' object (or something coercible such as a vector) andx
a vector, thengreorder
orders data iny = unlist(gsplit(x, g))
such thatidentical(greorder(y, g), x)
.
Improvements
-
fmean
,fprod
,fmode
andfndistinct
were rewritten in C, providing performance improvements, particularly infmode
andfndistinct
, and improvements for integers infmean
andfprod
. -
OpenMP multithreading in
fsum
,fmean
,fmedian
,fnth
,fmode
andfndistinct
, implemented via an additionalnthreads
argument. The default is to use 1 thread, which internally calls a serial version of the code infsum
andfmean
(thus no change in the default behavior). The plan is to slowly roll this out over all statistical functions and then introduce options to set alternative global defaults. Multi-threading internally works different for different functions, see thenthreads
argument documentation of a particular function. Unfortunately I currently cannot guarantee thread safety, as parallelization of complex loops entails some tricky bugs and I have limited time to sort these out. So please report bugs, and if you happen to have experience with OpenMP please consider examining the code and making some suggestions. -
TRA
has an additional option"replace_NA"
, e.g.wlddev |> fgroup_by(iso3c) |> fmutate(across(PCGDP:POP, fmedian, TRA = "replace_NA"))
performs median value imputation of missing values. Similarly fo...
collapse version 1.7.6
-
Corrected a C-level bug in
gsplit
that could lead R to crash in some instances (gsplit
is used internally infsummarise
,fmutate
,BY
andcollap
to perform computations with base R (non-optimized) functions). -
Ensured that
BY.grouped_df
always (by default) returns grouping columns in aggregations i.e.iris |> gby(Species) |> nv() |> BY(sum)
now gives the same asiris |> gby(Species) |> nv() |> fsum()
. -
A
.
was added to the first argument of functionsfselect
,fsubset
,colorder
andfgroup_by
, i.e.fselect(x, ...) -> fselect(.x, ...)
. The reason for this is that over time I added the option to select-rename columns e.g.fselect(mtcars, cylinders = cyl)
, which was not offered when these functions were created. This presents problems if columns should be renamed intox
, e.g.fselect(mtcars, x = cyl)
failed, see #221. Renaming the first argument to.x
somewhat guards against such situations. I think this change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. I really hope this breaks nobody's code. -
Added a function
GRPN
to make it easy to add a column of group sizes e.g.mtcars %>% fgroup_by(cyl,vs,am) %>% ftransform(Sizes = GRPN(.))
ormtcars %>% ftransform(Sizes = GRPN(list(cyl, vs, am)))
orGRPN(mtcars, by = ~cyl+vs+am)
. -
Added
[.pwcor
and[.pwcov
, to be able to subset correlation/covariance matrices without loosing the print formatting.
collapse version 1.7.5
collapse 1.7.5
-
In the development version on GitHub, a
.
was added to the first argument of functionsfselect
,fsubset
,colorder
andfgroup_by
, i.e.fselect(x, ...) -> fselect(.x, ...)
. The reason for this is that over time I added the option to select-rename columns e.g.fselect(mtcars, cylinders = cyl)
, which was not offered when these functions were created. This presents problems if columns should be renamed intox
, e.g.fselect(mtcars, x = cyl)
fails, see e.g. #221 . Renaming the first argument to.x
somewhat guards against such situations. I think this API change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. For now it remains in the development version which you can install usingremotes::install_github("SebKrantz/collapse")
. If you have strong objections to this change (because it will break your code or you know of people that have a programming style where they explicitly set the first argument of data manipulation functions), please let me know! -
Also ensuring tidyverse examples are in
\donttest{}
and building without the dplyr testing file to avoid issues with static code analysis on CRAN. -
20-50% Speed improvement in
gsplit
(and therefore infsummarise
,fmutate
,collap
andBY
when invoked with base R functions) when grouping withGRP(..., sort = TRUE, return.order = TRUE)
. To enable this by default, the default for argumentreturn.order
inGRP
was set tosort
, which retains the ordering vector (needed for the optimization). Retaining the ordering vector uses up some memory which can possibly adversely affect computations with big data, but with big datasort = FALSE
usually gives faster results anyway, and you can also always setreturn.order = FALSE
(also infgroup_by
,collap
), so this default gives the best of both worlds.
- An ancient depreciated argument
sort.row
(replaced bysort
in 2020) is now removed fromcollap
. Also argumentsreturn.order
andmethod
were added tocollap
providing full control of the grouping that happens internally.
collapse 1.7.4
-
Tests needed to be adjusted for the upcoming release of dplyr 1.0.8 which involves an API change in
mutate
.fmutate
will not take over these changes i.e.fmutate(..., .keep = "none")
will continue to work likedplyr::transmute
. Furthermore, no more tests involving dplyr are run on CRAN, and I will also not follow along with any future dplyr API changes. -
The C-API macro
installTrChar
(used in the newmassign
function) was replaced withinstallChar
to maintain backwards compatibility with R versions prior to 3.6.0. Thanks @tedmoorman #213. -
Minor improvements to
group()
, providing increased performance for doubles and also increased performance when the second grouping variable is integer, which turned out to be very slow in some instances.