Skip to content

Releases: SebKrantz/collapse

collapse version 1.7.6

12 Feb 15:21
cbb8d07
Compare
Choose a tag to compare
  • Corrected a C-level bug in gsplit that could lead R to crash in some instances (gsplit is used internally in fsummarise, fmutate, BY and collap to perform computations with base R (non-optimized) functions).

  • Ensured that BY.grouped_df always (by default) returns grouping columns in aggregations i.e. iris |> gby(Species) |> nv() |> BY(sum) now gives the same as iris |> gby(Species) |> nv() |> fsum().

  • A . was added to the first argument of functions fselect, fsubset, colorder and fgroup_by, i.e. fselect(x, ...) -> fselect(.x, ...). The reason for this is that over time I added the option to select-rename columns e.g. fselect(mtcars, cylinders = cyl), which was not offered when these functions were created. This presents problems if columns should be renamed into x, e.g. fselect(mtcars, x = cyl) failed, see #221. Renaming the first argument to .x somewhat guards against such situations. I think this change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. I really hope this breaks nobody's code.

  • Added a function GRPN to make it easy to add a column of group sizes e.g. mtcars %>% fgroup_by(cyl,vs,am) %>% ftransform(Sizes = GRPN(.)) or mtcars %>% ftransform(Sizes = GRPN(list(cyl, vs, am))) or GRPN(mtcars, by = ~cyl+vs+am).

  • Added [.pwcor and [.pwcov, to be able to subset correlation/covariance matrices without loosing the print formatting.

collapse version 1.7.5

05 Feb 13:55
24f06e4
Compare
Choose a tag to compare

collapse 1.7.5

  • In the development version on GitHub, a . was added to the first argument of functions fselect, fsubset, colorder and fgroup_by, i.e. fselect(x, ...) -> fselect(.x, ...). The reason for this is that over time I added the option to select-rename columns e.g. fselect(mtcars, cylinders = cyl), which was not offered when these functions were created. This presents problems if columns should be renamed into x, e.g. fselect(mtcars, x = cyl) fails, see e.g. #221 . Renaming the first argument to .x somewhat guards against such situations. I think this API change is worthwhile to implement, because it makes the package more robust going forward, and usually the first argument of these functions is never invoked explicitly. For now it remains in the development version which you can install using remotes::install_github("SebKrantz/collapse"). If you have strong objections to this change (because it will break your code or you know of people that have a programming style where they explicitly set the first argument of data manipulation functions), please let me know!

  • Also ensuring tidyverse examples are in \donttest{} and building without the dplyr testing file to avoid issues with static code analysis on CRAN.

  • 20-50% Speed improvement in gsplit (and therefore in fsummarise, fmutate, collap and BY when invoked with base R functions) when grouping with GRP(..., sort = TRUE, return.order = TRUE). To enable this by default, the default for argument return.order in GRP was set to sort, which retains the ordering vector (needed for the optimization). Retaining the ordering vector uses up some memory which can possibly adversely affect computations with big data, but with big data sort = FALSE usually gives faster results anyway, and you can also always set return.order = FALSE (also in fgroup_by, collap), so this default gives the best of both worlds.

  • An ancient depreciated argument sort.row (replaced by sort in 2020) is now removed from collap. Also arguments return.order and method were added to collap providing full control of the grouping that happens internally.

collapse 1.7.4

  • Tests needed to be adjusted for the upcoming release of dplyr 1.0.8 which involves an API change in mutate. fmutate will not take over these changes i.e. fmutate(..., .keep = "none") will continue to work like dplyr::transmute. Furthermore, no more tests involving dplyr are run on CRAN, and I will also not follow along with any future dplyr API changes.

  • The C-API macro installTrChar (used in the new massign function) was replaced with installChar to maintain backwards compatibility with R versions prior to 3.6.0. Thanks @tedmoorman #213.

  • Minor improvements to group(), providing increased performance for doubles and also increased performance when the second grouping variable is integer, which turned out to be very slow in some instances.

collapse version 1.7.3

26 Jan 23:57
5a79c77
Compare
Choose a tag to compare
  • Removed tests involving the weights package (which is not available on R-devel CRAN checks).

  • fgroup_by is more flexible, supporting computing columns e.g. fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) and various programming options e.g. fgroup_by(GGDC10S, 1:3), fgroup_by(GGDC10S, c("Variable", "Country")), or fgroup_by(GGDC10S, is.character). You can also use column sequences e.g. fgroup_by(GGDC10S, Country:Variable, Year), but this should not be mixed with computing columns. Compute expressions may also not include the : function.

  • More memory efficient attribute handling in C/C++ (using C-API macro SHALLOW_DUPLICATE_ATTRIB instead of DUPLICATE_ATTRIB) in most places.

collapse version 1.7.2

21 Jan 13:51
6fd48fa
Compare
Choose a tag to compare
  • Ensured that the base pipe |> is not used in tests or examples, to avoid errors on CRAN checks with older versions of R.

collapse version 1.7.1

21 Jan 00:39
0e09a57
Compare
Choose a tag to compare

collapse 1.7.1

  • Fixed minor C/C++ issues flagged in CRAN checks.

  • Added option ties = "last" to fmode.

  • Added argument stable.algo to qsu. Setting stable.algo = FALSE toggles a faster calculation of the standard deviation, yielding 2x speedup on large datasets.

  • Fast Statistical Functions now internally use group for grouping data if both g and TRA arguments are used, yielding efficiency gains on unsorted data.

  • Ensured that fmutate and fsummarise can be called if collapse is not attached.

collapse version 1.7.0

14 Jan 10:57
a5d668d
Compare
Choose a tag to compare

collapse 1.7.0

collapse 1.7.0, released mid January 2022, brings major improvements in the computational backend of the package, it's data manipulation capabilities, and a whole set of new functions that enable more flexible and memory efficiency R programming - significantly enhancing the language itself. For the vast majority of codes, updating to 1.7 should not cause any problems.

Changes to functionality

  • num_vars is now implemented in C, yielding a massive performance increase over checking columns using vapply(x, is.numeric, logical(1)). It selects columns where (is.double(x) || is.integer(x)) && !is.object(x). This provides the same results for most common classes found in data frames (e.g. factors and date columns are not numeric), however it is possible for users to define methods for is.numeric for other objects, which will not be respected by num_vars anymore. A prominent example are base R's 'ts' objects i.e. is.numeric(AirPassengers) returns TRUE, but is.object(AirPassengers) is also TRUE so the above yields FALSE, implying - if you happened to work with data frames of 'ts' columns - that num_vars will now not select those anymore. Please make me aware if there are other important classes that are found in data frames and where is.numeric returns TRUE. num_vars is also used internally in collap so this might affect your aggregations.

  • In flag, fdiff and fgrowth, if a plain numeric vector is passed to the t argument such that is.double(t) && !is.object(t), it is coerced to integer using as.integer(t) and directly used as time variable, rather than applying ordered grouping first. This is to avoid the inefficiency of grouping, and owes to the fact that in most data imported into R with various packages, the time (year) variables are coded as double although they should be integer (I also don't know of any cases where time needs to be indexed by a non-date variable with decimal places). Note that the algorithm internally handles irregularity in the time variable so this is not a problem. Should this break any code, kindly raise an issue on GitHub.

  • The function setrename now truly renames objects by reference (without creating a shallow copy). The same is true for vlabels<- (which was rewritten in C) and a new function setrelabel. Thus additional care needs to be taken (with use inside functions etc.) as the renaming will take global effects unless a shallow copy of the data was created by some prior operation inside the function. If in doubt, better use frename or relabel which do create a shallow copy.

  • Some improvements to the BY function, both in terms of performance and security. Performance is enhanced through a new C function gsplit, providing split-apply-combine computing speeds competitive with dplyr on a much broader range of R objects. Regarding Security: if the result of the computation has the same length as the original data, names / rownames and grouping columns (for grouped data) are only added to the result object if known to be valid, i.e. if the data was originally sorted by the grouping columns (information recorded by GRP.default(..., sort = TRUE), which is called internally on non-factor/GRP/qG objects). This is because BY does not reorder data after the split-apply-combine step (unlike dplyr::mutate); data are simply recombined in the order of the groups. Because of this, in general, BY should be used to compute summary statistics (unless data are sorted before grouping). The added security makes this explicit.

  • Added a method length.GRP giving the length of a grouping object. This could break code calling length on a grouping object before (which just returned the length of the list).

  • Functions renamed in collapse 1.6.0 will now print a message telling you to use the updated names. The functions under the old names will stay around for 1-3 more years.

  • The passing of argument order instead of sort in function GRP (from a very early version of collapse), is now disabled.

Bug Fixes

  • Fixed a bug in some functions using Welfords Online Algorithm (fvar, fsd, fscale and qsu) to calculate variances, occurring when initial or final zero weights caused the running sum of weights in the algorithm to be zero, yielding a division by zero and NA as output although a value was expected. These functions now skip zero weights alongside missing weights, which also implies that you can pass a logical vector to the weights argument to very efficiently calculate statistics on a subset of data (e.g. using qsu).

Additions

Basic Computational Infrastructure

  • Function group was added, providing a low-level interface to a new unordered grouping algorithm based on hashing in C and optimized for R's data structures. The algorithm was heavily inspired by the great kit package of Morgan Jacob, and now feeds into the package through multiple central functions (including GRP / fgroup_by, funique and qF) when invoked with argument sort = FALSE. It is also used in internal groupings performed in data transformation functions such as fwithin (when no factor or 'GRP' object is provided to the g argument). The speed of the algorithm is very promising (often superior to radixorder), and it could be used in more places still. I welcome any feedback on it's performance on different datasets.

  • Function gsplit provides an efficient alternative to split based on grouping objects. It is used as a new backend to rsplit (which also supports data frame) as well as BY, collap, fsummarise and fmutate - for more efficient grouped operations with functions external to the package.

  • Added multiple functions to facilitate memory efficient programming (written in C). These include elementary mathematical operations by reference (setop, %+=%, %-=%, %*=%, %/=%), supporting computations involving integers and doubles on vectors, matrices and data frames (including row-wise operations via setop) with no copies at all. Furthermore a set of functions which check a single value against a vector without generating logical vectors: whichv, whichNA (operators %==% and %!=% which return indices and are significantly faster than ==, especially inside functions like fsubset), anyv and allv (allNA was already added before). Finally, functions setv and copyv speed up operations involving the replacement of a value (x[x == 5] <- 6) or of a sequence of values from a equally sized object (x[x == 5] <- y[x == 5], or x[ind] <- y[ind] where ind could be pre-computed vectors or indices) in vectors and data frames without generating any logical vectors or materializing vector subsets.

  • Function vlengths was added as a more efficient alternative to lengths (without method dispatch, simply coded in C).

  • Function massign provides a multivariate version of assign (written in C, and supporting all basic vector types). In addition the operator %=% was added as an efficient multiple assignment operator. (It is called %=% and not %<-% to facilitate the translation of Matlab or Python codes into R, and because the zeallot package already provides multiple-assignment operators (%<-% and %->%), which are significantly more versatile, but orders of magnitude slower than %=%)

High-Level Features

  • Fully fledged fmutate function that provides functionality analogous to dplyr::mutate (sequential evaluation of arguments, including arbitrary tagged expressions and across statements). fmutate is optimized to work together with the packages Fast Statistical and Data Transformation Functions, yielding fast, vectorized execution, but also benefits from gsplit for other operations.

  • across() function implemented for use inside fsummarise and fmutate. It is also optimized for Fast Statistical and Data Transformation Functions, but performs well with other functions too. It has an additional arguments .apply = FALSE which will apply functions to the entire subset of the data instead of individual columns, and thus allows for nesting tibbles and estimating models or correlation matrices by groups etc.. across() also supports an arbitrary number of additional arguments which are split and evaluated by groups if necessary. Multiple across() statements can be combined with tagged vector expressions in a single call to fsummarise or fmutate. Thus the computational framework is pretty general and similar to data.table, although less efficient with big datasets.

  • Added functions relabel and setrelabel to make interactive dealing with variable labels a bit easier. Note that both functions operate by reference. (Through vlabels<- which is implemented in C. Taking a shallow copy of the data frame is useless in this case because variable labels are attributes of the columns, not of the frame). The only difference between the two is that setrelabel returns the result invisibly.

  • function shortcuts rnm and mtt added for frename and fmutate. across can also be abbreviated using acr.

  • Added two options that can be invoked before loading of the package to change the namespace: options(collapse_mask = c(...)) can be set to export copies of selected (or all) functions in the package that start with f removing the leading f e.g. fsubset -> subset (both fsubset and subset will be exported). This allows masking base R and dplyr functions (even basic functions such as sum, mean, unique etc. if desired) with collapse's fast functions, facilitating the optimization of existing codes and allowing you to work with collapse using a more natural namespace. The package has been internally insulated against such changes, but of course they might have major effects on...

Read more

collapse version 1.6.5

24 Jul 18:12
ca9ad35
Compare
Choose a tag to compare

collapse 1.6.5

  • Use of VECTOR_PTR in C API now gives an error on R-devel even if USE_RINTERNALS is defined. Thus this patch gets rid of all remaining usage of this macro to avoid errors on CRAN checks using the development version of R.

  • The print method for qsu now uses an apostrophe (') to designate million digits, instead of a comma (,). This is to avoid confusion with the decimal point, and the typical use of (,) for thousands (which I don't like).

collapse version 1.6.4

09 Jul 11:45
498817f
Compare
Choose a tag to compare

collapse 1.6.4

A patch for 1.6.0 which fixes (minor) issues flagged by CRAN and adds a few handy extras.

Bug Fixes

  • Puts examples using the new base pipe |> inside \donttest{} so that they don't fail CRAN tests on older R versions.

  • Fixes a LTO issue caused by a small mistake in a header file (which does not have any implications to the user but was detected by CRAN checks).

  • Checks on the gcc11 compiler flagged an additional issue with a pointer pointing to element -1 of a C array (which I had done on purpose to index it with an R integer vector).

  • Fixes a valgrind issue because of comparing an uninitialized value to something.

Additions

  • Added a function fcomputev, which allows selecting columns and transforming them with a function in one go. The keep argument can be used to add columns to the selection that are not transformed.

  • Added a function setLabels as a wrapper around vlabels<- to facilitate setting variable labels inside pipes.

  • Function rm_stub now has an argument regex = TRUE which triggers a call to gsub and allows general removing of character sequences in column names on the fly.

Improvements

  • vlabels<- and setLabels now support list of variable labels or other attributes (i.e. the value is internally subset using [[, not [). Thus they are now general functions to attach a vector or list of attributes to columns in a list / data frame.

Other Changes

  • CRAN maintainers have asked me to remove a line in a Makevars file intended to reduce the size of Rcpp object files (which has been there since version 1.4). So the installed size of the package may now be larger.

collapse version 1.6.0

27 Jun 22:09
61e72fe
Compare
Choose a tag to compare

collapse 1.6.0

collapse 1.6.0, released end of June 2021, presents some significant improvements in the user-friendliness, compatibility and programmability of the package, as well as a few function additions.

Changes to Functionality

  • ffirst, flast, fnobs, fsum, fmin and fmax were rewritten in C. The former three now also support list columns (where NULL or empty list elements are considered missing values when na.rm = TRUE), and are extremely fast for grouped aggregation if na.rm = FALSE. The latter three also support and return integers, with significant performance gains, even compared to base R. Code using these functions expecting an error for list-columns or expecting double output even if the input is integer should be adjusted.

  • collapse now directly supports sf data frames through functions like fselect, fsubset, num_vars, qsu, descr, varying, funique, roworder, rsplit, fcompute etc., which will take along the geometry column even if it is not explicitly selected (mirroring dplyr methods for sf data frames). This is mostly done internally at C-level, so functions remain simple and fast. Existing code that explicitly selects the geometry column is unaffected by the change, but code of the form sf_data %>% num_vars %>% qDF %>% ..., where columns excluding geometry were selected and the object later converted to a data frame, needs to be rewritten as sf_data %>% qDF %>% num_vars %>% .... A short vignette was added describing the integration of collapse and sf.

  • I've received several requests for increased namespace consistency. collapse functions were named to be consistent with base R, dplyr and data.table, resulting in names like is.Date, fgroup_by or settransformv. To me this makes sense, but I've been convinced that a bit more consistency is advantageous. Towards that end I have decided to eliminate the '.' notation of base R and to remove some unexpected capitalizations in function names giving some people the impression I was using camel-case. The following functions are renamed:
    fNobs -> fnobs, fNdistinct -> fndistinct, pwNobs -> pwnobs, fHDwithin -> fhdwithin
    fHDbetween -> fhdbetween, as.factor_GRP -> as_factor_GRP, as.factor_qG -> as_factor_qG, is.GRP -> is_GRP, is.qG -> is_qG, is.unlistable -> is_unlistable, is.categorical -> is_categorical, is.Date -> is_date, as.numeric_factor -> as_numeric_factor, as.character_factor -> as_character_factor,
    Date_vars -> date_vars.
    This is done in a very careful manor, the others will stick around for a long while (end of 2022), and the generics of fNobs, fNdistinct, fHDbetween and fHDwithin will be kept in the package for an indeterminate period, but their core methods will not be exported beyond 2022. I will start warning about these renamed functions in 2022. In the future I will undogmatically stick to a function naming style with lowercase function names and underslashes where words need to be split. Other function names will be kept. To say something about this: The quick-conversion functions qDF qDT, qM, qF, qG are consistent and in-line with data.table (setDT etc.), and similarly the operators L, F, D, Dlog, G, B, W, HDB, HDW. I'll keep GRP, BY and TRA, for lack of better names, parsimony and because they are central to the package. The camel case will be kept in helper functions setDimnames etc. because they work like stats setNames and do not modify the argument by reference (like settransform or setrename and various data.table functions). Functions copyAttrib and copyMostAttrib are exports of like-named functions in the C API and thus kept as they are. Finally, I want to keep fFtest the way it is because the F-distribution is widely recognized by a capital F.

  • I've updated the wlddev dataset with the latest data from the World Bank, and also added a variable giving the total population (which may be useful e.g. for population-weighted aggregations across regions). The extra column could invalidate codes used to demonstrate something (I had to adjust some examples, tests and code in vignettes).

Additions

  • Added a function fcumsum (written in C), permitting flexible (grouped, ordered) cumulative summations on matrix-like objects (integer or double typed) with extra methods for grouped data frames and panel series and data frames. Apart from the internal grouping, and an ordering argument allowing cumulative sums in a different order than data appear, fcumsum has 2 options to deal with missing values. The default (na.rm = TRUE) is to skip (preserve) missing values, whereas setting fill = TRUE allows missing values to be populated with the previous value of the cumulative sum (starting from 0).

  • Added a function alloc to efficiently generate vectors initialized with any value (faster than rep_len).

  • Added a function pad to efficiently pad vectors / matrices / data.frames with a value (default is NA). This function was mainly created to make it easy to expand results coming from a statistical model fitted on data with missing values to the original length. For example let data <- na_insert(mtcars); mod <- lm(mpg ~ cyl, data), then we can do settransform(data, resid = pad(resid(mod), mod$na.action)), or we could do pad(model.matrix(mod), mod$na.action) or pad(model.frame(mod), mod$na.action) to receive matrices and data frames from model data matching the rows of data. pad is a general function that will also work with mixed-type data. It is also possible to pass a vector of indices matching the rows of the data to pad, in which case pad will fill gaps in those indices with a value/row in the data.

Improvements

  • Full data.table support, including reference semantics (set*, :=)!! There is some complex C-level programming behind data.table's operations by reference. Notably, additional (hidden) column pointers are allocated to be able to add columns without taking a shallow copy of the data.table, and an ".internal.selfref" attribute containing an external pointer is used to check if any shallow copy was made using base R commands like <-. This is done to avoid even a shallow copy of the data.table in manipulations using := (and is in my opinion not worth it as even large tables are shallow copied by base R (>=3.1.0) within microseconds and all of this complicates development immensely). Previously, collapse treated data.table's like any other data frame, using shallow copies in manipulations and preserving the attributes (thus ignoring how data.table works internally). This produced a warning whenever you wanted to use data.table reference semantics (set*, :=) after passing the data.table through a collapse function such as collap, fselect, fsubset, fgroup_by etc. From v1.6.0, I have adopted essential C code from data.table to do the overallocation and generate the ".internal.selfref" attribute, thus seamless workflows combining collapse and data.table are now possible. This comes at a cost of about 2-3 microseconds per function, as to do this I have to shallow copy the data.table again and add extra column pointers and an ".internal.selfref" attribute telling data.table that this table was not copied (it seems to be the only way to do it for now). This integration encompasses all data manipulation functions in collapse, but not the Fast Statistical Functions themselves. Thus you can do agDT <- DT %>% fselect(id, col1:coln) %>% collap(~id, fsum); agDT[, newcol := 1], but you would need to do add a qDT after a function like fsum if you want to use reference semantics without incurring a warning: agDT <- DT %>% fselect(id, col1:coln) %>% fgroup_by(id) %>% fsum %>% qDT; agDT[, newcol := 1]. collapse appears to be the first package that attempts to account for data.table's internal working without importing data.table, and qDT is now the fastest way to create a fully functional data.table from any R object. A global option "collapse_DT_alloccol" was added to regulate how many columns collapse overallocates when creating data.table's. The default is 100, which is lower than the data.table default of 1024. This was done to increase efficiency of the additional shallow copies, and may be changed by the user.

  • Programming enabled with fselect and fgroup_by (you can now pass vectors containing column names or indices). Note that instead of fselect you should use get_vars for standard eval programming.

  • fselect and fsubset support in-place renaming e.g. fselect(data, newname = var1, var3:varN),
    fsubset(data, vark > varp, newname = var1, var3:varN).

  • collap supports renaming columns in the custom argument, e.g. collap(data, ~ id, custom = list(fmean = c(newname = "var1", "var2"), fmode = c(newname = 3), flast = is_date)).

  • Performance improvements: fsubset / ss return the data or perform a simple column subset without deep copying the data if all rows are selected through a logical expression. fselect and get_vars, num_vars etc. are slightly faster through data frame subsetting done fully in C. ftransform / fcompute use alloc instead of base::rep to replicate a scalar value which is slightly more efficient.

  • fcompute now has a keep argument, to preserve several existing columns when computing columns on a data frame.

  • replace_NA now has a cols argument, so we can do replace_NA(data, cols = is.numeric), to replace NA's in numeric columns. I note that for big numeric data data.table::setnafill is the most efficient solution.

  • fhdbetween and fhdwithin have an effect argument in plm methods, allowing centering on selected identifiers. The default is still to center on all panel identifiers.
    ...

Read more

collapse version 1.5.3

09 Mar 21:33
5ce5a53
Compare
Choose a tag to compare

Changes to Functionality

  • The first argument of ftransform was renamed to .data from X. This was done to enable the user to transform columns named "X". For the same reason the first argument of frename was renamed to .x from x (not .data to make it explicit that .x can be any R object with a "names" attribute). It is not possible to depreciate X and x without at the same time undoing the benefits of the argument renaming, thus this change is immediate and code breaking in rare cases where the first argument is explicitly set.

  • The function is.regular to check whether an R object is atomic or list-like is depreciated and will be removed before the end of the year. This was done to avoid a namespace clash with the zoo package (#127).

Bug Fixes

  • For reasons of efficiency, most statistical and transformation functions used the C macro SHALLOW_DUPLICATE_ATTRIB to copy column attributes in a data frame. Since this macro does not copy S4 object bits, it caused some problems with S4 object columns such as POSIXct (e.g. computing lags/leads, first and last values on these columns). This is now fixed, all statistical functions (apart from fvar and fsd) now use DUPLICATE_ATTRIB and thus preserve S4 object columns (#91).

  • unlist2d produced a subsetting error if an empty list was present in the list-tree. This is now fixed, empty or NULL elements in the list-tree are simply ignored (#99).

Additions

  • A function fsummarise was added to facilitate translating dplyr / data.table code to collapse. Like collap, it is only very fast when used with the Fast Statistical Functions.

  • A function t_list is made available to efficiently transpose lists of lists.

Improvements

  • C files are compiled -O3 on Windows, which gives a boost of around 20% for the grouping mechanism applied to character data.