Skip to content

Commit

Permalink
Faster uniqueN(v) for logical v (#2648)
Browse files Browse the repository at this point in the history
  • Loading branch information
HughParsonage authored and mattdowle committed Mar 2, 2018
1 parent 5a0ba36 commit a664ea4
Show file tree
Hide file tree
Showing 6 changed files with 62 additions and 3 deletions.
16 changes: 14 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,10 +65,22 @@

13. `unique(DT)` now returns `DT` early when there are no duplicates to save RAM, [#2013](https://github.com/Rdatatable/data.table/issues/2013). Thanks to Michael Chirico for the PR.

14. Subsetting optimization with keys and indices is now possible for compound queries like `DT[a==1 & b==2]`, [#2472](https://github.com/Rdatatable/data.table/issues/2472).
14. `uniqueN()` is now faster on logical vectors. Thanks to Hugh Parsonage for [PR#2648](https://github.com/Rdatatable/data.table/pull/2648).
```
N = 1e9
was now
x = c(TRUE,FALSE,NA,rep(TRUE,N))
uniqueN(x) == 3 5.4s 0.00s
x = c(TRUE,rep(FALSE,N), NA)
uniqueN(x,na.rm=TRUE) == 2 5.4s 0.00s
x = c(rep(TRUE,N),FALSE,NA)
uniqueN(x) == 3 6.7s 0.38s
```

15. Subsetting optimization with keys and indices is now possible for compound queries like `DT[a==1 & b==2]`, [#2472](https://github.com/Rdatatable/data.table/issues/2472).
Thanks to @MichaelChirico for reporting and to @MarkusBonsch for the implementation.

15. `melt.data.table` now offers friendlier functionality for providing `value.name` for `list` input to `measure.vars`, [#1547](https://github.com/Rdatatable/data.table/issues/1547). Thanks @MichaelChirico and @franknarf1 for the suggestion and use cases, @jangorecki and @mrdwab for implementation feedback, and @MichaelChirico for ultimate implementation.
16. `melt.data.table` now offers friendlier functionality for providing `value.name` for `list` input to `measure.vars`, [#1547](https://github.com/Rdatatable/data.table/issues/1547). Thanks @MichaelChirico and @franknarf1 for the suggestion and use cases, @jangorecki and @mrdwab for implementation feedback, and @MichaelChirico for ultimate implementation.

#### BUG FIXES

Expand Down
5 changes: 4 additions & 1 deletion R/duplicated.R
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,10 @@ uniqueN <- function(x, by = if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)
if (is.null(x)) return(0L)
if (!is.atomic(x) && !is.data.frame(x))
stop("x must be an atomic vector or data.frames/data.tables")
if (is.atomic(x)) x = as_list(x)
if (is.atomic(x)) {
if (is.logical(x)) return(.Call(CuniqueNlogical, x, na.rm=na.rm))
x = as_list(x)
}
if (is.null(by)) by = seq_along(x)
o = forderv(x, by=by, retGrp=TRUE, na.last=if (!na.rm) FALSE else NA)
starts = attr(o, 'starts')
Expand Down
16 changes: 16 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -6486,6 +6486,22 @@ DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6))
test(1475.1, uniqueN(DT), 10L)
test(1475.2, DT[, .(uN=uniqueN(.SD)), by=A], data.table(A=1:3, uN=c(3L,4L,3L)))

# specialized uniqueN for logical vectors, PR#2648
test(1475.3, uniqueN(c(NA, TRUE, FALSE)), 3L)
test(1475.4, uniqueN(c(NA, TRUE, FALSE), na.rm = TRUE), 2L)
test(1475.5, uniqueN(c(TRUE, FALSE), na.rm = TRUE), 2L)
test(1475.6, uniqueN(c(TRUE, FALSE)), 2L)
test(1475.7, uniqueN(c(TRUE, NA)), 2L)
test(1475.8, uniqueN(c(TRUE, NA), na.rm=TRUE), 1L)
test(1475.9, uniqueN(c(FALSE, NA)), 2L)
test(1475.11, uniqueN(c(FALSE, NA), na.rm=TRUE), 1L)
test(1475.12, uniqueN(c(NA,NA)), 1L)
test(1475.13, uniqueN(c(NA,NA), na.rm=TRUE), 0L)
test(1475.14, uniqueN(NA), 1L)
test(1475.15, uniqueN(NA, na.rm=TRUE), 0L)
test(1475.16, uniqueN(logical()), 0L)
test(1475.17, uniqueN(logical(), na.rm=TRUE), 0L)

# preserve class attribute in GForce mean (and sum)
DT <- data.table(x = rep(1:3, each = 3), y = as.Date(seq(Sys.Date(), (Sys.Date() + 8), by = "day")))
test(1476.1, DT[, .(y=mean(y)), x], setDT(aggregate(y ~ x, DT, mean)))
Expand Down
1 change: 1 addition & 0 deletions src/data.table.h
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
// #include <signal.h> // the debugging machinery + breakpoint aidee
// raise(SIGINT);
#include <stdint.h> // for uint64_t rather than unsigned long long
#include <stdbool.h>
#include "myomp.h"

// data.table depends on R>=3.0.0 when R_xlen_t was introduced
Expand Down
2 changes: 2 additions & 0 deletions src/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ SEXP fsort();
SEXP inrange();
SEXP between();
SEXP hasOpenMP();
SEXP uniqueNlogical();

// .Externals
SEXP fastmean();
Expand Down Expand Up @@ -154,6 +155,7 @@ R_CallMethodDef callMethods[] = {
{"Cinrange", (DL_FUNC) &inrange, -1},
{"Cbetween", (DL_FUNC) &between, -1},
{"ChasOpenMP", (DL_FUNC) &hasOpenMP, -1},
{"CuniqueNlogical", (DL_FUNC) &uniqueNlogical, -1},
{NULL, NULL, 0}
};

Expand Down
25 changes: 25 additions & 0 deletions src/uniqlist.c
Original file line number Diff line number Diff line change
Expand Up @@ -228,3 +228,28 @@ SEXP nestedid(SEXP l, SEXP cols, SEXP order, SEXP grps, SEXP resetvals, SEXP mul
UNPROTECT(1);
return(ans);
}

SEXP uniqueNlogical(SEXP x, SEXP narmArg) {
// single pass; short-circuit and return as soon as all 3 values are found
if (!isLogical(x)) error("x is not a logical vector");
if (!isLogical(narmArg) || length(narmArg)!=1 || INTEGER(narmArg)[0]==NA_INTEGER) error("na.rm must be TRUE or FALSE");
bool narm = LOGICAL(narmArg)[0]==1;
const R_xlen_t n = xlength(x);
if (n==0)
return ScalarInteger(0); // empty vector
Rboolean first = LOGICAL(x)[0];
R_xlen_t i=0;
while (++i<n && LOGICAL(x)[i]==first);
if (i==n)
return ScalarInteger(first==NA_INTEGER && narm ? 0 : 1); // all one value
Rboolean second = LOGICAL(x)[i];
// we've found 2 different values (first and second). Which one didn't we find? Then just look for that.
// NA_LOGICAL == INT_MIN checked in init.c
const int third = (first+second == 1) ? NA_LOGICAL : ( first+second == INT_MIN ? TRUE : FALSE );
if (third==NA_LOGICAL && narm)
return ScalarInteger(2); // TRUE and FALSE found before any NA, but na.rm=TRUE so we're done
while (++i<n) if (LOGICAL(x)[i]==third)
return ScalarInteger(3-narm);
return ScalarInteger(2-(narm && third!=NA_LOGICAL));
}

0 comments on commit a664ea4

Please sign in to comment.