Faster i #4585

ColeMiller1 · 2020-07-01T00:34:56Z

Towards #3735 (maybe closes...?)
Closes this comment in the code:

TODO: Incorporate which_ here on DT[!i] where i is logical. Should avoid i = !i (above) - inefficient.

dt[i, ] is around twice as fast than before.

library(data.table)
## Base
allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
system.time({for (r in 1:nrow(allIterations)) {DoSomething(allIterations[r, ])}})

##   user  system elapsed 
##   5.67    0.02    5.91 

setDT(allIterations)
system.time({for (r in 1:nrow(allIterations)) {DoSomething(allIterations[r, ])}})

## Before Patch
##   user  system elapsed 
##  17.53    0.58   18.46

## After Patch
##   user  system elapsed 
##   9.53    0.00    9.67

For dt[!lgl] we see a lot of memory savings with some speed savings:

library(data.table)
set.seed(123L)
n = 1e8L
dt = data.table(rep.int(1L, n))
inds = sample(c(FALSE, TRUE), n, TRUE)
bench::mark(dt[!inds])

## Before Patch
##  expression   min median `itr/sec` mem_alloc
##  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>
##1 dt[!inds]  1150ms  1150ms     0.873    1.12GB

## After Patch
##  expression   min median `itr/sec` mem_alloc
##  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>
##1 dt[!inds]  925ms  925ms      1.08     763MB

CconvertNegAndZeroIdx is also faster and also includes break when threads are now 1. Also, avoiding the OpenMP when threads are set to 1 also improves performance on at least Windows.

Note - there probably could be follow-up PRs related to the default number of threads (for me on only 2T, somewhere between 1E5 and 1E6 is where the break even point is). Secondly, c(0, seq_len(1025L)) is somehow faster than seq_len(1025L) within the function with this early break. It just seems surprising that somehow removing a zero is faster than returning the inds as is.

library(data.table)

## small scenario just over the 1024 row threshold of 2 threads:
inds = seq_len(1025L)
system.time(for (i in 1:100000) .Call(data.table:::CconvertNegAndZeroIdx, inds, 2000L, TRUE))

setDTthreads(1L)
##   user  system elapsed 
##   1.05    0.00    1.22 

setDTthreads(2L)
##   user  system elapsed 
##   2.90    1.52    4.61  

## early break scenario which is best case scenario
inds = c(0L, inds)

system.time(for (i in 1:100000) .Call(data.table:::CconvertNegAndZeroIdx, inds, 2000L, TRUE))

setDTthreads(1L)
##   user  system elapsed 
##   0.62    0.00    0.63 

setDTthreads(2L)
##   user  system elapsed 
##   3.75    1.54    5.73 

## a normal scenario - 1 million row 
inds = seq_len(1e6L)
system.time(for (i in 1:5000) .Call(data.table:::CconvertNegAndZeroIdx, inds, 2000L, TRUE))

setDTthreads(1L)
##   user  system elapsed 
##   4.98    0.00    5.09 

setDTthreads(2L)
##   user  system elapsed 
##   7.27    0.12    4.19

R/data.table.R

codecov · 2020-07-01T00:44:19Z

Codecov Report

Merging #4585 into master will decrease coverage by 0.02%.
The diff coverage is 95.60%.

@@            Coverage Diff             @@
##           master    #4585      +/-   ##
==========================================
- Coverage   99.61%   99.58%   -0.03%     
==========================================
  Files          73       73              
  Lines       14119    14120       +1     
==========================================
- Hits        14064    14061       -3     
- Misses         55       59       +4

Impacted Files	Coverage Δ
src/subset.c	`98.00% <60.00%> (-2.00%)`	⬇️
R/data.table.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba32f3c...f5d5f13. Read the comment docs.

MichaelChirico · 2020-07-01T03:14:23Z

R/data.table.R

+      tt_isub = substitute(i)
+      tt_jsub = substitute(j)
+      if (!is.null(names(sys.call())) &&  # not relying on nargs() as it considers DT[,] to have 3 arguments, #3163
+          tryCatch(!is.symbol(tt_isub), error=function(e)TRUE) &&   # a symbol that inherits missingness from caller isn't missing for our purpose; test 1974


what errors are being caught here?

Here's the test. In this case, cols is missing I believe.

# no error when j is supplied but inherits missingness from caller DT = data.table(a=1:3, b=4:6) f = function(cols) DT[,cols] test(1974.1, f(), output="a.*b.*3:.*6")

edit: I did try removing this branch but it produced errors. It's a real head scratcher but I just kept it. It's only been moved.

R/data.table.R

MichaelChirico · 2020-07-01T03:20:38Z

R/data.table.R

-      # #932 related so that !(v1 == 1) becomes v1 == 1 instead of (v1 == 1) after removing "!"
-      if (isub %iscall% "(" && !is.name(isub[[2L]]))
+      if (isub %iscall% "eval") {  # TO DO: or ..()
+        isub = eval(.massagei(isub[[2L]]), list(.N = nrow(x)), parent.frame())


can we add .SD=x to envir arg here & get .SD to work in i just like that?

I just compiled with adding .SD and success!

Note, previously .N was assigned to the parent.frame() and then restoring it if necessary. Because of that, all 4 eval calls related to processing i were largely the same.

While skipping that approach is faster, we now have to deal with associating each of the 4 eval calls with .N or whatever special variable(s) we want to use so there's a little more accounting. In theory we could have also used the previous approach to also assign .SD to the parent.frame.

ha, came across this comment again 🙃

R/data.table.R

ColeMiller1 · 2020-07-02T01:19:33Z

src/subset.c

+      if ((elem < 1 && elem != NA_INTEGER) || elem > max) stop = true;
+    }
+  } else {
+    #pragma omp parallel for num_threads(nth)


The OpenMP loop is what is missing in coverage. I am unsure - I foolishly included Rprintf("OpenMP_Loop") within the loop and during at least one of the tests, my console was full of "OpenMP_Loop" statements. That would suggest that the code coverage bot only has 1 thread, but I would have expected similar issues in #4558 as I incorporated the approach from that PR.

ColeMiller1 added 3 commits June 30, 2020 17:02

Update data.table.R

22678aa

Update subset.c

427e8dd

Update is.null(i)

cf4a964

ColeMiller1 commented Jul 1, 2020

View reviewed changes

R/data.table.R Show resolved Hide resolved

use gettextf in pattern warning

044369f

MichaelChirico reviewed Jul 1, 2020

View reviewed changes

R/data.table.R Outdated Show resolved Hide resolved

MichaelChirico reviewed Jul 1, 2020

View reviewed changes

R/data.table.R Show resolved Hide resolved

ColeMiller1 added 3 commits July 1, 2020 05:55

Changes based on Michael's comments plus coverage

a3ad9d8

Corrected error message

48cfac8

error message space

f5d5f13

ColeMiller1 commented Jul 2, 2020

View reviewed changes

ColeMiller1 mentioned this pull request Jul 29, 2020

Subsetting with Date produces error #4651

Open

ColeMiller1 mentioned this pull request Nov 15, 2020

data.table spark/databases interface #1828

Closed

ColeMiller1 mentioned this pull request Feb 16, 2021

Make .I available in i #4695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster i #4585

Faster i #4585

ColeMiller1 commented Jul 1, 2020 •

edited

codecov bot commented Jul 1, 2020 •

edited

MichaelChirico Jul 1, 2020

ColeMiller1 Jul 1, 2020 •

edited

MichaelChirico Jul 1, 2020

ColeMiller1 Jul 2, 2020

MichaelChirico Apr 20, 2024

ColeMiller1 Jul 2, 2020

Faster i #4585

Are you sure you want to change the base?

Faster i #4585

Conversation

ColeMiller1 commented Jul 1, 2020 • edited

codecov bot commented Jul 1, 2020 • edited

Codecov Report

MichaelChirico Jul 1, 2020

Choose a reason for hiding this comment

ColeMiller1 Jul 1, 2020 • edited

Choose a reason for hiding this comment

MichaelChirico Jul 1, 2020

Choose a reason for hiding this comment

ColeMiller1 Jul 2, 2020

Choose a reason for hiding this comment

MichaelChirico Apr 20, 2024

Choose a reason for hiding this comment

ColeMiller1 Jul 2, 2020

Choose a reason for hiding this comment

ColeMiller1 commented Jul 1, 2020 •

edited

codecov bot commented Jul 1, 2020 •

edited

ColeMiller1 Jul 1, 2020 •

edited