Move some setDT validation checks to C #5427

MichaelChirico · 2022-08-01T01:44:07Z

It looks like we are mostly bottlenecked by base R here -- setting the names to V1:Vncol requires building a huge character vector which appears to be choking the global string cache (our best friend :) )

To see evidence of this, here's timings for (1) a plain setDT() as cited in the original issue; (2) a re-run of setDT(), where the names are already in the string cache; and (3) a run where the input is already named (so the branch running paste0(...) is skipped):

# __ON MASTER__
library(data.table)

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  25.518   0.316  25.835 

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  20.219   0.292  20.511 

DT = replicate(1e7, 1, simplify = FALSE)
# _not_ V1:Vncol_, so that these names aren't cached yet (doesn't matter much but still)
names(DT) = seq_along(DT)
system.time(setDT(DT))
#    user  system elapsed 
#  16.325   0.148  16.474

This PR represents a definite improvement (especially for repeated runs, which TBH probably aren't all that practically relevant), but still hits that base bottleneck:

# __THIS PR__
library(data.table)

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  17.822   0.211  18.035 

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#   5.834   0.023   5.858 

DT = replicate(1e7, 1, simplify = FALSE)
# _not_ V1:Vncol_, so that these names aren't cached yet (doesn't matter much but still)
names(DT) = seq_along(DT)
system.time(setDT(DT))
#    user  system elapsed 
#   3.841   0.007   3.850

R/data.table.R

jangorecki · 2022-08-01T07:36:37Z

There is a PR related to making setDT more lightweight already #4477. AFAIU it won't address problem described here.

jangorecki · 2022-08-01T11:58:45Z

Some functionality discussed here in comments is already implemented in src/utils.c in https://github.com/Rdatatable/data.table/pull/4370/files

codecov · 2024-03-09T16:26:01Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.49%. Comparing base (6f3fc8d) to head (1872f47).
Report is 19 commits behind head on master.

❗ Current head 1872f47 differs from pull request most recent head af48a80. Consider uploading reports for the commit af48a80 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5427      +/-   ##
==========================================
- Coverage   97.51%   97.49%   -0.03%     
==========================================
  Files          80       80              
  Lines       14979    14880      -99     
==========================================
- Hits        14607    14507     -100     
- Misses        372      373       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MichaelChirico · 2024-03-10T03:56:21Z

R/data.table.R

  if (is.data.table(x)) {
    # fix for #1078 and #1128, see .resetclass() for explanation.
    setattr(x, 'class', .resetclass(x, 'data.table'))
    if (!missing(key)) setkeyv(x, key) # fix for #1169
    if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
    if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
  } else if (is.data.frame(x)) {
+    # check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
+    # for performance, only warn on the first such column, #5426
+    for (jj in seq_along(x)) {


It might be preferable to combine the logic for the for loops instead of just emitting the warning, but I didn't see an easy way to do so -- the setdt_nrows parallel also does the other checks in the same loop. Unless we think looping over columns twice is fine to exchange for the clarity of sharing this logic. But this loop is pretty straightforward.

I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.

This is a good performance improvement.

MichaelChirico · 2024-03-10T05:15:44Z

src/assign.c

+      }
+      len_xi = INTEGER(dim_xi)[0];
+    } else {
+      len_xi = LENGTH(xi);


with #5981 in mind I do have some wariness about potentially introducing issues by skipping dispatch.

HughParsonage

Some comments but nothing that would require changes.

HughParsonage · 2024-03-13T07:07:08Z

R/data.table.R

  if (is.data.table(x)) {
    # fix for #1078 and #1128, see .resetclass() for explanation.
    setattr(x, 'class', .resetclass(x, 'data.table'))
    if (!missing(key)) setkeyv(x, key) # fix for #1169
    if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
    if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
  } else if (is.data.frame(x)) {
+    # check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
+    # for performance, only warn on the first such column, #5426
+    for (jj in seq_along(x)) {


I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.

src/assign.c

HughParsonage · 2024-03-13T07:12:05Z

R/data.table.R

  if (is.data.table(x)) {
    # fix for #1078 and #1128, see .resetclass() for explanation.
    setattr(x, 'class', .resetclass(x, 'data.table'))
    if (!missing(key)) setkeyv(x, key) # fix for #1169
    if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
    if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
  } else if (is.data.frame(x)) {
+    # check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
+    # for performance, only warn on the first such column, #5426
+    for (jj in seq_along(x)) {


This is a good performance improvement.

src/assign.c

tdhock · 2024-04-22T18:35:37Z

here is a performance analysis from #6094

from the left-most panel, we see that the proposed changes would reduce time complexity to what appears to be constant (grey Fast curve, independent of the number of columns N), is that expected? I thought that it should still be linear in the number of columns, even after moving to C code, because there still is a for loop over columns right? Maybe we are seeing a constant trend because the C code is so fast that we can not see the asymptotic trend?

MichaelChirico · 2024-04-23T05:41:26Z

(aside: @Anirban166, I think it would be much better to give a clearer title to the graphs, "Regression fixed in #5463" is not very useful as I have to remember which issue is which, there's no linking in the chart title, etc, better to summarize what's being benchmarked, exactly, e.g. "memrecycle performance")

MichaelChirico · 2024-04-23T05:48:39Z

the proposed changes would reduce time complexity to what appears to be constant

Something seems off, I tried locally and don't see that, e.g.

l=replicate(1e5, 1, simplify=FALSE)
system.time(setDT(l))
#    user  system elapsed 
#   0.054   0.000   0.054 

l=replicate(1e6, 1, simplify=FALSE)
system.time(setDT(l))
#    user  system elapsed 
#   0.544   0.012   0.556 

l=replicate(1e7, 1, simplify=FALSE)
system.time(setDT(l))
#   user  system elapsed 
#   6.898   0.084   6.982

This still represents a big improvement over master:

#    user  system elapsed 
#   0.190   0.004   0.194 
#    user  system elapsed 
#   2.613   0.020   2.633 
#    user  system elapsed 
#  32.088   0.316  32.405

tdhock · 2024-04-23T18:33:27Z

Something seems off, I tried locally and don't see that, e.g.

the first one takes time, and the subsequent runs are a no-op:

> l=replicate(1e5, 1, simplify=FALSE)

> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
   0.05    0.02    0.06 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0

MichaelChirico · 2024-04-23T18:51:43Z

oh, right, so probably the benchmarked operation should be setDT(copy(l)). or otherwise {setDT(l); setattr(l, "class", NULL)} to reset without needing to deep-copy the object (which has overhead)

tdhock · 2024-04-23T18:52:11Z

there is some inconsistency between system.time and bench::mark, because it does a memory measurement first (which makes L a data.table and then subsequent timings are no-op), bench says it is really fast/microseconds (and that is what we are using in atime), whereas system.time is slow/correct I think.

> l=replicate(1e6, 1, simplify=FALSE)
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Fast         79.9µs 79.9µs    12516.    38.2MB        0     1     0     79.9µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Fast         83.6µs 83.6µs    11962.        0B        0     1     0     83.6µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> l=replicate(1e6, 1, simplify=FALSE)
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
   0.66    0.00    0.66 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0

tdhock · 2024-04-23T20:15:58Z

here is a solution

(expr.list <- atime::atime_versions_exprs(
  pkg.path = "~/R/data.table",
  pkg.edit.fun = pkg.edit.fun,
  expr = {
    ##data.table:::setDT(data.table::copy(L))
    data.table:::setattr(L,"class",NULL)
    data.table:::setDT(L)
  },
  sha.vec=sha.vec))
library(data.table)
atime.result <- atime::atime(
  expr.list=expr.list,
  N=10^seq(1,6,by=0.5),
  setup = {
    L <- replicate(N, 1, simplify = FALSE)
    setDT(L)
  })  
plot(atime.result)

tdhock

lgtm
performance increase is great

MichaelChirico · 2024-04-23T20:45:17Z

thanks! let's merge #6094 first to get some "official" atime reporting here as a "production" proof

Anirban166 · 2024-04-23T20:56:39Z

thanks! let's merge #6094 first to get some "official" atime reporting here as a "production" proof

Yup, we are working towards that!

(aside: @Anirban166, I think it would be much better to give a clearer title to the graphs, "Regression fixed in #5463" is not very useful as I have to remember which issue is which, there's no linking in the chart title, etc, better to summarize what's being benchmarked, exactly, e.g. "memrecycle performance")

Agreed here, and how about "memrecycle regression fixed in #5463"? (me and Toby discussed this a while ago and we think it's useful to keep the number for quick reference, and that using 'performance' everytime would be redundant)

Moved tests to the more appropriate directory and added a test case based on the performance improvement to be brought by #5427

github-actions · 2024-04-24T17:28:20Z

Generated via commit 1e758e6

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 41 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 33 seconds

tdhock · 2024-04-24T17:33:24Z

hi @MichaelChirico the peformance test looks good for this PR, and I added a NEWS item, so please feel free to merge.

MichaelChirico · 2024-04-24T17:50:04Z

love it, thanks @Anirban166 and @tdhock for the work on CB!!

MichaelChirico added 2 commits July 31, 2022 18:10

move setDT validation checks to C

7cc4da4

fix some tests

bd2fc1e

ColeMiller1 reviewed Aug 1, 2022

View reviewed changes

R/data.table.R Show resolved Hide resolved

MichaelChirico marked this pull request as draft January 7, 2024 15:45

MichaelChirico added 3 commits March 9, 2024 11:11

Merge branch 'master' into setdt-wide

835f118

restore prototype

2831c59

Rename to reflect return value

b53d967

MichaelChirico added 2 commits March 9, 2024 22:37

stab at sharing message at R+C levels

5a2d8e9

Compiles & basically runs

c929792

MichaelChirico commented Mar 10, 2024

View reviewed changes

MichaelChirico added 3 commits March 10, 2024 04:09

Run POSIXlt test first; save LENGTH(dim) to reuse

0696944

new test

d0e0216

warning, not error

fb45cd8

MichaelChirico marked this pull request as ready for review March 10, 2024 05:11

MichaelChirico requested review from HughParsonage and jangorecki as code owners March 10, 2024 05:11

MichaelChirico added this to the 1.16.0 milestone Mar 10, 2024

MichaelChirico commented Mar 10, 2024

View reviewed changes

HughParsonage approved these changes Mar 13, 2024

View reviewed changes

MichaelChirico commented Mar 13, 2024

View reviewed changes

src/assign.c Outdated Show resolved Hide resolved

correct comment

1872f47

This was referenced Apr 18, 2024

Moved tests to the more appropriate directory and added a test case based on the performance improvement to be brought by #5427 #6094

Merged

Testing workflow changes and commit SHAs Anirban166/data.table#9

Closed

jangorecki removed their request for review April 23, 2024 10:42

tdhock approved these changes Apr 23, 2024

View reviewed changes

tdhock added a commit that referenced this pull request Apr 24, 2024

Merge pull request #6094 from Rdatatable/add-a-test-and-move-to-ci

29a0f13

Moved tests to the more appropriate directory and added a test case based on the performance improvement to be brought by #5427

tdhock and others added 2 commits April 24, 2024 10:11

Merge branch 'master' into setdt-wide

1e758e6

setDT faster

af48a80

MichaelChirico merged commit 2487c61 into master Apr 24, 2024
5 checks passed

MichaelChirico deleted the setdt-wide branch April 24, 2024 17:50

Anirban166 mentioned this pull request Apr 29, 2024

Refactor colClasses handling for readability #6106

Merged

MichaelChirico mentioned this pull request Apr 30, 2024

Investigate atime CI failure for setDT update #6110

Closed

tdhock restored the setdt-wide branch April 30, 2024 15:53

tdhock mentioned this pull request May 1, 2024

Performance Test on data.table Issues.qmd file rdatatable-community/The-Raft#16

Merged

tdhock mentioned this pull request May 31, 2024

adding an atime test case Performance Regression with .N and := #PR5463 #6160

Closed

ben-schwen mentioned this pull request Sep 9, 2024

setDT should check if any column is POSIXlt #4800

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move some setDT validation checks to C #5427

Move some setDT validation checks to C #5427

MichaelChirico commented Aug 1, 2022 •

edited

Loading

jangorecki commented Aug 1, 2022 •

edited

Loading

jangorecki commented Aug 1, 2022

codecov bot commented Mar 9, 2024 •

edited

Loading

MichaelChirico Mar 10, 2024

HughParsonage Mar 13, 2024

HughParsonage Mar 13, 2024

MichaelChirico Mar 10, 2024

HughParsonage left a comment

HughParsonage Mar 13, 2024

HughParsonage Mar 13, 2024

tdhock commented Apr 22, 2024

MichaelChirico commented Apr 23, 2024

MichaelChirico commented Apr 23, 2024 •

edited

Loading

tdhock commented Apr 23, 2024

MichaelChirico commented Apr 23, 2024

tdhock commented Apr 23, 2024 •

edited

Loading

tdhock commented Apr 23, 2024

tdhock left a comment

MichaelChirico commented Apr 23, 2024 •

edited

Loading

Anirban166 commented Apr 23, 2024

github-actions bot commented Apr 24, 2024 •

edited

Loading

tdhock commented Apr 24, 2024

MichaelChirico commented Apr 24, 2024

Move some setDT validation checks to C #5427

Move some setDT validation checks to C #5427

Conversation

MichaelChirico commented Aug 1, 2022 • edited Loading

jangorecki commented Aug 1, 2022 • edited Loading

jangorecki commented Aug 1, 2022

codecov bot commented Mar 9, 2024 • edited Loading

Codecov Report

MichaelChirico Mar 10, 2024

Choose a reason for hiding this comment

HughParsonage Mar 13, 2024

Choose a reason for hiding this comment

HughParsonage Mar 13, 2024

Choose a reason for hiding this comment

MichaelChirico Mar 10, 2024

Choose a reason for hiding this comment

HughParsonage left a comment

Choose a reason for hiding this comment

HughParsonage Mar 13, 2024

Choose a reason for hiding this comment

HughParsonage Mar 13, 2024

Choose a reason for hiding this comment

tdhock commented Apr 22, 2024

MichaelChirico commented Apr 23, 2024

MichaelChirico commented Apr 23, 2024 • edited Loading

tdhock commented Apr 23, 2024

MichaelChirico commented Apr 23, 2024

tdhock commented Apr 23, 2024 • edited Loading

tdhock commented Apr 23, 2024

tdhock left a comment

Choose a reason for hiding this comment

MichaelChirico commented Apr 23, 2024 • edited Loading

Anirban166 commented Apr 23, 2024

github-actions bot commented Apr 24, 2024 • edited Loading

tdhock commented Apr 24, 2024

MichaelChirico commented Apr 24, 2024

MichaelChirico commented Aug 1, 2022 •

edited

Loading

jangorecki commented Aug 1, 2022 •

edited

Loading

codecov bot commented Mar 9, 2024 •

edited

Loading

MichaelChirico commented Apr 23, 2024 •

edited

Loading

tdhock commented Apr 23, 2024 •

edited

Loading

MichaelChirico commented Apr 23, 2024 •

edited

Loading

github-actions bot commented Apr 24, 2024 •

edited

Loading