Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move some setDT validation checks to C #5427

Merged
merged 13 commits into from
Apr 24, 2024
Merged

Move some setDT validation checks to C #5427

merged 13 commits into from
Apr 24, 2024

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Aug 1, 2022

Closes #5426

It looks like we are mostly bottlenecked by base R here -- setting the names to V1:Vncol requires building a huge character vector which appears to be choking the global string cache (our best friend :) )

To see evidence of this, here's timings for (1) a plain setDT() as cited in the original issue; (2) a re-run of setDT(), where the names are already in the string cache; and (3) a run where the input is already named (so the branch running paste0(...) is skipped):

# __ON MASTER__
library(data.table)

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  25.518   0.316  25.835 

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  20.219   0.292  20.511 

DT = replicate(1e7, 1, simplify = FALSE)
# _not_ V1:Vncol_, so that these names aren't cached yet (doesn't matter much but still)
names(DT) = seq_along(DT)
system.time(setDT(DT))
#    user  system elapsed 
#  16.325   0.148  16.474 

This PR represents a definite improvement (especially for repeated runs, which TBH probably aren't all that practically relevant), but still hits that base bottleneck:

# __THIS PR__
library(data.table)

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#  17.822   0.211  18.035 

DT = replicate(1e7, 1, simplify = FALSE)
system.time(setDT(DT))
#    user  system elapsed 
#   5.834   0.023   5.858 

DT = replicate(1e7, 1, simplify = FALSE)
# _not_ V1:Vncol_, so that these names aren't cached yet (doesn't matter much but still)
names(DT) = seq_along(DT)
system.time(setDT(DT))
#    user  system elapsed 
#   3.841   0.007   3.850 

@jangorecki
Copy link
Member

jangorecki commented Aug 1, 2022

There is a PR related to making setDT more lightweight already #4477. AFAIU it won't address problem described here.

@jangorecki
Copy link
Member

Some functionality discussed here in comments is already implemented in src/utils.c in https://github.com/Rdatatable/data.table/pull/4370/files

@MichaelChirico MichaelChirico marked this pull request as draft January 7, 2024 15:45
Copy link

codecov bot commented Mar 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.49%. Comparing base (6f3fc8d) to head (1872f47).
Report is 19 commits behind head on master.

❗ Current head 1872f47 differs from pull request most recent head af48a80. Consider uploading reports for the commit af48a80 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5427      +/-   ##
==========================================
- Coverage   97.51%   97.49%   -0.03%     
==========================================
  Files          80       80              
  Lines       14979    14880      -99     
==========================================
- Hits        14607    14507     -100     
- Misses        372      373       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

if (is.data.table(x)) {
# fix for #1078 and #1128, see .resetclass() for explanation.
setattr(x, 'class', .resetclass(x, 'data.table'))
if (!missing(key)) setkeyv(x, key) # fix for #1169
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
} else if (is.data.frame(x)) {
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
# for performance, only warn on the first such column, #5426
for (jj in seq_along(x)) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be preferable to combine the logic for the for loops instead of just emitting the warning, but I didn't see an easy way to do so -- the setdt_nrows parallel also does the other checks in the same loop. Unless we think looping over columns twice is fine to exchange for the clarity of sharing this logic. But this loop is pretty straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance improvement.

@MichaelChirico MichaelChirico marked this pull request as ready for review March 10, 2024 05:11
@MichaelChirico MichaelChirico added this to the 1.16.0 milestone Mar 10, 2024
}
len_xi = INTEGER(dim_xi)[0];
} else {
len_xi = LENGTH(xi);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with #5981 in mind I do have some wariness about potentially introducing issues by skipping dispatch.

Copy link
Member

@HughParsonage HughParsonage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments but nothing that would require changes.

if (is.data.table(x)) {
# fix for #1078 and #1128, see .resetclass() for explanation.
setattr(x, 'class', .resetclass(x, 'data.table'))
if (!missing(key)) setkeyv(x, key) # fix for #1169
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
} else if (is.data.frame(x)) {
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
# for performance, only warn on the first such column, #5426
for (jj in seq_along(x)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say the loop overhead is likely to be negligible compared to even this simple logic, so I don't think it's worth much instinctively.

src/assign.c Outdated Show resolved Hide resolved
if (is.data.table(x)) {
# fix for #1078 and #1128, see .resetclass() for explanation.
setattr(x, 'class', .resetclass(x, 'data.table'))
if (!missing(key)) setkeyv(x, key) # fix for #1169
if (check.names) setattr(x, "names", make.names(names(x), unique=TRUE))
if (selfrefok(x) > 0L) return(invisible(x)) else setalloccol(x)
} else if (is.data.frame(x)) {
# check no matrix-like columns, #3760. Allow a single list(matrix) is unambiguous and depended on by some revdeps, #3581
# for performance, only warn on the first such column, #5426
for (jj in seq_along(x)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good performance improvement.

src/assign.c Outdated Show resolved Hide resolved
@tdhock
Copy link
Member

tdhock commented Apr 22, 2024

here is a performance analysis from #6094
68747470733a2f2f61737365742e636d6c2e6465762f636264643431343862336662373965323663353136623265613539376333336135633861316336633f636d6c3d706e672663616368652d6279706173733d39646462343966312d323461662d346638342d613366

from the left-most panel, we see that the proposed changes would reduce time complexity to what appears to be constant (grey Fast curve, independent of the number of columns N), is that expected? I thought that it should still be linear in the number of columns, even after moving to C code, because there still is a for loop over columns right? Maybe we are seeing a constant trend because the C code is so fast that we can not see the asymptotic trend?

@MichaelChirico
Copy link
Member Author

(aside: @Anirban166, I think it would be much better to give a clearer title to the graphs, "Regression fixed in #5463" is not very useful as I have to remember which issue is which, there's no linking in the chart title, etc, better to summarize what's being benchmarked, exactly, e.g. "memrecycle performance")

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Apr 23, 2024

the proposed changes would reduce time complexity to what appears to be constant

Something seems off, I tried locally and don't see that, e.g.

l=replicate(1e5, 1, simplify=FALSE)
system.time(setDT(l))
#    user  system elapsed 
#   0.054   0.000   0.054 

l=replicate(1e6, 1, simplify=FALSE)
system.time(setDT(l))
#    user  system elapsed 
#   0.544   0.012   0.556 

l=replicate(1e7, 1, simplify=FALSE)
system.time(setDT(l))
#   user  system elapsed 
#   6.898   0.084   6.982

This still represents a big improvement over master:

#    user  system elapsed 
#   0.190   0.004   0.194 
#    user  system elapsed 
#   2.613   0.020   2.633 
#    user  system elapsed 
#  32.088   0.316  32.405 

@jangorecki jangorecki removed their request for review April 23, 2024 10:42
@tdhock
Copy link
Member

tdhock commented Apr 23, 2024

Something seems off, I tried locally and don't see that, e.g.

the first one takes time, and the subsequent runs are a no-op:

> l=replicate(1e5, 1, simplify=FALSE)

> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
   0.05    0.02    0.06 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0 

@MichaelChirico
Copy link
Member Author

oh, right, so probably the benchmarked operation should be setDT(copy(l)). or otherwise {setDT(l); setattr(l, "class", NULL)} to reset without needing to deep-copy the object (which has overhead)

@tdhock
Copy link
Member

tdhock commented Apr 23, 2024

there is some inconsistency between system.time and bench::mark, because it does a memory measurement first (which makes L a data.table and then subsequent timings are no-op), bench says it is really fast/microseconds (and that is what we are using in atime), whereas system.time is slow/correct I think.

> l=replicate(1e6, 1, simplify=FALSE)
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Fast         79.9µs 79.9µs    12516.    38.2MB        0     1     0     79.9µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> bench::mark(
+ Fast=data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l),
+ iterations=1)
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 Fast         83.6µs 83.6µs    11962.        0B        0     1     0     83.6µs
# ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
> l=replicate(1e6, 1, simplify=FALSE)
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
   0.66    0.00    0.66 
> system.time(data.table.1872f473b20fdcddc5c1b35d79fe9229cd9a1d15::setDT(l))
   user  system elapsed 
      0       0       0 

@tdhock
Copy link
Member

tdhock commented Apr 23, 2024

here is a solution

(expr.list <- atime::atime_versions_exprs(
  pkg.path = "~/R/data.table",
  pkg.edit.fun = pkg.edit.fun,
  expr = {
    ##data.table:::setDT(data.table::copy(L))
    data.table:::setattr(L,"class",NULL)
    data.table:::setDT(L)
  },
  sha.vec=sha.vec))
library(data.table)
atime.result <- atime::atime(
  expr.list=expr.list,
  N=10^seq(1,6,by=0.5),
  setup = {
    L <- replicate(N, 1, simplify = FALSE)
    setDT(L)
  })  
plot(atime.result)

image

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm
performance increase is great

@MichaelChirico
Copy link
Member Author

MichaelChirico commented Apr 23, 2024

thanks! let's merge #6094 first to get some "official" atime reporting here as a "production" proof

@Anirban166
Copy link
Member

thanks! let's merge #6094 first to get some "official" atime reporting here as a "production" proof

Yup, we are working towards that!

(aside: @Anirban166, I think it would be much better to give a clearer title to the graphs, "Regression fixed in #5463" is not very useful as I have to remember which issue is which, there's no linking in the chart title, etc, better to summarize what's being benchmarked, exactly, e.g. "memrecycle performance")

Agreed here, and how about "memrecycle regression fixed in #5463"? (me and Toby discussed this a while ago and we think it's useful to keep the number for quick reference, and that using 'performance' everytime would be redundant)

tdhock added a commit that referenced this pull request Apr 24, 2024
Moved tests to the more appropriate directory and added a test case based on the performance improvement to be brought by #5427
Copy link

github-actions bot commented Apr 24, 2024

Comparison Plot

Generated via commit 1e758e6

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 41 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 33 seconds

@tdhock
Copy link
Member

tdhock commented Apr 24, 2024

hi @MichaelChirico the peformance test looks good for this PR, and I added a NEWS item, so please feel free to merge.

@MichaelChirico
Copy link
Member Author

love it, thanks @Anirban166 and @tdhock for the work on CB!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

setDT extremely slow for very wide input
6 participants