Marking columns as externally-referenced - WIP #4902

OfekShilon · 2021-02-13T18:58:55Z

This PR is not mergeable yet, it is intended to start a conversation. I hope to get your feedback on the basic approach: marking columns with an attribute saying they are also referenced outside the DT, and then use this attribute to bypass memrecycle.
It solves the immediate issue, but:
(1) Many tests break, as the output of setDT is indeed changed (additional attribute).
(2) I'm not entirely comfortable with this invasive approach. Perhaps it would be better to contain the change inside the DT itself (adding a list attribute to it), and not add attributes to raw data. I'm not sure which modifications would be required to make this work - cbind? Elsewhere? Any thoughts are welcome.

jangorecki · 2021-02-14T08:00:21Z

I like the idea, although I think we could see if we could utilize R C api ref count. This PR seems to reinvent what is there under NAMED/MAYBE_SHARED flag.

OfekShilon · 2021-02-15T13:22:53Z

@jangorecki That sounds like a good idea, but I don't understand R's ref count and don't trust them (this discussion in r-devel by Matt Dowle shows it's more complicated than it seems).

This is what I get when I try to inspect it:

> a <- 1:100
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...
> b <- a
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...

So it seems the refcount (7?) doesn't count the number of names bound to the data. This is for R3.6 - do you get different results?

MichaelChirico · 2021-02-15T15:53:18Z

there were some changes in R ≥ 4.0 that may have an affect here (sorry I don't know the full details)

…

On Mon, Feb 15, 2021, 5:23 AM Ofek ***@***.***> wrote: @jangorecki <https://github.com/jangorecki> That sounds like a good idea, but I don't understand R's ref count and don't trust them (this discussion in r-devel <https://stat.ethz.ch/pipermail/r-devel/2011-November/062653.html> by Matt Dowle shows it's more complicated than it seems). This is what I get when I try to inspect it: > a <- 1:100 > .Internal(inspect(a)) @5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,... > b <- a > .Internal(inspect(a)) @5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,... So it seems the refcount (7?) doesn't count the number of names bound to the data. This is for R3.6 - do you get different results? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4902 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2BA5NZT2TQ7ERJCLD3AQ3S7EN4HANCNFSM4XSNCHCQ> .

OfekShilon · 2021-02-15T16:14:05Z

This is what I get on R4.0.3:

> a <- 1:100
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)]  1 : 100 (compact)
> b <- a
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)]  1 : 100 (compact)
> a<- c(1,3,5)
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5
> b <- a
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5

I assume the difference is because 1:100 is now stored as ALTREP.

This seems encouraging, but is an R4.0-only solution considered acceptable?

MichaelChirico · 2021-02-15T16:56:51Z

as long as the package works with R 3.1.0 it should be fine. it's ok for the package to work better w newer versions of R.

…

On Mon, Feb 15, 2021, 8:14 AM Ofek ***@***.***> wrote: This is what I get on R4.0.3: > a <- 1:100 > .Internal(inspect(a)) @564a1336a8b0 13 INTSXP g0c0 [REF(65535)] 1 : 100 (compact) > b <- a > .Internal(inspect(a)) @564a1336a8b0 13 INTSXP g0c0 [REF(65535)] 1 : 100 (compact) > a<- c(1,3,5) > .Internal(inspect(a)) @564a1334bfb8 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5 > b <- a > .Internal(inspect(a)) @564a1334bfb8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5 I assume the difference is because 1:100 is now stored as ALTREP. This seems encouraging, but is an R4.0 only solution acceptable? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4902 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2BA5ITJORUHKJ25BJ6HBLS7FB55ANCNFSM4XSNCHCQ> .

OfekShilon · 2021-02-15T18:15:46Z

It seems ref counts are still very hard to reason about in R4:

> a <- c(1,3,5)
> .Internal(inspect(a))
@5611e3b71b78 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5
> dt1 <- data.table(a=a, b=c(2,4,6))
> .Internal(inspect(dt1[["a"]]))    # 2 seems reasonable. But why was it copied??
@5611e3b71df8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5
> .Internal(inspect(dt1[["b"]]))    # Why 2? Not referenced externally
@5611e3b71e48 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 2,4,6
> a <- 1
> .Internal(inspect(dt1[["a"]]))     # not decremented :(
@5611e3b71df8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5
> 
> 
> df <- data.frame(a=11)
> .Internal(inspect(df[["a"]]))     # surprisingly high..
@5611e4099438 14 REALSXP g0c1 [REF(11)] (len=1, tl=0) 11
> dt <- setDT(df)
> .Internal(inspect(dt[["a"]]))     # at least it wasn't copied
@5611e4099438 14 REALSXP g0c1 [REF(12)] (len=1, tl=0) 11
> 
> rm(df)
> .Internal(inspect(dt[["a"]]))     # not decremented  :(
@5611e4099438 14 REALSXP g0c1 [REF(12)] (len=1, tl=0) 11

Am I missing something? It seems 'manual' solutions are still in order.

ColeMiller1 · 2021-02-16T12:35:27Z

Won’t this lead to all columns being copied as they are being assigned? All the columns are being marked externally referrenced which would lead to memrecycle.

Maybe it would be better to have use cases of what setDT leads to copies or doesn’t. Related, there probably should be a way for advanced users to override this behavior like setDT(..., copy = FALSE).

As for expected behavior, these are some examples for design.

dt = setDT(list(x = 1)) # no copies needed

y = 2
dt = setDT(list(x = 1, y = y)) # y will need copies

DF = data.frame(x = 1)
setDT(DF) # no copies

y = 2
DF = data.frame(x = 1, y = y)
setDT(DF) # y will need copies

DF1 = data.frame(x = 1)
DF2 = DF1
setDT(DF1) # x will need copies, DF2 is unaffected

f = function (DF) {
  setDT(DF)
  DF[, x := 5]
  DF
}
DF3 = data.frame(x = 1)
DT = f(DF3) # DF3 is still data.frame but f() returns data.table. x will need copies

OfekShilon · 2021-02-16T13:31:38Z

Won’t this lead to all columns being copied as they are being assigned? All the columns are being marked externally referrenced which would lead to memrecycle.

All columns in a setDT-generated DT will indeed be marked as externally referenced. This doesn't mean they are always copied when assigned: if plonking takes place (i.e. i argument to [ is empty and the column is replaced entirely) no copy takes place.

Related, there probably should be a way for advanced users to override this behavior like setDT(..., copy = FALSE).

Sounds good. I think I would prefer a global option personally.

As for expected behavior, these are some examples for design.
...
DF = data.frame(x = 1)
setDT(DF) # no copies

Why not?

...

We tried in-house some similar analysis of arguments to setDT (whether its argument is a function call, or whether its argument is a caller-argument etc.) and didn't get very far - as there was an overwhelming amount of corner cases.

Did anyone ever measure the performance impact of memrecycle? Personally I think giving it up altogether might be a worthy trade off for consistent semantics.

ColeMiller1 · 2021-02-20T01:01:18Z

I like to design with use cases in mind. While I agree that it can be easy to get bogged down by edge cases, these are very common use cases.

As for expected behavior, these are some examples for design.
...
DF = data.frame(x = 1)
setDT(DF) # no copies

Why not?

I expect objects that have no other references would not need to copy the vectors down the road. Why would we need to copy anything other than the pointers to the data?

I can't really speak to the performance considerations of memrecycle.

OfekShilon · 2021-02-20T19:52:44Z

@ColeMiller1 examples of internal attempts to work around this limitation:
(1) We tried to detect if setDT is called on a function argument and assert. This is the common case of data-modification-leakage, but misses cases like d1<-data.frame; d2<-d1; setDT(d2).
(2) We tried to allow setDT to operate only on function calls, like setDT(generateData(...)) - to force it to operate only on "new", unreferenced, data. This missed function calls like cbind, or even user-defined functions that leave column data in place.
And more.

In this context, 'design by use case' might mean - have the implementation work by the contract most of the time, and neglect some harder cases. As a user (of this library and others) I'm not in favor. In real-life size code bases this won't mean less bugs, but rather harder to isolate ones. The right thing to do is to declare a contract and comply 100% of the time - or the package would be very hard to use. This might mean changing the contract to make such compliance possible: a different example in DT is - it cannot really apply to its argument by reference, so I think it shouldn't try to. The same goes here: either := modifications apply to all names bound to the date, or only to the current one. The current state, where sometimes it's one and sometimes the other - is extremely hard to work with, whether the 'other' is 10% or 0.1% of the cases.

Regarding profiling:
I came across @jangorecki 's benchmark. Would it be possible to use it to benchmark the main data.table against a version without memory recycling? (I can try myself but it would have to be on a machine serving many other users, not ideal for benchmarking...)

Clarification: My fork is not currently ready for this. If you agree I'll prepare it.

jangorecki · 2021-02-21T08:19:45Z

We can run db-benchmark tests for data.table branch other than master, it happened few times in the past already.
I would like to hear @mattdowle and/or @arunsrinivasan comments on this idea. If it is a lot of work, then best to wait for their comments before investing more time into to.

OfekShilon · 2021-02-21T08:39:23Z

We can run db-benchmark tests for data.table branch other than master, it happened few times in the past already.
I would like to hear @mattdowle and/or @arunsrinivasan comments on this idea. If it is a lot of work, then best to wait for their comments before investing more time into to.

Thanks @jangorecki. If it's a lot of work on your side too then I'll try to find a dedicated machine to benchmark on myself.

jangorecki · 2021-02-21T09:31:26Z

Not at all, I just install branch to custom library location and run benchmark script.

ColeMiller1 · 2021-02-21T14:39:37Z

@OfekShilon I had no intention that edge case would be fail - more so that there are expected behaviors that we should keep in mind. Specifically, if there is a way to design this so that columns unique to a data table are unaffected, that's what I would advocate for.

This is tangentially relevant to the discussions and is at the very least a great reference:

RcppCore/Rcpp#943

And this comment stood out as it is referring to copying pointers and not the underlying memory for Rcpp.

RcppCore/Rcpp#943 (comment)

OfekShilon · 2021-03-16T12:28:38Z

@jangorecki I disabled in-place assignment completely in this branch. Could you please try and benchmark it?

jangorecki · 2021-03-16T14:31:19Z

@OfekShilon I am happy to run benchmarks of it, but I don't have any benchmark code for stressing this. None of db-benchmark tests utilize in-place assignment. Do you have any code that I can run and scale up?

OfekShilon · 2021-03-21T18:36:47Z

I just ran a quick and dirty benchmark myself, attached is the test script (note that in real life my R often crashes after install_github and I can't run the script in one piece). The key code I run is -

...
dt <- setDT(df)
microbenchmark( dt[!is.na(a), a := newDat] )

Printing the microbenchmark results:

> mbVanillaIP
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 vanilla i-p 202.3709 216.1531 245.5188 227.0159 280.0774 334.3973   100
> mbTestIP
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 no in-place 247.6703 317.6447 321.4421 319.6458 321.6315 364.8844   100

So there is a noticeable - but I'd say not overwhelming - performance impact.
I think it's safe to say that in any form the assignment of data is never an interesting bottleneck - it is always its processing. Personally I'll be very, very happy to pay this small price in the initial creation of data if that means avoiding its leakage to places it doesn't belong. (you can see again a demo of the problem and the solution in the attached script).

What do you guys think?
@jangorecki, @MichaelChirico, @ColeMiller1

testNoInPlaceDT.zip

OfekShilon · 2021-03-28T08:16:26Z

The discussion diverged too much - I will close this PR and start a new one with the full suggested code in place.

Marking columns as externally-referenced - WIP

82ef134

jangorecki added the WIP label Feb 14, 2021

OfekShilon closed this Mar 28, 2021

OfekShilon mentioned this pull request Mar 30, 2021

Fix #4783 with a designated option #4947

Closed

MichaelChirico removed the WIP label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marking columns as externally-referenced - WIP #4902

Marking columns as externally-referenced - WIP #4902

OfekShilon commented Feb 13, 2021

jangorecki commented Feb 14, 2021

OfekShilon commented Feb 15, 2021

MichaelChirico commented Feb 15, 2021 via email

OfekShilon commented Feb 15, 2021 •

edited

MichaelChirico commented Feb 15, 2021 via email

OfekShilon commented Feb 15, 2021

ColeMiller1 commented Feb 16, 2021

OfekShilon commented Feb 16, 2021 •

edited

ColeMiller1 commented Feb 20, 2021

OfekShilon commented Feb 20, 2021 •

edited

jangorecki commented Feb 21, 2021 •

edited

OfekShilon commented Feb 21, 2021

jangorecki commented Feb 21, 2021

ColeMiller1 commented Feb 21, 2021

OfekShilon commented Mar 16, 2021

jangorecki commented Mar 16, 2021 •

edited

OfekShilon commented Mar 21, 2021 •

edited

OfekShilon commented Mar 28, 2021

Marking columns as externally-referenced - WIP #4902

Marking columns as externally-referenced - WIP #4902

Conversation

OfekShilon commented Feb 13, 2021

jangorecki commented Feb 14, 2021

OfekShilon commented Feb 15, 2021

MichaelChirico commented Feb 15, 2021 via email

OfekShilon commented Feb 15, 2021 • edited

MichaelChirico commented Feb 15, 2021 via email

OfekShilon commented Feb 15, 2021

ColeMiller1 commented Feb 16, 2021

OfekShilon commented Feb 16, 2021 • edited

ColeMiller1 commented Feb 20, 2021

OfekShilon commented Feb 20, 2021 • edited

jangorecki commented Feb 21, 2021 • edited

OfekShilon commented Feb 21, 2021

jangorecki commented Feb 21, 2021

ColeMiller1 commented Feb 21, 2021

OfekShilon commented Mar 16, 2021

jangorecki commented Mar 16, 2021 • edited

OfekShilon commented Mar 21, 2021 • edited

OfekShilon commented Mar 28, 2021

OfekShilon commented Feb 15, 2021 •

edited

OfekShilon commented Feb 16, 2021 •

edited

OfekShilon commented Feb 20, 2021 •

edited

jangorecki commented Feb 21, 2021 •

edited

jangorecki commented Mar 16, 2021 •

edited

OfekShilon commented Mar 21, 2021 •

edited