Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marking columns as externally-referenced - WIP #4902

Closed
wants to merge 1 commit into from
Closed

Marking columns as externally-referenced - WIP #4902

wants to merge 1 commit into from

Conversation

OfekShilon
Copy link
Contributor

This PR is not mergeable yet, it is intended to start a conversation. I hope to get your feedback on the basic approach: marking columns with an attribute saying they are also referenced outside the DT, and then use this attribute to bypass memrecycle.
It solves the immediate issue, but:
(1) Many tests break, as the output of setDT is indeed changed (additional attribute).
(2) I'm not entirely comfortable with this invasive approach. Perhaps it would be better to contain the change inside the DT itself (adding a list attribute to it), and not add attributes to raw data. I'm not sure which modifications would be required to make this work - cbind? Elsewhere? Any thoughts are welcome.

@jangorecki
Copy link
Member

I like the idea, although I think we could see if we could utilize R C api ref count. This PR seems to reinvent what is there under NAMED/MAYBE_SHARED flag.

@jangorecki jangorecki added the WIP label Feb 14, 2021
@OfekShilon
Copy link
Contributor Author

@jangorecki That sounds like a good idea, but I don't understand R's ref count and don't trust them (this discussion in r-devel by Matt Dowle shows it's more complicated than it seems).

This is what I get when I try to inspect it:

> a <- 1:100
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...
> b <- a
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...

So it seems the refcount (7?) doesn't count the number of names bound to the data. This is for R3.6 - do you get different results?

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 15, 2021 via email

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Feb 15, 2021

This is what I get on R4.0.3:

> a <- 1:100
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)]  1 : 100 (compact)
> b <- a
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)]  1 : 100 (compact)
> a<- c(1,3,5)
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5
> b <- a
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5

I assume the difference is because 1:100 is now stored as ALTREP.

This seems encouraging, but is an R4.0-only solution considered acceptable?

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 15, 2021 via email

@OfekShilon
Copy link
Contributor Author

It seems ref counts are still very hard to reason about in R4:

> a <- c(1,3,5)
> .Internal(inspect(a))
@5611e3b71b78 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5
> dt1 <- data.table(a=a, b=c(2,4,6))
> .Internal(inspect(dt1[["a"]]))    # 2 seems reasonable. But why was it copied??
@5611e3b71df8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5
> .Internal(inspect(dt1[["b"]]))    # Why 2? Not referenced externally
@5611e3b71e48 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 2,4,6
> a <- 1
> .Internal(inspect(dt1[["a"]]))     # not decremented :(
@5611e3b71df8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5
> 
> 
> df <- data.frame(a=11)
> .Internal(inspect(df[["a"]]))     # surprisingly high..
@5611e4099438 14 REALSXP g0c1 [REF(11)] (len=1, tl=0) 11
> dt <- setDT(df)
> .Internal(inspect(dt[["a"]]))     # at least it wasn't copied
@5611e4099438 14 REALSXP g0c1 [REF(12)] (len=1, tl=0) 11
> 
> rm(df)
> .Internal(inspect(dt[["a"]]))     # not decremented  :(
@5611e4099438 14 REALSXP g0c1 [REF(12)] (len=1, tl=0) 11

Am I missing something? It seems 'manual' solutions are still in order.

@ColeMiller1
Copy link
Contributor

Won’t this lead to all columns being copied as they are being assigned? All the columns are being marked externally referrenced which would lead to memrecycle.

Maybe it would be better to have use cases of what setDT leads to copies or doesn’t. Related, there probably should be a way for advanced users to override this behavior like setDT(..., copy = FALSE).

As for expected behavior, these are some examples for design.

dt = setDT(list(x = 1)) # no copies needed

y = 2
dt = setDT(list(x = 1, y = y)) # y will need copies

DF = data.frame(x = 1)
setDT(DF) # no copies

y = 2
DF = data.frame(x = 1, y = y)
setDT(DF) # y will need copies

DF1 = data.frame(x = 1)
DF2 = DF1
setDT(DF1) # x will need copies, DF2 is unaffected

f = function (DF) {
  setDT(DF)
  DF[, x := 5]
  DF
}
DF3 = data.frame(x = 1)
DT = f(DF3) # DF3 is still data.frame but f() returns data.table. x will need copies

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Feb 16, 2021

Won’t this lead to all columns being copied as they are being assigned? All the columns are being marked externally referrenced which would lead to memrecycle.

All columns in a setDT-generated DT will indeed be marked as externally referenced. This doesn't mean they are always copied when assigned: if plonking takes place (i.e. i argument to [ is empty and the column is replaced entirely) no copy takes place.

Related, there probably should be a way for advanced users to override this behavior like setDT(..., copy = FALSE).

Sounds good. I think I would prefer a global option personally.

As for expected behavior, these are some examples for design.
...
DF = data.frame(x = 1)
setDT(DF) # no copies

Why not?

...

We tried in-house some similar analysis of arguments to setDT (whether its argument is a function call, or whether its argument is a caller-argument etc.) and didn't get very far - as there was an overwhelming amount of corner cases.

Did anyone ever measure the performance impact of memrecycle? Personally I think giving it up altogether might be a worthy trade off for consistent semantics.

@ColeMiller1
Copy link
Contributor

I like to design with use cases in mind. While I agree that it can be easy to get bogged down by edge cases, these are very common use cases.

As for expected behavior, these are some examples for design.
...
DF = data.frame(x = 1)
setDT(DF) # no copies

Why not?

I expect objects that have no other references would not need to copy the vectors down the road. Why would we need to copy anything other than the pointers to the data?

I can't really speak to the performance considerations of memrecycle.

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Feb 20, 2021

@ColeMiller1 examples of internal attempts to work around this limitation:
(1) We tried to detect if setDT is called on a function argument and assert. This is the common case of data-modification-leakage, but misses cases like d1<-data.frame; d2<-d1; setDT(d2).
(2) We tried to allow setDT to operate only on function calls, like setDT(generateData(...)) - to force it to operate only on "new", unreferenced, data. This missed function calls like cbind, or even user-defined functions that leave column data in place.
And more.

In this context, 'design by use case' might mean - have the implementation work by the contract most of the time, and neglect some harder cases. As a user (of this library and others) I'm not in favor. In real-life size code bases this won't mean less bugs, but rather harder to isolate ones. The right thing to do is to declare a contract and comply 100% of the time - or the package would be very hard to use. This might mean changing the contract to make such compliance possible: a different example in DT is - it cannot really apply to its argument by reference, so I think it shouldn't try to. The same goes here: either := modifications apply to all names bound to the date, or only to the current one. The current state, where sometimes it's one and sometimes the other - is extremely hard to work with, whether the 'other' is 10% or 0.1% of the cases.

Regarding profiling:
I came across @jangorecki 's benchmark. Would it be possible to use it to benchmark the main data.table against a version without memory recycling? (I can try myself but it would have to be on a machine serving many other users, not ideal for benchmarking...)

Clarification: My fork is not currently ready for this. If you agree I'll prepare it.

@jangorecki
Copy link
Member

jangorecki commented Feb 21, 2021

We can run db-benchmark tests for data.table branch other than master, it happened few times in the past already.
I would like to hear @mattdowle and/or @arunsrinivasan comments on this idea. If it is a lot of work, then best to wait for their comments before investing more time into to.

@OfekShilon
Copy link
Contributor Author

We can run db-benchmark tests for data.table branch other than master, it happened few times in the past already.
I would like to hear @mattdowle and/or @arunsrinivasan comments on this idea. If it is a lot of work, then best to wait for their comments before investing more time into to.

Thanks @jangorecki. If it's a lot of work on your side too then I'll try to find a dedicated machine to benchmark on myself.

@jangorecki
Copy link
Member

Not at all, I just install branch to custom library location and run benchmark script.

@ColeMiller1
Copy link
Contributor

@OfekShilon I had no intention that edge case would be fail - more so that there are expected behaviors that we should keep in mind. Specifically, if there is a way to design this so that columns unique to a data table are unaffected, that's what I would advocate for.

This is tangentially relevant to the discussions and is at the very least a great reference:

RcppCore/Rcpp#943

And this comment stood out as it is referring to copying pointers and not the underlying memory for Rcpp.

RcppCore/Rcpp#943 (comment)

@OfekShilon
Copy link
Contributor Author

@jangorecki I disabled in-place assignment completely in this branch. Could you please try and benchmark it?

@jangorecki
Copy link
Member

jangorecki commented Mar 16, 2021

@OfekShilon I am happy to run benchmarks of it, but I don't have any benchmark code for stressing this. None of db-benchmark tests utilize in-place assignment. Do you have any code that I can run and scale up?

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Mar 21, 2021

I just ran a quick and dirty benchmark myself, attached is the test script (note that in real life my R often crashes after install_github and I can't run the script in one piece). The key code I run is -

...
dt <- setDT(df)
microbenchmark( dt[!is.na(a), a := newDat] )

Printing the microbenchmark results:

> mbVanillaIP
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 vanilla i-p 202.3709 216.1531 245.5188 227.0159 280.0774 334.3973   100
> mbTestIP
Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 no in-place 247.6703 317.6447 321.4421 319.6458 321.6315 364.8844   100

So there is a noticeable - but I'd say not overwhelming - performance impact.
I think it's safe to say that in any form the assignment of data is never an interesting bottleneck - it is always its processing. Personally I'll be very, very happy to pay this small price in the initial creation of data if that means avoiding its leakage to places it doesn't belong. (you can see again a demo of the problem and the solution in the attached script).

What do you guys think?
@jangorecki, @MichaelChirico, @ColeMiller1

testNoInPlaceDT.zip

@OfekShilon
Copy link
Contributor Author

The discussion diverged too much - I will close this PR and start a new one with the full suggested code in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants