-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marking columns as externally-referenced - WIP #4902
Conversation
I like the idea, although I think we could see if we could utilize R C api ref count. This PR seems to reinvent what is there under NAMED/MAYBE_SHARED flag. |
@jangorecki That sounds like a good idea, but I don't understand R's ref count and don't trust them (this discussion in r-devel by Matt Dowle shows it's more complicated than it seems). This is what I get when I try to inspect it:
So it seems the refcount (7?) doesn't count the number of names bound to the data. This is for R3.6 - do you get different results? |
there were some changes in R ≥ 4.0 that may have an affect here (sorry I
don't know the full details)
…On Mon, Feb 15, 2021, 5:23 AM Ofek ***@***.***> wrote:
@jangorecki <https://github.com/jangorecki> That sounds like a good idea,
but I don't understand R's ref count and don't trust them (this
discussion in r-devel
<https://stat.ethz.ch/pipermail/r-devel/2011-November/062653.html> by
Matt Dowle shows it's more complicated than it seems).
This is what I get when I try to inspect it:
> a <- 1:100
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...
> b <- a
> .Internal(inspect(a))
@5595afc47e60 13 INTSXP g1c0 [MARK,NAM(7)] (len=100, tl=0) 1,2,3,4,5,...
So it seems the refcount (7?) doesn't count the number of names bound to
the data. This is for R3.6 - do you get different results?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4902 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5NZT2TQ7ERJCLD3AQ3S7EN4HANCNFSM4XSNCHCQ>
.
|
This is what I get on R4.0.3:
I assume the difference is because 1:100 is now stored as ALTREP. This seems encouraging, but is an R4.0-only solution considered acceptable? |
as long as the package works with R 3.1.0 it should be fine. it's ok for
the package to work better w newer versions of R.
…On Mon, Feb 15, 2021, 8:14 AM Ofek ***@***.***> wrote:
This is what I get on R4.0.3:
> a <- 1:100
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)] 1 : 100 (compact)
> b <- a
> .Internal(inspect(a))
@564a1336a8b0 13 INTSXP g0c0 [REF(65535)] 1 : 100 (compact)
> a<- c(1,3,5)
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(1)] (len=3, tl=0) 1,3,5
> b <- a
> .Internal(inspect(a))
@564a1334bfb8 14 REALSXP g0c3 [REF(2)] (len=3, tl=0) 1,3,5
I assume the difference is because 1:100 is now stored as ALTREP.
This seems encouraging, but is an R4.0 only solution acceptable?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4902 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB2BA5ITJORUHKJ25BJ6HBLS7FB55ANCNFSM4XSNCHCQ>
.
|
It seems ref counts are still very hard to reason about in R4:
Am I missing something? It seems 'manual' solutions are still in order. |
Won’t this lead to all columns being copied as they are being assigned? All the columns are being marked externally referrenced which would lead to memrecycle. Maybe it would be better to have use cases of what As for expected behavior, these are some examples for design. dt = setDT(list(x = 1)) # no copies needed
y = 2
dt = setDT(list(x = 1, y = y)) # y will need copies
DF = data.frame(x = 1)
setDT(DF) # no copies
y = 2
DF = data.frame(x = 1, y = y)
setDT(DF) # y will need copies
DF1 = data.frame(x = 1)
DF2 = DF1
setDT(DF1) # x will need copies, DF2 is unaffected
f = function (DF) {
setDT(DF)
DF[, x := 5]
DF
}
DF3 = data.frame(x = 1)
DT = f(DF3) # DF3 is still data.frame but f() returns data.table. x will need copies |
All columns in a
Sounds good. I think I would prefer a global option personally.
Why not?
We tried in-house some similar analysis of arguments to setDT (whether its argument is a function call, or whether its argument is a caller-argument etc.) and didn't get very far - as there was an overwhelming amount of corner cases. Did anyone ever measure the performance impact of |
I like to design with use cases in mind. While I agree that it can be easy to get bogged down by edge cases, these are very common use cases.
I expect objects that have no other references would not need to copy the vectors down the road. Why would we need to copy anything other than the pointers to the data? I can't really speak to the performance considerations of |
@ColeMiller1 examples of internal attempts to work around this limitation: In this context, 'design by use case' might mean - have the implementation work by the contract most of the time, and neglect some harder cases. As a user (of this library and others) I'm not in favor. In real-life size code bases this won't mean less bugs, but rather harder to isolate ones. The right thing to do is to declare a contract and comply 100% of the time - or the package would be very hard to use. This might mean changing the contract to make such compliance possible: a different example in DT is - it cannot really apply to its argument by reference, so I think it shouldn't try to. The same goes here: either Regarding profiling: Clarification: My fork is not currently ready for this. If you agree I'll prepare it. |
We can run db-benchmark tests for data.table branch other than master, it happened few times in the past already. |
Thanks @jangorecki. If it's a lot of work on your side too then I'll try to find a dedicated machine to benchmark on myself. |
Not at all, I just install branch to custom library location and run benchmark script. |
@OfekShilon I had no intention that edge case would be fail - more so that there are expected behaviors that we should keep in mind. Specifically, if there is a way to design this so that columns unique to a data table are unaffected, that's what I would advocate for. This is tangentially relevant to the discussions and is at the very least a great reference: And this comment stood out as it is referring to copying pointers and not the underlying memory for Rcpp. |
@jangorecki I disabled in-place assignment completely in this branch. Could you please try and benchmark it? |
@OfekShilon I am happy to run benchmarks of it, but I don't have any benchmark code for stressing this. None of db-benchmark tests utilize in-place assignment. Do you have any code that I can run and scale up? |
I just ran a quick and dirty benchmark myself, attached is the test script (note that in real life my R often crashes after ...
dt <- setDT(df)
microbenchmark( dt[!is.na(a), a := newDat] ) Printing the microbenchmark results:
So there is a noticeable - but I'd say not overwhelming - performance impact. What do you guys think? |
The discussion diverged too much - I will close this PR and start a new one with the full suggested code in place. |
This PR is not mergeable yet, it is intended to start a conversation. I hope to get your feedback on the basic approach: marking columns with an attribute saying they are also referenced outside the DT, and then use this attribute to bypass
memrecycle
.It solves the immediate issue, but:
(1) Many tests break, as the output of setDT is indeed changed (additional attribute).
(2) I'm not entirely comfortable with this invasive approach. Perhaps it would be better to contain the change inside the DT itself (adding a list attribute to it), and not add attributes to raw data. I'm not sure which modifications would be required to make this work - cbind? Elsewhere? Any thoughts are welcome.