Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be less conservative about ALTREP #4697

Open
MichaelChirico opened this issue Sep 7, 2020 · 7 comments
Open

Be less conservative about ALTREP #4697

MichaelChirico opened this issue Sep 7, 2020 · 7 comments
Labels
performance top request One of our most-requested issues

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 7, 2020

Discovered via SO Q&A:

https://stackoverflow.com/a/63770450/3576984

Currently j expands all ALTREPs, which can lead to significant performance degradation:

system.time(for (i in 1:40732170) {})
#    user  system elapsed 
#   0.764   0.004   0.880 
data.table()[ , system.time(for (i in 1:40732170) {})]
#    user  system elapsed 
#   3.756   0.020   3.927

I don't know enough about ALTREP to say for sure which cases we can/can't avoid expansion, but hopefully we can do better than the above.

(and actually I'm not 100% sure ALTREP is to blame here, but seems the likeliest candidate)

@ColeMiller1
Copy link
Contributor

Should we also be more aggressive about using ALTREP? That is, .I is almost free now. irows could be refactored from being NULL and might (?) end up in simpler code.

R 3.5.0 is 2.5 years old so I think we're getting close to where incorporating it makes sense.

@MichaelChirico
Copy link
Member Author

Great talk from Gabe Becker here:

https://youtu.be/8i7ziLqsE2s

I think a low hanging fruit could be to set (first) keys to KNOWN_INCREASING, that could e.g. let R by very fast in doing unique(DT$key).

@jangorecki
Copy link
Member

Unless there are NAs which R places at the end and not at the front.

@MichaelChirico
Copy link
Member Author

Actually no! one of the sortedness enum fit our case if I'm not mistaken: SORTED_INCR_NA_1ST

link to slides here:

https://www.bioconductor.org/help/course-materials/

@ben-schwen
Copy link
Member

Well if we manage to support altrep vectors as columns then out-of-memory data.table should be easier targetable.

@tdhock
Copy link
Member

tdhock commented Jan 18, 2024

is there a more convincing real example?
it seems like in a real example if you are computing something over a large loop you will also be returning a large number of rows, and you should have memory for that, so optimizing for ALTREP is not necessary.

@jangorecki
Copy link
Member

out-of-memory data.table is very desired feature due to severity of consequences when you just don't have enough memory. Together with long vector support these were two main points on the roadmap for data.table new features that Matt was working on. As he said, long vector support is not that great deal as long as there is no out-of-memory data.table.

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

5 participants