Be less conservative about ALTREP #4697

MichaelChirico · 2020-09-07T02:04:27Z

Discovered via SO Q&A:

https://stackoverflow.com/a/63770450/3576984

Currently j expands all ALTREPs, which can lead to significant performance degradation:

system.time(for (i in 1:40732170) {})
#    user  system elapsed 
#   0.764   0.004   0.880 
data.table()[ , system.time(for (i in 1:40732170) {})]
#    user  system elapsed 
#   3.756   0.020   3.927

I don't know enough about ALTREP to say for sure which cases we can/can't avoid expansion, but hopefully we can do better than the above.

(and actually I'm not 100% sure ALTREP is to blame here, but seems the likeliest candidate)

The text was updated successfully, but these errors were encountered:

ColeMiller1 · 2020-09-11T00:40:15Z

Should we also be more aggressive about using ALTREP? That is, .I is almost free now. irows could be refactored from being NULL and might (?) end up in simpler code.

R 3.5.0 is 2.5 years old so I think we're getting close to where incorporating it makes sense.

MichaelChirico · 2020-11-22T15:42:43Z

Great talk from Gabe Becker here:

https://youtu.be/8i7ziLqsE2s

I think a low hanging fruit could be to set (first) keys to KNOWN_INCREASING, that could e.g. let R by very fast in doing unique(DT$key).

jangorecki · 2020-11-22T16:08:05Z

Unless there are NAs which R places at the end and not at the front.

MichaelChirico · 2020-11-22T16:25:24Z

Actually no! one of the sortedness enum fit our case if I'm not mistaken: SORTED_INCR_NA_1ST

link to slides here:

https://www.bioconductor.org/help/course-materials/

ben-schwen · 2024-01-18T13:03:36Z

Well if we manage to support altrep vectors as columns then out-of-memory data.table should be easier targetable.

tdhock · 2024-01-18T15:55:25Z

is there a more convincing real example?
it seems like in a real example if you are computing something over a large loop you will also be returning a large number of rows, and you should have memory for that, so optimizing for ALTREP is not necessary.

jangorecki · 2024-01-18T16:14:27Z

out-of-memory data.table is very desired feature due to severity of consequences when you just don't have enough memory. Together with long vector support these were two main points on the roadmap for data.table new features that Matt was working on. As he said, long vector support is not that great deal as long as there is no out-of-memory data.table.

MichaelChirico added the performance label Sep 7, 2020

MichaelChirico mentioned this issue May 4, 2021

unique can be optimized on keyed data.tables #2947

Open

MichaelChirico mentioned this issue May 14, 2021

Master list of most-requested issues #3189

Open

76 tasks

ColeMiller1 mentioned this issue Sep 9, 2023

Formal governance document #5676

Closed

MichaelChirico mentioned this issue Jan 18, 2024

implement frev as fast base::rev alternative #5907

Open

4 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be less conservative about ALTREP #4697

Be less conservative about ALTREP #4697

MichaelChirico commented Sep 7, 2020 •

edited

Loading

ColeMiller1 commented Sep 11, 2020

MichaelChirico commented Nov 22, 2020

jangorecki commented Nov 22, 2020

MichaelChirico commented Nov 22, 2020

ben-schwen commented Jan 18, 2024

tdhock commented Jan 18, 2024

jangorecki commented Jan 18, 2024

Be less conservative about ALTREP #4697

Be less conservative about ALTREP #4697

Comments

MichaelChirico commented Sep 7, 2020 • edited Loading

ColeMiller1 commented Sep 11, 2020

MichaelChirico commented Nov 22, 2020

jangorecki commented Nov 22, 2020

MichaelChirico commented Nov 22, 2020

ben-schwen commented Jan 18, 2024

tdhock commented Jan 18, 2024

jangorecki commented Jan 18, 2024

MichaelChirico commented Sep 7, 2020 •

edited

Loading