-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Description
I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.
Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).
In such cases the current design of combination of GC and String type cause us to create a lot of small strings. The effect is described e.g. here h2oai/db-benchmark#210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).
I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).
As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):
- improve the "generational" aspect of GC (related: The GC often doesn't act generational #40644)
- have a special handling of
Stringtype in GC (related to the above, but we might e.g. decide to always treatStringas very old; possibly this could be enabled/disabled by some run-time option) - have a run-time option to turn on/off
Stringinterning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time - have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)
In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.