Make String more GC friendly

I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.

Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).

In such cases the current design of combination of GC and `String` type cause us to create a lot of small strings. The effect is described e.g. here https://github.com/h2oai/db-benchmark/issues/210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).

I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).

As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):
* improve the "generational" aspect of GC (related: https://github.com/JuliaLang/julia/issues/40644)
* have a special handling of `String` type in GC (related to the above, but we might e.g. decide to always treat `String` as very old; possibly this could be enabled/disabled by some run-time option)
* have a run-time option to turn on/off `String` interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time
* have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)

In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Make String more GC friendly #40840

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Make String more GC friendly #40840

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions