Skip to content

Make String more GC friendly #40840

@bkamins

Description

@bkamins

I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.

Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).

In such cases the current design of combination of GC and String type cause us to create a lot of small strings. The effect is described e.g. here h2oai/db-benchmark#210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).

I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).

As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):

  • improve the "generational" aspect of GC (related: The GC often doesn't act generational #40644)
  • have a special handling of String type in GC (related to the above, but we might e.g. decide to always treat String as very old; possibly this could be enabled/disabled by some run-time option)
  • have a run-time option to turn on/off String interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time
  • have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)

In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.

Metadata

Metadata

Assignees

No one assigned

    Labels

    GCGarbage collectorperformanceMust go fasterstrings"Strings!"

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions