Skip to content

Optimize symbol table #13258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 7, 2025
Merged

Optimize symbol table #13258

merged 2 commits into from
Jun 7, 2025

Conversation

NaN-git
Copy link
Contributor

@NaN-git NaN-git commented May 24, 2025

Motivation

The symbol table is an append only data structure and it allocates a lot of small strings. Thus the first optimization is to use a std::pmr::monotonic_buffer_resource for string allocations.

Then there is no need to store a std::string_view in the set because a pointer into the ChunkedVector can be used instead, which frees a size_t. Furthermore the size of std::string is 32 bytes on x86_64-linux with libstdc++, but the strings aren't dynamic, i.e. storing a pointer and a size is sufficient.

Some builtins transform symbols stored in the symbol table, i.e. a value is allocated, which points to a string in the symbol table. The allocation can be avoided, if the value is stored in the ChunkedVector. On x86_64-linux the size of Value is 24 bytes for storing a pointer to a string, i.e. it adds some overhead.
Combining everything into SymbolValue 32 bytes are stored, while the index into the ChunkedVector is no longer stored in the set, freeing up another uint32_t or 8 bytes due to alignment.
The Value optimization reduces the max. memory usage for nix search noticeable, i.e. I measured a 6% reduction (~300 MiB) with this PR and current nixpkgs. A 2%-3% speedup was observed, too.

Finally, the symbol table does not require that pointers to set entries are stable, i.e. the cache friendly boost::unordered_flat_set can be used, which stores the set entries in-place, so that it has less memory overhead, too.
This implementation requires a good hash function, where different bits of the output are distributed uniformly. I don't know whether the hash functions of libstdc++ and libc++ (especially older versions) are good enough, so that I decided to use the hash function provided by boost.

Context

Making the symbol table thread-safe isn't too much effort anymore by using boost::concurrent_flat_set instead. The question is, which operations shall be lock free. Making operator[] thread-safe requires some changes of ChunkedVector.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@NaN-git NaN-git requested review from roberth and edolstra as code owners May 24, 2025 18:20
@Ericson2314
Copy link
Member

CC @xokdvium

@tomberek
Copy link
Contributor

Random tidbit: during eval of a NixOS system, we allocate a lot of common small strings:

   7561 sym: 1 # 981 = "
   7651 sym: 5 # 591 = users
   8191 sym: 3 # 64 = lib
   9219 sym: 7 # 5307 = systemd
  10940 sym: 6 # 120 = config
  15525 sym: 1 # 28 = 1
  19273 sym: 10 # 5414 = modulePath
  19285 sym: 5 # 171 = order
  19315 sym: 7 # 6171 = control
  19696 sym: 6 # 105 = :anon-
  21718 sym: 8 # 852 = settings
  22510 sym: 8 # 1494 = services
  22585 sym: 5 # 4292 = check
  22668 sym: 12 # 122 = freeformType
  27293 sym: 4 # 3963 = name
  28222 sym: 11 # 2335 = specialArgs
  38879 sym: 0 # 5 =
  40980 sym: 6 # 419 = enable
  45039 sym: 7 # 89 = _module
  53297 sym: 4 # 4926 = args

@NaN-git NaN-git force-pushed the opt-symbol-table branch from 838c84f to 5ad93a1 Compare May 25, 2025 22:00
@NaN-git
Copy link
Contributor Author

NaN-git commented May 25, 2025

@tomberek How are these strings allocated? Is mkString involved or does the PR improve the situation?

@NaN-git NaN-git force-pushed the opt-symbol-table branch from 5ad93a1 to e8912c0 Compare May 25, 2025 23:25
@NaN-git NaN-git force-pushed the opt-symbol-table branch 2 times, most recently from 43c3477 to 1fbe558 Compare May 27, 2025 20:07
@xokdvium
Copy link
Contributor

Great work!

For reference, some ad-hoc measurements really show an amazing improvement in terms of speed as well as memory usage (compared against current master):

Command Mean [ms] Min [ms] Max [ms] Relative
result/bin/nix eval nixpkgs#hello --no-eval-cache --read-only 348.3 ± 5.1 339.2 356.1 1.00
nix eval nixpkgs#hello --no-eval-cache --read-only 365.2 ± 3.9 359.0 370.6 1.05 ± 0.02
$ command time -v result/bin/nix eval nixpkgs#nixosTests.gnome --no-eval-cache --read-only
Maximum resident set size (kbytes): 800468
$ command time -v nix eval nixpkgs#nixosTests.gnome --no-eval-cache --read-only
Maximum resident set size (kbytes): 871412

@edolstra
Copy link
Member

Can you expand a bit on how well this would be amenable to multi-threaded evaluation? In #10938 I reduced lock contention for symbol table insertions by sharding the mapping from strings to symbol offsets, and made symbol reads lock-free entirely by making sure that symbols never move in memory.

@xokdvium
Copy link
Contributor

I reduced lock contention for symbol table insertions by sharding the mapping from strings to symbol offsets, and made symbol reads lock-free entirely by making sure that symbols never move in memory.

Making the symbol table thread-safe isn't too much effort anymore by using boost::concurrent_flat_set instead.

Looking at the API https://live.boost.org/doc/libs/develop/libs/unordered/doc/html/unordered/reference/unordered_flat_set.html this does look a drop-in replacement. Boost's implementation doesn't use sharding, but does something more clever it seems: https://bannalia.blogspot.com/2023/07/inside-boostconcurrentflatmap.html. Judging by the benchmarks it seems like its performance is very agreeable and I doubt we could hand-roll something more performant.

@NaN-git
Copy link
Contributor Author

NaN-git commented May 27, 2025

Can you expand a bit on how well this would be amenable to multi-threaded evaluation? In #10938 I reduced lock contention for symbol table insertions by sharding the mapping from strings to symbol offsets, and made symbol reads lock-free entirely by making sure that symbols never move in memory.

I'm not sure whether this is thread safe. There is no guarantee that the thread, which calls operator[], actually observes the string written to memory. The function reads arena.size, which implies a memory barrier, but the writes in SymbolTable::create happen after updating this atomic variable. There seems to be no other synchronization point.

Read locks are also quite expensive.

Having a max. size of the symbol table seems to be questionable. Also I'm not sure whether reserving so much virtual memory space on a 32 bit system is a good idea.

Sharding could be applied to ChunkedVector, too, or it could be rewritten to be mostly lock free.

boost::concurrent_flat_set itself should avoid contentions and it's SIMD friendly and inserts/lookups are atomic. Then this would be thread safe, if operator[] and SymbolStr(const Key &) are thread safe..

@tomberek
Copy link
Contributor

tomberek commented May 29, 2025

Builds and tests pass on {x86_64,aarch64}-{darwin,linux} locally as well. Seeing a ~8% memory and CPU usage decrease.

Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

What's more is that this removes a significant number of Values from the GC-managed heap, so this saves not just allocations, but also a bit of time spent in GC.

@NaN-git NaN-git force-pushed the opt-symbol-table branch from 1fbe558 to ed4e512 Compare June 1, 2025 19:01
@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/a-look-at-nixos-nixpkgs-evaluation-times-over-the-years/65114/9

@roberth roberth merged commit 1022598 into NixOS:master Jun 7, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants