Skip to content

libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) #13407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 2, 2025

Conversation

xokdvium
Copy link
Contributor

@xokdvium xokdvium commented Jun 28, 2025

Motivation

This shaves off a very significant (~20% percent reduction in maximum heap size and ~17% in total bytes) amount of memory used for evaluation as well as reduces the GC-managed heap. See the results below. The fraction of collected memory is not affected.

Memory usage comparison

Below are the comparisons for NIX_SHOW_STATS=1 nix-env --query --available --out-path --file ../nixpkgs --eval-system x86_64-linux > /dev/null

Before

{
  ....
  "gc": {
    "cycles": 14,
    "heapSize": 14818668544,
    "totalBytes": 27457473440
  },
  ....
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 8,
    "Value": 24
  },
  ....
  "values": {
    "bytes": 7345779024,
    "number": 306074126
  }
}

After

{
  ....
  "gc": {
    "cycles": 14,
    "heapSize": 11916210176,
    "totalBytes": 22553098160
  },
  ....
  "sizes": {
    "Attr": 16,
    "Bindings": 16,
    "Env": 8,
    "Value": 16
  },
  ....
  "values": {
    "bytes": 4897186016,
    "number": 306074126
  }
}

So the savings are around 4.9 GiB on a full nixpkgs eval.

Previously the union discriminator (InternalType) was
stored as a separate field in the Value, which takes up
whole 8 bytes due to padding needed for member alignment.
This effectively wasted 7 whole bytes of memory. Instead
of doing that InternalType is packed into pointer
alignment niches. As it turns out, there's more than enough
unused bits there for the bit packing to be effective.

See the doxygen comment in the ValueStorage specialization
for more details.

This does not add any performance overhead, even though
we now consistently assert the InternalType in all getters.

This can also be made atomic with a double width compare-and-swap
instruction on x86_64 (CMPXCHG16B instruction) for parallel evaluation.

Context

This is the best we can do with the current data model to address #8621.
The improvement is very significant and should give more leeway to the ever growing nixpkgs.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@github-actions github-actions bot added new-cli Relating to the "nix" command c api Nix as a C library with a stable interface labels Jun 28, 2025
@xokdvium xokdvium changed the title libexpr: Reduce the size of Value down to 16 bytes libexpr: Reduce the size of Value down to 16 bytes (on 64 bit systems) Jun 28, 2025
Copy link
Member

@Mic92 Mic92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff. Have only went over briefly. But I am interested in a review from @NaN-git

@xokdvium
Copy link
Contributor Author

Also fixed i686-linux builds. Needed some SFINAE trickery to make the compiler not instantiate ValueStorage member functions that would be ill-formed on a non-64 bit system.

@xokdvium
Copy link
Contributor Author

FYI, some prior discussion about slimming down Value by using pointer tagging #9895 (comment).

@NaN-git
Copy link
Contributor

NaN-git commented Jun 29, 2025

@xokdvium Does this PR slow down the garbage collection due to the tagged/misaligned pointers?
Otherwise the computational overhead seems to be reasonable.

@xokdvium
Copy link
Contributor Author

xokdvium commented Jun 29, 2025

Does this PR slow down the garbage collection due to the tagged/misaligned pointers?

Considering that the GC has to work much less (significantly less memory to scan) it works out that the fraction of the time spent in GC is even reduced (~5% difference for hello package and ~10% for nixosTests.gnome):

(For reference, here's how Boehm does the marking with specified offsets: https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/new_hblk.c#L203-L206, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L246-L249, https://github.com/ivmai/bdwgc/blob/88a61fc6f2688d8e1ea94f3f810c23db89a3f30b/include/private/gc_pmark.h#L272-L274)

GC_INITIAL_HEAP_SIZE=1M NIX_SHOW_STATS=1 result/bin/nix-instantiate ../nixpkgs -A hello --readonly-mode

I'm actually seeing a modest speedup across the board both for exercising the gc and with large initial heap sizes.

Also I wouldn't really worry about the compute overhead. All of the pointer tagging and untagging are the cheapest possible instructions, which take up a single clock cycle and considering the huge out-of-order windows and plenty of ALUs with large superscalar pipelines of modern processors this works out to be negligible (at least compared to the wins from doing less memory allocations, which are syscalls).

For example, here's what getInternalType compiles down to on x86_64 (but it also gets inlined everywhere that accesses happen):

0000000000080790 <nix::ValueStorage<8ul, void>::getInternalType() const>:
   80790:	       48 8b 07             	mov    (%rdi),%rax
   80793:	       89 c2                	mov    %eax,%edx
   80795:	       83 e2 07             	and    $0x7,%edx
   80798:	       83 fa 04             	cmp    $0x4,%edx
   8079b:	,----- 7f 23                	jg     807c0 <nix::ValueStorage<8ul, void>::getInternalType() const+0x30>
   8079d:	|      89 c6                	mov    %eax,%esi
   8079f:	|      8d 4a 0b             	lea    0xb(%rdx),%ecx
   807a2:	|      83 e6 06             	and    $0x6,%esi
   807a5:	|  ,-- 75 0b                	jne    807b2 <nix::ValueStorage<8ul, void>::getInternalType() const+0x22>
   807a7:	|  |   48 c1 e8 03          	shr    $0x3,%rax
   807ab:	|  |   85 d2                	test   %edx,%edx
   807ad:	|  |   89 c1                	mov    %eax,%ecx
   807af:	|  |   0f 44 ce             	cmove  %esi,%ecx
   807b2:	|  '-> 89 c8                	mov    %ecx,%eax
   807b4:	|      31 d2                	xor    %edx,%edx
   807b6:	|      31 c9                	xor    %ecx,%ecx
   807b8:	|      31 f6                	xor    %esi,%esi
   807ba:	|      31 ff                	xor    %edi,%edi
   807bc:	|      c3                   	ret
   807bd:	|      0f 1f 00             	nopl   (%rax)
   807c0:	'----> 83 fa 05             	cmp    $0x5,%edx
   807c3:	   ,-- 75 15                	jne    807da <nix::ValueStorage<8ul, void>::getInternalType() const+0x4a> < Predictable branch to unreachable
   807c5:	   |   48 8b 4f 08          	mov    0x8(%rdi),%rcx
   807c9:	   |   83 e1 07             	and    $0x7,%ecx
   807cc:	   |   83 c1 08             	add    $0x8,%ecx
   807cf:	   |   89 c8                	mov    %ecx,%eax
   807d1:	   |   31 d2                	xor    %edx,%edx
   807d3:	   |   31 c9                	xor    %ecx,%ecx
   807d5:	   |   31 f6                	xor    %esi,%esi
   807d7:	   |   31 ff                	xor    %edi,%edi
   807d9:	   |   c3                   	ret
   807da:	   '-> 50                   	push   %rax
   807db:	       48 8d 15 a0 cb 14 00 	lea    0x14cba0(%rip),%rdx        # 1cd382 <_fini+0x2f0da>
   807e2:	       be 17 02 00 00       	mov    $0x217,%esi
   807e7:	       48 8d 3d 5d ca 14 00 	lea    0x14ca5d(%rip),%rdi        # 1cd24b <_fini+0x2efa3>
   807ee:	       e8 0d 59 fb ff       	call   36100 <nix::panic(char const*, int, char const*)@plt>
   807f3:	       90                   	nop
   807f4:	       66 66 2e 0f 1f 84 00 	data16 cs nopw 0x0(%rax,%rax,1)
   807fb:	       00 00 00 00 
   807ff:	       90                   	nop

(So basically just several bitshifts and branches to unreachable, which are always predictable)

The pointer tagging for setPairOfPointersPayload is just a simple or instruction:

00000000000834d0 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)>:
   834d0:	    40 f6 c6 07          	test   $0x7,%sil
   834d4:	,-- 75 1f                	jne    834f5 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)+0x25>
   834d6:	|   48 83 ce 05          	or     $0x5,%rsi < This is the pointer tagging
   834da:	|   48 89 37             	mov    %rsi,(%rdi)
   834dd:	|   f6 c2 07             	test   $0x7,%dl
   834e0:	+-- 75 13                	jne    834f5 <void nix::ValueStorage<8ul, void>::setPairOfPointersPayload<(nix::InternalType)11, nix::Env, nix::Expr>(nix::Env*, nix::Expr*)+0x25>
   834e2:	|   48 83 ca 03          	or     $0x3,%rdx

One thing I'm really inclined to do is just get rid of the assertions, which would really simplify the code generation for all of these functions. That way we'd get even more performance wins on top of the reduced memory usage. In my measurements with assertions the runtime is slightly better, but without them the improvement (compared to current master) is the ballpark of 3.3%:

Benchmark 1: GC_INITIAL_HEAP_SIZE=16G result/bin/nix-instantiate ../nixpkgs -A nixosTests.gnome --readonly-mode
  Time (mean ± σ):      2.979 s ±  0.020 s    [User: 2.355 s, System: 0.623 s]
  Range (min … max):    2.953 s …  3.011 s    10 runs
Benchmark 1: GC_INITIAL_HEAP_SIZE=16G nix-instantiate ../nixpkgs -A nixosTests.gnome --readonly-mode
  Time (mean ± σ):      3.072 s ±  0.033 s    [User: 2.346 s, System: 0.725 s]
  Range (min … max):    3.033 s …  3.141 s    10 runs

@xokdvium xokdvium force-pushed the smaller-value branch 2 times, most recently from 6b99241 to 6315294 Compare July 1, 2025 22:17
Copy link
Member

@roberth roberth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic!

Just minor comments. This is close to done.

// These should be removed eventually, by putting the functionality that's
// needed by callers into methods of this type

// type() == nThunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see that it was about the group at first. Doesn't seem like a particularly valuable comment.

Suggested change
// type() == nThunk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing. Can be done separately.

xokdvium added 6 commits July 2, 2025 21:51
The following commits will touch this file significantly, so
it's better to get the formatting out of the way first.
…alueStorage`

This factors out most of the value representation into a mixin class.
`finishValue` is now gone for good and replaced with a simple template
function `setStorage` which derives the type information/disriminator from
the type of the argument. Likewise, reading of the value goes through function
template `getStorage`.

An empty type `Null` is introduced to make the bijection InternalType <-> C++ type
complete.
This also makes it possible to make `payload` field private
in the `ValueStorage` class template.
This shaves off a very significand amount of memory used
for evaluation as well as reduces the GC-managed heap.

Previously the union discriminator (InternalType) was
stored as a separate field in the Value, which takes up
whole 8 bytes due to padding needed for member alignment.
This effectively wasted 7 whole bytes of memory. Instead
of doing that InternalType is instead packed into pointer
alignment niches. As it turns out, there's more than enough
unused bits there for the bit packing to be effective.

See the doxygen comment in the ValueStorage specialization
for more details.

This does not add any performance overhead, even though
we now consistently assert the InternalType in all getters.

This can also be made atomic with a double width compare-and-swap
instruction on x86_64 (CMPXCHG16B instruction) for parallel evaluation.
@roberth roberth merged commit c79b801 into NixOS:master Jul 2, 2025
12 checks passed
@NaN-git
Copy link
Contributor

NaN-git commented Jul 2, 2025

It would have been nice not to use the phrase dword because on a 64bit system I would expect that the word size is 64bit, although e.g. on x86_64 64bit arithmetic instructions might not have the most efficient encoding.

Microsoft denotes 32bit integers as dword, so that this can create a lot of confusion.

for storing the type information. This way tagged pointers are considered
to be valid, even when they are not aligned. */
if constexpr (detail::useBitPackedValueStorage<sizeof(void *)>)
for (std::size_t i = 1; i < sizeof(std::uintptr_t); ++i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't introduce std::uintptr_t here because sizeof(std::uintptr_t) == sizeof(void *) isn't necessarily true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here because sizeof(std::uintptr_t) == sizeof(void *) isn't necessarily true.

Yeah, that's true. Though I can't think of a platform where that is actually the case.

* discriminator is stored in the first double word.
*
* The layout of discriminator bits is determined by the 3 bits of PrimaryDiscriminator,
* which are always stored in the lower 3 bits of the first dword of the payload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dword is ambiguous. My expectation is that the word size on a 64bit machine is 64bit, although on x86_64 64bit arithmetic instruction might not have the most efficient encoding.
Historically Microsoft uses dword for 32bit values.

return nFloat;
case tThunk:
case tApp:
return nThunk;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please introduce formatting changes in another PR.

/* Whether to use a specialization of ValueStorage that does bitpacking into
alignment niches. */
template<std::size_t ptrSize>
inline constexpr bool useBitPackedValueStorage = (ptrSize == 8) && (__STDCPP_DEFAULT_NEW_ALIGNMENT__ >= 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the GC isn't affected by __STDCPP_DEFAULT_NEW_ALIGNMENT__, i.e. this makes sense when the default allocator is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the GC isn't affected by STDCPP_DEFAULT_NEW_ALIGNMENT, i.e. this makes sense when the default allocator is used.

I've added that because Value also stores pointers that are not owned by Boehm (like SourceAccessor). Those are most likely allocated with the default allocator.

@NaN-git
Copy link
Contributor

NaN-git commented Jul 2, 2025

argh I did a review, but I forgot to publish it.

@xokdvium xokdvium deleted the smaller-value branch July 4, 2025 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c api Nix as a C library with a stable interface new-cli Relating to the "nix" command
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants