Investigate why NUMA interleaving isn't reliable on Threadripper. #15
The curious thing here is that it's either 100% or 50%. That corresponds to perfect distribution and 3-to-1 distribution across the nodes. (3x more memory on one node than the other)
This seems too "round" to be a coincidence. Running out of memory one one node wouldn't explain this.
I've never observed this on my dated quad-opteron. And unfortunately, I do not have access to a Threadripper system. So this might take a while to track down.
The text was updated successfully, but these errors were encountered:
Had a discussion with Oliver Kruse. And while he wasn't able to reproduce it with a 1950X, he did bring up a point which seems to be the likely cause of this on the 2990WX. So huge thanks to him!
The screenshots on the forum post show that Windows (and thus y-cruncher) reads the hardware as 4 NUMA nodes despite there being only 2 memory domains.
Windows uses the CPU topology to define nodes. And since the 2990WX has 4 dies, it has 4 nodes. 2 of them have memory, the other 2 don't.
y-cruncher reads the hardware as having 4 NUMA nodes and attempts to allocate memory evenly across the 4 nodes. These allocations are done using
However, 2 of the nodes have no memory. Therefore
If the empty nodes gets bound to different nodes, the distribution will be perfect (100%/2). If they both get bound to the same node, the memory distribution will be 3-to-1 - thus giving the 50%/2.
The solution is to exclude NUMA nodes that don't have memory. This should take care of Threadripper and other similar cases. But it won't solve the more general case of heterogeneous systems.
This is easy to do on Windows. But Linux will take some more investigation.
A temporary work-around is to manually select the NUMA nodes in memory allocator. You will need to experiment to see which 2 nodes of the 4 are the ones with memory.
I rolled out v0.7.6.9488 yesterday which will disregard NUMA nodes that have no memory.
This has been tested on Windows using an artificial environment. If the cause of the bug is as described above, then this should be fixed for Windows.
On Linux, the fix remains completely untested so I'm less confident it works there.