MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

edoapra · 2022-04-04T19:51:48Z

Since every shared memory allocation in MPI-PR opens a memory mapped file for each rank and each rank needs to know the file descriptors of all the ranks on the same node ... you end up with a (proc. per node)^2 file descriptors opened for every shared memory allocation.
Since 128-core hardware is becoming common place and 128*128=16K, we already have seen reports of Global Arrays runs that required to increase the kernel limit /proc/sys/fs/file-max to values O(10^6)-O(10^7).

https://groups.google.com/g/nwchem-forum/c/Q-qvcHP9vP4
nwchemgit/nwchem#338

Can we try to address this from the GA side?
Possible solutions that come to my mind (I have no idea about their feasibility)

Disable shared memory
Split physical node in smaller "virtual nodes" (may be consistent with numa or socket domains)

The text was updated successfully, but these errors were encountered:

bjpalmer · 2022-04-04T20:06:03Z

We definitely need to hold on to shared memory. The alternative would be to run everything through the progress rank, which would hit performance, probably significantly.

Have you tried increasing the number of progress ranks? There is a good chance this would be equivalent to creating virtual nodes. If it isn't, then we could probably fix it up so that it is. The variable to set this is GA_NUM_PROGRESS_RANKS_PER_NODE

jeffhammond · 2022-04-04T20:19:26Z

It's possible to allocate one SHM slab per GA. MPI RMA does this under the hood.

edoapra · 2022-04-04T22:13:53Z

Have you tried increasing the number of progress ranks? There is a good chance this would be equivalent to creating virtual nodes. If it isn't, then we could probably fix it up so that it is. The variable to set this is GA_NUM_PROGRESS_RANKS_PER_NODE

~~A quick test on a 2 node x 64 processes/node does not seem to show any change in the high watermark for the number of open file descriptors that seems to be around 960K.~~
The previous test was still using GA_NUM_PROGRESS_RANKS_PER_NODE=1 (my bad)

bjpalmer · 2022-04-04T23:56:34Z

Okay, I'll take a look. How many progress ranks per node were you using? My guess is that if you double the number of progress ranks per node, it should be possible to decrease the number of file descriptors by 4.

edoapra · 2022-04-05T00:46:56Z

Okay, I'll take a look. How many progress ranks per node were you using? My guess is that if you double the number of progress ranks per node, it should be possible to decrease the number of file descriptors by 4.

Now that I am correctly setting GA_NUM_PROGRESS_RANKS_PER_NODE, I am getting a somewhat linear decrease in the number of file descriptors. Could it be that still every process on a node creates its own mapped file, but only sees the file descriptors of its sub-group

Results on a a single node run with 128 processes

GA_NUM_PROGRESS_RANKS_PER_NODE	no. file descriptors
1	4,505,000
2	2,276,000
4	1,163,000
8	607,000

bjpalmer · 2022-04-05T16:25:08Z

So what happens when you increase the number of progress ranks?

bjpalmer · 2022-04-05T17:14:29Z

You are correct. Each process creates it own mapped file but only sees the other processes in its own subgroup so I guess a linear decrease is what you would expect. Is this good enough? The only other possibility would be to have one process do a single large allocation and then divide that up among all other processes. That would probably take some significant effort.

Also, it looks like the code opens a shared memory segment (which creates a file descriptor), gets the pointer using mmap and then closes the file descriptor, so the descriptors are not hanging around for any great length of time. Do your numbers reflect total number of file descriptor created or maximum open at any one time?

edoapra · 2022-04-05T17:33:07Z

You are correct. Each process creates it own mapped file but only sees the other processes in its own subgroup so I guess a linear decrease is what you would expect. Is this good enough? The only other possibility would be to have one process do a single large allocation and then divide that up among all other processes. That would probably take some significant effort.

This is probably good enough for the time being.
If we try the option of a single large shared allocation, we end up doing something similar to what we have in the old SysV shared memory code with all the associated bugs that came with it ...

On a related topic, I still don't understand how the comex mpi-pr code decides the size and number of allocations. I have edited the part of NWChem code I am using for this experiment by drastically reducing the number of ga_create() (and ga_destroy() of course). While the wall-time decreased, I have not seen any significant change in the number of file descriptors used (it might well be that I am doing something silly from the NWChem side, though).

Also, it looks like the code opens a shared memory segment (which creates a file descriptor), gets the pointer using mmap and then closes the file descriptor, so the descriptors are not hanging around for any great length of time. Do your numbers reflect total number of file descriptor created or maximum open at any one time?

The number of file descriptors I am quoting is the output of lsof. That seem to reflect what the kernel complains about (i.e. number that should not exceed /proc/sys/fs/file-max)

ssh compute-node /usr/sbin/lsof | grep nwchem | grep cmx | wc -l

edoapra assigned ajaypanyala, edoapra and bjpalmer Apr 4, 2022

This was referenced Jul 13, 2023

MPI-PR: check for number of open files #310

Merged

O(10^3) global arrays are created from the property code nwchemgit/nwchem#831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

edoapra commented Apr 4, 2022

bjpalmer commented Apr 4, 2022

jeffhammond commented Apr 4, 2022

edoapra commented Apr 4, 2022 •

edited

bjpalmer commented Apr 4, 2022

edoapra commented Apr 5, 2022 •

edited

bjpalmer commented Apr 5, 2022

bjpalmer commented Apr 5, 2022

edoapra commented Apr 5, 2022

MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

Comments

edoapra commented Apr 4, 2022

bjpalmer commented Apr 4, 2022

jeffhammond commented Apr 4, 2022

edoapra commented Apr 4, 2022 • edited

bjpalmer commented Apr 4, 2022

edoapra commented Apr 5, 2022 • edited

bjpalmer commented Apr 5, 2022

bjpalmer commented Apr 5, 2022

edoapra commented Apr 5, 2022

edoapra commented Apr 4, 2022 •

edited

edoapra commented Apr 5, 2022 •

edited