Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI-PR: quadratics scaling for the number of open file descriptors with respect to the number of processes/node #257

Open
edoapra opened this issue Apr 4, 2022 · 8 comments
Assignees

Comments

@edoapra
Copy link
Contributor

edoapra commented Apr 4, 2022

Since every shared memory allocation in MPI-PR opens a memory mapped file for each rank and each rank needs to know the file descriptors of all the ranks on the same node ... you end up with a (proc. per node)^2 file descriptors opened for every shared memory allocation.
Since 128-core hardware is becoming common place and 128*128=16K, we already have seen reports of Global Arrays runs that required to increase the kernel limit /proc/sys/fs/file-max to values O(10^6)-O(10^7).

https://groups.google.com/g/nwchem-forum/c/Q-qvcHP9vP4
nwchemgit/nwchem#338

Can we try to address this from the GA side?
Possible solutions that come to my mind (I have no idea about their feasibility)

  • Disable shared memory
  • Split physical node in smaller "virtual nodes" (may be consistent with numa or socket domains)
@bjpalmer
Copy link
Member

bjpalmer commented Apr 4, 2022

We definitely need to hold on to shared memory. The alternative would be to run everything through the progress rank, which would hit performance, probably significantly.

Have you tried increasing the number of progress ranks? There is a good chance this would be equivalent to creating virtual nodes. If it isn't, then we could probably fix it up so that it is. The variable to set this is GA_NUM_PROGRESS_RANKS_PER_NODE

@jeffhammond
Copy link
Member

It's possible to allocate one SHM slab per GA. MPI RMA does this under the hood.

@edoapra
Copy link
Contributor Author

edoapra commented Apr 4, 2022

Have you tried increasing the number of progress ranks? There is a good chance this would be equivalent to creating virtual nodes. If it isn't, then we could probably fix it up so that it is. The variable to set this is GA_NUM_PROGRESS_RANKS_PER_NODE

A quick test on a 2 node x 64 processes/node does not seem to show any change in the high watermark for the number of open file descriptors that seems to be around 960K.
The previous test was still using GA_NUM_PROGRESS_RANKS_PER_NODE=1 (my bad)

@bjpalmer
Copy link
Member

bjpalmer commented Apr 4, 2022

Okay, I'll take a look. How many progress ranks per node were you using? My guess is that if you double the number of progress ranks per node, it should be possible to decrease the number of file descriptors by 4.

@edoapra
Copy link
Contributor Author

edoapra commented Apr 5, 2022

Okay, I'll take a look. How many progress ranks per node were you using? My guess is that if you double the number of progress ranks per node, it should be possible to decrease the number of file descriptors by 4.

Now that I am correctly setting GA_NUM_PROGRESS_RANKS_PER_NODE, I am getting a somewhat linear decrease in the number of file descriptors. Could it be that still every process on a node creates its own mapped file, but only sees the file descriptors of its sub-group

Results on a a single node run with 128 processes

GA_NUM_PROGRESS_RANKS_PER_NODE no. file descriptors
1 4,505,000
2 2,276,000
4 1,163,000
8 607,000

@bjpalmer
Copy link
Member

bjpalmer commented Apr 5, 2022

So what happens when you increase the number of progress ranks?

@bjpalmer
Copy link
Member

bjpalmer commented Apr 5, 2022

You are correct. Each process creates it own mapped file but only sees the other processes in its own subgroup so I guess a linear decrease is what you would expect. Is this good enough? The only other possibility would be to have one process do a single large allocation and then divide that up among all other processes. That would probably take some significant effort.

Also, it looks like the code opens a shared memory segment (which creates a file descriptor), gets the pointer using mmap and then closes the file descriptor, so the descriptors are not hanging around for any great length of time. Do your numbers reflect total number of file descriptor created or maximum open at any one time?

@edoapra
Copy link
Contributor Author

edoapra commented Apr 5, 2022

You are correct. Each process creates it own mapped file but only sees the other processes in its own subgroup so I guess a linear decrease is what you would expect. Is this good enough? The only other possibility would be to have one process do a single large allocation and then divide that up among all other processes. That would probably take some significant effort.

This is probably good enough for the time being.
If we try the option of a single large shared allocation, we end up doing something similar to what we have in the old SysV shared memory code with all the associated bugs that came with it ...

On a related topic, I still don't understand how the comex mpi-pr code decides the size and number of allocations. I have edited the part of NWChem code I am using for this experiment by drastically reducing the number of ga_create() (and ga_destroy() of course). While the wall-time decreased, I have not seen any significant change in the number of file descriptors used (it might well be that I am doing something silly from the NWChem side, though).

Also, it looks like the code opens a shared memory segment (which creates a file descriptor), gets the pointer using mmap and then closes the file descriptor, so the descriptors are not hanging around for any great length of time. Do your numbers reflect total number of file descriptor created or maximum open at any one time?

The number of file descriptors I am quoting is the output of lsof. That seem to reflect what the kernel complains about (i.e. number that should not exceed /proc/sys/fs/file-max)

ssh compute-node /usr/sbin/lsof | grep nwchem | grep cmx | wc -l

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants