Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Still) Excessive memory usage #118

Open
fstein93 opened this issue Aug 15, 2022 · 3 comments
Open

(Still) Excessive memory usage #118

fstein93 opened this issue Aug 15, 2022 · 3 comments

Comments

@fstein93
Copy link

fstein93 commented Aug 15, 2022

Dear authors,

I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.

A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:

  1. n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
  2. n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).

I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.

My questions are:

  1. What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
  2. Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
  3. Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?

EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.

EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.

@airmler
Copy link

airmler commented Nov 25, 2022

i am not a cosma developer but I can give an advice:
simply set
export COSMA_CPU_MAX_MEMORY=XXX
to a value around 2-3 times the value you need to store the matrices. This should be enough to find a reasonable setting for COSMA and you should outperform ScaLAPACK (at least for the largeK case).
ScaLAPACK should roughly need twice the memory as it uses the SUMMA algorithm.

@fstein93
Copy link
Author

It does not help with the default settings. I could get it running by simply setting COSMA_ADAPT_STRATEGY=OFF. But I wonder why the default does not capture it properly.

@ajaypanyala
Copy link

I have the same issue with the GPU runs on NERSC Perlmutter. I am running the cosma matrix-multiply miniapp with m=n=k=25000. It fails with OOM errors on even 100 nodes. I built COSMA with the regular CUDA options (no NCCL or GPU-aware MPI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants