(Still) Excessive memory usage #118

fstein93 · 2022-08-15T13:24:57Z

Dear authors,

I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules).
I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.

A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:

n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).

I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.

My questions are:

What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?

EDIT:
I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.

EDIT2:
The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.

airmler · 2022-11-25T12:43:30Z

i am not a cosma developer but I can give an advice:
simply set
export COSMA_CPU_MAX_MEMORY=XXX
to a value around 2-3 times the value you need to store the matrices. This should be enough to find a reasonable setting for COSMA and you should outperform ScaLAPACK (at least for the largeK case).
ScaLAPACK should roughly need twice the memory as it uses the SUMMA algorithm.

fstein93 · 2022-11-26T09:03:50Z

It does not help with the default settings. I could get it running by simply setting COSMA_ADAPT_STRATEGY=OFF. But I wonder why the default does not capture it properly.

ajaypanyala · 2023-06-08T16:28:22Z

I have the same issue with the GPU runs on NERSC Perlmutter. I am running the cosma matrix-multiply miniapp with m=n=k=25000. It fails with OOM errors on even 100 nodes. I built COSMA with the regular CUDA options (no NCCL or GPU-aware MPI).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Still) Excessive memory usage #118

(Still) Excessive memory usage #118

fstein93 commented Aug 15, 2022 •

edited

airmler commented Nov 25, 2022

fstein93 commented Nov 26, 2022

ajaypanyala commented Jun 8, 2023

(Still) Excessive memory usage #118

(Still) Excessive memory usage #118

Comments

fstein93 commented Aug 15, 2022 • edited

airmler commented Nov 25, 2022

fstein93 commented Nov 26, 2022

ajaypanyala commented Jun 8, 2023

fstein93 commented Aug 15, 2022 •

edited