ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

oleuns · 2024-02-01T11:00:34Z

Describe your problem

Hey,
I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.

Environment:

OS: Ubuntu 22.04.2 LTS (both workstations)
MPI runtime: I am not sure
RELION version: 5.0-beta-0-commit-90d239 (both workstations)
Memory: 128GB/512GB
GPU: Quadro RTX5000/RTX3080TI

Dataset:

Box size: 220/350
Pixel size: 1.0152 Å/px/1.0152 Å/px
Number of particles: 70,000/130,000
Description: monomeric protein, around 120 kDa/ dimeric protein, around 140 kDa

Job options:

Type of job: 3D Auto-Refine
Number of MPI processes: 3
Number of threads: 10
Full command (see note.txt in the job directory):

++++ with the following command(s):
which relion_refine_mpi --o Refine3D/job043/run --auto_refine --split_random_halves --blush --i Select/job033/particles.star --ref Class3D/job028/run_it050_class005.mrc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /media/supervisor/DATA/Test --pool 10 --pad 2 --ctf --particle_diameter 140 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "2" --reuse_scratch --pipeline_control Refine3D/job043/
++++

Error message:

ERROR: CudaCustomAllocator out of memory
[requestedSpace: 64752640 B]
[largestContinuousFreeSpace: 42727424 B]
[totalFreeSpace: 166338048 B]
(194048B) (195584B) [512B] (512B) (512B) [512B] (512B) (512B) (512B) (512B) [512B] (512B) [512B] (512B) [1024B] (512B) (512B) (512B) [1024B] (512B) (512B) (512B) [512B] (512B) (512B) [512B] (512B) [512B] (512B) (1024B) [512B] (512B) (512B) [512B] (1024B) (2048B) (1024B) [1536B] (1536B) (512B) [512B] (512B) (512B) (512B) [3072B] (512B) (13312B) [7680B] (4096B) (4096B) (2048B) (13312B) [58368B] (36864B) (36864B) [5120B] (6656B) [27648B] (3072B) [5632B] (6656B) [1536B] (3072B) [43520B] (56832B) (10240B) (10240B) [9216B] (194048B) (195584B) (113152B) [156672B] (80896B) (113152B) [6144B] (21504B) (43008B) (43008B) (43008B) (36864B) (36864B) (36864B) (36864B) [43520B] (43008B) (36864B) [22016B] (194048B) [99840B] (36864B) [10752B] (113152B) (113152B) [397824B] (36864B) [22016B] (194048B) (147456B) [46592B] (195584B) (194048B) [92672B] (194048B) (195584B) (195584B) [99328B] (194048B) (195584B) (161280B) [91648B] (194048B) (147456B) [46592B] (147456B) [147456B] (161280B) (161280B) (147456B) (147456B) (147456B) [124928B] (147456B) (161280B) (294912B) (147456B) (194048B) [562176B] (147456B) [163840B] (195584B) [636416B] (195584B) [397824B] (294912B) [291328B] (194048B) [147456B] (195584B) (294912B) (294912B) [459264B] (1903616B) [142848B] (3096576B) (294912B) (294912B) [294400B] (194048B) (195584B) (1903616B) (294912B) (294912B) [290816B] (194048B) (195584B) [294912B] (294912B) [119808B] (194048B) (195584B) (294912B) (294912B) (294912B) [440320B] (194048B) (194048B) (195584B) (195584B) (194048B) (195584B) [1140736B] (194048B) (195584B) [1330176B] (3096576B) (1327104B) (1327104B) [4610048B] (1327104B) [1511424B] (3096576B) (3082240B) [1680384B] (3096576B) (3096576B) [3011584B] (3096576B) (3096576B) [42727424B] (3096576B) [1004544B] (67702784B) (135405568B) (135405568B) (135405568B) (135405568B) (3096576B) (3096576B) [37564416B] (3082240B) [14336B] (65076736B) (88330752B) (130152960B) (19625472B) (130152960B) (176660992B) (19625472B) (176660992B) (130152960B) (130152960B) (176660992B) (176660992B) (3807232B) (3096576B) (3807232B) (3096576B) (3096576B) [5833216B] (9789952B) (19579392B) (19579392B) (19579392B) (19579392B) (31375872B) (31375872B) (16241664B) (16241664B) (39250944B) (39250944B) (32483328B) (32483328B) (3096576B) [30427648B] (62751232B) (62751232B) [29662208B] = 2725961728B

RELION version: 5.0-beta-0-commit-90d239
exiting with an error ...

hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
```
     (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.

in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
```
     (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.

follower 2 encountered error: === Backtrace ===
/home/supervisor/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55584e7c528d]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0xe4ec4) [0x55584e79bec4]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0x3770cd) [0x55584ea2e0cd]
/lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e) [0x7f1326665c0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1325894ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1325926850]

ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
```
     (1.1e-8)*(N*2)^3  GB  
```
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.

MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.

Job options:

Type of job: 3D Classification
Number of MPI processes: 1
Number of threads: 16
Full command (see note.txt in the job directory):

The text was updated successfully, but these errors were encountered:

biochem-fan · 2024-02-01T11:18:38Z

This happens when you have some really bad particles which are hard to align. There are many discussions about this on the CCPEM mailing list, for example: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2212&L=CCPEM&P=R43497&K=2 and https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1610&L=CCPEM&P=R52316.

biochem-fan closed this as completed Feb 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

oleuns commented Feb 1, 2024

biochem-fan commented Feb 1, 2024

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

Comments

oleuns commented Feb 1, 2024

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

biochem-fan commented Feb 1, 2024

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.