Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: CudaCustomAllocator out of memory in SPA using Relion 5.0-beta-0-commit-90d239 #1075

Closed
oleuns opened this issue Feb 1, 2024 · 1 comment

Comments

@oleuns
Copy link

oleuns commented Feb 1, 2024

Describe your problem

Hey,
I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.

Environment:

  • OS: Ubuntu 22.04.2 LTS (both workstations)
  • MPI runtime: I am not sure
  • RELION version: 5.0-beta-0-commit-90d239 (both workstations)
  • Memory: 128GB/512GB
  • GPU: Quadro RTX5000/RTX3080TI

Dataset:

  • Box size: 220/350
  • Pixel size: 1.0152 Å/px/1.0152 Å/px
  • Number of particles: 70,000/130,000
  • Description: monomeric protein, around 120 kDa/ dimeric protein, around 140 kDa

Job options:

  • Type of job: 3D Auto-Refine
  • Number of MPI processes: 3
  • Number of threads: 10
  • Full command (see note.txt in the job directory):

++++ with the following command(s):
which relion_refine_mpi --o Refine3D/job043/run --auto_refine --split_random_halves --blush --i Select/job033/particles.star --ref Class3D/job028/run_it050_class005.mrc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /media/supervisor/DATA/Test --pool 10 --pad 2 --ctf --particle_diameter 140 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "2" --reuse_scratch --pipeline_control Refine3D/job043/
++++

Error message:

ERROR: CudaCustomAllocator out of memory
[requestedSpace: 64752640 B]
[largestContinuousFreeSpace: 42727424 B]
[totalFreeSpace: 166338048 B]
(194048B) (195584B) [512B] (512B) (512B) [512B] (512B) (512B) (512B) (512B) [512B] (512B) [512B] (512B) [1024B] (512B) (512B) (512B) [1024B] (512B) (512B) (512B) [512B] (512B) (512B) [512B] (512B) [512B] (512B) (1024B) [512B] (512B) (512B) [512B] (1024B) (2048B) (1024B) [1536B] (1536B) (512B) [512B] (512B) (512B) (512B) [3072B] (512B) (13312B) [7680B] (4096B) (4096B) (2048B) (13312B) [58368B] (36864B) (36864B) [5120B] (6656B) [27648B] (3072B) [5632B] (6656B) [1536B] (3072B) [43520B] (56832B) (10240B) (10240B) [9216B] (194048B) (195584B) (113152B) [156672B] (80896B) (113152B) [6144B] (21504B) (43008B) (43008B) (43008B) (36864B) (36864B) (36864B) (36864B) [43520B] (43008B) (36864B) [22016B] (194048B) [99840B] (36864B) [10752B] (113152B) (113152B) [397824B] (36864B) [22016B] (194048B) (147456B) [46592B] (195584B) (194048B) [92672B] (194048B) (195584B) (195584B) [99328B] (194048B) (195584B) (161280B) [91648B] (194048B) (147456B) [46592B] (147456B) [147456B] (161280B) (161280B) (147456B) (147456B) (147456B) [124928B] (147456B) (161280B) (294912B) (147456B) (194048B) [562176B] (147456B) [163840B] (195584B) [636416B] (195584B) [397824B] (294912B) [291328B] (194048B) [147456B] (195584B) (294912B) (294912B) [459264B] (1903616B) [142848B] (3096576B) (294912B) (294912B) [294400B] (194048B) (195584B) (1903616B) (294912B) (294912B) [290816B] (194048B) (195584B) [294912B] (294912B) [119808B] (194048B) (195584B) (294912B) (294912B) (294912B) [440320B] (194048B) (194048B) (195584B) (195584B) (194048B) (195584B) [1140736B] (194048B) (195584B) [1330176B] (3096576B) (1327104B) (1327104B) [4610048B] (1327104B) [1511424B] (3096576B) (3082240B) [1680384B] (3096576B) (3096576B) [3011584B] (3096576B) (3096576B) [42727424B] (3096576B) [1004544B] (67702784B) (135405568B) (135405568B) (135405568B) (135405568B) (3096576B) (3096576B) [37564416B] (3082240B) [14336B] (65076736B) (88330752B) (130152960B) (19625472B) (130152960B) (176660992B) (19625472B) (176660992B) (130152960B) (130152960B) (176660992B) (176660992B) (3807232B) (3096576B) (3807232B) (3096576B) (3096576B) [5833216B] (9789952B) (19579392B) (19579392B) (19579392B) (19579392B) (31375872B) (31375872B) (16241664B) (16241664B) (39250944B) (39250944B) (32483328B) (32483328B) (3096576B) [30427648B] (62751232B) (62751232B) [29662208B] = 2725961728B

RELION version: 5.0-beta-0-commit-90d239
exiting with an error ...

hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run,
    and be particularly wary of 'device X is split between N followers', which
    will result in a higher memory cost on GPU X. In classifications, GPU-
    sharing between MPI-ranks is typically fine, whereas it will usually
    cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a
    too-big box-size for the GPU memory. Currently, N-pixel particle images
    will require roughly

         (1.1e-8)*(N*2)^3  GB  
    

    of memory (per rank) during the final iteration of refinement (using
    single-precision GPU code, which is default). 450-pixel images can therefore
    just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
    During classifications, resolution is typically lower and N is suitably
    reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have
    a very large number of orientations which are found as possible during
    the expectation step, which results in large arrays being needed on the
    GPU. If this is the case, you should find large (>10'000) values of
    '_rlnNrOfSignificantSamples' in your _data.star output files. You can try
    adding the --maxsig

    , flag, where P is an integer limit, but you
    should probably also consult expertise or re-evaluate your data and/or
    input reference. Seeing large such values means relion is finding nothing
    to align.
    If none of the above applies, please report the error to the relion
    developers at github.com/3dem/relion/issues.

in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run,
    and be particularly wary of 'device X is split between N followers', which
    will result in a higher memory cost on GPU X. In classifications, GPU-
    sharing between MPI-ranks is typically fine, whereas it will usually
    cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a
    too-big box-size for the GPU memory. Currently, N-pixel particle images
    will require roughly

         (1.1e-8)*(N*2)^3  GB  
    

    of memory (per rank) during the final iteration of refinement (using
    single-precision GPU code, which is default). 450-pixel images can therefore
    just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
    During classifications, resolution is typically lower and N is suitably
    reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have
    a very large number of orientations which are found as possible during
    the expectation step, which results in large arrays being needed on the
    GPU. If this is the case, you should find large (>10'000) values of
    '_rlnNrOfSignificantSamples' in your _data.star output files. You can try
    adding the --maxsig

    , flag, where P is an integer limit, but you
    should probably also consult expertise or re-evaluate your data and/or
    input reference. Seeing large such values means relion is finding nothing
    to align.
    If none of the above applies, please report the error to the relion
    developers at github.com/3dem/relion/issues.

follower 2 encountered error: === Backtrace ===
/home/supervisor/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55584e7c528d]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0xe4ec4) [0x55584e79bec4]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0x3770cd) [0x55584ea2e0cd]
/lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e) [0x7f1326665c0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1325894ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1325926850]

ERROR:

You ran out of memory on the GPU(s).

Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.

  1. Check the device-mapping presented at the beginning of each run,
    and be particularly wary of 'device X is split between N followers', which
    will result in a higher memory cost on GPU X. In classifications, GPU-
    sharing between MPI-ranks is typically fine, whereas it will usually
    cause out-of-memory during the last iteration of high-resolution refinement.

  2. If you are not GPU-sharing across MPI-follower ranks, then you might be using a
    too-big box-size for the GPU memory. Currently, N-pixel particle images
    will require roughly

         (1.1e-8)*(N*2)^3  GB  
    

    of memory (per rank) during the final iteration of refinement (using
    single-precision GPU code, which is default). 450-pixel images can therefore
    just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
    During classifications, resolution is typically lower and N is suitably
    reduced, which means that memory use is much lower.

  3. If the above estimation fits onto (all of) your GPU(s), you may have
    a very large number of orientations which are found as possible during
    the expectation step, which results in large arrays being needed on the
    GPU. If this is the case, you should find large (>10'000) values of
    '_rlnNrOfSignificantSamples' in your _data.star output files. You can try
    adding the --maxsig

    , flag, where P is an integer limit, but you
    should probably also consult expertise or re-evaluate your data and/or
    input reference. Seeing large such values means relion is finding nothing
    to align.
    If none of the above applies, please report the error to the relion
    developers at github.com/3dem/relion/issues.


MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.

Job options:

  • Type of job: 3D Classification
  • Number of MPI processes: 1
  • Number of threads: 16
  • Full command (see note.txt in the job directory):
@biochem-fan
Copy link
Member

This happens when you have some really bad particles which are hard to align. There are many discussions about this on the CCPEM mailing list, for example: https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind2212&L=CCPEM&P=R43497&K=2 and https://www.jiscmail.ac.uk/cgi-bin/wa-jisc.exe?A2=ind1610&L=CCPEM&P=R52316.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants