You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey,
I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.
RELION version: 5.0-beta-0-commit-90d239
exiting with an error ...
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
(1.1e-8)*(N*2)^3 GB
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
(1.1e-8)*(N*2)^3 GB
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
(1.1e-8)*(N*2)^3 GB
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.
Job options:
Type of job: 3D Classification
Number of MPI processes: 1
Number of threads: 16
Full command (see note.txt in the job directory):
The text was updated successfully, but these errors were encountered:
Describe your problem
Hey,
I encountered a CUDA memory allocation issue during 3D classification and 3D auto-refining with similar errors. The error occurs in my hands occasionally at a random iteration during the refinement. I had this issue on two different workstations with two different particle sets.
Environment:
Dataset:
Job options:
note.txt
in the job directory):++++ with the following command(s):
which relion_refine_mpi
--o Refine3D/job043/run --auto_refine --split_random_halves --blush --i Select/job033/particles.star --ref Class3D/job028/run_it050_class005.mrc --ini_high 30 --dont_combine_weights_via_disc --scratch_dir /media/supervisor/DATA/Test --pool 10 --pad 2 --ctf --particle_diameter 140 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "2" --reuse_scratch --pipeline_control Refine3D/job043/++++
Error message:
RELION version: 5.0-beta-0-commit-90d239
exiting with an error ...
hwloc/linux: Ignoring PCI device with non-16bit domain.
Pass --enable-32bits-pci-domain to configure to support such devices
(warning: it would break the library ABI, don't enable unless really needed).
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
in: /home/supervisor/relion/src/acc/cuda/custom_allocator.cuh, line 539
ERROR:
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
follower 2 encountered error: === Backtrace ===
/home/supervisor/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55584e7c528d]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0xe4ec4) [0x55584e79bec4]
/home/supervisor/relion/build/bin/relion_refine_mpi(+0x3770cd) [0x55584ea2e0cd]
/lib/x86_64-linux-gnu/libgomp.so.1(+0x1dc0e) [0x7f1326665c0e]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f1325894ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x7f1325926850]
ERROR:
You ran out of memory on the GPU(s).
Each MPI-rank running on a GPU increases the use of GPU-memory. Relion
tries to distribute load over multiple GPUs to increase performance,
but doing this in a general and memory-efficient way is difficult.
Check the device-mapping presented at the beginning of each run,
and be particularly wary of 'device X is split between N followers', which
will result in a higher memory cost on GPU X. In classifications, GPU-
sharing between MPI-ranks is typically fine, whereas it will usually
cause out-of-memory during the last iteration of high-resolution refinement.
If you are not GPU-sharing across MPI-follower ranks, then you might be using a
too-big box-size for the GPU memory. Currently, N-pixel particle images
will require roughly
of memory (per rank) during the final iteration of refinement (using
single-precision GPU code, which is default). 450-pixel images can therefore
just about fit into a GPU with 8GB of memory, since 11*(450*2)^3 ~= 8.02
During classifications, resolution is typically lower and N is suitably
reduced, which means that memory use is much lower.
If the above estimation fits onto (all of) your GPU(s), you may have
a very large number of orientations which are found as possible during
the expectation step, which results in large arrays being needed on the
GPU. If this is the case, you should find large (>10'000) values of
'_rlnNrOfSignificantSamples' in your _data.star output files. You can try
adding the --maxsig
, flag, where P is an integer limit, but you
should probably also consult expertise or re-evaluate your data and/or
input reference. Seeing large such values means relion is finding nothing
to align.
If none of the above applies, please report the error to the relion
developers at github.com/3dem/relion/issues.
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
I accidentally deleted the folders for the 3D classification. The next time I see them I will post them too.
Job options:
note.txt
in the job directory):The text was updated successfully, but these errors were encountered: