MFC simulation doesn't build with `--debug` on GPU #123

sbryngelson · 2023-03-18T20:05:17Z

MFC simulation doesn't build (compile) with --debug on GPU.

Here's a dump from wingtip. Same is true on Phoenix, though. Builds fine without --debug. Also -j 8 isn't the problem.

[I]wingtip-gpu3: sbryngelson3/MFC $ ./mfc.sh build -j 8 -t simulation --debug --gpu
mfc: Found CMake: /nethome/sbryngelson3/MFC/build/cmake/bin/cmake.
mfc: OK > (venv) Entered the Python virtual environment.
      ___            ___          ___
     /__/\          /  /\        /  /\       sbryngelson3@wingtip-gpu3 [Linux]
    |  |::\        /  /:/_      /  /:/       ---------------------------------
    |  |:|:\      /  /:/ /\    /  /:/        --jobs 8
  __|__|:|\:\    /  /:/ /:/   /  /:/  ___    --mpi
 /__/::::| \:\  /__/:/ /:/   /__/:/  /  /\   --gpu
 \  \:\~~\__\/  \  \:\/:/    \  \:\ /  /:/   --debug
  \  \:\         \  \::/      \  \:\  /:/    --targets simulation
   \  \:\         \  \:\       \  \:\/:/     --------------------------------------------
    \  \:\         \  \:\       \  \::/      $ ./mfc.sh [build, run, test, clean] --help
     \__\/          \__\/        \__\/

Building simulation:

  $ cmake -DMFC_SIMULATION=ON -Wno-dev -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_PREFIX_PATH=/nethome/sbryngelson3/MFC/build/install
-DCMAKE_FIND_ROOT_PATH=/nethome/sbryngelson3/MFC/build/install -DCMAKE_INSTALL_PREFIX=/nethome/sbryngelson3/MFC/build/install -DMFC_MPI=ON -DMFC_OpenACC=ON -S
/nethome/sbryngelson3/MFC/ -B /nethome/sbryngelson3/MFC/build/simulation

-- The C compiler identification is NVHPC 22.11.0
-- The Fortran compiler identification is NVHPC 22.11.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/nvc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/nvfortran - skipped
-- Found MPI_Fortran: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempif08.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1") found components: Fortran
-- Found OpenACC_C: -acc
-- Found OpenACC_Fortran: -acc
-- Found CUDAToolkit: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8/include (found version "11.8.89")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CUDAToolkit_LIBRARY_ROOT=
-- Could NOT find cuTENSOR (missing: cuTENSOR_LIBRARY)
CMake Warning at CMakeLists.txt:239 (message):
  Failed to locate the NVIDIA cuTENSOR library.  MFC will be built without
  support for it, disallowing the use of cu_tensor=T.  This can result in
  degraded performance.
Call Stack (most recent call first):
  CMakeLists.txt:280 (MFC_SETUP_TARGET)


-- Configuring done
-- Generating done
-- Build files have been written to: /nethome/sbryngelson3/MFC/build/simulation

  $ cmake --build /nethome/sbryngelson3/MFC/build/simulation --target simulation -j 8 --config Debug

[  2%] Preprocessing (Fypp) p_main.fpp
[ 17%] Preprocessing (Fypp) m_bubbles.fpp
[ 17%] Preprocessing (Fypp) m_cbc.fpp
[ 17%] Preprocessing (Fypp) m_data_output.fpp
[ 17%] Preprocessing (Fypp) m_compute_cbc.fpp
[ 17%] Preprocessing (Fypp) m_constants.fpp
[ 17%] Preprocessing (Fypp) m_global_parameters.fpp
[ 17%] Preprocessing (Fypp) m_fftw.fpp
[ 19%] Preprocessing (Fypp) m_hypoelastic.fpp
[ 21%] Preprocessing (Fypp) m_mpi_common.fpp
[ 26%] Preprocessing (Fypp) m_mpi_proxy.fpp
[ 26%] Preprocessing (Fypp) m_qbmm.fpp
[ 28%] Preprocessing (Fypp) m_monopole.fpp
[ 30%] Preprocessing (Fypp) m_rhs.fpp
[ 32%] Preprocessing (Fypp) m_riemann_solvers.fpp
[ 34%] Preprocessing (Fypp) m_start_up.fpp
[ 36%] Preprocessing (Fypp) m_time_steppers.fpp
[ 39%] Preprocessing (Fypp) m_variables_conversion.fpp
[ 41%] Preprocessing (Fypp) m_viscous.fpp
[ 43%] Preprocessing (Fypp) m_weno.fpp
[ 45%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/autogen/m_constants.fpp.f90.o
[ 47%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/m_nvtx.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
[ 50%] Building Fortran object CMakeFiles/simulation.dir/src/common/m_derived_types.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
[ 52%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 5)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 6)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 6)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 45)
s_initialize_global_parameters_module:
      7, include 'm_global_parameters.fpp'
         471, Generating update device(weno_polyn)
         472, Generating update device(nb)
         550, Generating enter data create(r0(nb),v0(nb),weight(nb))
         551, Generating enter data create(bub_idx%rs(nb),bub_idx%vs(nb))
         552, Generating enter data create(bub_idx%ms(nb),bub_idx%ps(nb))
         561, Generating enter data create(bub_idx%moms(nb,6))
         652, Generating enter data create(bub_idx%rs(nb),bub_idx%vs(nb))
         653, Generating enter data create(bub_idx%ms(nb),bub_idx%ps(nb))
         654, Generating enter data create(r0(nb),v0(nb),weight(nb))
         707, Generating enter data create(re_idx(1:2,1:re_size$r6))
         736, Generating update device(re_size(:))
         761, Generating enter data create(ptil(ix%beg:ix%end,iy%beg:iy%end,iz%beg:iz%end))
         779, Generating update device(starty,startx,startz)
         803, Generating update device(strxb,momxe,advxe,contxe,bubxe,sys_size,strxe,intxb,e_idx,bubxb,alf_idx,contxb,buff_size,advxb,momxb,intxe)
         806, Generating enter data create(x_cb(-(buff_size)-1:m+buff_size))
         807, Generating enter data create(x_cc(-buff_size:m+buff_size))
         808, Generating enter data create(dx(-buff_size:m+buff_size))
         812, Generating enter data create(y_cb(-(buff_size)-1:n+buff_size))
         813, Generating enter data create(y_cc(-buff_size:n+buff_size))
         814, Generating enter data create(dy(-buff_size:n+buff_size))
         818, Generating enter data create(z_cb(-(buff_size)-1:p+buff_size))
         819, Generating enter data create(z_cc(-buff_size:p+buff_size))
         820, Generating enter data create(dz(-buff_size:p+buff_size))
s_initialize_nonpoly:
      7, include 'm_global_parameters.fpp'
         848, Generating enter data create(pb0(nb),mass_n0(nb),mass_v0(nb),pe_t(nb))
         849, Generating enter data create(k_v(nb),k_n(nb),omegan(nb))
         850, Generating enter data create(im_trans_c(nb),im_trans_t(nb),re_trans_t(nb),re_trans_c(nb))
s_finalize_global_parameters_module:
      7, include 'm_global_parameters.fpp'
        1006, Generating exit data delete(re_idx(:,:))
        1010, Generating exit data delete(x_cb(:),dx(:),x_cc(:))
        1013, Generating exit data delete(y_cb(:),dy(:),y_cc(:))
        1016, Generating exit data delete(z_cc(:),z_cb(:),dz(:))
s_simpson:
      7, include 'm_global_parameters.fpp'
nvfortran-Fatal-/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2 TERMINATED by signal 11
Arguments to /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2
/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2 /tmp/nvfortranUejVgkE8S8y-R.ilm -fn /nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90 -debug -x 120 0x8000 -opt 0 -terse 1 -inform warn -x 51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -tp skylake-avx512 -x 124 0x1400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -astype 0 -x 121 1 -x 183 4 -x 121 0x800 -x 68 0x1 -x 8 0x40000000 -x 70 0x40000000 -x 56 0x10 -x 54 0x10 -x 120 0x2000000 -x 249 140 -x 68 0x20 -x 70 0x40000000 -x 8 0x40000000 -x 164 0x800000 -x 71 0x2000 -x 71 0x4000 -x 34 0x40000000 -x 83 0x1 -x 85 0x1 -x 206 0x02 -x 120 0x1000000 -x 68 0x1 -x 39 4 -x 56 0x10 -x 26 0x10 -x 26 1 -x 56 0x4000 -x 124 1 -accel tesla -accel host -x 197 0 -x 175 0 -x 203 0 -x 204 0 -x 163 0xc0008 -x 180 0x40 -x 180 0x800 -x 163 0x400000 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 176 0x100 -cudacap 80 -x 180 0x4000400 -x 121 0xc00 -x 186 0x80 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 163 0x1 -x 186 0x80000 -cudaver 11080 -x 176 0x100 -cudacap 80 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 189 0x8000 -y 163 0xc0000000 -x 163 0x800000 -x 189 0x10 -y 189 0x4000000 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 187 0x40000 -x 187 0x8000000 -x 60 512 -x 124 0x20 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -x 192 0x40000000 -x 215 0x60 -cci /tmp/nvfortranEejVgApGJPSQJ.cci -cmdline '+nvfortran /nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90 -DMFC_DEBUG -DMFC_MPI -DMFC_SIMULATION -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/lib -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/math_libs/include -g -O0 -Mbounds -r8 -Mfreeform -Mr8intrinsics -cpp -Mpreprocess -Minfo=accel -gpu=keep,ptxinfo,lineinfo,autocompare,debug -acc -Mpreprocess -c -o CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o' -stbfile /tmp/nvfortranoejVgQkIgvDS8.stb -asm /tmp/nvfortranoejVgQBMqyd5U.ll
make[3]: *** [CMakeFiles/simulation.dir/build.make:365: CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o] Error 127
make[2]: *** [CMakeFiles/Makefile2:84: CMakeFiles/simulation.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:91: CMakeFiles/simulation.dir/rule] Error 2
make: *** [Makefile:170: simulation] Error 2


Error: Failed to build the simulation target.

Terminated
mfc: (venv) Exiting the Python virtual environment.

The text was updated successfully, but these errors were encountered:

henryleberre · 2023-03-18T21:49:24Z

First reported when merging the GPU branch with master last summer (on MFC-develop): https://forums.developer.nvidia.com/t/nvhpc-22-5-fort2-terminated-by-signal-11/219545. The bug disappeared and then reappeared. This is a bug with NVHPC (the compiler shouldn't segfault). --debug was very helpful. I could make a script to go back and find what commit started all of this again. Should I?

henryleberre · 2023-03-18T21:50:39Z

Back then MFC's build system was less than desirable and it complicated the reporting process.

sbryngelson · 2023-03-18T21:53:20Z

I would report it again, then. If you believe this was triggered by a commit revision, then it would be useful to go through old releases.

sbryngelson · 2023-03-22T18:16:45Z

From Mat Colgrove:

Oddly, I’m not able to reproduce the error with 22.5 or 22.11, but can with 23.1 and our development compiler. Sometimes this can mean that there’s a UMR or other memory issue where it only shows up occasionally, but running the 22.11 back-end compiler through valgrind shows no issues. Hence while it’s likely the same issue, I’m not 100% sure.

I filed a problem report, TPR #33317, and sent it to engineering for investigation.

In my case, if I add an opt level, i.e. change “-O0” to “-O1” or “-O2”, the error goes away. You might try this as well, just as a work around until we can get this fixed.

-Mat

https://forums.developer.nvidia.com/t/nvhpc-22-5-fort2-terminated-by-signal-11/219545/8

sbryngelson · 2023-03-22T18:27:05Z

I can confirm that the -O0 optimization flag fixes the error. Will submit a PR to this effect.

sbryngelson added the bug Something isn't working label Mar 18, 2023

sbryngelson assigned henryleberre Mar 18, 2023

wilfonba mentioned this issue Mar 18, 2023

MPI_Abort fix #125

Merged

henryleberre pushed a commit to henryleberre/MFC that referenced this issue Mar 21, 2023

Merge pull request MFlowCode#123 from henryleberre/master

837c824

sbryngelson mentioned this issue Mar 22, 2023

Update CMakeLists.txt for --debug fix on nvidia GPUs #130

Merged

sbryngelson assigned sbryngelson and unassigned henryleberre Mar 22, 2023

sbryngelson closed this as completed in #130 Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MFC simulation doesn't build with `--debug` on GPU #123

MFC simulation doesn't build with `--debug` on GPU #123

sbryngelson commented Mar 18, 2023

henryleberre commented Mar 18, 2023

henryleberre commented Mar 18, 2023

sbryngelson commented Mar 18, 2023

sbryngelson commented Mar 22, 2023 •

edited

sbryngelson commented Mar 22, 2023

MFC simulation doesn't build with --debug on GPU #123

MFC simulation doesn't build with --debug on GPU #123

Comments

sbryngelson commented Mar 18, 2023

henryleberre commented Mar 18, 2023

henryleberre commented Mar 18, 2023

sbryngelson commented Mar 18, 2023

sbryngelson commented Mar 22, 2023 • edited

sbryngelson commented Mar 22, 2023

MFC simulation doesn't build with `--debug` on GPU #123

MFC simulation doesn't build with `--debug` on GPU #123

sbryngelson commented Mar 22, 2023 •

edited