Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MFC simulation doesn't build with --debug on GPU #123

Closed
sbryngelson opened this issue Mar 18, 2023 · 5 comments · Fixed by #130
Closed

MFC simulation doesn't build with --debug on GPU #123

sbryngelson opened this issue Mar 18, 2023 · 5 comments · Fixed by #130
Assignees
Labels
bug Something isn't working

Comments

@sbryngelson
Copy link
Member

MFC simulation doesn't build (compile) with --debug on GPU.

Here's a dump from wingtip. Same is true on Phoenix, though. Builds fine without --debug. Also -j 8 isn't the problem.

[I]wingtip-gpu3: sbryngelson3/MFC $ ./mfc.sh build -j 8 -t simulation --debug --gpu
mfc: Found CMake: /nethome/sbryngelson3/MFC/build/cmake/bin/cmake.
mfc: OK > (venv) Entered the Python virtual environment.
      ___            ___          ___
     /__/\          /  /\        /  /\       sbryngelson3@wingtip-gpu3 [Linux]
    |  |::\        /  /:/_      /  /:/       ---------------------------------
    |  |:|:\      /  /:/ /\    /  /:/        --jobs 8
  __|__|:|\:\    /  /:/ /:/   /  /:/  ___    --mpi
 /__/::::| \:\  /__/:/ /:/   /__/:/  /  /\   --gpu
 \  \:\~~\__\/  \  \:\/:/    \  \:\ /  /:/   --debug
  \  \:\         \  \::/      \  \:\  /:/    --targets simulation
   \  \:\         \  \:\       \  \:\/:/     --------------------------------------------
    \  \:\         \  \:\       \  \::/      $ ./mfc.sh [build, run, test, clean] --help
     \__\/          \__\/        \__\/

Building simulation:

  $ cmake -DMFC_SIMULATION=ON -Wno-dev -DCMAKE_EXPORT_COMPILE_COMMANDS=ON -DCMAKE_BUILD_TYPE=Debug -DCMAKE_PREFIX_PATH=/nethome/sbryngelson3/MFC/build/install
-DCMAKE_FIND_ROOT_PATH=/nethome/sbryngelson3/MFC/build/install -DCMAKE_INSTALL_PREFIX=/nethome/sbryngelson3/MFC/build/install -DMFC_MPI=ON -DMFC_OpenACC=ON -S
/nethome/sbryngelson3/MFC/ -B /nethome/sbryngelson3/MFC/build/simulation

-- The C compiler identification is NVHPC 22.11.0
-- The Fortran compiler identification is NVHPC 22.11.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/nvc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/nvfortran - skipped
-- Found MPI_Fortran: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempif08.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1") found components: Fortran
-- Found OpenACC_C: -acc
-- Found OpenACC_Fortran: -acc
-- Found CUDAToolkit: /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8/include (found version "11.8.89")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- CUDAToolkit_LIBRARY_ROOT=
-- Could NOT find cuTENSOR (missing: cuTENSOR_LIBRARY)
CMake Warning at CMakeLists.txt:239 (message):
  Failed to locate the NVIDIA cuTENSOR library.  MFC will be built without
  support for it, disallowing the use of cu_tensor=T.  This can result in
  degraded performance.
Call Stack (most recent call first):
  CMakeLists.txt:280 (MFC_SETUP_TARGET)


-- Configuring done
-- Generating done
-- Build files have been written to: /nethome/sbryngelson3/MFC/build/simulation

  $ cmake --build /nethome/sbryngelson3/MFC/build/simulation --target simulation -j 8 --config Debug

[  2%] Preprocessing (Fypp) p_main.fpp
[ 17%] Preprocessing (Fypp) m_bubbles.fpp
[ 17%] Preprocessing (Fypp) m_cbc.fpp
[ 17%] Preprocessing (Fypp) m_data_output.fpp
[ 17%] Preprocessing (Fypp) m_compute_cbc.fpp
[ 17%] Preprocessing (Fypp) m_constants.fpp
[ 17%] Preprocessing (Fypp) m_global_parameters.fpp
[ 17%] Preprocessing (Fypp) m_fftw.fpp
[ 19%] Preprocessing (Fypp) m_hypoelastic.fpp
[ 21%] Preprocessing (Fypp) m_mpi_common.fpp
[ 26%] Preprocessing (Fypp) m_mpi_proxy.fpp
[ 26%] Preprocessing (Fypp) m_qbmm.fpp
[ 28%] Preprocessing (Fypp) m_monopole.fpp
[ 30%] Preprocessing (Fypp) m_rhs.fpp
[ 32%] Preprocessing (Fypp) m_riemann_solvers.fpp
[ 34%] Preprocessing (Fypp) m_start_up.fpp
[ 36%] Preprocessing (Fypp) m_time_steppers.fpp
[ 39%] Preprocessing (Fypp) m_variables_conversion.fpp
[ 41%] Preprocessing (Fypp) m_viscous.fpp
[ 43%] Preprocessing (Fypp) m_weno.fpp
[ 45%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/autogen/m_constants.fpp.f90.o
[ 47%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/m_nvtx.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
[ 50%] Building Fortran object CMakeFiles/simulation.dir/src/common/m_derived_types.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
[ 52%] Building Fortran object CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o
nvfortran-Warning-CUDA Fortran or OpenACC GPU targets disables -Mbounds
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 5)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 6)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 6)
NVFORTRAN-W-0951-Extraneous tokens ignored following # directive (/nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90: 45)
s_initialize_global_parameters_module:
      7, include 'm_global_parameters.fpp'
         471, Generating update device(weno_polyn)
         472, Generating update device(nb)
         550, Generating enter data create(r0(nb),v0(nb),weight(nb))
         551, Generating enter data create(bub_idx%rs(nb),bub_idx%vs(nb))
         552, Generating enter data create(bub_idx%ms(nb),bub_idx%ps(nb))
         561, Generating enter data create(bub_idx%moms(nb,6))
         652, Generating enter data create(bub_idx%rs(nb),bub_idx%vs(nb))
         653, Generating enter data create(bub_idx%ms(nb),bub_idx%ps(nb))
         654, Generating enter data create(r0(nb),v0(nb),weight(nb))
         707, Generating enter data create(re_idx(1:2,1:re_size$r6))
         736, Generating update device(re_size(:))
         761, Generating enter data create(ptil(ix%beg:ix%end,iy%beg:iy%end,iz%beg:iz%end))
         779, Generating update device(starty,startx,startz)
         803, Generating update device(strxb,momxe,advxe,contxe,bubxe,sys_size,strxe,intxb,e_idx,bubxb,alf_idx,contxb,buff_size,advxb,momxb,intxe)
         806, Generating enter data create(x_cb(-(buff_size)-1:m+buff_size))
         807, Generating enter data create(x_cc(-buff_size:m+buff_size))
         808, Generating enter data create(dx(-buff_size:m+buff_size))
         812, Generating enter data create(y_cb(-(buff_size)-1:n+buff_size))
         813, Generating enter data create(y_cc(-buff_size:n+buff_size))
         814, Generating enter data create(dy(-buff_size:n+buff_size))
         818, Generating enter data create(z_cb(-(buff_size)-1:p+buff_size))
         819, Generating enter data create(z_cc(-buff_size:p+buff_size))
         820, Generating enter data create(dz(-buff_size:p+buff_size))
s_initialize_nonpoly:
      7, include 'm_global_parameters.fpp'
         848, Generating enter data create(pb0(nb),mass_n0(nb),mass_v0(nb),pe_t(nb))
         849, Generating enter data create(k_v(nb),k_n(nb),omegan(nb))
         850, Generating enter data create(im_trans_c(nb),im_trans_t(nb),re_trans_t(nb),re_trans_c(nb))
s_finalize_global_parameters_module:
      7, include 'm_global_parameters.fpp'
        1006, Generating exit data delete(re_idx(:,:))
        1010, Generating exit data delete(x_cb(:),dx(:),x_cc(:))
        1013, Generating exit data delete(y_cb(:),dy(:),y_cc(:))
        1016, Generating exit data delete(z_cc(:),z_cb(:),dz(:))
s_simpson:
      7, include 'm_global_parameters.fpp'
nvfortran-Fatal-/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2 TERMINATED by signal 11
Arguments to /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2
/opt/nvidia/hpc_sdk/Linux_x86_64/22.11/compilers/bin/tools/fort2 /tmp/nvfortranUejVgkE8S8y-R.ilm -fn /nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90 -debug -x 120 0x8000 -opt 0 -terse 1 -inform warn -x 51 0x20 -x 119 0xa10000 -x 122 0x40 -x 123 0x1000 -x 127 4 -x 127 17 -x 19 0x400000 -x 28 0x40000 -x 120 0x10000000 -x 70 0x8000 -x 122 1 -x 125 0x20000 -quad -x 59 4 -tp skylake-avx512 -x 124 0x1400 -y 15 2 -x 57 0x3b0000 -x 58 0x48000000 -x 49 0x100 -astype 0 -x 121 1 -x 183 4 -x 121 0x800 -x 68 0x1 -x 8 0x40000000 -x 70 0x40000000 -x 56 0x10 -x 54 0x10 -x 120 0x2000000 -x 249 140 -x 68 0x20 -x 70 0x40000000 -x 8 0x40000000 -x 164 0x800000 -x 71 0x2000 -x 71 0x4000 -x 34 0x40000000 -x 83 0x1 -x 85 0x1 -x 206 0x02 -x 120 0x1000000 -x 68 0x1 -x 39 4 -x 56 0x10 -x 26 0x10 -x 26 1 -x 56 0x4000 -x 124 1 -accel tesla -accel host -x 197 0 -x 175 0 -x 203 0 -x 204 0 -x 163 0xc0008 -x 180 0x40 -x 180 0x800 -x 163 0x400000 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 176 0x100 -cudacap 80 -x 180 0x4000400 -x 121 0xc00 -x 186 0x80 -x 180 0x4000400 -x 121 0xc00 -x 194 0x40000 -x 163 0x1 -x 186 0x80000 -cudaver 11080 -x 176 0x100 -cudacap 80 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 189 0x8000 -y 163 0xc0000000 -x 163 0x800000 -x 189 0x10 -y 189 0x4000000 -cudaroot /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8 -x 187 0x40000 -x 187 0x8000000 -x 60 512 -x 124 0x20 -x 0 0x1000000 -x 2 0x100000 -x 0 0x2000000 -x 161 16384 -x 162 16384 -x 192 0x40000000 -x 215 0x60 -cci /tmp/nvfortranEejVgApGJPSQJ.cci -cmdline '+nvfortran /nethome/sbryngelson3/MFC/src/simulation/autogen/m_global_parameters.fpp.f90 -DMFC_DEBUG -DMFC_MPI -DMFC_SIMULATION -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/comm_libs/openmpi/openmpi-3.1.5/lib -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/cuda/11.8/include -isystem /opt/nvidia/hpc_sdk/Linux_x86_64/22.11/math_libs/include -g -O0 -Mbounds -r8 -Mfreeform -Mr8intrinsics -cpp -Mpreprocess -Minfo=accel -gpu=keep,ptxinfo,lineinfo,autocompare,debug -acc -Mpreprocess -c -o CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o' -stbfile /tmp/nvfortranoejVgQkIgvDS8.stb -asm /tmp/nvfortranoejVgQBMqyd5U.ll
make[3]: *** [CMakeFiles/simulation.dir/build.make:365: CMakeFiles/simulation.dir/src/simulation/autogen/m_global_parameters.fpp.f90.o] Error 127
make[2]: *** [CMakeFiles/Makefile2:84: CMakeFiles/simulation.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:91: CMakeFiles/simulation.dir/rule] Error 2
make: *** [Makefile:170: simulation] Error 2


Error: Failed to build the simulation target.

Terminated
mfc: (venv) Exiting the Python virtual environment.
@sbryngelson sbryngelson added the bug Something isn't working label Mar 18, 2023
@wilfonba wilfonba mentioned this issue Mar 18, 2023
@henryleberre
Copy link
Member

First reported when merging the GPU branch with master last summer (on MFC-develop): https://forums.developer.nvidia.com/t/nvhpc-22-5-fort2-terminated-by-signal-11/219545. The bug disappeared and then reappeared. This is a bug with NVHPC (the compiler shouldn't segfault). --debug was very helpful. I could make a script to go back and find what commit started all of this again. Should I?

@henryleberre
Copy link
Member

Back then MFC's build system was less than desirable and it complicated the reporting process.

@sbryngelson
Copy link
Member Author

I would report it again, then. If you believe this was triggered by a commit revision, then it would be useful to go through old releases.

henryleberre pushed a commit to henryleberre/MFC that referenced this issue Mar 21, 2023
@sbryngelson
Copy link
Member Author

sbryngelson commented Mar 22, 2023

From Mat Colgrove:

Oddly, I’m not able to reproduce the error with 22.5 or 22.11, but can with 23.1 and our development compiler. Sometimes this can mean that there’s a UMR or other memory issue where it only shows up occasionally, but running the 22.11 back-end compiler through valgrind shows no issues. Hence while it’s likely the same issue, I’m not 100% sure.

I filed a problem report, TPR #33317, and sent it to engineering for investigation.

In my case, if I add an opt level, i.e. change “-O0” to “-O1” or “-O2”, the error goes away. You might try this as well, just as a work around until we can get this fixed.

-Mat

https://forums.developer.nvidia.com/t/nvhpc-22-5-fort2-terminated-by-signal-11/219545/8

@sbryngelson
Copy link
Member Author

I can confirm that the -O0 optimization flag fixes the error. Will submit a PR to this effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

2 participants