From dd70fc8035184f407f8c696a9bb602c1aaa7f63b Mon Sep 17 00:00:00 2001 From: Tanush Prathi Date: Fri, 18 Jul 2025 21:14:44 -0400 Subject: [PATCH 1/3] Added GPU debugging and update cursor rules --- .cursor/rules/mfc-agent-rules.mdc | 18 ++-- docs/documentation/gpuDebugging.md | 155 +++++++++++++++++++++++++++++ docs/documentation/readme.md | 1 + 3 files changed, 166 insertions(+), 8 deletions(-) create mode 100644 docs/documentation/gpuDebugging.md diff --git a/.cursor/rules/mfc-agent-rules.mdc b/.cursor/rules/mfc-agent-rules.mdc index 02fe5d7e0e..6a8baddd64 100644 --- a/.cursor/rules/mfc-agent-rules.mdc +++ b/.cursor/rules/mfc-agent-rules.mdc @@ -16,7 +16,7 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w - Most sources are `.fpp`; CMake transpiles them to `.f90`. - **Fypp macros** live in `src//include/` you should scan these first. `` ∈ {`simulation`,`common`,`pre_process`,`post_process`}. -- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC**. +- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC** or **OpenMP**. - Assume free-form Fortran 2008+, `implicit none`, explicit `intent`, and modern intrinsics. - Prefer `module … contains … subroutine foo()`; avoid `COMMON` blocks and @@ -56,27 +56,29 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w * Every variable: `intent(in|out|inout)` + appropriate `dimension` / `allocatable` / `pointer`. * Use `s_mpi_abort()` for errors, not `stop`. -* Mark OpenACC-callable helpers that are called from OpenACC parallel loops immediately after declaration: +* Mark GPU-callable helpers that are called from GPU parallel loops immediately after declaration: ```fortran subroutine s_flux_update(...) - !$acc routine seq + $:GPU_ROUTINE(function_name='s_flux_update', parallelism='[seq]') ... end subroutine ``` --- -# 3 OpenACC Programming Guidelines (for kernels) +# 3 FYPP Macros for GPU acceleration Pogramming Guidelines (for kernels) + +Do not directly use OpenACC or OpenMP directives directly. Instead, use the FYPP macros contained in src/common/include/parallel_macros.fpp Wrap tight loops with ```fortran -!$acc parallel loop gang vector default(present) reduction(...) +$:GPU_PARALLEL_FOR(private='[...]', copy='[...]') ``` -* Add `collapse(n)` to merge nested loops when safe. -* Declare loop-local variables with `private(...)`. +* Add `collapse=n` to merge nested loops when safe. +* Declare loop-local variables with `private='[...]'`. * Allocate large arrays with `managed` or move them into a persistent - `!$acc enter data` region at start-up. + `$:GPU_ENTER_DATA(...)` region at start-up. * **Do not** place `stop` / `error stop` inside device code. * Must compile with Cray `ftn` and NVIDIA `nvfortran` for GPU offloading; also build CPU-only with GNU `gfortran` and Intel `ifx`/`ifort`. diff --git a/docs/documentation/gpuDebugging.md b/docs/documentation/gpuDebugging.md new file mode 100644 index 0000000000..daac529552 --- /dev/null +++ b/docs/documentation/gpuDebugging.md @@ -0,0 +1,155 @@ +# Debugging Tools and Tips for GPUs + +## Compiler agnostic tools + +## OpenMP tools +```bash +OMP_DISPLAY_ENV=true | false | verbose +``` +- Prints out the internal control values and environment variables at beginning of program if `true` or `verbose` +- `verbose` will also print out vendor-specific internal control values and environment variables + +```bash +OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT +``` +- Quick way to turn off off-load (DISABLED) or make it abort if a GPU isn't found (MANDATORY) +- great first test: does the problem disappear when you drop back to the CPU? + +```bash +OMP_THREAD_LIMIT= +``` +- Sets maximum number of OpenMP threads to use in a contention group +- Might be useful in checking for issues with contention or race conditions + +```bash +OMP_DISPLAY_AFFINITY=TRUE +``` +- Will display affinity bindings for each OpenMP thread, containing hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding. + +## Cray Compiler Tools + +### Cray General Options + +```bash +CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) +``` +- Dumps a time-stamped log line ("ACC: …) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU. + +- Outputs on STDERR by default. Can be changed by setting `CRAY_ACC_DEBUG_FILE`. + - Recognizes `stderr`, `stdout`, and `process`. + - `process` automatically generates a new file based on `pid` (each MPI process will have a different file) + +- While this enviornment variable specifies ACC, it can be used for both OpenACC and OpenMP + +```bash +CRAY_ACC_FORCE_EARLY_INIT=1 +``` +- Force full GPU initialization at program start so you can see start-up hangs immediately +- Default behavior without environment variable is to defer initalization on first use +- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device + +### Cray OpenACC Options + +```bash +CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1 +``` +- Will cause acc_present_dump() to output variable names and file locations in addition to variable mappings +- Add acc_present_dump() around hotspots to help find problems with data movements + - Helps more if adding `CRAY_ACC_DEBUG` environment variable + +## NVHPC Compiler Options + +### NVHPC General Options + +```bash +STATIC_RANDOM_SEED=1 +``` +- Forces the seed returned by RANDOM_SEED to be constant, so generates same sequence of random numbers +- Useful for testing issues with randomized data + +```bash +NVCOMPILER_TERM=option[,option] +``` +- `[no]debug`: Enables/disables just-in-time debugging (debugging invoked on error) +- `[no]trace`: Enables/disables stack traceback on error + +### NVHPC OpenACC Options + +```bash +NVCOMPILER_ACC_NOTIFY= +``` +- Assign the environment variable to a bitmask to print out information to stderr for the following + - kernel launches: 1 + - data transfers: 2 + - region entry/exit: 4 + - wait operation of synchronizations with the device: 8 + - device memory allocations and deallocations: 16 +- 1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?" + +```bash +NVCOMPILER_ACC_TIME=1 +``` +- Lightweight profiler +- prints a tidy end-of-run table with per-region and per-kernel times and bytes moved +- Do not use with CUDA profiler at the same time + +```bash +NVCOMPILER_ACC_DEBUG=1 +``` +- Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc. +- Great for "partially present" or "pointer went missing" errors. +- [Doc for NVCOMPILER_ACC_DEBUG](https://docs.nvidia.com/hpc-sdk/archive/20.9/pdf/hpc209openacc_gs.pdf) + - Ctrl+F for `NVCOMPILER_ACC_DEBUG` + +### NVHPC OpenMP Options + +```bash +LIBOMPTARGET_PROFILE=run.json +``` +- Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope +- great lightweight profiler when Nsight is over-kill. +- Granularity in µs via `LIBOMPTARGET_PROFILE_GRANULARITY` (default 500). + +```bash +LIBOMPTARGET_INFO= +``` +- Prints out different types of runtime information +- Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits. +- Perfect first stop for "why is nothing copied?" +- Flags + - Print all data arguments upon entering an OpenMP device kernel: 0x01 + - Indicate when a mapped address already exists in the device mapping table: 0x02 + - Dump the contents of the device pointer map at kernel exit: 0x04 + - Indicate when an entry is changed in the device mapping table: 0x08 + - Print OpenMP kernel information from device plugins: 0x10 + - Indicate when data is copied to and from the device: 0x20 + +```bash +LIBOMPTARGET_DEBUG=1 +``` +- Developer-level trace (host-side) +- Much noisier than INFO +- only works if the runtime was built with -DOMPTARGET_DEBUG. + +```bash +LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3} +``` +- This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT. +- The value is corresponds to the -O{0,1,2,3} command line argument passed to clang. + +```bash +LIBOMPTARGET_JIT_SKIP_OPT=1 +``` +- This environment variable can be used to skip the optimization pipeline during JIT compilation. +- If set, the image will only be passed through the backend. +- The backend is invoked with the LIBOMPTARGET_JIT_OPT_LEVEL flag. + +## Compiler Documentation +- [Cray & OpenMP Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables) +- [Cray & OpenACC Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables) +- [NVHPC & OpenACC Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables) +- [NVHPC & OpenMP Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#id2) +- [LLVM & OpenMP Docs] (https://openmp.llvm.org/design/Runtimes.html) + - NVHPC is built on top of LLVM +- [OpenMP Docs](https://www.openmp.org/spec-html/5.1/openmp.html) +- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf) \ No newline at end of file diff --git a/docs/documentation/readme.md b/docs/documentation/readme.md index 8bd2957332..5ca45150e4 100644 --- a/docs/documentation/readme.md +++ b/docs/documentation/readme.md @@ -10,6 +10,7 @@ - [Flow Visualization](visualization.md) - [Performance](expectedPerformance.md) - [GPU Parallelization](gpuParallelization.md) +- [GPU Debugging](gpuDebugging.md) - [MFC's Authors](authors.md) - [References](references.md) From c7c04bd8410be71b538e979cadc3dd3c6c0ca1a1 Mon Sep 17 00:00:00 2001 From: Tanush Prathi Date: Fri, 18 Jul 2025 21:19:49 -0400 Subject: [PATCH 2/3] Ran spellcheck --- docs/documentation/gpuDebugging.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/documentation/gpuDebugging.md b/docs/documentation/gpuDebugging.md index daac529552..379a8caa2d 100644 --- a/docs/documentation/gpuDebugging.md +++ b/docs/documentation/gpuDebugging.md @@ -39,13 +39,13 @@ CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) - Recognizes `stderr`, `stdout`, and `process`. - `process` automatically generates a new file based on `pid` (each MPI process will have a different file) -- While this enviornment variable specifies ACC, it can be used for both OpenACC and OpenMP +- While this environment variable specifies ACC, it can be used for both OpenACC and OpenMP ```bash CRAY_ACC_FORCE_EARLY_INIT=1 ``` - Force full GPU initialization at program start so you can see start-up hangs immediately -- Default behavior without environment variable is to defer initalization on first use +- Default behavior without environment variable is to defer initialization on first use - Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device ### Cray OpenACC Options From 3e4982b07a23215c1f34a963e7401f93ec39349a Mon Sep 17 00:00:00 2001 From: Spencer Bryngelson Date: Sat, 19 Jul 2025 09:39:51 -0400 Subject: [PATCH 3/3] Update gpuDebugging.md --- docs/documentation/gpuDebugging.md | 33 +++++++++++++++--------------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/docs/documentation/gpuDebugging.md b/docs/documentation/gpuDebugging.md index 379a8caa2d..6535dbbbbd 100644 --- a/docs/documentation/gpuDebugging.md +++ b/docs/documentation/gpuDebugging.md @@ -6,19 +6,19 @@ ```bash OMP_DISPLAY_ENV=true | false | verbose ``` -- Prints out the internal control values and environment variables at beginning of program if `true` or `verbose` +- Prints out the internal control values and environment variables at the beginning of the program if `true` or `verbose` - `verbose` will also print out vendor-specific internal control values and environment variables ```bash OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT ``` -- Quick way to turn off off-load (DISABLED) or make it abort if a GPU isn't found (MANDATORY) -- great first test: does the problem disappear when you drop back to the CPU? +- Quick way to turn off off-load (`DISABLED`) or make it abort if a GPU isn't found (`MANDATORY`) +- Great first test: does the problem disappear when you drop back to the CPU? ```bash OMP_THREAD_LIMIT= ``` -- Sets maximum number of OpenMP threads to use in a contention group +- Sets the maximum number of OpenMP threads to use in a contention group - Might be useful in checking for issues with contention or race conditions ```bash @@ -33,7 +33,7 @@ OMP_DISPLAY_AFFINITY=TRUE ```bash CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) ``` -- Dumps a time-stamped log line ("ACC: …) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU. +- Dumps a time-stamped log line (`"ACC: ...`) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU. - Outputs on STDERR by default. Can be changed by setting `CRAY_ACC_DEBUG_FILE`. - Recognizes `stderr`, `stdout`, and `process`. @@ -45,16 +45,16 @@ CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) CRAY_ACC_FORCE_EARLY_INIT=1 ``` - Force full GPU initialization at program start so you can see start-up hangs immediately -- Default behavior without environment variable is to defer initialization on first use -- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device +- Default behavior without an environment variable is to defer initialization on first use +- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device ### Cray OpenACC Options ```bash CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1 ``` -- Will cause acc_present_dump() to output variable names and file locations in addition to variable mappings -- Add acc_present_dump() around hotspots to help find problems with data movements +- Will cause `acc_present_dump()` to output variable names and file locations in addition to variable mappings +- Add `acc_present_dump()` around hotspots to help find problems with data movements - Helps more if adding `CRAY_ACC_DEBUG` environment variable ## NVHPC Compiler Options @@ -64,7 +64,7 @@ CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1 ```bash STATIC_RANDOM_SEED=1 ``` -- Forces the seed returned by RANDOM_SEED to be constant, so generates same sequence of random numbers +- Forces the seed returned by `RANDOM_SEED` to be constant, so it generates the same sequence of random numbers - Useful for testing issues with randomized data ```bash @@ -107,7 +107,7 @@ NVCOMPILER_ACC_DEBUG=1 LIBOMPTARGET_PROFILE=run.json ``` - Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope -- great lightweight profiler when Nsight is over-kill. +- Great lightweight profiler when Nsight is overkill. - Granularity in µs via `LIBOMPTARGET_PROFILE_GRANULARITY` (default 500). ```bash @@ -128,23 +128,24 @@ LIBOMPTARGET_INFO= LIBOMPTARGET_DEBUG=1 ``` - Developer-level trace (host-side) -- Much noisier than INFO -- only works if the runtime was built with -DOMPTARGET_DEBUG. +- Much noisier than `INFO` +- Only works if the runtime was built with `-DOMPTARGET_DEBUG`. ```bash LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3} ``` - This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT. -- The value is corresponds to the -O{0,1,2,3} command line argument passed to clang. +- The value corresponds to the `-O{0,1,2,3}` command line argument passed to clang. ```bash LIBOMPTARGET_JIT_SKIP_OPT=1 ``` - This environment variable can be used to skip the optimization pipeline during JIT compilation. - If set, the image will only be passed through the backend. -- The backend is invoked with the LIBOMPTARGET_JIT_OPT_LEVEL flag. +- The backend is invoked with the `LIBOMPTARGET_JIT_OPT_LEVEL` flag. ## Compiler Documentation + - [Cray & OpenMP Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables) - [Cray & OpenACC Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables) - [NVHPC & OpenACC Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables) @@ -152,4 +153,4 @@ LIBOMPTARGET_JIT_SKIP_OPT=1 - [LLVM & OpenMP Docs] (https://openmp.llvm.org/design/Runtimes.html) - NVHPC is built on top of LLVM - [OpenMP Docs](https://www.openmp.org/spec-html/5.1/openmp.html) -- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf) \ No newline at end of file +- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf)