diff --git a/.cursor/rules/mfc-agent-rules.mdc b/.cursor/rules/mfc-agent-rules.mdc index 02fe5d7e0e..6a8baddd64 100644 --- a/.cursor/rules/mfc-agent-rules.mdc +++ b/.cursor/rules/mfc-agent-rules.mdc @@ -16,7 +16,7 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w - Most sources are `.fpp`; CMake transpiles them to `.f90`. - **Fypp macros** live in `src//include/` you should scan these first. `` ∈ {`simulation`,`common`,`pre_process`,`post_process`}. -- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC**. +- Only `simulation` (+ its `common` calls) is GPU-accelerated via **OpenACC** or **OpenMP**. - Assume free-form Fortran 2008+, `implicit none`, explicit `intent`, and modern intrinsics. - Prefer `module … contains … subroutine foo()`; avoid `COMMON` blocks and @@ -56,27 +56,29 @@ Written primarily for Fortran/Fypp; the OpenACC and style sections matter only w * Every variable: `intent(in|out|inout)` + appropriate `dimension` / `allocatable` / `pointer`. * Use `s_mpi_abort()` for errors, not `stop`. -* Mark OpenACC-callable helpers that are called from OpenACC parallel loops immediately after declaration: +* Mark GPU-callable helpers that are called from GPU parallel loops immediately after declaration: ```fortran subroutine s_flux_update(...) - !$acc routine seq + $:GPU_ROUTINE(function_name='s_flux_update', parallelism='[seq]') ... end subroutine ``` --- -# 3 OpenACC Programming Guidelines (for kernels) +# 3 FYPP Macros for GPU acceleration Pogramming Guidelines (for kernels) + +Do not directly use OpenACC or OpenMP directives directly. Instead, use the FYPP macros contained in src/common/include/parallel_macros.fpp Wrap tight loops with ```fortran -!$acc parallel loop gang vector default(present) reduction(...) +$:GPU_PARALLEL_FOR(private='[...]', copy='[...]') ``` -* Add `collapse(n)` to merge nested loops when safe. -* Declare loop-local variables with `private(...)`. +* Add `collapse=n` to merge nested loops when safe. +* Declare loop-local variables with `private='[...]'`. * Allocate large arrays with `managed` or move them into a persistent - `!$acc enter data` region at start-up. + `$:GPU_ENTER_DATA(...)` region at start-up. * **Do not** place `stop` / `error stop` inside device code. * Must compile with Cray `ftn` and NVIDIA `nvfortran` for GPU offloading; also build CPU-only with GNU `gfortran` and Intel `ifx`/`ifort`. diff --git a/docs/documentation/gpuDebugging.md b/docs/documentation/gpuDebugging.md new file mode 100644 index 0000000000..6535dbbbbd --- /dev/null +++ b/docs/documentation/gpuDebugging.md @@ -0,0 +1,156 @@ +# Debugging Tools and Tips for GPUs + +## Compiler agnostic tools + +## OpenMP tools +```bash +OMP_DISPLAY_ENV=true | false | verbose +``` +- Prints out the internal control values and environment variables at the beginning of the program if `true` or `verbose` +- `verbose` will also print out vendor-specific internal control values and environment variables + +```bash +OMP_TARGET_OFFLOAD = MANDATORY | DISABLED | DEFAULT +``` +- Quick way to turn off off-load (`DISABLED`) or make it abort if a GPU isn't found (`MANDATORY`) +- Great first test: does the problem disappear when you drop back to the CPU? + +```bash +OMP_THREAD_LIMIT= +``` +- Sets the maximum number of OpenMP threads to use in a contention group +- Might be useful in checking for issues with contention or race conditions + +```bash +OMP_DISPLAY_AFFINITY=TRUE +``` +- Will display affinity bindings for each OpenMP thread, containing hostname, process identifier, OS thread identifier, OpenMP thread identifier, and affinity binding. + +## Cray Compiler Tools + +### Cray General Options + +```bash +CRAY_ACC_DEBUG: 0 (off), 1, 2, 3 (very noisy) +``` +- Dumps a time-stamped log line (`"ACC: ...`) for every allocation, data transfer, kernel launch, wait, etc. Great first stop when "nothing seems to run on the GPU. + +- Outputs on STDERR by default. Can be changed by setting `CRAY_ACC_DEBUG_FILE`. + - Recognizes `stderr`, `stdout`, and `process`. + - `process` automatically generates a new file based on `pid` (each MPI process will have a different file) + +- While this environment variable specifies ACC, it can be used for both OpenACC and OpenMP + +```bash +CRAY_ACC_FORCE_EARLY_INIT=1 +``` +- Force full GPU initialization at program start so you can see start-up hangs immediately +- Default behavior without an environment variable is to defer initialization on first use +- Device initialization includes initializing the GPU vendor’s low-level device runtime library (e.g., libcuda for NVIDIA GPUs) and establishing all necessary software contexts for interacting with the device + +### Cray OpenACC Options + +```bash +CRAY_ACC_PRESENT_DUMP_SAVE_NAMES=1 +``` +- Will cause `acc_present_dump()` to output variable names and file locations in addition to variable mappings +- Add `acc_present_dump()` around hotspots to help find problems with data movements + - Helps more if adding `CRAY_ACC_DEBUG` environment variable + +## NVHPC Compiler Options + +### NVHPC General Options + +```bash +STATIC_RANDOM_SEED=1 +``` +- Forces the seed returned by `RANDOM_SEED` to be constant, so it generates the same sequence of random numbers +- Useful for testing issues with randomized data + +```bash +NVCOMPILER_TERM=option[,option] +``` +- `[no]debug`: Enables/disables just-in-time debugging (debugging invoked on error) +- `[no]trace`: Enables/disables stack traceback on error + +### NVHPC OpenACC Options + +```bash +NVCOMPILER_ACC_NOTIFY= +``` +- Assign the environment variable to a bitmask to print out information to stderr for the following + - kernel launches: 1 + - data transfers: 2 + - region entry/exit: 4 + - wait operation of synchronizations with the device: 8 + - device memory allocations and deallocations: 16 +- 1 (kernels only) is the usual first step.3 (kernels + copies) is great for "why is it so slow?" + +```bash +NVCOMPILER_ACC_TIME=1 +``` +- Lightweight profiler +- prints a tidy end-of-run table with per-region and per-kernel times and bytes moved +- Do not use with CUDA profiler at the same time + +```bash +NVCOMPILER_ACC_DEBUG=1 +``` +- Spews everything the runtime sees: host/device addresses, mapping events, present-table look-ups, etc. +- Great for "partially present" or "pointer went missing" errors. +- [Doc for NVCOMPILER_ACC_DEBUG](https://docs.nvidia.com/hpc-sdk/archive/20.9/pdf/hpc209openacc_gs.pdf) + - Ctrl+F for `NVCOMPILER_ACC_DEBUG` + +### NVHPC OpenMP Options + +```bash +LIBOMPTARGET_PROFILE=run.json +``` +- Emits a Chrome-trace (JSON) timeline you can open in chrome://tracing or Speedscope +- Great lightweight profiler when Nsight is overkill. +- Granularity in µs via `LIBOMPTARGET_PROFILE_GRANULARITY` (default 500). + +```bash +LIBOMPTARGET_INFO= +``` +- Prints out different types of runtime information +- Human-readable log of data-mapping inserts/updates, kernel launches, copies, waits. +- Perfect first stop for "why is nothing copied?" +- Flags + - Print all data arguments upon entering an OpenMP device kernel: 0x01 + - Indicate when a mapped address already exists in the device mapping table: 0x02 + - Dump the contents of the device pointer map at kernel exit: 0x04 + - Indicate when an entry is changed in the device mapping table: 0x08 + - Print OpenMP kernel information from device plugins: 0x10 + - Indicate when data is copied to and from the device: 0x20 + +```bash +LIBOMPTARGET_DEBUG=1 +``` +- Developer-level trace (host-side) +- Much noisier than `INFO` +- Only works if the runtime was built with `-DOMPTARGET_DEBUG`. + +```bash +LIBOMPTARGET_JIT_OPT_LEVEL=-O{0,1,2,3} +``` +- This environment variable can be used to change the optimization pipeline used to optimize the embedded device code as part of the device JIT. +- The value corresponds to the `-O{0,1,2,3}` command line argument passed to clang. + +```bash +LIBOMPTARGET_JIT_SKIP_OPT=1 +``` +- This environment variable can be used to skip the optimization pipeline during JIT compilation. +- If set, the image will only be passed through the backend. +- The backend is invoked with the `LIBOMPTARGET_JIT_OPT_LEVEL` flag. + +## Compiler Documentation + +- [Cray & OpenMP Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openmp.7.html#environment-variables) +- [Cray & OpenACC Docs](https://cpe.ext.hpe.com/docs/24.11/cce/man7/intro_openacc.7.html#environment-variables) +- [NVHPC & OpenACC Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#environment-variables) +- [NVHPC & OpenMP Docs](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html?highlight=NVCOMPILER_#id2) +- [LLVM & OpenMP Docs] (https://openmp.llvm.org/design/Runtimes.html) + - NVHPC is built on top of LLVM +- [OpenMP Docs](https://www.openmp.org/spec-html/5.1/openmp.html) +- [OpenACC Docs](https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.7.pdf) diff --git a/docs/documentation/readme.md b/docs/documentation/readme.md index 8bd2957332..5ca45150e4 100644 --- a/docs/documentation/readme.md +++ b/docs/documentation/readme.md @@ -10,6 +10,7 @@ - [Flow Visualization](visualization.md) - [Performance](expectedPerformance.md) - [GPU Parallelization](gpuParallelization.md) +- [GPU Debugging](gpuDebugging.md) - [MFC's Authors](authors.md) - [References](references.md)