Skip to content

AMD Flang Weno Optimizations#1374

Merged
sbryngelson merged 19 commits intoMFlowCode:masterfrom
danieljvickers:weno-array-unwrap
Apr 21, 2026
Merged

AMD Flang Weno Optimizations#1374
sbryngelson merged 19 commits intoMFlowCode:masterfrom
danieljvickers:weno-array-unwrap

Conversation

@danieljvickers
Copy link
Copy Markdown
Member

Description

Loop unwrapping to optimize AMD Flang GPU performance for WENO.

Type of change

  • Bug fix

Testing

Performance checking of the examples/3D_performance_test case

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

  • @coderabbitai review — incremental review (new changes only)
  • @coderabbitai full review — full review from scratch
  • /review — Qodo review
  • /improve — Qodo code suggestions
  • @claude full review — Claude full review (also triggers on PR open/reopen/ready)
  • Add label claude-full-review — Claude full review via label

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

Claude Code Review

Head SHA: 26e982e

Files changed:

  • 2
  • CMakeLists.txt
  • src/simulation/m_weno.fpp

Findings

1. WENO5 final omega normalization hardcodes stencil indices 0–2 instead of using a loop (lines ~1104–1106 and ~1145–1147)

The PR replaces omega = alpha/sum(alpha) in the WENO5 path with:

omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2))
omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2))
omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2))

This hardcodes weno_num_stencils == 2. Every other conversion in this PR consistently uses do q = 0, weno_num_stencils. A do q loop with a precomputed denominator would be both consistent and avoid recomputing the sum three times. The current form silently gives wrong or incomplete results if weno_num_stencils differs.

2. Intermediate omega = alpha/sum(alpha) inside the mapped_weno branch is not converted (WENO3 and WENO5)

The mapped_weno branch computes an initial omega before remapping alpha:

do q = 0, weno_num_stencils
    alpha(q) = d_cbL/R_${XYZ}$(q, j)/(beta(q)**2._wp)
end do
omega = alpha/sum(alpha)   ! <- still an array operation
do q = 0, weno_num_stencils
    alpha(q) = ...remapped formula using omega(q)...
end do

The surrounding alpha assignments were converted to do q loops to fix AMD flang GPU array-section issues, but the intermediate omega = alpha/sum(alpha) — which uses the same sum(array_section) construct — was left unchanged. This intermediate statement will still hit the same compiler limitation on AMD flang OpenMP GPU builds, silently producing wrong omega values when mapped_weno is active.

3. WENO3 final omega = alpha/sum(alpha) not converted

The WENO5 path's final omega = alpha/sum(alpha) is converted (finding 1 above), but the identical statement in the WENO3 path (outside the if/else block, in the weno_order == 3 section) is left as an array operation. If AMD flang OpenMP GPU builds cannot handle omega = alpha/sum(alpha) for WENO5, the same issue applies to WENO3 (weno_num_stencils == 1, two stencils). The fix appears incomplete for third-order WENO.

@danieljvickers danieljvickers marked this pull request as ready for review April 20, 2026 19:12
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

📝 Walkthrough

Walkthrough

This pull request contains two primary modifications: First, the CMakeLists.txt expands compile options for OpenMP GPU offloading with the LLVMFlang compiler, adding -fopenmp-target-fast, -fopenmp-assume-threads-oversubscription, and -fopenmp-assume-teams-oversubscription to the existing flags while keeping link options unchanged. Second, the m_weno.fpp file refactors WENO computation logic by replacing array section assignments with explicit element-wise loops indexed by q across multiple WENO schemes (wenojs, mapped_weno, wenoz), and restructures omega normalization in higher-order WENO paths from array division to individual scalar assignments.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is incomplete. It lacks required sections: missing 'Fixes #(issue)' reference, unchecked type-of-change boxes, incomplete testing details, and missing GPU testing checklist items. Complete the pull request description template: reference any related issue, check appropriate type-of-change boxes, provide detailed testing steps, and fill out the GPU changes checklist with results.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: loop unwrapping optimizations for WENO code targeting AMD Flang GPU performance.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/simulation/m_weno.fpp (2)

1107-1109: Optional: hoist the shared denominator and consider applying the same unroll to order-3/order-7.

Two minor points on the order-5 omega normalization changes at 1107-1109 and 1148-1150:

  1. The expression (alpha(0) + alpha(1) + alpha(2)) is recomputed three times per side. A compiler will likely CSE, but an explicit temporary makes intent clearer and is cheaper to read/verify:

    -                                        omega(0) = alpha(0)/(alpha(0) + alpha(1) + alpha(2))
    -                                        omega(1) = alpha(1)/(alpha(0) + alpha(1) + alpha(2))
    -                                        omega(2) = alpha(2)/(alpha(0) + alpha(1) + alpha(2))
    +                                        tau = alpha(0) + alpha(1) + alpha(2)   ! reuse scratch or add a local
    +                                        omega(0) = alpha(0)/tau
    +                                        omega(1) = alpha(1)/tau
    +                                        omega(2) = alpha(2)/tau

    (Use a dedicated local, not tau, if it's still live elsewhere — here it is not after this point in the left block. For the right block at 1148-1150, tau is still reusable by wenoz above, so add a fresh scalar and declare it private in the outer GPU_PARALLEL_LOOP.)

  2. The same omega = alpha/sum(alpha) pattern still exists at lines 986, 1015 (order-3), 1304, 1366 (order-7), and inside the order-5 mapped_weno/teno branches (1070, 1094). If the AMD Flang perf regression you targeted is specifically the array sum() reduction, consider applying the unroll uniformly (or justify in a comment why only the order-5 final normalization was unrolled). As-is, the half-done refactor is a small consistency wart.


965-983: Add GPU_LOOP(parallelism='[seq]') markers to new q loops for consistency with adjacent code.

The newly introduced do q = 0, weno_num_stencils loops inside GPU_PARALLEL_LOOP kernels (order-3 at lines 965–983 and 996–1012, order-5 at lines 1063–1075 and 1123–1135, order-7 at lines 1264–1276 and 1340–1352) lack explicit $:GPU_LOOP(parallelism='[seq]') annotations, while sibling wenoz/teno q loops in the same kernels carry these markers (e.g., lines 1080, 1088, 1096, 1137, 1142, 1280, 1292, 1354, 1360). Although inner loops default to sequential execution when unmarked (confirmed across nvfortran, Cray ftn, and AMD Flang), adding explicit seq markers maintains code clarity and consistency with established patterns in this function.

♻️ Example for order-3 left block
                                     if (wenojs) then
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp)
                                         end do
                                     else if (mapped_weno) then
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbL_${XYZ}$ (q, j)/(beta(q)**2._wp)
                                         end do
                                         omega = alpha/sum(alpha)
+                                        $:GPU_LOOP(parallelism='[seq]')
                                         do q = 0, weno_num_stencils
                                             alpha(q) = (d_cbL_${XYZ}$ (q, j)*(1._wp + d_cbL_${XYZ}$ (q, &
                                                   & j) - 3._wp*omega(q)) + omega(q)**2._wp)*(omega(q)/(d_cbL_${XYZ}$ (q, &
                                                   & j)**2._wp + omega(q)*(1._wp - 2._wp*d_cbL_${XYZ}$ (q, j))))
                                         end do
                                     else if (wenoz) then
                                         ! Borges, et al. (2008)
                                         tau = abs(beta(1) - beta(0))
                                         do q = 0, weno_num_stencils
                                             alpha(q) = d_cbR_${XYZ}$ (q, j)*(1._wp + tau/beta(q))
                                         end do

Apply the same treatment to the mirrored right-side blocks (lines 996–1012, 1123–1135, 1340–1352) and order-5/7 counterparts (lines 1063–1075, 1264–1276).


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: da41e88b-a73a-4de0-abe0-bebeba2682cb

📥 Commits

Reviewing files that changed from the base of the PR and between 92585b1 and 26e982e.

📒 Files selected for processing (2)
  • CMakeLists.txt
  • src/simulation/m_weno.fpp

Comment thread CMakeLists.txt
target_link_options(${a_target} PRIVATE -fopenmp)
elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang")
target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a)
target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

fc="${FC:-flang}"

if ! command -v "$fc" >/dev/null 2>&1; then
  echo "Fortran compiler '$fc' not found; set FC to the LLVMFlang/flang binary to verify these flags." >&2
  exit 2
fi

"$fc" --help 2>&1 | grep -E -- \
  '-fopenmp-target-fast|-fopenmp-assume-threads-oversubscription|-fopenmp-assume-teams-oversubscription|--offload-arch' \
  || {
    echo "One or more LLVMFlang OpenMP offload flags are not advertised by this compiler." >&2
    exit 1
  }

Repository: MFlowCode/MFC

Length of output: 152


🏁 Script executed:

# Check if CMakeLists.txt exists and get file size
wc -l CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 76


🏁 Script executed:

# Read lines 630-650 to see the context around line 639
sed -n '630,650p' CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 1484


🏁 Script executed:

# Search for GPU architecture configuration patterns in CMakeLists.txt
rg -n "offload-arch|AMDGPU|AMD_GPU|GPU_ARCH" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 352


🏁 Script executed:

# Search for LLVMFlang or flang compiler handling
rg -n "LLVMFlang\|flang" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Search for GPU architecture configuration patterns more broadly
rg -n "GPU|ARCH|offload" CMakeLists.txt | head -30

Repository: MFlowCode/MFC

Length of output: 1453


🏁 Script executed:

# Search for MFC_* CMake options
rg -n "^option\|^set.*CACHE" CMakeLists.txt | head -40

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Get more context around the LLVMFlang block - lines 625-660
sed -n '625,660p' CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 2309


🏁 Script executed:

# Check if there are any other references to gfx90a or MI250X in the repo
rg -n "gfx90a|MI250X|AMD" --type-add 'cmake:*.cmake' -t cmake

Repository: MFlowCode/MFC

Length of output: 559


🏁 Script executed:

# Search for any existing CMake configuration or detection of AMD GPU architecture
rg -n "AMDGPU\|amdgpu\|ROCM\|rocm\|HIP" CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Check git history to see if these lines were modified in this PR
git log -p --follow -S "offload-arch=gfx90a" -- CMakeLists.txt | head -100

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

# Get the git diff to see what actually changed in this PR
git diff HEAD~1 CMakeLists.txt | grep -A 5 -B 5 "LLVMFlang" || echo "No diff found or file unchanged"

Repository: MFlowCode/MFC

Length of output: 292


🏁 Script executed:

# Search for any mention of build type (Release/Debug) constraints on GPU flags
rg -n "CMAKE_BUILD_TYPE\|Release\|Debug" CMakeLists.txt | head -20

Repository: MFlowCode/MFC

Length of output: 39


Gate the LLVMFlang fast-offload assumptions and GPU arch.

Lines 639–640 hardcode --offload-arch=gfx90a for both compile and link, and line 639 unconditionally enables -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription for all LLVMFlang builds, including Debug and RelDebug. ROCm documents -fopenmp-target-fast as requiring its constraints to be satisfied and implying extra assumptions (e.g., ignoring target env vars, applying -O3 when no -O* flag is set), which can alter debugging and performance-tuning behavior. Additionally, hardcoding --offload-arch=gfx90a limits this path to MI250X-class GPUs; the codebase's own FIXME comment at line 645 (in the GNU section) acknowledges this limitation.

Make the GPU architecture configurable via CMake and gate the fast assumptions behind Release build type and/or an explicit option.

Suggested direction
+option(MFC_AMD_OPENMP_FAST "Enable LLVMFlang AMD OpenMP fast target assumptions" ON)
+set(MFC_AMDGPU_ARCH "native" CACHE STRING "AMD GPU architecture for LLVMFlang OpenMP offload")

 elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang")
-    target_compile_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a -fopenmp-target-fast -fopenmp-assume-threads-oversubscription -fopenmp-assume-teams-oversubscription)
-    target_link_options(${a_target} PRIVATE -fopenmp --offload-arch=gfx90a)
+    target_compile_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}")
+    target_link_options(${a_target} PRIVATE -fopenmp "--offload-arch=${MFC_AMDGPU_ARCH}")
+
+    if (MFC_AMD_OPENMP_FAST AND CMAKE_BUILD_TYPE STREQUAL "Release")
+        target_compile_options(${a_target} PRIVATE
+            -fopenmp-target-fast
+            -fopenmp-assume-threads-oversubscription
+            -fopenmp-assume-teams-oversubscription)
+    endif()
 endif()

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.66%. Comparing base (92585b1) to head (26e982e).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1374      +/-   ##
==========================================
+ Coverage   64.62%   64.66%   +0.04%     
==========================================
  Files          71       71              
  Lines       18407    18432      +25     
  Branches     1516     1517       +1     
==========================================
+ Hits        11895    11919      +24     
  Misses       5555     5555              
- Partials      957      958       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson merged commit ee8ad84 into MFlowCode:master Apr 21, 2026
161 of 198 checks passed
danieljvickers added a commit to danieljvickers/MFC-Dan that referenced this pull request Apr 26, 2026
Co-authored-by: Daniel Vickers <danieljvickers@login09.frontier.olcf.ornl.gov>
Co-authored-by: Daniel Vickers <danieljvickers@frontier00007.frontier.olcf.ornl.gov>
Co-authored-by: Daniel Vickers <danieljvickers@login12.frontier.olcf.ornl.gov>
Co-authored-by: wilfonba <bwilfong3@gatech.edu>
Co-authored-by: Daniel Vickers <danieljvickers@frontier01665.frontier.olcf.ornl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants