Refactor GPU support and improve solvers with CUDA enhancements#203
Merged
jameslehoux merged 30 commits intoworkingfrom Mar 29, 2026
Merged
Refactor GPU support and improve solvers with CUDA enhancements#203jameslehoux merged 30 commits intoworkingfrom
jameslehoux merged 30 commits intoworkingfrom
Conversation
Major architecture overhaul: modular solvers, Fortran→C++ migration, GPU acceleration
- Fix GPU wheel CI: use sameli/manylinux_2_34_x86_64_cuda_12.6 (the manylinux_2_28 variant with CUDA 12.6 does not exist on Docker Hub) - Update dnf repo from 'powertools' to 'crb' for RHEL 9-based image - Add tFloodFill integration test: validates parallelFloodFill and collectBoundarySeeds with 4 test cases (full flood, partial flood, boundary seeds, multi-label) - Add tTortuosityMLMG integration test: validates MLMG matrix-free solver against analytical tau=(N-1)/N on uniform block, with directional symmetry test (Y direction) https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix GPU wheel Docker image and add FloodFill/MLMG solver tests
MLMG applies Dirichlet BCs at domain faces (external), not at cell centers like HYPRE, so tau=1.0 for a uniform block (not (N-1)/N). https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix MLMG test expected tau and apply clang-format
The GPU workflow now patches pyproject.toml to set the distribution name to 'openimpala-cuda' before building. The import name stays 'openimpala' — only the pip package name differs, so users choose between `pip install openimpala` (CPU) and `pip install openimpala-cuda` (CUDA GPU). __init__.py resolves version from either package. All tutorials and notebooks updated to install openimpala-cuda to leverage Colab T4 GPUs. https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
The manylinux_2_34 container ships GCC 14+ which is unsupported by CUDA 12.6's nvcc (max GCC 13). Install gcc-toolset-13 and set CC/CXX/FC to GCC 13 for all dependency builds and wheel compilation. Also fixes: - Quote semicolons in CMAKE_CUDA_ARCHITECTURES to prevent shell splitting (was causing "sh: line 1: 70: command not found" errors) - Pass CUDAFLAGS="-allow-unsupported-compiler" to HYPRE build as safety net - Set CUDAHOSTCXX and CMAKE_CUDA_HOST_COMPILER for AMReX and OpenImpala - Bump cache key to force rebuild with new toolchain https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Publish GPU wheel as separate openimpala-cuda PyPI package
AMReX built with CUDA defines AMREX_GPU_HOST_DEVICE as __host__ __device__ in all its headers. Any .cpp file that includes AMReX headers must be compiled by nvcc, not the regular C++ compiler, or the CUDA keywords cause "does not name a type" errors. Fix by setting LANGUAGE CUDA on all source files when GPU_BACKEND=CUDA: - src/CMakeLists.txt: IO_SOURCES, PROPS_SOURCES, Diffusion.cpp - python/CMakeLists.txt: BINDING_SOURCES - tests/CMakeLists.txt: test source files via openimpala_add_test() https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix CUDA build: compile all sources as CUDA when GPU backend is enabled
CUDA's atomicAdd supports int, unsigned int, unsigned long long, float, and double — but NOT signed long long. When compiling with nvcc, the Gpu::Atomic::Add(&long_long_ptr, 1LL) calls fail. Switch DeviceVector counters from long long to int in: - ConnectedComponents.cpp: component volume counting - ThroughThicknessProfile.cpp: per-slice phase counting int is sufficient for voxel counts (max ~2 billion cells per component). The host-side m_volumes (vector<long long>) is populated via .assign() which widens int→long long safely. https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix CUDA atomicAdd: replace long long with int for GPU atomic counters
CUDA's extended __device__ lambdas cannot appear inside private or protected member functions. The error: "The enclosing parent function for an extended __device__ lambda cannot have private or protected access within its class" Move methods containing AMREX_GPU_DEVICE lambdas from private/protected to public in: - ConnectedComponents.H: run(), findNextUnlabeled() - PercolationCheck.H: run() - TortuositySolverBase.H: buildDiffusionCoeffField(), generateActivityMask(), globalFluxes(), computePlaneFluxes(), solve(), preconditionPhaseFab(), parallelFloodFill(), writeSolutionPlotfile() - TortuosityHypre.H: solve(), setupMatrixEquation(), preconditionPhaseFab(), generateActivityMask(), global_fluxes(), computePlaneFluxes() Data members remain private/protected. This is the standard pattern for AMReX-based GPU codes. https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix CUDA: make methods with __device__ lambdas publicly accessible
Same nvcc restriction: __device__ lambdas cannot be inside private or protected member functions. - HDF5Reader.H: move readAndThresholdFab() to public - TortuosityDirect.H: move solve(), advance(), and other methods containing AMREX_GPU_DEVICE lambdas to public https://claude.ai/code/session_01WR9HkUD95rp3XzZU95j2y7
…ing-notebook-JUBs9 Fix CUDA: make remaining private methods with GPU lambdas public
… and fix setVal overload NVCC forbids extended __device__ lambdas inside constructors. Extract ParallelFor loops from TortuosityHypre and EffectiveDiffusivityHypre constructors into separate member functions (initializeDiffCoeff, buildTraversableMask). Also fix setVal overload resolution by using amrex::DestComp/NumComps wrapper types instead of raw int arguments. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA compilation errors: extract device lambdas from constructors and fix setVal overload
…eduction NVCC also requires that enclosing functions for __device__ lambdas have public access within their class. Move initializeDiffCoeff and buildTraversableMask to public sections. For setVal, explicitly specify the RunOn::Host template parameter to resolve deduction failure. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA: make device-lambda methods public and fix setVal template deduction
amrex::Gpu::DeviceScalar has a deleted move-assignment operator. Use htod_memcpy to reset the device flag to zero instead of constructing a new DeviceScalar each iteration. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA: replace deleted DeviceScalar move-assignment with htod_memcpy
…n REVStudy NVCC cannot deduce the run_on template parameter for the 6-argument BaseFab::copy overload. Explicitly specify RunOn::Host. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA: explicit RunOn::Host for BaseFab::copy template deduction in REVStudy
NVCC forbids extended __device__ lambdas in constructors. Move the ParallelFor kernel into a public compute() method called from the constructor. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA: extract device lambda from ThroughThicknessProfile constructor
NVCC requires functions containing __device__ lambdas to have public access within their class. https://claude.ai/code/session_01RKnn97qiD7sbCeABHH3eQk
…ild-yptxM Fix CUDA: move TortuosityMLMG::solve() from protected to public
…ild-yptxM Fix clang-format violations in ThroughThicknessProfile and EffectiveDiffusivityHypre
Code Coverage ReportGenerated by CI — coverage data from gcovr |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.