Skip to content

Local Aware IBM#1378

Draft
danieljvickers wants to merge 55 commits intoMFlowCode:masterfrom
danieljvickers:local-aware-ibm
Draft

Local Aware IBM#1378
danieljvickers wants to merge 55 commits intoMFlowCode:masterfrom
danieljvickers:local-aware-ibm

Conversation

@danieljvickers
Copy link
Copy Markdown
Member

Description

The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.

Type of change

  • Refactor

Testing

TBD

Checklist

  • I added or updated tests for new behavior

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

  • @coderabbitai review — incremental review (new changes only)
  • @coderabbitai full review — full review from scratch
  • /review — Qodo review
  • /improve — Qodo code suggestions
  • @claude full review — Claude full review (also triggers on PR open/reopen/ready)
  • Add label claude-full-review — Claude full review via label

@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: ee0bc0c

Files changed:

  • 8
  • docs/documentation/case.md
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/simulation/m_global_parameters.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_ibm.fpp
  • src/simulation/m_start_up.fpp
  • src/simulation/m_time_steppers.fpp

Findings:

[Correctness — Out-of-bounds array access in s_reduce_ib_patch_array]
src/simulation/m_start_up.fpp

patch_ib_gbl is a fixed-size local array of size num_ib_patches_max = 50000 (from m_constants.fpp). In the 3D branch, num_aware_ibs = num_local_ibs_max * 27 = 2000 * 27 = 54000. Both the num_procs == 1 MPI path and the non-MPI #else path then execute:

patch_ib(:) = patch_ib_gbl(1:num_aware_ibs)   ! 1:54000 from a 50000-element array

Accessing elements 50001–54000 of patch_ib_gbl is an out-of-bounds array access for any 3D simulation. The 2D case (num_aware_ibs = 18000 < 50000) is unaffected. Fix: either increase patch_ib_gbl's declared size to accommodate the 3D case (e.g. max(num_ib_patches_max, num_local_ibs_max*27)), or clamp the right-hand-side slice to min(num_aware_ibs, num_ib_patches_max) (with acknowledgement that patches beyond index 50000 would be dropped).


[MPI Correctness — Global allreduce replaced by 2-hop neighborhood reduce]
src/simulation/m_ibm.fpp

The previous s_mpi_allreduce_vectors_sum(forces, ..., num_ibs, 3) (a global collective) is replaced by s_communicate_ib_forces, which propagates force/torque contributions at most 2 hops in each Cartesian direction. This is only correct if every processor that contributed ghost-point force calculations for a given IB is within 2 MPI ranks of the owning rank along every axis. An IB whose ghost-point stencil spans more than 2 subdomains in any direction (possible with fine MPI decompositions or large gp_layers) will accumulate under-counted forces and torques silently, producing wrong physics with no diagnostic. The comments acknowledge "local neighborhood" but do not document the assumption or enforce it with a runtime guard (e.g. asserting gp_layers is small relative to subdomain size).


[Memory Management / GPU Correctness — patch_ib allocated with raw allocate/deallocate]
src/simulation/m_global_parameters.fpp, src/simulation/m_start_up.fpp

patch_ib is now allocatable and listed in GPU_DECLARE. The initial allocation in s_set_default_values:

allocate (patch_ib(num_ib_patches_max))

and the resize in s_reduce_ib_patch_array:

deallocate (patch_ib)
...
allocate (patch_ib(num_aware_ibs))

both use bare Fortran statements instead of @:ALLOCATE/@:DEALLOCATE. Per CLAUDE.md, every @:ALLOCATE must have a matching @:DEALLOCATE. More critically, after s_reduce_ib_patch_array populates the freshly-allocated host patch_ib (via the filtering loop and assignments), there is no $:GPU_UPDATE(device=[patch_ib]) call to push the updated data to the device. On GPU builds, the device-side patch_ib will be stale or uninitialized after this function returns, corrupting any subsequent GPU kernel that reads it before the data is otherwise re-uploaded.

@sbryngelson
Copy link
Copy Markdown
Member

Claude Code Review

PR #1378 — Local Aware IBM

This is a draft PR. The concept (replacing global allreduce with neighborhood-local communication for IBM forces) is sound and the scalability motivation is clear. Several correctness issues need to be addressed before this is merge-ready.


Critical Issues

1. f_local_rank_owns_location returns uninitialized result for single-process MPI builds
src/simulation/m_collisions.fpp

The new function wraps its entire ownership logic in #ifdef MFC_MPI / if (num_procs > 1) but provides no else branch. If num_procs == 1 inside an MPI build, owns_collision is never set, producing an uninitialized logical. The old f_local_rank_owns_collision handled this correctly with if (num_procs == 1) then; owns_collision = .true.. Fix: add else; owns_collision = .true. inside the if (num_procs > 1) block.

2. f_neighborhood_ranks_own_location silently disables locality check for 1–2 rank runs
src/simulation/m_collisions.fpp

The guard if (num_procs > 2) means that for num_procs == 2 the function unconditionally returns .true., so every IB is treated as neighborhood-owned with no filtering. Two-rank runs are the canonical small test configuration. The threshold should be > 1, not > 2.

3. Out-of-bounds potential in s_reduce_ib_patch_array
src/simulation/m_start_up.fpp

In 3D with 27 neighbor cells, num_aware_ibs can reach 27 * (local IB count), which for any non-trivial simulation exceeds num_local_ibs_max = 2000. There is no @:PROHIBIT or bounds guard before indexing local_ib_patch_ids(1:num_local_ibs_max). Silent memory corruption or segfault will occur once num_aware_ibs > 2000.

4. Analytic MIBM code generation uses local loop index, not global patch ID
toolchain/mfc/case.py — __get_analytic_mib_fpp

The template still emits if (i == {pid}) where pid is the global patch number from the Python case file and i is the local loop counter over num_ibs. After this PR, num_ibs counts locally-owned IBs, so local index i no longer maps to the global patch number. All simulations with analytically-defined IB motion will silently apply velocity to the wrong patches in multi-rank runs.


Important Issues

5. error stop in s_compute_image_points
src/simulation/m_ibm.fpp:485

if (bounds_error) error stop "Ghost Point and Image Point on Different Processors. Exiting"

error stop is forbidden by project standards (CLAUDE.md). Must be replaced with call s_mpi_abort(...).

6. models array allocated without matching @:DEALLOCATE
src/simulation/m_ibm.fpp:56

s_initialize_ibm_module has @:ALLOCATE(models(num_ibs)) but s_finalize_ibm_module has no matching @:DEALLOCATE(models). This violates the project rule requiring every @:ALLOCATE to have a paired @:DEALLOCATE, and leaks GPU device memory.

7. Post-process s_write_ib_variable called from all ranks with orphaned mesh
src/post_process/m_data_output.fpp

The PR removes the if (proc_rank == 0) guard, moving DBPUTPM (mesh definition) to rank 0 only but leaving DBPUTPV1 (variable write, inside s_write_ib_variable) called on all ranks. Non-zero ranks will write a point variable onto a mesh that was never defined in their per-process SILO file, producing corrupt output or a library error.

8. normal_axis normalization bug in s_compute_moment_of_inertia
src/simulation/m_ibm.fpp:1154

normal_axis = axis/sqrt(sum(axis))   ! BUG: missing **2

Should be sqrt(sum(axis**2)). As written this is sqrt(sum of components) rather than Euclidean norm, corrupting moment-of-inertia calculation for any 3D rotating IB. Pre-existing issue but touched by this PR.


Suggestions

  • num_local_ibs_max = 2000: add @:PROHIBIT(num_local_ibs > num_local_ibs_max, "num_local_ibs exceeds num_local_ibs_max") at the point of use.
  • Mixed #:if defined('MFC_MPI') vs #ifdef MFC_MPI within the same module — pick one for consistency.
  • Testing is marked "TBD". At minimum the existing 2D IBM/MIBM regression tests should be shown passing and a multi-rank (≥4) run with >10 IBs demonstrated before leaving draft.

Strengths

  • The architectural direction (local-only IB awareness replacing global allreduce) is the correct approach for scaling to O(10⁴) particles.
  • The s_encode_patch_periodicity/s_decode_patch_periodicity interface cleanly maintains global identity through the periodic marker field.
  • The 26-neighbor rank lookup table in s_compute_ib_neighbor_ranks avoids repeated topology queries at runtime.
  • The post-process SILO simplification (single mesh instead of per-rank mesh) is a positive cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants