Skip to content

[JAX] Add wait per multi-proc cleanup in L0_jax_distributed_unittest#2979

Merged
phu0ngng merged 2 commits into
NVIDIA:mainfrom
phu0ngng:cgemm_mprocs_fix
May 12, 2026
Merged

[JAX] Add wait per multi-proc cleanup in L0_jax_distributed_unittest#2979
phu0ngng merged 2 commits into
NVIDIA:mainfrom
phu0ngng:cgemm_mprocs_fix

Conversation

@phu0ngng
Copy link
Copy Markdown
Collaborator

Description

Add wait per multi-proc cleanup in L0_jax_distributed_unittest to prevent later tests process starts before the previous tests' cleanup is done. This helps to prevent mismatch issues in CGEMM tests reported by QA.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
@phu0ngng phu0ngng requested a review from ptrendx May 12, 2026 18:12
@phu0ngng
Copy link
Copy Markdown
Collaborator Author

/te-ci JAX L0

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR adds a wait call immediately after the explicit cleanup() invocation at the end of two JAX distributed test shell scripts to ensure all killed background processes are fully reaped by the shell before the script exits. This prevents GPU/resource contention when the outer L0_jax_distributed_unittest harness starts the next test script.

  • run_test_cgemm.sh: wait added after cleanup() so that SIGKILL'd processes are reaped before the script terminates, closing the race window between CGEMM test runs.
  • run_test_multiprocessing_encoder.sh: Same one-line addition for consistency and to prevent the same class of mismatches in encoder tests.

Confidence Score: 5/5

Safe to merge — the one-line addition to each script correctly closes the process-reaping race at script exit.

Both changes are minimal and targeted: adding wait after cleanup() ensures the shell blocks until all SIGKILL'd child processes are fully reaped before the script returns, which is the correct mechanism for preventing the GPU resource contention described in the PR. The existing wait calls inside the test loop already handle normal inter-test-case synchronization; the new wait covers the edge case of lingering processes at exit. No logic paths are altered and the kill -0 guards in cleanup() make repeated cleanup calls safe.

No files require special attention.

Important Files Changed

Filename Overview
examples/jax/collective_gemm/run_test_cgemm.sh Adds wait after cleanup() at script end to block until killed child processes are reaped, preventing resource contention with subsequent test invocations.
examples/jax/encoder/run_test_multiprocessing_encoder.sh Same one-line wait addition after cleanup() for the encoder multiprocessing test script, matching the fix applied to the CGEMM script.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Start test loop] --> B[Spawn N GPU processes as background jobs]
    B --> C[wait — all test processes finish]
    C --> D[Check log for PASS/FAIL/SKIP]
    D --> E[wait — before log cleanup]
    E --> F[rm log files]
    F --> G{More test cases?}
    G -- Yes --> A
    G -- No --> H[wait — post-loop]
    H --> I[cleanup — send SIGTERM + SIGKILL to any lingering PIDs]
    I --> J["wait NEW — block until killed processes are reaped"]
    J --> K[exit HAS_FAILURE]
    K --> L[EXIT trap fires cleanup again — harmless, kill -0 guards no-op]
Loading

Reviews (2): Last reviewed commit: "Merge branch 'main' into cgemm_mprocs_fi..." | Re-trigger Greptile

@phu0ngng phu0ngng merged commit 4eab389 into NVIDIA:main May 12, 2026
9 of 12 checks passed
@phu0ngng phu0ngng deleted the cgemm_mprocs_fix branch May 12, 2026 21:01
faradawn pushed a commit to faradawn/TransformerEngine that referenced this pull request May 14, 2026
NVIDIA#2979)

add wait per multi-proc test cleanup

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants