[PyTorch] Isolate CP pool worker stdout from NCCL/library banners#3074
Conversation
Greptile SummaryThis PR fixes a CI regression where
Confidence Score: 5/5Safe to merge — test infrastructure only, no library or runtime code changed. The fd-level isolation is correctly placed before import torch so even import-time C-level writes are captured. The _JSON_STDOUT dup is only used on rank 0, eliminating any null-dereference risk on other ranks. The parent-side changes are comment-only. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Parent as pytest (parent)
participant Pipe as stdout pipe (fd 1 original)
participant Stderr as stderr
participant Worker as torchrun worker (rank-0)
Note over Worker: Module load BEFORE import torch
Worker->>Pipe: os.dup(1) _JSON_STDOUT (private fd)
Worker->>Stderr: os.dup2(2, 1) fd 1 now points at stderr
Worker->>Stderr: "sys.stdout = sys.stderr"
Note over Worker: import torch (NCCL init, banners)
Worker->>Stderr: NCCL banner via fd 1 to stderr
Note over Worker: main() dispatch loop
Parent->>Worker: JSON request via stdin
Worker->>Pipe: _JSON_STDOUT.write(json.dumps(payload))
Parent->>Pipe: select() + readline()
Parent->>Parent: json.loads(line) clean JSON
Reviews (4): Last reviewed commit: "Merge branch 'main' into sudhakars/cp_po..." | Re-trigger Greptile |
| _RESP_PREFIX = "[CP_POOL_RESP] " | ||
|
|
||
|
|
||
| def _send_response(rank: int, payload: dict) -> None: |
There was a problem hiding this comment.
_RESP_PREFIX duplicated across worker and parent without a shared source
The sentinel string "[CP_POOL_RESP] " is defined independently in both run_attention_with_cp_pool.py and test_attention_with_cp.py. If they ever drift (e.g. a typo during a rename), the parent will silently spin through every line until it times out, emitting a confusing "timed out" failure rather than a protocol-mismatch error. A brief comment calling out the coupling would make this risk explicit.
| _RESP_PREFIX = "[CP_POOL_RESP] " | |
| def _send_response(rank: int, payload: dict) -> None: | |
| # NOTE: this string must stay identical to PoolWorker._RESP_PREFIX in | |
| # test_attention_with_cp.py; the parent uses it to identify response lines. | |
| _RESP_PREFIX = "[CP_POOL_RESP] " | |
| def _send_response(rank: int, payload: dict) -> None: |
There was a problem hiding this comment.
Resolved by removing the sentinel entirely in e543ef89 — the rank-0 fd redirect keeps the pipe clean on its own (verified: 0 non-protocol lines under NCCL_DEBUG=INFO), so there's no longer a duplicated _RESP_PREFIX to keep in sync across the two files.
sudhakarsingh27
left a comment
There was a problem hiding this comment.
Targeted fix, mostly minimal. Main asks: drop the unreachable _JSON_STDOUT is None raise in _send_response and the dead re-entry guard in _redirect_rank0_stdout_to_stderr. The fd-redirect + sentinel pair is justified (worker process vs torchrun agent process — different coverage), but confirm the agent actually writes to this pipe; if that's speculative the sentinel alone suffices. fd-level ops are correct and the import-time placement is right. Comments are on the verbose side but explain real non-obvious intent, so fine to keep.
The CP attention test pool worker (PR NVIDIA#2993) uses rank-0 stdout as a line-delimited JSON protocol channel to the parent pytest process. When NCCL_DEBUG is set to VERSION or INFO, NCCL writes an "NCCL version ...\n" banner to stdout (fd 1); that banner reaches the parent ahead of the first JSON response, json.loads raises, and because the pool fixture is session-scoped and killed on first failure, every subsequent CP test in the file fails. CI runners that export NCCL_DEBUG hit this on all ~200 non-skipped cases. The prior mitigation only redirected non-rank-0 stdout to /dev/null, so rank 0's own banner still corrupted the stream. Fix it at the source: in the worker, before importing torch, dup rank-0's original fd 1 into a private stream reserved for JSON, then point fd 1 at stderr. Any banner from NCCL/cuBLAS/cuDNN/torch (Python or C level, import-time or runtime) now lands on stderr — still drained into CI logs by the parent's stderr thread — instead of the protocol pipe. Combined with the existing non-rank-0 /dev/null redirect, the pipe carries only rank-0 JSON, so the parent's single-line read needs no change. Validated on 8xH100 (TE built from this commit). With NCCL_DEBUG=VERSION and =INFO, flash (p2p/all_gather/a2a) and fused cases pass and zero non-protocol lines reach the pipe; without the fix the same cases fail with "pool worker JSON protocol broke". Control with NCCL_DEBUG unset also passes. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
8332034 to
e543ef8
Compare
|
/te-ci pytorch L3 |
cyanguwa
left a comment
There was a problem hiding this comment.
Please make sure you test with NCCL_DEBUG turned on. Thanks for the PR!
Problem
The CP attention test pool worker (added in #2993) uses rank-0 stdout as a line-delimited JSON protocol channel between the worker and the parent pytest process.
PoolWorker._submit_oncereads one line andjson.loadsit.When
NCCL_DEBUGis set toVERSIONorINFO, NCCL writes a banner line —NCCL version 2.x.y+cudaA.B\n— to stdout (fd 1), not stderr. That banner reaches the parent ahead of the first JSON response,json.loadsraises, and because thecp_poolfixture is session-scoped and the pool is killed on first failure, every subsequent CP test in the file fails with:CI runners that export
NCCL_DEBUG=VERSION(e.g. the cuDNN nightly GB200 job) hit this on all ~200 non-skippedtest_cp_with_flash_attention/test_cp_with_fused_attentioncases. WithNCCL_DEBUGunset the banner isn't emitted, so the suite passes — which is why the regression only shows up on those runners.The existing mitigation only redirected non-rank-0 stdout to
/dev/null, so rank 0's own banner still corrupted the stream.Fix
Worker —
run_attention_with_cp_pool.py: beforeimport torch,dup()rank-0's original fd 1 into a private stream (_JSON_STDOUT) reserved for the JSON protocol, thendup2(2, 1)so fd 1 points at stderr. Any banner from NCCL / cuBLAS / cuDNN / torch — Python-level or C-level, import-time or runtime — now lands on stderr (still drained into CI logs by the parent's stderr thread) instead of the protocol pipe. Responses are written through_JSON_STDOUT. The redirect runs at module scope, beforeimport torch, so even import-time/C-level writes are covered.The pre-existing
_silence_non_rank0_stdout(rank>0 →/dev/null) is kept, so between the two redirects every rank's fd 1 is diverted away from the pipe. The asymmetry is intentional and documented in the code: rank 0 keeps a dup of fd 1 for the JSON channel and sends the rest to stderr (preserved in CI logs); rank>0 has nothing to preserve, so it goes to/dev/null.Parent —
test_attention_with_cp.py: comment-only update. The read path is unchanged (singleselect+readline+json.loads), since with both redirects in place the pipe carries only rank-0's JSON line. A non-JSON line is still turned into a clearAssertionError("pool worker JSON protocol broke") rather than a rawJSONDecodeError.Why fd-level isolation rather than a stdout sentinel/skip
An earlier revision prefixed responses with a
[CP_POOL_RESP]sentinel and had the parent skip non-protocol lines. That was dropped after measuring it: with the rank-0 fd redirect in place, instrumenting the parent to count non-sentinel lines on the pipe underNCCL_DEBUG=INFO(max verbosity) across multiple cases yields zero — the torchrun agent logs to stderr, not this stdout pipe, so the sentinel filtered nothing. fd isolation also behaves better underNCCL_DEBUG=INFO: the (large) banner volume goes to stderr instead of forcing the parent to read+skip hundreds of lines within the response deadline. Net result is a smaller, single-mechanism fix.Validation
8×H100, TE built from this commit, with the NCCL banner confirmed emitted in every run (
grep "NCCL version"):NCCL_DEBUG=VERSION, flash p2p + all_gather + a2a (one session, pool reused across the banner)NCCL_DEBUG=INFO(verbose), flash p2p + fused p2pNCCL_DEBUG=VERSION, fused p2pNCCL_DEBUGunset (control)qa/L1_pytorch_distributed_unittest/test.shpattern,NCCL_DEBUG=VERSION)NCCL_DEBUG=VERSIONpool worker JSON protocol broke(regression reproduced)Type of change
Checklist
NCCL_DEBUG=VERSION/INFO/unset; L1 det+non-det)