Skip to content

fix(env): cache MORI_ENABLE_SDMA / MORI_DISABLE_P2P at Context construction#317

Merged
jhchouuu merged 4 commits into
mainfrom
jiahzhou/sdma-env-cache
May 13, 2026
Merged

fix(env): cache MORI_ENABLE_SDMA / MORI_DISABLE_P2P at Context construction#317
jhchouuu merged 4 commits into
mainfrom
jiahzhou/sdma-env-cache

Conversation

@jhchouuu
Copy link
Copy Markdown
Collaborator

@jhchouuu jhchouuu commented May 13, 2026

Problem

MORI_ENABLE_SDMA and MORI_DISABLE_P2P were read independently from multiple call sites:

Location When read What it controls
Context::InitializePossibleTransports once at shmem init transport selection (P2P / SDMA / RDMA)
SymmMemManager::Malloc every allocation hipMalloc vs hipExtMallocWithFlags(hipDeviceMallocUncached)
EpDispatchCombineHandle ctor every handle construct config.enableSdma flag

If MORI_ENABLE_SDMA was set after Context init (e.g. a session-scoped pytest fixture spawns workers, workers init shmem, then the test function does os.environ["MORI_ENABLE_SDMA"] = "1"), the transport layer still believed P2P but per-allocation paths flipped to uncached buffers — producing a cache/IPC inconsistency that silently hung async_ll.

This was the latent reason tests/python/ops/test_dispatch_combine_async_ll.py had to be invoked with the env on the pytest command line in .github/workflows/ci.yml (PR #313); setting it inside the test function had no effect on transport selection but did flip allocation cacheability — the worst combination.

Fix

Snapshot both env vars exactly once in Context's constructor and expose them through Context::IsSdmaEnabled() / Context::IsP2PDisabled(). All later readers consult the cached value:

  • SymmMemManager::Malloccontext.IsSdmaEnabled() (already had a Context& reference)
  • EpDispatchCombineHandle → new mori::shmem::ShmemSdmaEnabled() accessor that delegates to Context::IsSdmaEnabled()

After this PR: either the env is set before shmem_init and takes effect end-to-end, or it isn't and is silently ignored — there is no half-on state.

Files

  • include/mori/application/context/context.hpp — add bool sdmaEnabled / bool p2pDisabled members + getters
  • src/application/context/context.cpp — initialize them in ctor; switch internal sites to member accessors; remove the file-local free functions
  • src/application/memory/symmetric_memory.cppSymmMemManager::Malloc consults context.IsSdmaEnabled()
  • include/mori/shmem/shmem_api.hpp + src/shmem/runtime.cpp — add ShmemSdmaEnabled() accessor (mirrors existing ShmemNumQpPerPe())
  • src/ops/dispatch_combine/dispatch_combine.cppEpDispatchCombineHandle ctor uses ShmemSdmaEnabled()

Test plan

  • CI green on MI300X-BNXT and MI355X-AINIC (this PR's box is MI308, SDMA queue init flaky on this hardware so local async_ll over SDMA isn't a reliable signal)
  • intranode and internode_v1 jobs unchanged (no SDMA path involved → no behavior change)
  • async_ll job (which sets MORI_ENABLE_SDMA=1 on the pytest cmdline) keeps passing — confirms the cached path is wired correctly

jhchouuu and others added 4 commits May 13, 2026 14:02
…uction

Both env vars used to be read independently from multiple call sites:
  * Context::InitializePossibleTransports (transport selection, one-shot)
  * SymmMemManager::Malloc (per-allocation, picks hipMalloc vs uncached
    hipExtMallocWithFlags(hipDeviceMallocUncached))
  * EpDispatchCombineHandle ctor (per-handle config flag)

When the env was mutated AFTER Context init -- e.g. a pytest fixture
spawns workers, workers init shmem, and the test function then sets
os.environ["MORI_ENABLE_SDMA"] = "1" -- the transport layer still
believed P2P but per-allocation paths flipped to uncached buffers,
producing a cache/IPC inconsistency that hung async_ll.

Snapshot both env vars exactly once in Context's constructor and
expose them through Context::IsSdmaEnabled() / IsP2PDisabled(). All
later readers (SymmMemManager via its Context reference, dispatch_combine
via the new ShmemSdmaEnabled() shmem accessor) now consult the cached
value, so the order in which env vars are set relative to shmem init
no longer matters: either it's set before init and takes effect, or it
isn't and is silently ignored -- no half-on state.
Remove late os.environ["MORI_DISABLE_P2P"] mutations from worker
functions — with the env-cache fix these are no-ops anyway, and they
prevented running tests under different transports.

Split the single async_ll CI step into two independent steps:
- MORI_ENABLE_SDMA=1: exercises the SDMA (anvil) intranode path
- MORI_DISABLE_P2P=1: exercises the IBGDA (RDMA) path

Both env vars are now set on the pytest process command line (before
worker spawn), consistent with the cached-env design.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Set MORI_ENABLE_SDMA=1 at module level via os.environ.setdefault so
that running the test locally without any env config picks the SDMA
path.  CI can still override per step:
- SDMA step passes MORI_ENABLE_SDMA=1 (redundant, harmless)
- IBGDA step passes MORI_DISABLE_P2P=1 (routes to RDMA; SDMA flag ignored)

Using setdefault avoids clobbering an explicit external override.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test file sets os.environ.setdefault("MORI_ENABLE_SDMA", "1") at
module level for convenient local runs.  Without an explicit override,
the IBGDA step would inherit sdmaEnabled=true, causing Malloc() to use
hipExtMallocWithFlags(uncached) while transport is RDMA — mismatch that
leads to timeout (exit code 124).

MORI_ENABLE_SDMA=0 prevents setdefault from firing so the Context
caches sdmaEnabled=false and allocations use normal hipMalloc.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jhchouuu jhchouuu merged commit a0598d0 into main May 13, 2026
18 of 19 checks passed
@jhchouuu jhchouuu deleted the jiahzhou/sdma-env-cache branch May 13, 2026 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant