fix(env): cache MORI_ENABLE_SDMA / MORI_DISABLE_P2P at Context construction#317
Merged
Conversation
…uction
Both env vars used to be read independently from multiple call sites:
* Context::InitializePossibleTransports (transport selection, one-shot)
* SymmMemManager::Malloc (per-allocation, picks hipMalloc vs uncached
hipExtMallocWithFlags(hipDeviceMallocUncached))
* EpDispatchCombineHandle ctor (per-handle config flag)
When the env was mutated AFTER Context init -- e.g. a pytest fixture
spawns workers, workers init shmem, and the test function then sets
os.environ["MORI_ENABLE_SDMA"] = "1" -- the transport layer still
believed P2P but per-allocation paths flipped to uncached buffers,
producing a cache/IPC inconsistency that hung async_ll.
Snapshot both env vars exactly once in Context's constructor and
expose them through Context::IsSdmaEnabled() / IsP2PDisabled(). All
later readers (SymmMemManager via its Context reference, dispatch_combine
via the new ShmemSdmaEnabled() shmem accessor) now consult the cached
value, so the order in which env vars are set relative to shmem init
no longer matters: either it's set before init and takes effect, or it
isn't and is silently ignored -- no half-on state.
Remove late os.environ["MORI_DISABLE_P2P"] mutations from worker functions — with the env-cache fix these are no-ops anyway, and they prevented running tests under different transports. Split the single async_ll CI step into two independent steps: - MORI_ENABLE_SDMA=1: exercises the SDMA (anvil) intranode path - MORI_DISABLE_P2P=1: exercises the IBGDA (RDMA) path Both env vars are now set on the pytest process command line (before worker spawn), consistent with the cached-env design. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Set MORI_ENABLE_SDMA=1 at module level via os.environ.setdefault so that running the test locally without any env config picks the SDMA path. CI can still override per step: - SDMA step passes MORI_ENABLE_SDMA=1 (redundant, harmless) - IBGDA step passes MORI_DISABLE_P2P=1 (routes to RDMA; SDMA flag ignored) Using setdefault avoids clobbering an explicit external override. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test file sets os.environ.setdefault("MORI_ENABLE_SDMA", "1") at
module level for convenient local runs. Without an explicit override,
the IBGDA step would inherit sdmaEnabled=true, causing Malloc() to use
hipExtMallocWithFlags(uncached) while transport is RDMA — mismatch that
leads to timeout (exit code 124).
MORI_ENABLE_SDMA=0 prevents setdefault from firing so the Context
caches sdmaEnabled=false and allocations use normal hipMalloc.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
MORI_ENABLE_SDMAandMORI_DISABLE_P2Pwere read independently from multiple call sites:Context::InitializePossibleTransportsSymmMemManager::MallochipMallocvshipExtMallocWithFlags(hipDeviceMallocUncached)EpDispatchCombineHandlectorconfig.enableSdmaflagIf
MORI_ENABLE_SDMAwas set after Context init (e.g. a session-scoped pytest fixture spawns workers, workers init shmem, then the test function doesos.environ["MORI_ENABLE_SDMA"] = "1"), the transport layer still believed P2P but per-allocation paths flipped to uncached buffers — producing a cache/IPC inconsistency that silently hungasync_ll.This was the latent reason
tests/python/ops/test_dispatch_combine_async_ll.pyhad to be invoked with the env on the pytest command line in.github/workflows/ci.yml(PR #313); setting it inside the test function had no effect on transport selection but did flip allocation cacheability — the worst combination.Fix
Snapshot both env vars exactly once in
Context's constructor and expose them throughContext::IsSdmaEnabled()/Context::IsP2PDisabled(). All later readers consult the cached value:SymmMemManager::Malloc→context.IsSdmaEnabled()(already had aContext&reference)EpDispatchCombineHandle→ newmori::shmem::ShmemSdmaEnabled()accessor that delegates toContext::IsSdmaEnabled()After this PR: either the env is set before
shmem_initand takes effect end-to-end, or it isn't and is silently ignored — there is no half-on state.Files
include/mori/application/context/context.hpp— addbool sdmaEnabled/bool p2pDisabledmembers + getterssrc/application/context/context.cpp— initialize them in ctor; switch internal sites to member accessors; remove the file-local free functionssrc/application/memory/symmetric_memory.cpp—SymmMemManager::Mallocconsultscontext.IsSdmaEnabled()include/mori/shmem/shmem_api.hpp+src/shmem/runtime.cpp— addShmemSdmaEnabled()accessor (mirrors existingShmemNumQpPerPe())src/ops/dispatch_combine/dispatch_combine.cpp—EpDispatchCombineHandlector usesShmemSdmaEnabled()Test plan
intranodeandinternode_v1jobs unchanged (no SDMA path involved → no behavior change)async_lljob (which setsMORI_ENABLE_SDMA=1on the pytest cmdline) keeps passing — confirms the cached path is wired correctly