feat(steering): cross-rank applied-action checksum in dynamic status#233
Merged
RhizoNymph merged 1 commit intoJul 5, 2026
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Failure mode
Sync capture consumers run identically and independently on every TP rank in lock-step, with zero communication. Each rank applies the returned steering actions via
SteeringModelRunnerMixin._apply_steering_actions. The contract is enforced only by convention: if one rank hits a local fault (consumer OOM, a swallowed exception at the observer boundary in_run_sync_consumers), its dyn_id allocation and steering tables silently desync from its siblings forever, corrupting output with no error surfaced anywhere.Detector
An always-on, cheap rolling checksum of applied steering actions per worker:
steering_model_runner_mixin.py):_apply_steering_actionsfolds every action that is actually applied (rejected actions excluded) intoself._steering_action_checksum(u64). The fold iszlib.crc32over a compact,PYTHONHASHSEED-free digest of the action content (class, targetreq_id/config_hash/dyn_id,hook/layer,source, and a bit-exact shape+CRC of any vector/probe payload) mixed splitmix-style in application order, plus a per-drain-batch ordinal so "same actions, different step" differs. Digests are bit-exact (not norms) because actions are host-side numpy built from rank-identical inputs — strictly stronger and never legitimately divergent across ranks. O(applied actions); zero cost on idle steps. The rollback of a failed declarative override is itself folded.get_dynamic_steering_statusexposesaction_checksum(hex) +action_count(picklable primitives)._merge.py+api_router.py):GET /v1/steering/dynamiccompares checksums across workers viacheck_action_determinism. A mismatch does not 500 — the response carriesdeterminism: {consistent: false, checksums: {...}}and a rate-limited server-side ERROR fires; on matchdeterminism: {consistent: true, action_count: N}.steering_update_accepted(new thin wrapper over the existing_validate_update) so the batched vector-update path folds exactly the applied set with no duplicated validation logic.Topology scoping / granularity caveat
Comparison is scoped within each PP stage (grouped by
pp_rank), mirroringdeep_merge_status: TP ranks in a stage own identical layers and must match, while PP stages own disjoint layers and may legitimately differ. Sync-consumer-originated actions only exist atpipeline_parallel_size == 1(enforced invllm/v1/capture/registry.py), where this reduces to an all-workers comparison — exactly right. Granularity is poll-time: a desync is detected on the next status poll, so the checksum bounds (does not prevent) corruption; pair with periodic polling.Docs
New
docs/design/dynamic_steering.md§6.1 documents the detector and its poll-time granularity.Expected sibling-PR conflicts
chore/steering-row-ownertouchessteering_manager.pyand the mixin's scale/monitor apply methods (different regions of the same file).docs/design/dynamic_steering.mdwithchore/steering-trust-hardening.