Add single-mode dispatch to spec_compression_stress.sh#33
Merged
Conversation
The script ran all three modes (baseline / self_verify / cpu_verify)
sequentially in one process. Each mode regenerates
astra-sim/inputs/{network.yml,system.json,memory_expansion.json} from
the cluster config, so naively backgrounding the three calls races on
those shared input files and corrupts the in-flight sim's view.
Two changes to enable parallel dispatch via container-per-mode (which
is filesystem-isolated by default, so the inputs/* race goes away):
* Accept a positional MODE argument: ``./script.sh <mode>`` runs only
the named mode. ``./script.sh`` keeps the legacy "all three
sequentially" behaviour. Unknown modes error out.
* Add a ``prepare`` mode that does just the workload synthesis +
scaled-config generation and exits. Lets a parallel dispatcher
pre-seed the shared OUTPUT_DIR once before fanning out the three
simulator jobs, instead of every job racing on those steps.
The header docstring documents the docker-based parallel pattern:
prepare on the host, then launch one sim image container per mode
with the same OUTPUT_DIR mounted in. Each container's astra-sim/
lives in its own overlay layer so the input-file race is impossible.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lets
serving/spec_compression_stress.shrun a single mode per invocation, so each mode can be fanned out into its own filesystem-isolated container in parallel.Why
The script ran
baseline → self_verify → cpu_verifysequentially in one process. The simulator regeneratesastra-sim/inputs/{network.yml, system.json, memory_expansion.json}from the cluster config on every run, so simply backgrounding the three calls would race on those shared files and corrupt the in-flight sim's view of its own network / system / memory config. The cleanest fix is to keep the script unchanged inside, but let the caller run each mode in its own sim docker container — the per-container overlay layer isolates the writes, no code patch needed.Changes
./script.sh(no arg)./script.sh baseline/self_verify/cpu_verify./script.sh prepare./script.sh <anything else>Parallel dispatch pattern (documented in header)
Each container gets its own overlay layer so the simulator's
astra-sim/inputs/*writes are local to that mode. No code change in the simulator was needed.Test plan
./script.sh(no arg) still runs all three sequentially with identical outputs to before../script.sh baselineproduces justbaseline.csvandbaseline.logunder the output dir../script.sh prepareproducesworkload.jsonl(and the scaled configs ifH100_SCALE != 1) and exits../script.sh nonsenseerrors out with the usage hint and a non-zero exit.OUTPUT_DIR, producing all three CSVs withoutastra-sim/inputs/*corruption.Generated by Claude Code