Skip to content

Migrate agentic-coding benchmarks to aiperf v0.2 (reopened)#1393

Merged
cquil11 merged 134 commits into
mainfrom
chore/agentx-v0.2-aiperf-testing-reopen
May 17, 2026
Merged

Migrate agentic-coding benchmarks to aiperf v0.2 (reopened)#1393
cquil11 merged 134 commits into
mainfrom
chore/agentx-v0.2-aiperf-testing-reopen

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented May 16, 2026

Summary

Migrates scenario-type: agentic-coding from kv-cache-tester to aiperf (cjq/weka-live-assistant-responses), adds Kimi K2.5 agentic sweeps across B200, B300, H100, H200, MI355X, and pulls in a stack of agentic-specific aiperf features. bmk_agentic JSON schema preserved — downstream aggregators unaffected.

Dataset (v0.2): semianalysisai/cc-traces-weka-no-subagents-051226. Pulled via aiperf's --public-dataset semianalysis_cc_traces_weka flag — the loader name is a stable alias for the dated HF revision, so bumping the dataset means just rev'ing the alias on the aiperf side.

InferenceX repo additions since v0.1

  • benchmarks/benchmark_lib.sh: new build_replay_cmd emits aiperf profile --scenario inferencex-agentx-mvp … with --streaming --use-server-token-count --random-seed 42 --failed-request-threshold 0.05 --trajectory-start-{min,max}-ratio 0.25/0.75; install_agentic_deps editable-installs aiperf in-container; AIPERF_DATASET_{CONFIGURATION_TIMEOUT,MMAP_CACHE_DIR,WEKA_LIVE_ASSISTANT_RESPONSES} wired
  • utils/process_agentic_result.py rewritten to consume aiperf profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json; preserves all keys summarize.py consumes; theoretical cache-hit computed from trace metadata
  • utils/summarize.py per-metric stat set: drop median / p99 / p99.9, add p75 / p95 (keep mean / p90 / std)
  • utils/test_process_agentic_result.py fixture-driven contract test for every key downstream reads
  • 6 new agentic launchers: kimik2.5_{fp4_b200, fp4_b300, fp4_mi355x, int4_b200, int4_h100, int4_h200}.sh with per-target tunings (fp8 KV on Hopper, --block-size=1 + AITER on MI355X, lazy_offload=true JSON form on H200+MI355X, VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 on B200 for TP=4)
  • .github/configs/{nvidia,amd}-master.yaml: agentic-coding blocks for all 5 SKUs with explicit conc-list per offloading mode + runner: h200-dgxc pin (cache mount availability)
  • runners/launch_h200-dgxc-slurm.sh: mount /home/sa-shared/gharunners/ai-perf-cache/aiperf_mmap_cache, export AIPERF_DATASET_MMAP_CACHE_DIR (parity with B200 DGXC + MI355X)
  • Removed: utils/find_reusable_sweep_run.py, utils/validate_reusable_sweep_artifacts.py + tests (aiperf's content-addressed mmap cache replaces them)

aiperf submodule additions since v0.1

  • inferencex-agentx-mvp scenario plugin: locks --num-dataset-entries, --inter-turn-delay-cap-seconds 60, --cache-bust first_turn_prefix, timing_mode=AGENTIC_REPLAY, requires --ignore-eos and weka_trace loader
  • AGENTIC_REPLAY timing strategy + TrajectorySource: per-trajectory k_i sampling, WARMUP→PROFILING resume at k_i+1, FIFO recycle queue, context-overflow trajectory short-circuit
  • weka_trace loader with AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1 mode (user-only deltas, server-threaded assistant responses) — sources from semianalysisai/cc-traces-weka-no-subagents-051226
  • ContextOverflowCountMetric + scenario.context_overflow.is_context_overflow_response substring classifier; RecordProcessor drops overflow records before metrics push in AGENTIC_REPLAY scenarios
  • --failed-request-threshold (float in [0,1]): in-flight ProfileCancelCommand broadcast when error rate exceeds threshold after max(concurrency, 10) grace floor
  • --trajectory-start-{min,max}-ratio: replaces hardcoded 0%-70% k_i range; seed-deterministic per (random_seed, trace_id)
  • TrajectorySource build-time summary table (range cfg + per-lane k_i/N/pct/trace_id); per-trajectory warmup-completion log lines on credit return
  • Content-addressed mmap dataset cache (AIPERF_DATASET_MMAP_CACHE_DIR); AIPERF_DATASET_CONFIGURATION_TIMEOUT + AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT env knobs (default 300s → 1800s in our launchers)
  • --use-server-token-count: derive ISL/OSL from server usage.prompt_tokens/completion_tokens instead of client tokenizer (avoids CPU-pinning at high conc on custom-tokenizer models)
  • Realtime stats row: srv line with server-side running-avg throughput; intvty per-user row replaces split tin/tout; ISL/OSL p50/p75/p90/p99 row; per-warmup-completion lines emitted in non-TTY runs

Test plan

  • Kimi K2.5 INT4 H200 (lazy_offload + cache mount): 17/17 clean
  • Kimi K2.5 FP4 B200 TP=8 + TP=4 (cudagraph estimator off): 24/24 clean
  • Kimi K2.5 FP4 B300 none: 9/9 clean
  • Kimi K2.5 FP4 B300 cpu: blocked on cluster recovery
  • Kimi K2.5 FP4 MI355X: re-dispatch after g09 + g11 drained

Results

CleanShot 2026-05-15 at 16 50 16 CleanShot 2026-05-15 at 16 51 01 CleanShot 2026-05-15 at 16 50 38 CleanShot 2026-05-15 at 16 51 23

cquil11 and others added 30 commits April 28, 2026 09:11
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:

Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
  driving multi-turn HF-dataset traces against any OpenAI-compatible
  endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
  streamed chunk via chunk.model_dump(), and integer token IDs
  (apply_chat_template prompt + logprobs.content completion) into
  debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
  → delta.reasoning_content) so reasoning-heavy responses are counted
  and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
  rather than the local apply_chat_template estimate which breaks for
  gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
  users replaying the same trace_id don't accidentally share KV-cache
  blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
  the dispatch-jitter "Wait time" with the trace's true "Inter-turn
  time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
  warmup-tail prefill doesn't bleed into period 1.

Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
  debug-trace (boolean) and duration-override (string seconds), forwarded
  to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
  mapped to DEBUG_TRACE env var; duration override threads through to
  matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
  install_agentic_deps / write_agentic_result_json helpers; consumes
  DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
  match the actual runner.name observed by the workflow.

Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
  (vllm/sglang Prometheus parsers), pareto plotter, per-config
  distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
  generate_sweep_configs.py + validation.py.

Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
  (dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).

All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same value, two names — collapse to one. Workflow templates already
exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc),
and the agentic matrix entries carried both `users: int` and
`conc: [users]`. Drop the duplicates and standardize on conc/CONC:

- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant
  USERS env var (CONC remains)
- e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}`
  to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'`
  since matrix.config.conc is now a scalar
- generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int)
  only; loop variable renamed from `users` to `conc`; exp-name template
  now uses `_conc{N}` instead of `_users{N}`
- validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int`
- process_agentic_result.py: read CONC env var, emit single `"conc"` key
- collect_sweep_results.py: regex updated to match `_conc{N}_offload`
- benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC

The trace-replayer's --start-users / --max-users CLI flags are upstream's
API and are left unchanged; benchmark_lib.sh just passes $CONC into them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up these submodule commits (callanjfox/kv-cache-tester):
- 7b7f883 silence kimi: target the actual loaded-tokenizer module logger
- 5b87e43 silence kimi: replace static logger lookup with content filter
- 3394450 silence Kimi tokenization_kimi.py per-call encode warning
- 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 new agentic-coding launcher scripts brought over from
chore/agentx-integration, with USERS → CONC normalization:
- benchmarks/single_node/agentic/gptoss_fp4_h100.sh
- benchmarks/single_node/agentic/gptoss_fp4_h200.sh
- benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh
- benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh
- benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep
visualizer for cross-config performance comparison) and updates
requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken
needed by the analyzer + by trace-replay's tokenizer paths.

The bench/ directory is intentionally NOT added: bench/metrics_collector.py
duplicated utils/trace-replay/server_metrics.py and was already removed
on this branch; bench/run_metrics_collector.py depends on it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds agentic-coding scenario blocks to the master configs for the
five models whose launchers were just brought over:
- kimik2.5-fp4-b200-vllm (image bumped to v0.19.1)
- gptoss-fp4-h100-vllm
- gptoss-fp4-h200-vllm
- gptoss-fp4-mi300x-vllm
- gptoss-fp4-mi325x-vllm

Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for
low/mid concurrency and offloading=cpu for high concurrency, with a
crossover at conc=64. Other agentic-coding sections present on
chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up
since several of the underlying model entries were restructured by main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agentic-coding scenario type uses benchmarks/single_node/agentic/
launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml.
b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without
honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/
even for agentic runs. Other runners (h100-*, h200-*, mi*) already had
this plumbing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…H200

- minimaxm2.5-fp8-b200-vllm
- qwen3.5-bf16-b200-sglang
- glm5-fp8-b200-sglang
- dsv4-fp8-h200-vllm

Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for
max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the
trace replayer via build_replay_cmd, and emits the agentic result JSON.
Master config gets an agentic-coding scenario block sweeping conc 1..32
at offloading=none; b200-dsv4 entries left untouched since that runner
type isn't registered in runners.yaml.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- minimaxm2.5-fp8-mi355x-vllm
- qwen3.5-fp8-mi355x-sglang
- glm5.1-fp4-mi355x-sglang
- kimik2.5-fp4-mi355x-vllm

Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks
(VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds
CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching
agentic-coding scenarios sweeping conc 1..32 at offloading=none.

dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq
launcher requires a bespoke vLLM PR rebuild that adds risk to
trace-replayer testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5-fp4

Phase-2 coverage extension across precision (int4 vs fp4 for kimi,
fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss).

- gptoss-fp4-b200-vllm
- kimik2.5-int4-b200-vllm
- minimaxm2.5-fp4-b200-vllm

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on
B200 with PyTorch/CuDNN compatibility errors at server start. Add an
fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a
working qwen3.5 trace-replayer test on NVIDIA.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the launcher matrix at benchmarks/single_node/agentic/, how to
dispatch debug runs via gh workflow run, and what fields in the result
JSON to inspect for verification (num_requests_successful,
total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.).

Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/
pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in
runners.yaml) so future testers don't repeat them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL
(1 still in flight); failures are all image- or vLLM-parser-level, not
replayer bugs. Replayer's per-model delta-field routing + long-prefill
agentic flow verified end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL.
The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN
image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm
qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace
replayer itself. All 7 active model families have at least one PASS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exp-name template emits offload{none|cpu|ssd} (per the matrix
generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"),
but the regex was looking for offload(on|off) — so every artifact
directory failed to parse, the aggregator wrote nothing to aggregated/,
and collect-agentic-results uploaded no files ("No files were found
with the provided path: aggregated/").

Verified the fix matches real artifact names from this branch's runs
(b200/h100, none/cpu).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200,
gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add
offloading=cpu at high concurrency (typically conc 64+) where KV cache
pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so
the crossover region is sampled by both. cpu-offload sweep tail uses
larger conc points (96, 128, 192, 256) since the only reason to enable
cpu offload is when concurrency stresses HBM.

For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers
without the OFFLOADING=cpu plumbing): expand the conc range on
offloading=none. sglang manages its own KV eviction via the radix
cache, so concurrency above HBM capacity is handled internally rather
than via vLLM's --kv_offloading_backend.

dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200
also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so
left as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both nodes are currently dropping every job that lands on them:
- NCCL barrier dies during sglang Scheduler.init_model_worker with
  RuntimeError: NCCL error: unhandled cuda error  (stale CUDA contexts
  from a previous job that didn't tear down cleanly)
- HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with
  RuntimeError: Data processing error: CAS service error : IO Error:
  No space left on device (os error 28)

Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to
them. Drop this once sa-shared admins clean up the nodes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible
with the hybrid-KV-cache-manager (HMA) for models with mixed attention
layouts. When HMA is enabled, the OffloadingConnector init fails with:

  RuntimeError: Worker failed with error 'Connector OffloadingConnector
  does not support HMA but HMA is enabled.
  Please set --disable-hybrid-kv-cache-manager'.

This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job
failed with the above error while every offload=none sub-job passed
(see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in.
MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed
even with the broken flag.

Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager
is correctness-safe across the board: HMA is a pure optimization, and
disabling it is required for OffloadingConnector regardless of model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfigs

KV offloading via OffloadingConnector hits multiple upstream bugs on
older vllm tags:
  - v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute
    assertion in TRTLLM-attention path
  - v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat
  - v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean

Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200
(23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x.
Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across
H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges
sized from per-SKU GPU KV cache capacity:

  KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB
  Per-SKU GPU cache cap with tp=4 + 0.90 mem-util:
    H100   58 GB  -> 0.46M tok  (saturate ~conc 6)
    H200  277 GB  -> 2.19M tok  (saturate ~conc 29)
    B200  461 GB  -> 3.63M tok  (saturate ~conc 48)
    B300  807 GB  -> 6.35M tok  (saturate ~conc 85)
    MI300X 500 GB -> 3.93M tok  (saturate ~conc 52)
    MI355X 864 GB -> 6.81M tok  (saturate ~conc 91)

NVIDIA configs include offload=cpu starting at the saturation point
(simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1).
AMD configs do not enable cpu offload — vllm simple offloading isn't
supported on the rocm build for these models. AMD pushes offload=none
to a higher conc to demonstrate where GPU cache saturates.

Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300
v0.19.0-cu130 -> v0.19.1.
vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up
weight output_size 1536 / tp=8 = 192, not divisible by block_n=128.
Same constraint at vllm/model_executor/layers/quantization/fp8.py:638.

Per fixed-seq-len reference TPs:
  H100   tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8)
  H200   fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4
  B200   tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL)
  B300   tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep)
  MI300X tp=4 (fixed-seq-len has tp=2,4)
  MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8)

Concurrency expanded across the saturation cliff for each SKU; cpu
offload range extended to 384/512 on NVIDIA where applicable.
Per empirical compute ceilings observed in prior runs (mean in-flight reqs
mid-test on each platform):

  H100   tp=4 ep=4  ceiling ~10  (KV cliff ~6   -> cpu zone 6-10)
  H200   tp=4       ceiling ~35  (KV cliff ~29  -> cpu zone 29-35)
  B200   tp=4       ceiling ~50  (KV cliff ~48  -> very narrow)
  B300   tp=4       ceiling ~60  (KV cliff ~85  -> compute saturates first)
  MI300X tp=4       ceiling ~20  (estimated)
  MI355X tp=4 ep=4  ceiling ~60

Previous conc lists (1..256, even up to 512) wasted 30-min slots on
sub-jobs that just queue 200+ requests waiting on a server only running
4-50 in flight, leading to client-side 600s timeout cascades. New lists
"creep up" to 2-3x the ceiling, then stop.

NVIDIA cpu offload range narrowed to the zone between KV cliff and
compute ceiling, where offloading can actually relieve KV pressure
without compute already being the bottleneck.

AMD (mi300x, mi355x) keeps offload=none only.
Per user feedback: past the compute ceiling, throughput plateaus and
extra conc just adds queue depth and client timeouts -- wasted slots.
Reallocate sampling budget to densify around the cliff(s) for each SKU.

Per-SKU strategy (compute ceiling empirical, KV cliff analytical):

  H100   tp=4 ep=4  ceil 10  KV 6   -> dense 4-12  (sweet spot for cpu demo)
  H200   tp=4       ceil 35  KV 29  -> dense 24-40 (narrow cpu window)
  B200   tp=4       ceil 50  KV 48  -> dense 32-56 (cliffs colocated)
  B300   tp=4       ceil 60  KV 85  -> dense 48-72 (compute first; cpu won't help)
  MI300X tp=4       ceil 25  KV 52  -> dense 16-32 (compute first; AMD no cpu)
  MI355X tp=4 ep=4  ceil 60  KV 91  -> dense 48-72 (compute first; AMD no cpu)

Dense step (every 4-8 conc) around the cliffs to resolve the inflection;
sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x
ceiling to confirm plateau.

NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling
for direct same-conc comparison; doesn't extend past 1.3x ceiling.
- AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env
  var. SimpleCPUOffloadConnector isn't supported on rocm; native
  OffloadingConnector works (still passes --kv_offloading_backend
  native flag).
- Add cpu offload entries to AMD master configs (mi300x, mi355x).
- Add b300-p1 runner group (subset of b300 nodes 13-17 with the
  b300-p1 label) and target it from the b300 minimax config.
The agentic-coding benchmark IS a prefix-cache benchmark — the whole
point is measuring KV reuse across multi-turn conversations and
across users (with the per-user salt enabling deterministic prefix
overlap). Disabling prefix caching defeats the entire purpose.

Removed from 7 launchers that had it:
  dsv4_fp8_h200.sh
  gptoss_fp4_b200.sh (was in config.yaml)
  kimik2.5_fp4_mi355x.sh
  kimik2.5_int4_b200.sh
  minimaxm2.5_fp4_b200.sh
  minimaxm2.5_fp8_mi300x.sh
  minimaxm2.5_fp8_mi355x.sh

vLLM defaults to prefix caching ON when no flag is passed.
ROCM_AITER_FA was the suspect for both:
1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine
   on the same launcher pattern + image)
2. Prefix-cache Prometheus counters never increment (observability gap
   on FA backend, while UNIFIED_ATTN reports correctly on mi300x)

Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot.
The cpu range needs full overlap with none past the KV cliff so the
no-offload throughput collapse is visible at the same conc points
where cpu offload sustains throughput.

B200 tp=4 (KV cliff conc=48):
  none: [1,2,4,8,16,32,48,56,64,96,128]   (was capped at 64)
  cpu:  [48,56,64,96,128]                  (was capped at 64)

B300 tp=4 (KV cliff conc=85):
  none: [1,2,4,8,16,32,48,64,96,128,192]  (was capped at 96)
  cpu:  [48,64,96,128,192]                 (was capped at 96)

Past the cliff, the no-offload curve should collapse (recompute storm,
client-side timeouts), while cpu-offload sustains the compute ceiling.
cquil11 added 10 commits May 15, 2026 12:44
- TP=8 none: [1, 2, 4, 8, 16, 24, 32, 40, 48] (unchanged baseline)
- TP=8 cpu:  [32, 40, 48, 56] (was [1..48])

Lower concurrencies fit entirely on-GPU at MI355X's 288 GB HBM; running
cpu offload at conc<32 just adds the offload-path overhead without
measuring anything new. Restrict cpu to the cliff region where it
actually matters, and probe one step past the prior cap with conc=56.
Final per-metric stat set is now mean / p75 / p90 / p95 / std (was
mean / median / p90 / p99 / p99.9 / std). Applied across:

- utils/process_agentic_result.py: stats_for(), QPS aggregator,
  input/output_tokens, output_tokens_expected
- utils/summarize.py: single-node and multi-node CSV column headers
  and row formatters
- utils/test_process_agentic_result.py: SUMMARIZE_KEYS contract

Rationale: p99/p99.9 were dominated by trace-end stragglers and
weren't useful operational signal at the concurrency we sweep; p75
captures the tail where the agentic workload actually starts
diverging from the median, and p95 is the standard 'tail-but-not-
catastrophe' percentile that fits between p90 and the dropped p99.
Picks up cquil11/aiperf@4efdd6e8 "[RecordProcessor] Drop context-overflow
records for AGENTIC_REPLAY scenarios": context-overflow errors mid-
trajectory are already handled by agentic_replay.handle_credit_return
(recycles the conversation, spawns a fresh trajectory), so the parser-
classified context_overflow records were being double-counted as both
end-of-trajectory signals AND error metrics. Now they're dropped at the
record_processor_service layer before the MetricRecordsMessage push --
no contribution to failure totals, no entry in profile_export.jsonl, no
tick on error counters. Existing ContextOverflowCountMetric continues
to work outside AGENTIC_REPLAY scenarios for diagnostic purposes.

Effect on Kimi agentic results in this repo: the "errors=N" line in the
per-job logs and the failure column in aggregated CSVs will only count
real failures (server 5xx, parse errors, malformed responses), not the
expected end-of-trajectory context-overflow events.
Picks up cquil11/aiperf@8f41bc7b. New --failed-request-threshold flag
(float in [0,1], default None=disabled) on the agentic-benchmark
profile entrypoint. When PROFILING-phase error_records/total_records
exceeds the threshold after a grace floor of max(concurrency, 10)
records, RecordsManager broadcasts ProfileCancelCommand on the message
bus, the timing/server-metrics/gpu-telemetry managers tear down their
work, and the run exits non-zero via the existing cancel path.

Composes cleanly with the prior context-overflow drop: in AGENTIC_REPLAY
scenarios, context-overflow events are excluded from error_records, so
the threshold measures only real failures (server 5xx, parse errors,
malformed responses) and won't trip on the expected end-of-trajectory
overflow signal.

Usage example:
  aiperf profile ... --failed-request-threshold 0.05
(abort if >5% real failures after grace floor)
aiperf submodule pointer -> 343f33c6 picks up:
- [LoadGen] --failed-request-threshold (in-flight abort; already wired via
  earlier ee76801 bump)
- [AgenticReplay] --trajectory-start-min-ratio / --trajectory-start-max-ratio
  (configurable replacement for the previously hardcoded 0%-70% k_i range)
- [AgenticReplay] per-trajectory warmup completion log lines (start_turn,
  trace_id, status)

benchmark_lib.sh wires three new aiperf flags into build_replay_cmd for
all agentic launchers:
- --failed-request-threshold 0.05  (kill run early if real-failure rate > 5%)
- --trajectory-start-min-ratio 0.25
- --trajectory-start-max-ratio 0.75 (sample k_i from 25%-75% of the trace)
Picks up cquil11/aiperf@fccb8471. TrajectorySource now emits a one-block
info line right after building the trajectory list, showing per-lane
(k_i, num_turns, pct) plus configured/observed range summary. Lets you
verify that --trajectory-start-{min,max}-ratio produced the expected
distribution before any requests fire, no need to wait for warmup
completion lines.
Conflicts resolved:
- .github/configs/amd-master.yaml (dsv4-fp4-mi355x-atom): took main's
  simplified single-range conc form from PR #1311 (we had the older
  discrete-point version)
- .github/configs/nvidia-master.yaml (kimik2.5-int4-b200-vllm): kept our
  bump-rationale comment alongside main's v0.20.2 image (both sides
  agreed on the image, only the comment was new on ours)
- .github/configs/nvidia-master.yaml (minimaxm2.5-fp8-{h100,h200}-vllm):
  took main's v0.20.2 image bumps (we still had v0.19.1)

Cleanup:
- Drop our .gitignore additions (the 'scripts/debug_*.sh' line) per
  review feedback -- match main
- Drop docs/AGENTIC_TEST_COVERAGE.md and docs/AGENTIC_TEST_RESULTS.md
  (agent-generated planning slop, not load-bearing)
We don't need to plot any pareto frontiers from this repo -- aiperf
has its own plotting tutorial and any downstream visualization can
read the bmk_agentic JSON / aggregate exports directly.

Removed:
- utils/agentic-benchmark/scripts/plot_sweep_overview.py (v0.1 carryover)
- utils/agentic-benchmark/analysis/plot_pareto.py
- utils/generate_aiperf_plots.py (added earlier in this PR; not needed)
Picks up cquil11/aiperf@7d880a1e. The earlier context-overflow drop
(commit 4efdd6e8) broke the records-side <-> credit-side counter
invariant by returning early from _on_inference_results: records-side
total_records lagged credit-side final_requests_completed by one for
every overflow event, so the completion barrier at
records_tracker.py:144-147 never converged. End of every PROFILING
phase hung for the full benchmark_grace_period before timing out and
cancelling in-flight credits.

Fix preserves the original intent (context-overflow events stay out
of metrics) while keeping the invariant intact: overflow records flow
through normally but carry a context_overflow_skip flag on the
MetricRecordMetadata; RecordsManager counts them toward total_records
(classified as success so error counters stay at 0) but skips the
error tracker, accumulators, stream exporters, and the
--failed-request-threshold abort check.
Per PR review feedback, this branch must not alter any fixed-seq-len
scenarios or non-agentic functionality from origin/main.

Restored to match origin/main exactly:
- amd-master.yaml: re-add qwen3.5-fp4-mi355x-atom + minimaxm2.5-fp4-mi355x-atom
  entries (both have only fixed-seq-len scenarios; were missing from our
  branch since the v0.1 merge)
- nvidia-master.yaml: replace dsv4-fp4-gb200-dynamo-vllm fixed-seq-len
  block with origin/main version (we had drifted to a 1k/1k extrapolated
  layout; main is the canonical 8k/1k Pareto-mirrored block)
- nvidia-master.yaml: kimik2.5-int4-h100-vllm new entry has agentic-coding
  only (no fixed-seq-len) to keep the fixed-seq-len surface identical to
  main
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +1033 to 1038

# Generate metrics_plots.png from the same aiperf artifacts. Best-effort:
# don't fail the launcher if plot generation has trouble (e.g. matplotlib
# missing in a stripped-down image). The agg JSON is the success gate.
python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new write_agentic_result_json calls python3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir" to produce results/metrics_plots.png, and the workflow lists that PNG in the agentic artifact upload bundle — but utils/generate_aiperf_plots.py is not present anywhere in this PR or the repo. The call is wrapped in 2>&1 || true and the upload uses if-no-files-found: ignore, so the launcher and workflow appear green while every agentic run silently fails to emit the advertised plot. Either commit the missing script or remove the invocation at benchmarks/benchmark_lib.sh:1037 and the results/metrics_plots.png line at .github/workflows/benchmark-tmpl.yml:248.

Extended reasoning...

What the bug is

write_agentic_result_json in benchmarks/benchmark_lib.sh (lines 1026-1038) runs the post-run aggregator and then invokes a plotter:

python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true

This function is the final step of every agentic launcher under benchmarks/single_node/agentic/*.sh in this PR. Its output results/metrics_plots.png is then listed as one of the files in the agentic artifact upload at .github/workflows/benchmark-tmpl.yml:248. The comment block immediately above the plotter call advertises it as an intentional feature (Generate metrics_plots.png from the same aiperf artifacts), and the build_replay_cmd comment at lines 1008-1014 explicitly justifies the 1-second --slice-duration "so the post-run plotter has per-window time series … Without this, aiperf only emits aggregate stats and the 6x2 panels collapse to flat lines."

But utils/generate_aiperf_plots.py is not committed anywhere in the PR or the repo. A repo-wide search finds exactly one reference — the invocation itself — and no file matching **/generate_aiperf_plots*.

Why nothing currently fails

The call is wrapped in 2>&1 || true, so python: can't open file '.../utils/generate_aiperf_plots.py': [Errno 2] No such file or directory is captured into benchmark.log and the launcher's exit status stays clean. Separately, the workflow upload step uses if-no-files-found: ignore, so the missing results/metrics_plots.png is silently dropped from the uploaded bundle. There is no other check that would surface the missing artifact.

Step-by-step proof

  1. An agentic-coding job runs, e.g. benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh.
  2. The script ends with write_agentic_result_json "$RESULT_DIR" (line 156 of the new launcher).
  3. write_agentic_result_json invokes python3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir".
  4. Python exits non-zero with No such file or directory because the script does not exist.
  5. || true swallows the exit code; only benchmark.log records the error.
  6. The aggregated *.json is created, so the workflow's retry-based existence check at .github/workflows/benchmark-tmpl.yml passes.
  7. The Upload agentic raw results step lists results/metrics_plots.png but the file does not exist; if-no-files-found: ignore causes the missing file to be silently skipped.
  8. The artifact bundle is published without the advertised PNG, and no warning surfaces in the run summary.

Impact

The metrics_plots.png artifact is advertised in the PR description (build_replay_cmd comment) and explicitly listed in the workflow upload, but it will never be produced for any agentic run. Every agentic benchmark.log will also carry a noisy python: can't open file … line, complicating future log triage. This is not a correctness bug for the agg JSON path (the success gate is the JSON, not the PNG), so the pipeline still goes green — but the promised per-window time-series visualization is missing from every run.

How to fix

Two options, either is sufficient:

  1. Commit utils/generate_aiperf_plots.py alongside this PR. The 1-second --slice-duration plumbing in build_replay_cmd and the workflow artifact reference were clearly added in anticipation of this file.
  2. Remove the plotter call from write_agentic_result_json (benchmarks/benchmark_lib.sh:1034-1037) and drop results/metrics_plots.png from .github/workflows/benchmark-tmpl.yml:248. The --slice-duration 1.0 flag in build_replay_cmd can also be removed if no other consumer needs the per-window timeslice JSON, but profile_export_aiperf_timeslices.{json,csv} are also in the upload bundle, so it may still be useful.

Option 1 matches the apparent intent (the launcher comment is written assuming the plotter exists). Option 2 is the safer "make the PR self-consistent" path.

# Trace metadata lookup: conversation_id (= trace id) -> per-turn dict with
# ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache.
_TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None
_HF_DATASET = "semianalysisai/cc-traces-weka-042026"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 utils/process_agentic_result.py:40 hardcodes _HF_DATASET = "semianalysisai/cc-traces-weka-042026" (the v0.1 name) while benchmarks/benchmark_lib.sh:908 in this same PR downloads semianalysisai/cc-traces-weka-no-subagents-051226 (v0.2, matching the PR description and the aiperf submodule loader). _hf_traces_dir() builds its lookup path from _HF_DATASET, so the production HF cache directory datasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/ is never found, _load_trace_metadata() returns {}, and every shipped agentic agg JSON has theoretical_cache_hit_rate=null with all output_tokens_expected stats missing — silently breaking the "theoretical cache-hit computed from trace metadata" feature this PR advertises. Fix is a one-line constant bump (and the matching test fixture path at utils/test_process_agentic_result.py:408); ideally derive the name from a shared constant / env var shared with resolve_trace_source.

Extended reasoning...

What goes wrong

Two locations in this PR disagree on the HF dataset name:

  • Producerbenchmarks/benchmark_lib.sh:908 (resolve_trace_source) calls hf download --repo-type dataset semianalysisai/cc-traces-weka-no-subagents-051226. This is the only producer of the local HF cache for the agentic path, and it matches the PR description ("v0.2: semianalysisai/cc-traces-weka-no-subagents-051226") and the aiperf submodule loader (semianalysis_cc_traces_weka is the stable alias for this dated revision).
  • Consumerutils/process_agentic_result.py:40 hardcodes _HF_DATASET = "semianalysisai/cc-traces-weka-042026" — the v0.1 name from before this PR.

_hf_traces_dir() (process_agentic_result.py:118–146) builds its lookup path from _HF_DATASET:

org, name = _HF_DATASET.split("/", 1)
snapshots = cache_root / f"datasets--{org}--{name}" / "snapshots"

So it searches $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshots/, which never exists in production. The function returns None, _load_trace_metadata() returns an empty dict, and:

  1. compute_cache_stats() (process_agentic_result.py ~395–432): the theoretical-cache-hit walk runs only if metadata is truthy. Empty dict ⇒ result["theoretical_cache_hit_rate"] stays None.
  2. compute_workload_stats() (process_agentic_result.py ~277–294): the mean/p75/p90/p95/std_output_tokens_expected block is gated by if metadata:. Empty dict ⇒ none of those keys are emitted.

Impact

Every agentic-coding result JSON shipped to the downstream aggregator has theoretical_cache_hit_rate: null and is missing all five output_tokens_expected stats — directly contradicting this PR's claim: "theoretical cache-hit computed from trace metadata." The per-launcher print(f" Theoretical cache hit rate: ...") at the end of process_agentic_result.py simply never prints in production. End-users consuming the schema downstream see silent data loss, not a crash.

Why CI doesn't catch this

The new unit test utils/test_process_agentic_result.py:408 (test_processor_loads_traces_jsonl_for_theoretical_cache) builds its fake HF cache snapshot under hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc" — mirroring the same stale name the production code reads. The test passes because both sides agree on the wrong name. Nothing else exercises _HF_DATASET against the real dataset, and test_processor_emits_required_summarize_keys (line 264) doesn't include the optional theoretical_cache_hit_rate / output_tokens_expected keys in its required-key set, so their absence isn't flagged either.

Step-by-step proof

  1. A B200 job runs benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh, which calls resolve_trace_source (benchmark_lib.sh:907–916).
  2. That downloads semianalysisai/cc-traces-weka-no-subagents-051226 into $HF_HUB_CACHE. After download, $HF_HUB_CACHE contains datasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/<rev>/traces.jsonl (the HF hub layout).
  3. The launcher runs aiperf, which writes results/trace_replay/profile_export.jsonl etc.
  4. write_agentic_result_json invokes python3 utils/process_agentic_result.py.
  5. _load_trace_metadata() calls _hf_traces_dir(), which computes snapshots = $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshots from _HF_DATASET. snapshots.is_dir() is False (the cache directory uses the real dataset name -no-subagents-051226). Returns None.
  6. _load_trace_metadata() returns {}. _TRACE_METADATA_CACHE = {}.
  7. In compute_cache_stats(), if metadata: is False ⇒ theoretical_cache_hit_rate stays None.
  8. In compute_workload_stats(), the trace-metadata block at the bottom of the function never executes ⇒ no *_output_tokens_expected keys are written.
  9. Aggregator emits agg_*.json with "theoretical_cache_hit_rate": null and no expected-output stats. The per-launcher print(" Theoretical cache hit rate: ...") is gated on agg.get("theoretical_cache_hit_rate") is not None, so it never prints.

How to fix

Cheapest fix is a one-line constant bump in utils/process_agentic_result.py:

_HF_DATASET = "semianalysisai/cc-traces-weka-no-subagents-051226"

Plus the matching path in utils/test_process_agentic_result.py:408 (hf_cache / "datasets--semianalysisai--cc-traces-weka-no-subagents-051226" / "snapshots" / "abc"). Better: read the dataset name from a shared module-level constant (or env var) that both resolve_trace_source and process_agentic_result.py consume, so future dataset bumps can't desync again. As a regression guard, add an assertion in the test that the constant equals what resolve_trace_source actually downloads — or simply have the test set _HF_DATASET via the env-based shared constant rather than hardcoding the path.

Comment on lines 540 to +555

is_multinode = env_bool('IS_MULTINODE')
tp = env_int('TP', 1)
ep = env_int('EP_SIZE', 1)
dp_attention = os.environ.get('DP_ATTENTION', 'false')
def build_agg(
records: list[dict],
aggregate: dict,
server_metrics: dict,
) -> dict:
"""Compose the agg_*.json body from the three aiperf inputs."""
is_multinode = env_bool("IS_MULTINODE")
tp = env_int("TP", 1)
ep = env_int("EP_SIZE", 1)
dp_attention = os.environ.get("DP_ATTENTION", "false")
num_gpus = tp

if is_multinode:
prefill_num_workers = env_int('PREFILL_NUM_WORKERS')
prefill_tp = env_int('PREFILL_TP')
prefill_ep = env_int('PREFILL_EP', 1)
prefill_dp_attention = os.environ.get('PREFILL_DP_ATTN', 'false')
decode_num_workers = env_int('DECODE_NUM_WORKERS')
decode_tp = env_int('DECODE_TP')
decode_ep = env_int('DECODE_EP', 1)
decode_dp_attention = os.environ.get('DECODE_DP_ATTN', 'false')
prefill_num_workers = env_int("PREFILL_NUM_WORKERS")
prefill_tp = env_int("PREFILL_TP")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 build_agg() in utils/process_agentic_result.py sets both "num_requests_total" and "num_requests_successful" to len(records), but load_records() above explicitly filters out error records via 'if obj.get("error"): continue'. The two fields are therefore always equal — any downstream consumer computing a failure rate as 1 - successful/total will see 0% even when aiperf actually had errored requests. Fix is small: count error records during load_records and surface the count, or pull the true total from the aiperf aggregate JSON (request_count is already loaded but unused for these keys).

Extended reasoning...

What the bug is

utils/process_agentic_result.py rewrites the legacy agg JSON for the aiperf-based agentic pipeline. The legacy CSV-based implementation correctly distinguished the two fields:

# legacy (removed in this PR)
"num_requests_total": len(rows),         # all CSV rows
"num_requests_successful": len(successful),  # filtered by success == 'True'

The new code sets both to the same value:

# utils/process_agentic_result.py (build_agg, ~lines 601-603)
"num_requests_total": len(records),
"num_requests_successful": len(records),

But records is the output of load_records() (lines 93–105), which explicitly drops error rows:

obj = json.loads(line)
if obj.get("error"):
    continue
records.append(obj)

So records only contains successful requests; the two emitted fields are mathematically identical for every run.

Step-by-step proof

  1. aiperf writes profile_export.jsonl with one line per request. Failed requests carry an error key (verified in aiperf source).
  2. load_records() reads that file and skips any line where obj.get("error") is truthy.
  3. build_agg() then does "num_requests_total": len(records) and "num_requests_successful": len(records) — same records list.
  4. Imagine a run with 950 records emitted by aiperf, 50 of which carry "error": "...". load_records() returns a list of 900 entries. The agg JSON now claims num_requests_total = 900 and num_requests_successful = 900, so any consumer computing 1 - successful/total reads 0% failure rate even though aiperf saw 50 failures.

Why existing code doesn't prevent it

aiperf's --failed-request-threshold 0.05 aborts the run upstream above 5% error rate, but that's a coarse gate — non-zero error counts below 5% still pass through silently, and the labels still lie about the true total. The aggregate JSON (profile_export_aiperf.json) IS loaded by this script and contains request_count (the true total including errors) but it is never consulted for these two keys.

Impact

In-repo blast radius is bounded: utils/summarize.py headers don't read these fields, and the only consumer asserting on them is utils/test_process_agentic_result.py:491. The two fields nominally exist to track distinct quantities — that's why the pre-PR code emitted them separately — so any external dashboard / script (downstream Pareto analysis, the team's spreadsheets, etc.) computing failure rate from the emitted schema will see 0% regardless of actual aiperf error count. This is a silent schema regression.

How to fix

Two clean options:

  1. Count errors during loading. Change load_records to return (records, error_count) (or a second pass), then in build_agg:

    "num_requests_total": len(records) + error_count,
    "num_requests_successful": len(records),
  2. Pull from the aggregate. aggregate["request_count"] is already loaded:

    "num_requests_total": aggregate.get("request_count", len(records)),
    "num_requests_successful": len(records),

Either restores the original semantics of the two-field design.

Comment on lines +451 to +477
def _final_value(metric_name: str) -> float | None:
entry = metrics_by_name.get(metric_name)
if not isinstance(entry, dict):
return None
series = entry.get("series") or []
if not isinstance(series, list):
return None
for stats_key in ("total", "max", "avg"):
agg = 0.0
found = False
for s in series:
if not isinstance(s, dict):
continue
stats = s.get("stats")
if not isinstance(stats, dict):
continue
v = stats.get(stats_key)
if v is None:
continue
try:
agg += float(v)
found = True
except (TypeError, ValueError):
continue
if found:
return agg
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 compute_cache_stats._final_value() sums values across all per-engine metric series, which is correct for counters (vllm:prefix_cache_hits/queries, vllm:prompt_tokens, vllm:kv_offload_bytes_*) but wrong for the one percentage gauge in the mapping at lines 489–498 — vllm:cpu_kv_cache_usage_perccpu_kv_cache_usage_pct. With the DP-attn configs this PR adds (dsv4-fp4-b200-vllm dp-attn:true, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), vLLM emits one /metrics series per DP engine, so two engines at 50% report 100% and eight at 50% report 400%. Easiest fix: pick avg (not total) for percentage gauges, e.g. a small per-metric strategy table keyed on metric name.

Extended reasoning...

What the bug is. _final_value() walks ("total", "max", "avg") and, for the first stats key any series has, accumulates agg += float(v) across every series entry. The intent (see the comment at lines 447–448: "We aggregate across series (multiple endpoints / label sets) and prefer total for counters, then max/avg for gauges") is to sum counters and aggregate gauges, but the implementation sums in both cases.

Which metric is affected. Of the five metrics looked up via this helper for compute_cache_stats:

  • vllm:prefix_cache_hits / vllm:prefix_cache_queries — counters. Summing across series (and dividing in lockstep at lines 481–482) is correct.
  • vllm:cpu_prefix_cache_hits / vllm:cpu_prefix_cache_queries — counters, same lockstep ratio.
  • vllm:prompt_tokens / vllm:generation_tokens — counters.
  • vllm:kv_offload_bytes_* / vllm:kv_offload_time_* — counters.
  • vllm:cpu_kv_cache_usage_perccpu_kv_cache_usage_pct — a gauge expressing a percentage. This one is wrong when there are multiple series.

Why DP-attn configs trigger it. In a single-engine layout (TP-only), vLLM exposes one /metrics series, so the loop sums a single value and the result is right by accident. With DP-attn (added in this PR for dsv4-fp4-b200-vllm, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), each DP engine runs its own scheduler and emits its own series — typically tagged by engine index. The loop now sums N independent gauge readings.

Step-by-step proof. Take the dsv4-fp8-h200-vllm agentic-coding entry added in this PR:

- { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16] }

The launcher (dsv4_fp8_h200.sh) runs --data-parallel-size 8 → 8 DP engines, each exporting vllm:cpu_kv_cache_usage_perc. Say each engine has its CPU offload pool 50% full. The scrape produces a metrics dict shaped like:

"vllm:cpu_kv_cache_usage_perc": {
  "series": [
    {"labels": {"engine": "0"}, "stats": {"max": 50.0, "avg": 50.0, ...}},
    {"labels": {"engine": "1"}, "stats": {"max": 50.0, "avg": 50.0, ...}},
    ... 6 more identical entries ...
  ]
}

_final_value() tries "total" first — not present on a gauge — then "max", which is present on every series. It iterates and accumulates agg = 50 + 50 + 50 + 50 + 50 + 50 + 50 + 50 = 400.0, then returns 400.0. That value is written to cpu_kv_cache_usage_pct in the agg_*.json payload. The true aggregate utilization is ~50%, not 400%.

Impact. cpu_kv_cache_usage_pct is a diagnostic field for analyzing CPU-offload behaviour, not a headline metric. utils/summarize.py doesn't read it (verified — it consumes only the throughput/latency keys plus topology fields), so summary tables are unaffected. But the field is published in the per-run JSON and would mislead anyone inspecting the offload regime of the new DP-attn configs — exactly the configs this PR is intended to characterize. The misreport is also obviously wrong (>100%, often by a large multiple) so it's likely to be spotted, but it crowds noise into the very signal the new sweeps are designed to produce.

Suggested fix. Cheapest correct version is a per-metric aggregation strategy table, e.g.:

_AGG = {
    "vllm:cpu_kv_cache_usage_perc": "avg",
    # everything else defaults to summing "total" -> "max" -> "avg"
}

def _final_value(metric_name):
    mode = _AGG.get(metric_name, "sum")
    ...
    if mode == "avg":
        # mean across series of stats["avg"] (or "max" if "avg" missing)
    else:
        # existing sum behavior

Alternatively, pick avg instead of max when the only key available is a gauge stat — but the table form is clearer about which metrics are gauges. Existing tests (test_processor_aggregates_across_multiple_series) cover only the counter case; adding a percentage-gauge fixture would lock the behaviour in.

cquil11 added 4 commits May 15, 2026 21:58
Per PR review: this branch must not modify any existing master-yaml
entry. If our agentic-coding work needs different metadata (image,
runner, or extra scenarios) than what main has for an existing entry,
the change must live in a separate sibling config-key with the
'-agentic' suffix — leaving the original entry byte-identical to
origin/main so the fixed-seq-len test surface and any other consumer
of the original key sees zero change from this PR.

For every entry that previously diverged from main, this commit:
  1. Restores the original entry text verbatim to match origin/main
  2. Emits a sibling '<name>-agentic' entry containing our metadata
     overrides + only the agentic-coding scenarios block

amd-master.yaml: +6 sibling -agentic entries
  - glm5.1-fp4-mi355x-sglang-agentic
  - kimik2.5-fp4-mi355x-vllm-agentic
  - minimaxm2.5-fp8-mi300x-vllm-agentic
  - minimaxm2.5-fp8-mi325x-vllm-agentic
  - minimaxm2.5-fp8-mi355x-vllm-agentic
  - qwen3.5-fp8-mi355x-sglang-agentic

nvidia-master.yaml: +16 sibling -agentic entries
  - dsv4-fp4-b200-vllm-agentic, dsv4-fp4-b300-vllm-agentic,
    dsv4-fp8-h200-vllm-agentic, glm5-fp8-b200-sglang-agentic,
    gptoss-fp4-b200-vllm-agentic, kimik2.5-fp4-b200-vllm-agentic,
    kimik2.5-fp4-b300-vllm-agentic, kimik2.5-int4-b200-vllm-agentic,
    kimik2.5-int4-h200-vllm-agentic, minimaxm2.5-fp4-b200-vllm-agentic,
    minimaxm2.5-fp8-b200-vllm-agentic, minimaxm2.5-fp8-b300-vllm-agentic,
    minimaxm2.5-fp8-h100-vllm-agentic, minimaxm2.5-fp8-h200-vllm-agentic,
    qwen3.5-bf16-b200-sglang-agentic, qwen3.5-fp8-b200-sglang-agentic

Brand-new entries that don't exist on main (only kimik2.5-int4-h100-vllm
in this PR) stay as-is with agentic-coding scenarios only — no
fixed-seq-len block added.

Dispatch instructions: agentic sweeps now reference the '-agentic'
suffixed config-keys, e.g.
  gh workflow run ... -f generate-cli-command="test-config
      --config-keys kimik2.5-int4-h200-vllm-agentic
      --scenario-type agentic-coding"
Every split sibling now leads with a comment block explaining why it
exists. Two flavors:

Metadata divergence — lists the specific field(s) that differ from main:
  # Diverged from kimik2.5-int4-h200-vllm (agentic-coding sibling). Reasons below;
  # the original kimik2.5-int4-h200-vllm entry is left identical to origin/main so
  # its fixed-seq-len sweep is unaffected.
  #   - runner: 'h200' -> 'h200-dgxc'

Scenarios-only divergence — metadata matches main exactly; the split
exists because we added or modified the agentic-coding scenarios:
  # Diverged from dsv4-fp4-b300-vllm (agentic-coding sibling). Metadata is
  # identical to origin/main's dsv4-fp4-b300-vllm; the split exists because this
  # PR adds an agentic-coding scenarios block that differs from main
  # (either main had none or had a different conc/offload sweep).
  # The original dsv4-fp4-b300-vllm entry stays byte-identical to origin/main.

Annotations were generated programmatically from the field-level diff
against origin/main (utils/aiperf scripted; no manual edits). Existing
in-entry rationale comments are preserved below the header.
…oad modes

Extends kimik2.5-fp4-mi355x-vllm-agentic with TP=4 sweep at cliff-region
concurrencies on both offload modes. MI355X has 288 GB HBM/GPU so the
TP=4 half-node weight footprint (~62 GB/GPU) leaves plenty of headroom
unlike B200's 192 GB constraint. Restricted to cliff concurrencies (no
low-conc points) since the TP=4 vs TP=8 comparison is most useful at the
KV-pressure transition, not at lightly-loaded points.
out: dict[str, list[dict]] = {}
traces_dir = _hf_traces_dir()
if traces_dir is None:
_TRACE_METADATA_CACHE = out
)
if per_turn:
out[trace_id] = per_turn
_TRACE_METADATA_CACHE = out
@cquil11 cquil11 merged commit e92a9bf into main May 17, 2026
8 checks passed
@cquil11 cquil11 deleted the chore/agentx-v0.2-aiperf-testing-reopen branch May 17, 2026 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant