Migrate agentic-coding benchmarks to aiperf v0.2 (reopened) by cquil11 · Pull Request #1393 · SemiAnalysisAI/InferenceX

cquil11 · 2026-05-16T01:03:21Z

Summary

Migrates scenario-type: agentic-coding from kv-cache-tester to aiperf (cjq/weka-live-assistant-responses), adds Kimi K2.5 agentic sweeps across B200, B300, H100, H200, MI355X, and pulls in a stack of agentic-specific aiperf features. bmk_agentic JSON schema preserved — downstream aggregators unaffected.

Dataset (v0.2): semianalysisai/cc-traces-weka-no-subagents-051226. Pulled via aiperf's --public-dataset semianalysis_cc_traces_weka flag — the loader name is a stable alias for the dated HF revision, so bumping the dataset means just rev'ing the alias on the aiperf side.

InferenceX repo additions since v0.1

benchmarks/benchmark_lib.sh: new build_replay_cmd emits aiperf profile --scenario inferencex-agentx-mvp … with --streaming --use-server-token-count --random-seed 42 --failed-request-threshold 0.05 --trajectory-start-{min,max}-ratio 0.25/0.75; install_agentic_deps editable-installs aiperf in-container; AIPERF_DATASET_{CONFIGURATION_TIMEOUT,MMAP_CACHE_DIR,WEKA_LIVE_ASSISTANT_RESPONSES} wired
utils/process_agentic_result.py rewritten to consume aiperf profile_export.jsonl + profile_export_aiperf.json + server_metrics_export.json; preserves all keys summarize.py consumes; theoretical cache-hit computed from trace metadata
utils/summarize.py per-metric stat set: drop median / p99 / p99.9, add p75 / p95 (keep mean / p90 / std)
utils/test_process_agentic_result.py fixture-driven contract test for every key downstream reads
6 new agentic launchers: kimik2.5_{fp4_b200, fp4_b300, fp4_mi355x, int4_b200, int4_h100, int4_h200}.sh with per-target tunings (fp8 KV on Hopper, --block-size=1 + AITER on MI355X, lazy_offload=true JSON form on H200+MI355X, VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 on B200 for TP=4)
.github/configs/{nvidia,amd}-master.yaml: agentic-coding blocks for all 5 SKUs with explicit conc-list per offloading mode + runner: h200-dgxc pin (cache mount availability)
runners/launch_h200-dgxc-slurm.sh: mount /home/sa-shared/gharunners/ai-perf-cache → /aiperf_mmap_cache, export AIPERF_DATASET_MMAP_CACHE_DIR (parity with B200 DGXC + MI355X)
Removed: utils/find_reusable_sweep_run.py, utils/validate_reusable_sweep_artifacts.py + tests (aiperf's content-addressed mmap cache replaces them)

aiperf submodule additions since v0.1

inferencex-agentx-mvp scenario plugin: locks --num-dataset-entries, --inter-turn-delay-cap-seconds 60, --cache-bust first_turn_prefix, timing_mode=AGENTIC_REPLAY, requires --ignore-eos and weka_trace loader
AGENTIC_REPLAY timing strategy + TrajectorySource: per-trajectory k_i sampling, WARMUP→PROFILING resume at k_i+1, FIFO recycle queue, context-overflow trajectory short-circuit
weka_trace loader with AIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1 mode (user-only deltas, server-threaded assistant responses) — sources from semianalysisai/cc-traces-weka-no-subagents-051226
ContextOverflowCountMetric + scenario.context_overflow.is_context_overflow_response substring classifier; RecordProcessor drops overflow records before metrics push in AGENTIC_REPLAY scenarios
--failed-request-threshold (float in [0,1]): in-flight ProfileCancelCommand broadcast when error rate exceeds threshold after max(concurrency, 10) grace floor
--trajectory-start-{min,max}-ratio: replaces hardcoded 0%-70% k_i range; seed-deterministic per (random_seed, trace_id)
TrajectorySource build-time summary table (range cfg + per-lane k_i/N/pct/trace_id); per-trajectory warmup-completion log lines on credit return
Content-addressed mmap dataset cache (AIPERF_DATASET_MMAP_CACHE_DIR); AIPERF_DATASET_CONFIGURATION_TIMEOUT + AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT env knobs (default 300s → 1800s in our launchers)
--use-server-token-count: derive ISL/OSL from server usage.prompt_tokens/completion_tokens instead of client tokenizer (avoids CPU-pinning at high conc on custom-tokenizer models)
Realtime stats row: srv line with server-side running-avg throughput; intvty per-user row replaces split tin/tout; ISL/OSL p50/p75/p90/p99 row; per-warmup-completion lines emitted in non-TTY runs

Test plan

Kimi K2.5 INT4 H200 (lazy_offload + cache mount): 17/17 clean
Kimi K2.5 FP4 B200 TP=8 + TP=4 (cudagraph estimator off): 24/24 clean
Kimi K2.5 FP4 B300 none: 9/9 clean
Kimi K2.5 FP4 B300 cpu: blocked on cluster recovery
Kimi K2.5 FP4 MI355X: re-dispatch after g09 + g11 drained

Results

Adds end-to-end agentic-coding benchmark infrastructure on top of the existing fixed-seq-len harness. New components: Trace replayer - New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized) driving multi-turn HF-dataset traces against any OpenAI-compatible endpoint at fixed concurrency. - --debug-trace captures full per-request prompt/response, every streamed chunk via chunk.model_dump(), and integer token IDs (apply_chat_template prompt + logprobs.content completion) into debug_trace.jsonl. - Per-model delta-field abstraction (gpt-oss → delta.reasoning, default → delta.reasoning_content) so reasoning-heavy responses are counted and appended to conversation history correctly. - Input-token metric reads server's usage.prompt_tokens (authoritative) rather than the local apply_chat_template estimate which breaks for gpt-oss harmony's chat template. - Per-user 8-token salt prefix on conversation[0] so two in-flight users replaying the same trace_id don't accidentally share KV-cache blocks. - Period summary: counts up elapsed instead of down remaining; replaces the dispatch-jitter "Wait time" with the trace's true "Inter-turn time" sourced from RequestMetrics.delay_expected. - 5s quiesce between warmup completion and metrics-collector start so warmup-tail prefill doesn't bleed into period 1. Workflow plumbing - e2e-tests.yml: workflow_dispatch + workflow_call inputs for debug-trace (boolean) and duration-override (string seconds), forwarded to test-sweep-agentic and test-sweep-multi-node-agentic jobs. - benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input mapped to DEBUG_TRACE env var; duration override threads through to matrix.config.duration. - benchmark_lib.sh: build_replay_cmd / resolve_trace_source / install_agentic_deps / write_agentic_result_json helpers; consumes DEBUG_TRACE → --debug-trace. - runners/launch_*.sh: shared agentic mode dispatch + scenario routing. - runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to match the actual runner.name observed by the workflow. Result aggregation - utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector (vllm/sglang Prometheus parsers), pareto plotter, per-config distribution analyzer, sweep aggregator. - utils/process_agentic_result.py: per-job results.json builder. - utils/matrix_logic: agentic-coding scenario plumbing in generate_sweep_configs.py + validation.py. Examples (one per vendor) - benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA. - benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD. - Matching agentic-coding sections in nvidia-master.yaml (dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang). All other model-specific launchers and matrix entries are deliberately left out of this PR; downstream PRs add them on a per-model basis. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same value, two names — collapse to one. Workflow templates already exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc), and the agentic matrix entries carried both `users: int` and `conc: [users]`. Drop the duplicates and standardize on conc/CONC: - benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant USERS env var (CONC remains) - e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}` to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'` since matrix.config.conc is now a scalar - generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int) only; loop variable renamed from `users` to `conc`; exp-name template now uses `_conc{N}` instead of `_users{N}` - validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int` - process_agentic_result.py: read CONC env var, emit single `"conc"` key - collect_sweep_results.py: regex updated to match `_conc{N}_offload` - benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC The trace-replayer's --start-users / --max-users CLI flags are upstream's API and are left unchanged; benchmark_lib.sh just passes $CONC into them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pick up these submodule commits (callanjfox/kv-cache-tester): - 7b7f883 silence kimi: target the actual loaded-tokenizer module logger - 5b87e43 silence kimi: replace static logger lookup with content filter - 3394450 silence Kimi tokenization_kimi.py per-call encode warning - 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5 new agentic-coding launcher scripts brought over from chore/agentx-integration, with USERS → CONC normalization: - benchmarks/single_node/agentic/gptoss_fp4_h100.sh - benchmarks/single_node/agentic/gptoss_fp4_h200.sh - benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh - benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh - benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep visualizer for cross-config performance comparison) and updates requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken needed by the analyzer + by trace-replay's tokenizer paths. The bench/ directory is intentionally NOT added: bench/metrics_collector.py duplicated utils/trace-replay/server_metrics.py and was already removed on this branch; bench/run_metrics_collector.py depends on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds agentic-coding scenario blocks to the master configs for the five models whose launchers were just brought over: - kimik2.5-fp4-b200-vllm (image bumped to v0.19.1) - gptoss-fp4-h100-vllm - gptoss-fp4-h200-vllm - gptoss-fp4-mi300x-vllm - gptoss-fp4-mi325x-vllm Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for low/mid concurrency and offloading=cpu for high concurrency, with a crossover at conc=64. Other agentic-coding sections present on chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up since several of the underlying model entries were restructured by main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The agentic-coding scenario type uses benchmarks/single_node/agentic/ launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml. b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/ even for agentic runs. Other runners (h100-*, h200-*, mi*) already had this plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…H200 - minimaxm2.5-fp8-b200-vllm - qwen3.5-bf16-b200-sglang - glm5-fp8-b200-sglang - dsv4-fp8-h200-vllm Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the trace replayer via build_replay_cmd, and emits the agentic result JSON. Master config gets an agentic-coding scenario block sweeping conc 1..32 at offloading=none; b200-dsv4 entries left untouched since that runner type isn't registered in runners.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- minimaxm2.5-fp8-mi355x-vllm - qwen3.5-fp8-mi355x-sglang - glm5.1-fp4-mi355x-sglang - kimik2.5-fp4-mi355x-vllm Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks (VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching agentic-coding scenarios sweeping conc 1..32 at offloading=none. dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq launcher requires a bespoke vLLM PR rebuild that adds risk to trace-replayer testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…5-fp4 Phase-2 coverage extension across precision (int4 vs fp4 for kimi, fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss). - gptoss-fp4-b200-vllm - kimik2.5-int4-b200-vllm - minimaxm2.5-fp4-b200-vllm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on B200 with PyTorch/CuDNN compatibility errors at server start. Add an fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a working qwen3.5 trace-replayer test on NVIDIA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents the launcher matrix at benchmarks/single_node/agentic/, how to dispatch debug runs via gh workflow run, and what fields in the result JSON to inspect for verification (num_requests_successful, total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.). Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/ pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in runners.yaml) so future testers don't repeat them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL (1 still in flight); failures are all image- or vLLM-parser-level, not replayer bugs. Replayer's per-model delta-field routing + long-prefill agentic flow verified end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL. The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace replayer itself. All 7 active model families have at least one PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The exp-name template emits offload{none|cpu|ssd} (per the matrix generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"), but the regex was looking for offload(on|off) — so every artifact directory failed to parse, the aggregator wrote nothing to aggregated/, and collect-agentic-results uploaded no files ("No files were found with the provided path: aggregated/"). Verified the fix matches real artifact names from this branch's runs (b200/h100, none/cpu). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200, gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add offloading=cpu at high concurrency (typically conc 64+) where KV cache pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so the crossover region is sampled by both. cpu-offload sweep tail uses larger conc points (96, 128, 192, 256) since the only reason to enable cpu offload is when concurrency stresses HBM. For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers without the OFFLOADING=cpu plumbing): expand the conc range on offloading=none. sglang manages its own KV eviction via the radix cache, so concurrency above HBM capacity is handled internally rather than via vLLM's --kv_offloading_backend. dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200 also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so left as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both nodes are currently dropping every job that lands on them: - NCCL barrier dies during sglang Scheduler.init_model_worker with RuntimeError: NCCL error: unhandled cuda error (stale CUDA contexts from a previous job that didn't tear down cleanly) - HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28) Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to them. Drop this once sa-shared admins clean up the nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible with the hybrid-KV-cache-manager (HMA) for models with mixed attention layouts. When HMA is enabled, the OffloadingConnector init fails with: RuntimeError: Worker failed with error 'Connector OffloadingConnector does not support HMA but HMA is enabled. Please set --disable-hybrid-kv-cache-manager'. This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job failed with the above error while every offload=none sub-job passed (see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in. MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed even with the broken flag. Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager is correctness-safe across the board: HMA is a pure optimization, and disabling it is required for OffloadingConnector regardless of model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nfigs KV offloading via OffloadingConnector hits multiple upstream bugs on older vllm tags: - v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute assertion in TRTLLM-attention path - v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat - v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200 (23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x.

Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges sized from per-SKU GPU KV cache capacity: KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB Per-SKU GPU cache cap with tp=4 + 0.90 mem-util: H100 58 GB -> 0.46M tok (saturate ~conc 6) H200 277 GB -> 2.19M tok (saturate ~conc 29) B200 461 GB -> 3.63M tok (saturate ~conc 48) B300 807 GB -> 6.35M tok (saturate ~conc 85) MI300X 500 GB -> 3.93M tok (saturate ~conc 52) MI355X 864 GB -> 6.81M tok (saturate ~conc 91) NVIDIA configs include offload=cpu starting at the saturation point (simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1). AMD configs do not enable cpu offload — vllm simple offloading isn't supported on the rocm build for these models. AMD pushes offload=none to a higher conc to demonstrate where GPU cache saturates. Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300 v0.19.0-cu130 -> v0.19.1.

vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up weight output_size 1536 / tp=8 = 192, not divisible by block_n=128. Same constraint at vllm/model_executor/layers/quantization/fp8.py:638. Per fixed-seq-len reference TPs: H100 tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8) H200 fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4 B200 tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL) B300 tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep) MI300X tp=4 (fixed-seq-len has tp=2,4) MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8) Concurrency expanded across the saturation cliff for each SKU; cpu offload range extended to 384/512 on NVIDIA where applicable.

Per empirical compute ceilings observed in prior runs (mean in-flight reqs mid-test on each platform): H100 tp=4 ep=4 ceiling ~10 (KV cliff ~6 -> cpu zone 6-10) H200 tp=4 ceiling ~35 (KV cliff ~29 -> cpu zone 29-35) B200 tp=4 ceiling ~50 (KV cliff ~48 -> very narrow) B300 tp=4 ceiling ~60 (KV cliff ~85 -> compute saturates first) MI300X tp=4 ceiling ~20 (estimated) MI355X tp=4 ep=4 ceiling ~60 Previous conc lists (1..256, even up to 512) wasted 30-min slots on sub-jobs that just queue 200+ requests waiting on a server only running 4-50 in flight, leading to client-side 600s timeout cascades. New lists "creep up" to 2-3x the ceiling, then stop. NVIDIA cpu offload range narrowed to the zone between KV cliff and compute ceiling, where offloading can actually relieve KV pressure without compute already being the bottleneck. AMD (mi300x, mi355x) keeps offload=none only.

Per user feedback: past the compute ceiling, throughput plateaus and extra conc just adds queue depth and client timeouts -- wasted slots. Reallocate sampling budget to densify around the cliff(s) for each SKU. Per-SKU strategy (compute ceiling empirical, KV cliff analytical): H100 tp=4 ep=4 ceil 10 KV 6 -> dense 4-12 (sweet spot for cpu demo) H200 tp=4 ceil 35 KV 29 -> dense 24-40 (narrow cpu window) B200 tp=4 ceil 50 KV 48 -> dense 32-56 (cliffs colocated) B300 tp=4 ceil 60 KV 85 -> dense 48-72 (compute first; cpu won't help) MI300X tp=4 ceil 25 KV 52 -> dense 16-32 (compute first; AMD no cpu) MI355X tp=4 ep=4 ceil 60 KV 91 -> dense 48-72 (compute first; AMD no cpu) Dense step (every 4-8 conc) around the cliffs to resolve the inflection; sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x ceiling to confirm plateau. NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling for direct same-conc comparison; doesn't extend past 1.3x ceiling.

- AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env var. SimpleCPUOffloadConnector isn't supported on rocm; native OffloadingConnector works (still passes --kv_offloading_backend native flag). - Add cpu offload entries to AMD master configs (mi300x, mi355x). - Add b300-p1 runner group (subset of b300 nodes 13-17 with the b300-p1 label) and target it from the b300 minimax config.

The agentic-coding benchmark IS a prefix-cache benchmark — the whole point is measuring KV reuse across multi-turn conversations and across users (with the per-user salt enabling deterministic prefix overlap). Disabling prefix caching defeats the entire purpose. Removed from 7 launchers that had it: dsv4_fp8_h200.sh gptoss_fp4_b200.sh (was in config.yaml) kimik2.5_fp4_mi355x.sh kimik2.5_int4_b200.sh minimaxm2.5_fp4_b200.sh minimaxm2.5_fp8_mi300x.sh minimaxm2.5_fp8_mi355x.sh vLLM defaults to prefix caching ON when no flag is passed.

ROCM_AITER_FA was the suspect for both: 1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine on the same launcher pattern + image) 2. Prefix-cache Prometheus counters never increment (observability gap on FA backend, while UNIFIED_ATTN reports correctly on mi300x) Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot.

The cpu range needs full overlap with none past the KV cliff so the no-offload throughput collapse is visible at the same conc points where cpu offload sustains throughput. B200 tp=4 (KV cliff conc=48): none: [1,2,4,8,16,32,48,56,64,96,128] (was capped at 64) cpu: [48,56,64,96,128] (was capped at 64) B300 tp=4 (KV cliff conc=85): none: [1,2,4,8,16,32,48,64,96,128,192] (was capped at 96) cpu: [48,64,96,128,192] (was capped at 96) Past the cliff, the no-offload curve should collapse (recompute storm, client-side timeouts), while cpu-offload sustains the compute ceiling.

…lenty)

- TP=8 none: [1, 2, 4, 8, 16, 24, 32, 40, 48] (unchanged baseline) - TP=8 cpu: [32, 40, 48, 56] (was [1..48]) Lower concurrencies fit entirely on-GPU at MI355X's 288 GB HBM; running cpu offload at conc<32 just adds the offload-path overhead without measuring anything new. Restrict cpu to the cliff region where it actually matters, and probe one step past the prior cap with conc=56.

Final per-metric stat set is now mean / p75 / p90 / p95 / std (was mean / median / p90 / p99 / p99.9 / std). Applied across: - utils/process_agentic_result.py: stats_for(), QPS aggregator, input/output_tokens, output_tokens_expected - utils/summarize.py: single-node and multi-node CSV column headers and row formatters - utils/test_process_agentic_result.py: SUMMARIZE_KEYS contract Rationale: p99/p99.9 were dominated by trace-end stragglers and weren't useful operational signal at the concurrency we sweep; p75 captures the tail where the agentic workload actually starts diverging from the median, and p95 is the standard 'tail-but-not- catastrophe' percentile that fits between p90 and the dropped p99.

Picks up cquil11/aiperf@4efdd6e8 "[RecordProcessor] Drop context-overflow records for AGENTIC_REPLAY scenarios": context-overflow errors mid- trajectory are already handled by agentic_replay.handle_credit_return (recycles the conversation, spawns a fresh trajectory), so the parser- classified context_overflow records were being double-counted as both end-of-trajectory signals AND error metrics. Now they're dropped at the record_processor_service layer before the MetricRecordsMessage push -- no contribution to failure totals, no entry in profile_export.jsonl, no tick on error counters. Existing ContextOverflowCountMetric continues to work outside AGENTIC_REPLAY scenarios for diagnostic purposes. Effect on Kimi agentic results in this repo: the "errors=N" line in the per-job logs and the failure column in aggregated CSVs will only count real failures (server 5xx, parse errors, malformed responses), not the expected end-of-trajectory context-overflow events.

Picks up cquil11/aiperf@8f41bc7b. New --failed-request-threshold flag (float in [0,1], default None=disabled) on the agentic-benchmark profile entrypoint. When PROFILING-phase error_records/total_records exceeds the threshold after a grace floor of max(concurrency, 10) records, RecordsManager broadcasts ProfileCancelCommand on the message bus, the timing/server-metrics/gpu-telemetry managers tear down their work, and the run exits non-zero via the existing cancel path. Composes cleanly with the prior context-overflow drop: in AGENTIC_REPLAY scenarios, context-overflow events are excluded from error_records, so the threshold measures only real failures (server 5xx, parse errors, malformed responses) and won't trip on the expected end-of-trajectory overflow signal. Usage example: aiperf profile ... --failed-request-threshold 0.05 (abort if >5% real failures after grace floor)

aiperf submodule pointer -> 343f33c6 picks up: - [LoadGen] --failed-request-threshold (in-flight abort; already wired via earlier ee76801 bump) - [AgenticReplay] --trajectory-start-min-ratio / --trajectory-start-max-ratio (configurable replacement for the previously hardcoded 0%-70% k_i range) - [AgenticReplay] per-trajectory warmup completion log lines (start_turn, trace_id, status) benchmark_lib.sh wires three new aiperf flags into build_replay_cmd for all agentic launchers: - --failed-request-threshold 0.05 (kill run early if real-failure rate > 5%) - --trajectory-start-min-ratio 0.25 - --trajectory-start-max-ratio 0.75 (sample k_i from 25%-75% of the trace)

Picks up cquil11/aiperf@fccb8471. TrajectorySource now emits a one-block info line right after building the trajectory list, showing per-lane (k_i, num_turns, pct) plus configured/observed range summary. Lets you verify that --trajectory-start-{min,max}-ratio produced the expected distribution before any requests fire, no need to wait for warmup completion lines.

Conflicts resolved: - .github/configs/amd-master.yaml (dsv4-fp4-mi355x-atom): took main's simplified single-range conc form from PR #1311 (we had the older discrete-point version) - .github/configs/nvidia-master.yaml (kimik2.5-int4-b200-vllm): kept our bump-rationale comment alongside main's v0.20.2 image (both sides agreed on the image, only the comment was new on ours) - .github/configs/nvidia-master.yaml (minimaxm2.5-fp8-{h100,h200}-vllm): took main's v0.20.2 image bumps (we still had v0.19.1) Cleanup: - Drop our .gitignore additions (the 'scripts/debug_*.sh' line) per review feedback -- match main - Drop docs/AGENTIC_TEST_COVERAGE.md and docs/AGENTIC_TEST_RESULTS.md (agent-generated planning slop, not load-bearing)

We don't need to plot any pareto frontiers from this repo -- aiperf has its own plotting tutorial and any downstream visualization can read the bmk_agentic JSON / aggregate exports directly. Removed: - utils/agentic-benchmark/scripts/plot_sweep_overview.py (v0.1 carryover) - utils/agentic-benchmark/analysis/plot_pareto.py - utils/generate_aiperf_plots.py (added earlier in this PR; not needed)

Picks up cquil11/aiperf@7d880a1e. The earlier context-overflow drop (commit 4efdd6e8) broke the records-side <-> credit-side counter invariant by returning early from _on_inference_results: records-side total_records lagged credit-side final_requests_completed by one for every overflow event, so the completion barrier at records_tracker.py:144-147 never converged. End of every PROFILING phase hung for the full benchmark_grace_period before timing out and cancelling in-flight credits. Fix preserves the original intent (context-overflow events stay out of metrics) while keeping the invariant intact: overflow records flow through normally but carry a context_overflow_skip flag on the MetricRecordMetadata; RecordsManager counts them toward total_records (classified as success so error counters stay at 0) but skips the error tracker, accumulators, stream exporters, and the --failed-request-threshold abort check.

Per PR review feedback, this branch must not alter any fixed-seq-len scenarios or non-agentic functionality from origin/main. Restored to match origin/main exactly: - amd-master.yaml: re-add qwen3.5-fp4-mi355x-atom + minimaxm2.5-fp4-mi355x-atom entries (both have only fixed-seq-len scenarios; were missing from our branch since the v0.1 merge) - nvidia-master.yaml: replace dsv4-fp4-gb200-dynamo-vllm fixed-seq-len block with origin/main version (we had drifted to a 1k/1k extrapolated layout; main is the canonical 8k/1k Pareto-mirrored block) - nvidia-master.yaml: kimik2.5-int4-h100-vllm new entry has agentic-coding only (no fixed-seq-len) to keep the fixed-seq-len surface identical to main

github-actions · 2026-05-16T01:03:29Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-16T01:13:06Z

+
+    # Generate metrics_plots.png from the same aiperf artifacts. Best-effort:
+    # don't fail the launcher if plot generation has trouble (e.g. matplotlib
+    # missing in a stripped-down image). The agg JSON is the success gate.
+    python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true
 }


🔴 The new write_agentic_result_json calls python3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir" to produce results/metrics_plots.png, and the workflow lists that PNG in the agentic artifact upload bundle — but utils/generate_aiperf_plots.py is not present anywhere in this PR or the repo. The call is wrapped in 2>&1 || true and the upload uses if-no-files-found: ignore, so the launcher and workflow appear green while every agentic run silently fails to emit the advertised plot. Either commit the missing script or remove the invocation at benchmarks/benchmark_lib.sh:1037 and the results/metrics_plots.png line at .github/workflows/benchmark-tmpl.yml:248.

Extended reasoning...

What the bug is

write_agentic_result_json in benchmarks/benchmark_lib.sh (lines 1026-1038) runs the post-run aggregator and then invokes a plotter:

python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true

This function is the final step of every agentic launcher under benchmarks/single_node/agentic/*.sh in this PR. Its output results/metrics_plots.png is then listed as one of the files in the agentic artifact upload at .github/workflows/benchmark-tmpl.yml:248. The comment block immediately above the plotter call advertises it as an intentional feature (Generate metrics_plots.png from the same aiperf artifacts), and the build_replay_cmd comment at lines 1008-1014 explicitly justifies the 1-second --slice-duration "so the post-run plotter has per-window time series … Without this, aiperf only emits aggregate stats and the 6x2 panels collapse to flat lines."

But utils/generate_aiperf_plots.py is not committed anywhere in the PR or the repo. A repo-wide search finds exactly one reference — the invocation itself — and no file matching **/generate_aiperf_plots*.

Why nothing currently fails

The call is wrapped in 2>&1 || true, so python: can't open file '.../utils/generate_aiperf_plots.py': [Errno 2] No such file or directory is captured into benchmark.log and the launcher's exit status stays clean. Separately, the workflow upload step uses if-no-files-found: ignore, so the missing results/metrics_plots.png is silently dropped from the uploaded bundle. There is no other check that would surface the missing artifact.

Step-by-step proof

An agentic-coding job runs, e.g. benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh.

The script ends with write_agentic_result_json "$RESULT_DIR" (line 156 of the new launcher).

write_agentic_result_json invokes python3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir".

Python exits non-zero with No such file or directory because the script does not exist.

|| true swallows the exit code; only benchmark.log records the error.

The aggregated *.json is created, so the workflow's retry-based existence check at .github/workflows/benchmark-tmpl.yml passes.

The Upload agentic raw results step lists results/metrics_plots.png but the file does not exist; if-no-files-found: ignore causes the missing file to be silently skipped.

The artifact bundle is published without the advertised PNG, and no warning surfaces in the run summary.

Impact

The metrics_plots.png artifact is advertised in the PR description (build_replay_cmd comment) and explicitly listed in the workflow upload, but it will never be produced for any agentic run. Every agentic benchmark.log will also carry a noisy python: can't open file … line, complicating future log triage. This is not a correctness bug for the agg JSON path (the success gate is the JSON, not the PNG), so the pipeline still goes green — but the promised per-window time-series visualization is missing from every run.

How to fix

Two options, either is sufficient:

Commit utils/generate_aiperf_plots.py alongside this PR. The 1-second --slice-duration plumbing in build_replay_cmd and the workflow artifact reference were clearly added in anticipation of this file.

Remove the plotter call from write_agentic_result_json (benchmarks/benchmark_lib.sh:1034-1037) and drop results/metrics_plots.png from .github/workflows/benchmark-tmpl.yml:248. The --slice-duration 1.0 flag in build_replay_cmd can also be removed if no other consumer needs the per-window timeslice JSON, but profile_export_aiperf_timeslices.{json,csv} are also in the upload bundle, so it may still be useful.

Option 1 matches the apparent intent (the launcher comment is written assuming the plotter exists). Option 2 is the safer "make the PR self-consistent" path.

claude · 2026-05-16T01:13:06Z

+# Trace metadata lookup: conversation_id (= trace id) -> per-turn dict with
+# ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache.
+_TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None
+_HF_DATASET = "semianalysisai/cc-traces-weka-042026"


🔴 utils/process_agentic_result.py:40 hardcodes _HF_DATASET = "semianalysisai/cc-traces-weka-042026" (the v0.1 name) while benchmarks/benchmark_lib.sh:908 in this same PR downloads semianalysisai/cc-traces-weka-no-subagents-051226 (v0.2, matching the PR description and the aiperf submodule loader). _hf_traces_dir() builds its lookup path from _HF_DATASET, so the production HF cache directory datasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/ is never found, _load_trace_metadata() returns {}, and every shipped agentic agg JSON has theoretical_cache_hit_rate=null with all output_tokens_expected stats missing — silently breaking the "theoretical cache-hit computed from trace metadata" feature this PR advertises. Fix is a one-line constant bump (and the matching test fixture path at utils/test_process_agentic_result.py:408); ideally derive the name from a shared constant / env var shared with resolve_trace_source.

Extended reasoning...

What goes wrong

Two locations in this PR disagree on the HF dataset name:

Producer — benchmarks/benchmark_lib.sh:908 (resolve_trace_source) calls hf download --repo-type dataset semianalysisai/cc-traces-weka-no-subagents-051226. This is the only producer of the local HF cache for the agentic path, and it matches the PR description ("v0.2: semianalysisai/cc-traces-weka-no-subagents-051226") and the aiperf submodule loader (semianalysis_cc_traces_weka is the stable alias for this dated revision).

Consumer — utils/process_agentic_result.py:40 hardcodes _HF_DATASET = "semianalysisai/cc-traces-weka-042026" — the v0.1 name from before this PR.

_hf_traces_dir() (process_agentic_result.py:118–146) builds its lookup path from _HF_DATASET:

org, name = _HF_DATASET.split("/", 1) snapshots = cache_root / f"datasets--{org}--{name}" / "snapshots"

So it searches $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshots/, which never exists in production. The function returns None, _load_trace_metadata() returns an empty dict, and:

compute_cache_stats() (process_agentic_result.py ~395–432): the theoretical-cache-hit walk runs only if metadata is truthy. Empty dict ⇒ result["theoretical_cache_hit_rate"] stays None.

compute_workload_stats() (process_agentic_result.py ~277–294): the mean/p75/p90/p95/std_output_tokens_expected block is gated by if metadata:. Empty dict ⇒ none of those keys are emitted.

Impact

Every agentic-coding result JSON shipped to the downstream aggregator has theoretical_cache_hit_rate: null and is missing all five output_tokens_expected stats — directly contradicting this PR's claim: "theoretical cache-hit computed from trace metadata." The per-launcher print(f" Theoretical cache hit rate: ...") at the end of process_agentic_result.py simply never prints in production. End-users consuming the schema downstream see silent data loss, not a crash.

Why CI doesn't catch this

The new unit test utils/test_process_agentic_result.py:408 (test_processor_loads_traces_jsonl_for_theoretical_cache) builds its fake HF cache snapshot under hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc" — mirroring the same stale name the production code reads. The test passes because both sides agree on the wrong name. Nothing else exercises _HF_DATASET against the real dataset, and test_processor_emits_required_summarize_keys (line 264) doesn't include the optional theoretical_cache_hit_rate / output_tokens_expected keys in its required-key set, so their absence isn't flagged either.

Step-by-step proof

A B200 job runs benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh, which calls resolve_trace_source (benchmark_lib.sh:907–916).

That downloads semianalysisai/cc-traces-weka-no-subagents-051226 into $HF_HUB_CACHE. After download, $HF_HUB_CACHE contains datasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/<rev>/traces.jsonl (the HF hub layout).

The launcher runs aiperf, which writes results/trace_replay/profile_export.jsonl etc.

write_agentic_result_json invokes python3 utils/process_agentic_result.py.

_load_trace_metadata() calls _hf_traces_dir(), which computes snapshots = $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshots from _HF_DATASET. snapshots.is_dir() is False (the cache directory uses the real dataset name -no-subagents-051226). Returns None.

_load_trace_metadata() returns {}. _TRACE_METADATA_CACHE = {}.

In compute_cache_stats(), if metadata: is False ⇒ theoretical_cache_hit_rate stays None.

In compute_workload_stats(), the trace-metadata block at the bottom of the function never executes ⇒ no *_output_tokens_expected keys are written.

Aggregator emits agg_*.json with "theoretical_cache_hit_rate": null and no expected-output stats. The per-launcher print(" Theoretical cache hit rate: ...") is gated on agg.get("theoretical_cache_hit_rate") is not None, so it never prints.

How to fix

Cheapest fix is a one-line constant bump in utils/process_agentic_result.py:

_HF_DATASET = "semianalysisai/cc-traces-weka-no-subagents-051226"

Plus the matching path in utils/test_process_agentic_result.py:408 (hf_cache / "datasets--semianalysisai--cc-traces-weka-no-subagents-051226" / "snapshots" / "abc"). Better: read the dataset name from a shared module-level constant (or env var) that both resolve_trace_source and process_agentic_result.py consume, so future dataset bumps can't desync again. As a regression guard, add an assertion in the test that the constant equals what resolve_trace_source actually downloads — or simply have the test set _HF_DATASET via the env-based shared constant rather than hardcoding the path.

claude · 2026-05-16T01:13:06Z


-    is_multinode = env_bool('IS_MULTINODE')
-    tp = env_int('TP', 1)
-    ep = env_int('EP_SIZE', 1)
-    dp_attention = os.environ.get('DP_ATTENTION', 'false')
+def build_agg(
+    records: list[dict],
+    aggregate: dict,
+    server_metrics: dict,
+) -> dict:
+    """Compose the agg_*.json body from the three aiperf inputs."""
+    is_multinode = env_bool("IS_MULTINODE")
+    tp = env_int("TP", 1)
+    ep = env_int("EP_SIZE", 1)
+    dp_attention = os.environ.get("DP_ATTENTION", "false")
    num_gpus = tp

    if is_multinode:
-        prefill_num_workers = env_int('PREFILL_NUM_WORKERS')
-        prefill_tp = env_int('PREFILL_TP')
-        prefill_ep = env_int('PREFILL_EP', 1)
-        prefill_dp_attention = os.environ.get('PREFILL_DP_ATTN', 'false')
-        decode_num_workers = env_int('DECODE_NUM_WORKERS')
-        decode_tp = env_int('DECODE_TP')
-        decode_ep = env_int('DECODE_EP', 1)
-        decode_dp_attention = os.environ.get('DECODE_DP_ATTN', 'false')
+        prefill_num_workers = env_int("PREFILL_NUM_WORKERS")
+        prefill_tp = env_int("PREFILL_TP")


🟡 build_agg() in utils/process_agentic_result.py sets both "num_requests_total" and "num_requests_successful" to len(records), but load_records() above explicitly filters out error records via 'if obj.get("error"): continue'. The two fields are therefore always equal — any downstream consumer computing a failure rate as 1 - successful/total will see 0% even when aiperf actually had errored requests. Fix is small: count error records during load_records and surface the count, or pull the true total from the aiperf aggregate JSON (request_count is already loaded but unused for these keys).

Extended reasoning...

What the bug is

utils/process_agentic_result.py rewrites the legacy agg JSON for the aiperf-based agentic pipeline. The legacy CSV-based implementation correctly distinguished the two fields:

# legacy (removed in this PR) "num_requests_total": len(rows), # all CSV rows "num_requests_successful": len(successful), # filtered by success == 'True'

The new code sets both to the same value:

# utils/process_agentic_result.py (build_agg, ~lines 601-603) "num_requests_total": len(records), "num_requests_successful": len(records),

But records is the output of load_records() (lines 93–105), which explicitly drops error rows:

obj = json.loads(line) if obj.get("error"): continue records.append(obj)

So records only contains successful requests; the two emitted fields are mathematically identical for every run.

Step-by-step proof

aiperf writes profile_export.jsonl with one line per request. Failed requests carry an error key (verified in aiperf source).

load_records() reads that file and skips any line where obj.get("error") is truthy.

build_agg() then does "num_requests_total": len(records) and "num_requests_successful": len(records) — same records list.

Imagine a run with 950 records emitted by aiperf, 50 of which carry "error": "...". load_records() returns a list of 900 entries. The agg JSON now claims num_requests_total = 900 and num_requests_successful = 900, so any consumer computing 1 - successful/total reads 0% failure rate even though aiperf saw 50 failures.

Why existing code doesn't prevent it

aiperf's --failed-request-threshold 0.05 aborts the run upstream above 5% error rate, but that's a coarse gate — non-zero error counts below 5% still pass through silently, and the labels still lie about the true total. The aggregate JSON (profile_export_aiperf.json) IS loaded by this script and contains request_count (the true total including errors) but it is never consulted for these two keys.

Impact

In-repo blast radius is bounded: utils/summarize.py headers don't read these fields, and the only consumer asserting on them is utils/test_process_agentic_result.py:491. The two fields nominally exist to track distinct quantities — that's why the pre-PR code emitted them separately — so any external dashboard / script (downstream Pareto analysis, the team's spreadsheets, etc.) computing failure rate from the emitted schema will see 0% regardless of actual aiperf error count. This is a silent schema regression.

How to fix

Two clean options:

Count errors during loading. Change load_records to return (records, error_count) (or a second pass), then in build_agg:

"num_requests_total": len(records) + error_count, "num_requests_successful": len(records),

Pull from the aggregate. aggregate["request_count"] is already loaded:

"num_requests_total": aggregate.get("request_count", len(records)), "num_requests_successful": len(records),

Either restores the original semantics of the two-field design.

claude · 2026-05-16T01:13:06Z

+    def _final_value(metric_name: str) -> float | None:
+        entry = metrics_by_name.get(metric_name)
+        if not isinstance(entry, dict):
+            return None
+        series = entry.get("series") or []
+        if not isinstance(series, list):
+            return None
+        for stats_key in ("total", "max", "avg"):
+            agg = 0.0
+            found = False
+            for s in series:
+                if not isinstance(s, dict):
+                    continue
+                stats = s.get("stats")
+                if not isinstance(stats, dict):
+                    continue
+                v = stats.get(stats_key)
+                if v is None:
+                    continue
+                try:
+                    agg += float(v)
+                    found = True
+                except (TypeError, ValueError):
+                    continue
+            if found:
+                return agg
+        return None


🟡 compute_cache_stats._final_value() sums values across all per-engine metric series, which is correct for counters (vllm:prefix_cache_hits/queries, vllm:prompt_tokens, vllm:kv_offload_bytes_*) but wrong for the one percentage gauge in the mapping at lines 489–498 — vllm:cpu_kv_cache_usage_perc → cpu_kv_cache_usage_pct. With the DP-attn configs this PR adds (dsv4-fp4-b200-vllm dp-attn:true, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), vLLM emits one /metrics series per DP engine, so two engines at 50% report 100% and eight at 50% report 400%. Easiest fix: pick avg (not total) for percentage gauges, e.g. a small per-metric strategy table keyed on metric name.

Extended reasoning...

What the bug is. _final_value() walks ("total", "max", "avg") and, for the first stats key any series has, accumulates agg += float(v) across every series entry. The intent (see the comment at lines 447–448: "We aggregate across series (multiple endpoints / label sets) and prefer total for counters, then max/avg for gauges") is to sum counters and aggregate gauges, but the implementation sums in both cases.

Which metric is affected. Of the five metrics looked up via this helper for compute_cache_stats:

vllm:prefix_cache_hits / vllm:prefix_cache_queries — counters. Summing across series (and dividing in lockstep at lines 481–482) is correct.

vllm:cpu_prefix_cache_hits / vllm:cpu_prefix_cache_queries — counters, same lockstep ratio.

vllm:prompt_tokens / vllm:generation_tokens — counters.

vllm:kv_offload_bytes_* / vllm:kv_offload_time_* — counters.

vllm:cpu_kv_cache_usage_perc → cpu_kv_cache_usage_pct — a gauge expressing a percentage. This one is wrong when there are multiple series.

Why DP-attn configs trigger it. In a single-engine layout (TP-only), vLLM exposes one /metrics series, so the loop sums a single value and the result is right by accident. With DP-attn (added in this PR for dsv4-fp4-b200-vllm, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), each DP engine runs its own scheduler and emits its own series — typically tagged by engine index. The loop now sums N independent gauge readings.

Step-by-step proof. Take the dsv4-fp8-h200-vllm agentic-coding entry added in this PR:

- { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16] }

The launcher (dsv4_fp8_h200.sh) runs --data-parallel-size 8 → 8 DP engines, each exporting vllm:cpu_kv_cache_usage_perc. Say each engine has its CPU offload pool 50% full. The scrape produces a metrics dict shaped like:

"vllm:cpu_kv_cache_usage_perc": { "series": [ {"labels": {"engine": "0"}, "stats": {"max": 50.0, "avg": 50.0, ...}}, {"labels": {"engine": "1"}, "stats": {"max": 50.0, "avg": 50.0, ...}}, ... 6 more identical entries ... ] }

_final_value() tries "total" first — not present on a gauge — then "max", which is present on every series. It iterates and accumulates agg = 50 + 50 + 50 + 50 + 50 + 50 + 50 + 50 = 400.0, then returns 400.0. That value is written to cpu_kv_cache_usage_pct in the agg_*.json payload. The true aggregate utilization is ~50%, not 400%.

Impact. cpu_kv_cache_usage_pct is a diagnostic field for analyzing CPU-offload behaviour, not a headline metric. utils/summarize.py doesn't read it (verified — it consumes only the throughput/latency keys plus topology fields), so summary tables are unaffected. But the field is published in the per-run JSON and would mislead anyone inspecting the offload regime of the new DP-attn configs — exactly the configs this PR is intended to characterize. The misreport is also obviously wrong (>100%, often by a large multiple) so it's likely to be spotted, but it crowds noise into the very signal the new sweeps are designed to produce.

Suggested fix. Cheapest correct version is a per-metric aggregation strategy table, e.g.:

_AGG = { "vllm:cpu_kv_cache_usage_perc": "avg", # everything else defaults to summing "total" -> "max" -> "avg" } def _final_value(metric_name): mode = _AGG.get(metric_name, "sum") ... if mode == "avg": # mean across series of stats["avg"] (or "max" if "avg" missing) else: # existing sum behavior

Alternatively, pick avg instead of max when the only key available is a gauge stat — but the table form is clearer about which metrics are gauges. Existing tests (test_processor_aggregates_across_multiple_series) cover only the counter case; adding a percentage-gauge fixture would lock the behaviour in.

Per PR review: this branch must not modify any existing master-yaml entry. If our agentic-coding work needs different metadata (image, runner, or extra scenarios) than what main has for an existing entry, the change must live in a separate sibling config-key with the '-agentic' suffix — leaving the original entry byte-identical to origin/main so the fixed-seq-len test surface and any other consumer of the original key sees zero change from this PR. For every entry that previously diverged from main, this commit: 1. Restores the original entry text verbatim to match origin/main 2. Emits a sibling '<name>-agentic' entry containing our metadata overrides + only the agentic-coding scenarios block amd-master.yaml: +6 sibling -agentic entries - glm5.1-fp4-mi355x-sglang-agentic - kimik2.5-fp4-mi355x-vllm-agentic - minimaxm2.5-fp8-mi300x-vllm-agentic - minimaxm2.5-fp8-mi325x-vllm-agentic - minimaxm2.5-fp8-mi355x-vllm-agentic - qwen3.5-fp8-mi355x-sglang-agentic nvidia-master.yaml: +16 sibling -agentic entries - dsv4-fp4-b200-vllm-agentic, dsv4-fp4-b300-vllm-agentic, dsv4-fp8-h200-vllm-agentic, glm5-fp8-b200-sglang-agentic, gptoss-fp4-b200-vllm-agentic, kimik2.5-fp4-b200-vllm-agentic, kimik2.5-fp4-b300-vllm-agentic, kimik2.5-int4-b200-vllm-agentic, kimik2.5-int4-h200-vllm-agentic, minimaxm2.5-fp4-b200-vllm-agentic, minimaxm2.5-fp8-b200-vllm-agentic, minimaxm2.5-fp8-b300-vllm-agentic, minimaxm2.5-fp8-h100-vllm-agentic, minimaxm2.5-fp8-h200-vllm-agentic, qwen3.5-bf16-b200-sglang-agentic, qwen3.5-fp8-b200-sglang-agentic Brand-new entries that don't exist on main (only kimik2.5-int4-h100-vllm in this PR) stay as-is with agentic-coding scenarios only — no fixed-seq-len block added. Dispatch instructions: agentic sweeps now reference the '-agentic' suffixed config-keys, e.g. gh workflow run ... -f generate-cli-command="test-config --config-keys kimik2.5-int4-h200-vllm-agentic --scenario-type agentic-coding"

Every split sibling now leads with a comment block explaining why it exists. Two flavors: Metadata divergence — lists the specific field(s) that differ from main: # Diverged from kimik2.5-int4-h200-vllm (agentic-coding sibling). Reasons below; # the original kimik2.5-int4-h200-vllm entry is left identical to origin/main so # its fixed-seq-len sweep is unaffected. # - runner: 'h200' -> 'h200-dgxc' Scenarios-only divergence — metadata matches main exactly; the split exists because we added or modified the agentic-coding scenarios: # Diverged from dsv4-fp4-b300-vllm (agentic-coding sibling). Metadata is # identical to origin/main's dsv4-fp4-b300-vllm; the split exists because this # PR adds an agentic-coding scenarios block that differs from main # (either main had none or had a different conc/offload sweep). # The original dsv4-fp4-b300-vllm entry stays byte-identical to origin/main. Annotations were generated programmatically from the field-level diff against origin/main (utils/aiperf scripted; no manual edits). Existing in-entry rationale comments are preserved below the header.

…oad modes Extends kimik2.5-fp4-mi355x-vllm-agentic with TP=4 sweep at cliff-region concurrencies on both offload modes. MI355X has 288 GB HBM/GPU so the TP=4 half-node weight footprint (~62 GB/GPU) leaves plenty of headroom unlike B200's 192 GB constraint. Restricted to cliff concurrencies (no low-conc points) since the TP=4 vs TP=8 comparison is most useful at the KV-pressure transition, not at lightly-loaded points.

+    out: dict[str, list[dict]] = {}
+    traces_dir = _hf_traces_dir()
+    if traces_dir is None:
+        _TRACE_METADATA_CACHE = out


+            )
+        if per_turn:
+            out[trace_id] = per_turn
+    _TRACE_METADATA_CACHE = out


cquil11 and others added 30 commits April 28, 2026 09:11

cleanup

9b12096

agentic minimax-fp8-b300: revert to standard b300 runner tag

689ef0e

agentic minimax-fp8-b300: bump cpu DRAM offload to 2.2 TB (B300 has p…

e074201

…lenty)

cquil11 added 10 commits May 15, 2026 12:44

cquil11 requested a review from a team May 16, 2026 01:03

cquil11 requested review from 1am9trash, billishyahao, chunfangamd, jgangani, kedarpotdar-nv, seungrokj and yctseng0211 as code owners May 16, 2026 01:03

github-project-automation Bot added this to InferenceMAX Board May 16, 2026

claude Bot reviewed May 16, 2026

View reviewed changes

cquil11 added 4 commits May 15, 2026 21:58

Merge branch 'main' into chore/agentx-v0.2-aiperf-testing-reopen

3256ae3

github-code-quality Bot found potential problems May 17, 2026

View reviewed changes

Comment thread utils/process_agentic_result.py

out: dict[str, list[dict]] = {}

traces_dir = _hf_traces_dir()

if traces_dir is None:

_TRACE_METADATA_CACHE = out

Comment thread utils/process_agentic_result.py

)

if per_turn:

out[trace_id] = per_turn

_TRACE_METADATA_CACHE = out

cquil11 merged commit e92a9bf into main May 17, 2026
8 checks passed

cquil11 deleted the chore/agentx-v0.2-aiperf-testing-reopen branch May 17, 2026 16:29

github-project-automation Bot moved this to Done in InferenceMAX Board May 17, 2026

claude Bot mentioned this pull request May 17, 2026

Update qwen3.5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate agentic-coding benchmarks to aiperf v0.2 (reopened)#1393

Migrate agentic-coding benchmarks to aiperf v0.2 (reopened)#1393
cquil11 merged 134 commits into
mainfrom
chore/agentx-v0.2-aiperf-testing-reopen

cquil11 commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

claude Bot May 16, 2026

Uh oh!

claude Bot May 16, 2026

Uh oh!

claude Bot May 16, 2026

Uh oh!

claude Bot May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented May 16, 2026

Summary

InferenceX repo additions since v0.1

aiperf submodule additions since v0.1

Test plan

Results

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

claude Bot May 16, 2026

Choose a reason for hiding this comment

What the bug is

Why nothing currently fails

Step-by-step proof

Impact

How to fix

Uh oh!

claude Bot May 16, 2026

Choose a reason for hiding this comment

What goes wrong

Impact

Why CI doesn't catch this

Step-by-step proof

How to fix

Uh oh!

claude Bot May 16, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why existing code doesn't prevent it

Impact

How to fix

Uh oh!

claude Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant