Migrate agentic-coding benchmarks to aiperf v0.2 (reopened)#1393
Conversation
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:
Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
driving multi-turn HF-dataset traces against any OpenAI-compatible
endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
streamed chunk via chunk.model_dump(), and integer token IDs
(apply_chat_template prompt + logprobs.content completion) into
debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
→ delta.reasoning_content) so reasoning-heavy responses are counted
and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
rather than the local apply_chat_template estimate which breaks for
gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
users replaying the same trace_id don't accidentally share KV-cache
blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
the dispatch-jitter "Wait time" with the trace's true "Inter-turn
time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
warmup-tail prefill doesn't bleed into period 1.
Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
debug-trace (boolean) and duration-override (string seconds), forwarded
to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
mapped to DEBUG_TRACE env var; duration override threads through to
matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
install_agentic_deps / write_agentic_result_json helpers; consumes
DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
match the actual runner.name observed by the workflow.
Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
(vllm/sglang Prometheus parsers), pareto plotter, per-config
distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
generate_sweep_configs.py + validation.py.
Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
(dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).
All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same value, two names — collapse to one. Workflow templates already
exposed both CONC and USERS env vars (USERS was a mirror of inputs.conc),
and the agentic matrix entries carried both `users: int` and
`conc: [users]`. Drop the duplicates and standardize on conc/CONC:
- benchmark-tmpl.yml / benchmark-multinode-tmpl.yml: drop redundant
USERS env var (CONC remains)
- e2e-tests.yml / run-sweep.yml: pass `conc: ${{ matrix.config.conc }}`
to template; build agentic conc-list as `'[${{ matrix.config.conc }}]'`
since matrix.config.conc is now a scalar
- generate_sweep_configs.py: agentic entries emit Fields.CONC.value (int)
only; loop variable renamed from `users` to `conc`; exp-name template
now uses `_conc{N}` instead of `_users{N}`
- validation.py: drop Fields.USERS; agentic Pydantic models use `conc: int`
- process_agentic_result.py: read CONC env var, emit single `"conc"` key
- collect_sweep_results.py: regex updated to match `_conc{N}_offload`
- benchmark_lib.sh / agentic launcher scripts: $USERS → $CONC
The trace-replayer's --start-users / --max-users CLI flags are upstream's
API and are left unchanged; benchmark_lib.sh just passes $CONC into them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pick up these submodule commits (callanjfox/kv-cache-tester): - 7b7f883 silence kimi: target the actual loaded-tokenizer module logger - 5b87e43 silence kimi: replace static logger lookup with content filter - 3394450 silence Kimi tokenization_kimi.py per-call encode warning - 7ad6a9e delta-field map: add 'kimi' substring (uses delta.reasoning like gpt-oss) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 new agentic-coding launcher scripts brought over from chore/agentx-integration, with USERS → CONC normalization: - benchmarks/single_node/agentic/gptoss_fp4_h100.sh - benchmarks/single_node/agentic/gptoss_fp4_h200.sh - benchmarks/single_node/agentic/gptoss_fp4_mi300x.sh - benchmarks/single_node/agentic/gptoss_fp4_mi325x.sh - benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings utils/agentic-benchmark/analysis/ (plot_pareto.py — sweep visualizer for cross-config performance comparison) and updates requirements.txt with transformers/xlsxwriter/tqdm/datasets/tiktoken needed by the analyzer + by trace-replay's tokenizer paths. The bench/ directory is intentionally NOT added: bench/metrics_collector.py duplicated utils/trace-replay/server_metrics.py and was already removed on this branch; bench/run_metrics_collector.py depends on it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds agentic-coding scenario blocks to the master configs for the five models whose launchers were just brought over: - kimik2.5-fp4-b200-vllm (image bumped to v0.19.1) - gptoss-fp4-h100-vllm - gptoss-fp4-h200-vllm - gptoss-fp4-mi300x-vllm - gptoss-fp4-mi325x-vllm Each scenario sweeps tp 4/8 (and 1/2 on AMD/H200) at offloading=none for low/mid concurrency and offloading=cpu for high concurrency, with a crossover at conc=64. Other agentic-coding sections present on chore/agentx-integration (trtllm/srt-slurm based) are left for follow-up since several of the underlying model entries were restructured by main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agentic-coding scenario type uses benchmarks/single_node/agentic/ launchers, gated by SCENARIO_SUBDIR='agentic/' from benchmark-tmpl.yml. b200-cw, b200-dgxc, b200-nb, and b300-nv all built BENCH_BASE without honoring SCENARIO_SUBDIR, so dispatch always landed in single_node/ even for agentic runs. Other runners (h100-*, h200-*, mi*) already had this plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…H200 - minimaxm2.5-fp8-b200-vllm - qwen3.5-bf16-b200-sglang - glm5-fp8-b200-sglang - dsv4-fp8-h200-vllm Each launcher mirrors its fixed-seq-len sibling but: uses CONC env for max-num-seqs / cuda-graph-max-bs, sources benchmark_lib.sh, calls the trace replayer via build_replay_cmd, and emits the agentic result JSON. Master config gets an agentic-coding scenario block sweeping conc 1..32 at offloading=none; b200-dsv4 entries left untouched since that runner type isn't registered in runners.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- minimaxm2.5-fp8-mi355x-vllm - qwen3.5-fp8-mi355x-sglang - glm5.1-fp4-mi355x-sglang - kimik2.5-fp4-mi355x-vllm Each mirrors its fixed-seq-len sibling with ROCm-specific tweaks (VLLM_ROCM_USE_AITER, ROCM_QUICK_REDUCE_QUANTIZATION, etc.) and feeds CONC into max-num-seqs / cuda-graph-max-bs. Master configs gain matching agentic-coding scenarios sweeping conc 1..32 at offloading=none. dsv4-fp8-mi355x is intentionally skipped since the existing fixed-seq launcher requires a bespoke vLLM PR rebuild that adds risk to trace-replayer testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5-fp4 Phase-2 coverage extension across precision (int4 vs fp4 for kimi, fp4 vs fp8 for minimax) and runner (b200 vs h100/h200 for gptoss). - gptoss-fp4-b200-vllm - kimik2.5-int4-b200-vllm - minimaxm2.5-fp4-b200-vllm Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bf16 image lmsysorg/sglang:nightly-dev-20260216-d3bae71e fails on B200 with PyTorch/CuDNN compatibility errors at server start. Add an fp8 variant using lmsysorg/sglang:v0.5.9-cu130-amd64 to provide a working qwen3.5 trace-replayer test on NVIDIA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Documents the launcher matrix at benchmarks/single_node/agentic/, how to dispatch debug runs via gh workflow run, and what fields in the result JSON to inspect for verification (num_requests_successful, total_generation_tokens, median_ttft, median_tpot, total_tput_tps, etc.). Notes the two known-failing configs (qwen3.5 sglang on B200 — pytorch/ pytorch#168167; dsv4-fp4-b200-sglang — runner b200-dsv4 not in runners.yaml) so future testers don't repeat them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
15 debug runs across 7 model families × NVIDIA/AMD HW. 10 PASS / 5 FAIL (1 still in flight); failures are all image- or vLLM-parser-level, not replayer bugs. Replayer's per-model delta-field routing + long-prefill agentic flow verified end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 16 dispatched runs are now complete. Final tally: 10 PASS, 6 FAIL. The 6 failures are all infrastructure or vLLM-side issues (PyTorch/CuDNN image incompatibility, vLLM deepseek_v4 reasoning parser bug, sglang-rocm qwen3.5 streaming, SLURM time limit) — none indicate a bug in the trace replayer itself. All 7 active model families have at least one PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The exp-name template emits offload{none|cpu|ssd} (per the matrix
generator's f"{model_code}_tp{tp}_conc{conc}_offload{offloading}"),
but the regex was looking for offload(on|off) — so every artifact
directory failed to parse, the aggregator wrote nothing to aggregated/,
and collect-agentic-results uploaded no files ("No files were found
with the provided path: aggregated/").
Verified the fix matches real artifact names from this branch's runs
(b200/h100, none/cpu).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For the 5 vllm models (kimik2.5-fp4/int4-b200, minimaxm2.5-fp8-b200,
gptoss-fp4-b200, kimik2.5-fp4-mi355x, minimaxm2.5-fp8-mi355x): add
offloading=cpu at high concurrency (typically conc 64+) where KV cache
pressure exceeds GPU HBM. Overlap at conc=64 between none and cpu so
the crossover region is sampled by both. cpu-offload sweep tail uses
larger conc points (96, 128, 192, 256) since the only reason to enable
cpu offload is when concurrency stresses HBM.
For glm5-fp8-b200-sglang and glm5.1-fp4-mi355x-sglang (sglang launchers
without the OFFLOADING=cpu plumbing): expand the conc range on
offloading=none. sglang manages its own KV eviction via the radix
cache, so concurrency above HBM capacity is handled internally rather
than via vLLM's --kv_offloading_backend.
dsr1-fp4-{b200,mi355x}-sglang sweeps already cover conc 1..256 (b200
also has tp=4 ep=4 / tp=8 ep=8 split and tp=8 going to conc=512), so
left as-is.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both nodes are currently dropping every job that lands on them: - NCCL barrier dies during sglang Scheduler.init_model_worker with RuntimeError: NCCL error: unhandled cuda error (stale CUDA contexts from a previous job that didn't tear down cleanly) - HuggingFace CAS download for moonshotai/Kimi-K2.5 fails with RuntimeError: Data processing error: CAS service error : IO Error: No space left on device (os error 28) Adding --exclude=gpu-10,gpu-15 to salloc keeps SLURM from allocating to them. Drop this once sa-shared admins clean up the nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM's OffloadingConnector (--kv_offloading_backend native) is incompatible with the hybrid-KV-cache-manager (HMA) for models with mixed attention layouts. When HMA is enabled, the OffloadingConnector init fails with: RuntimeError: Worker failed with error 'Connector OffloadingConnector does not support HMA but HMA is enabled. Please set --disable-hybrid-kv-cache-manager'. This bit kimik2.5-fp4-mi355x's full sweep: every offload=cpu sub-job failed with the above error while every offload=none sub-job passed (see run 25117841192). Kimi-K2.5 uses hybrid attention so HMA kicks in. MiniMax-M2.5 doesn't, which is why its prior cpu-offload sweeps passed even with the broken flag. Switching all 11 cpu-offload launchers to --disable-hybrid-kv-cache-manager is correctness-safe across the board: HMA is a pure optimization, and disabling it is required for OffloadingConnector regardless of model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nfigs
KV offloading via OffloadingConnector hits multiple upstream bugs on
older vllm tags:
- v0.15.1 (gpt-oss-fp4-b200, kimi-int4-b200): flashinfer kv_cache_permute
assertion in TRTLLM-attention path
- v0.18.0-rocm (kimi-fp4-mi355x): HMA + OffloadingConnector incompat
- v0.19.0 (minimaxm2.5-fp8 b200/mi355x): not yet verified clean
Bumping to v0.19.1 (or v0.19.1-rocm) — proven-good on kimi-fp4-b200
(23/23 sweep PASS) and gptoss-fp4 h100/h200/mi300x/mi325x.
Add agentic-coding sections + launchers for MiniMax-M2.5 FP8 across
H100, H200, B200, B300, MI300X, MI355X (excluding MI325X). Conc ranges
sized from per-SKU GPU KV cache capacity:
KV per token (fp8, 62 layers × 8 KV heads × 128 dim × 2): ~124 KB
Per-SKU GPU cache cap with tp=4 + 0.90 mem-util:
H100 58 GB -> 0.46M tok (saturate ~conc 6)
H200 277 GB -> 2.19M tok (saturate ~conc 29)
B200 461 GB -> 3.63M tok (saturate ~conc 48)
B300 807 GB -> 6.35M tok (saturate ~conc 85)
MI300X 500 GB -> 3.93M tok (saturate ~conc 52)
MI355X 864 GB -> 6.81M tok (saturate ~conc 91)
NVIDIA configs include offload=cpu starting at the saturation point
(simple cpu offload via OffloadingConnector requires vllm ≥ 0.19.1).
AMD configs do not enable cpu offload — vllm simple offloading isn't
supported on the rocm build for these models. AMD pushes offload=none
to a higher conc to demonstrate where GPU cache saturates.
Image bumps: h100/h200/mi300x v0.18.0/v0.16.0 -> v0.19.1; b300
v0.19.0-cu130 -> v0.19.1.
vllm v0.19.1 fp8 quantization rejects tp=8 for MiniMax-M2.5: gate/up weight output_size 1536 / tp=8 = 192, not divisible by block_n=128. Same constraint at vllm/model_executor/layers/quantization/fp8.py:638. Per fixed-seq-len reference TPs: H100 tp=4 ep=4 (tp=8 ep=8 commented out in fixed-seq-len for fp8) H200 fixed-seq-len has only tp=8 (broken on v0.19.1 fp8); winging tp=4 B200 tp=4 (fixed-seq-len has tp=2,4; tp=2 too tight for agentic ISL) B300 tp=4 (primary; fixed-seq-len has tp=1,2,4 with various ep) MI300X tp=4 (fixed-seq-len has tp=2,4) MI355X tp=4 ep=4 (fixed-seq-len has tp=2 ep=2, tp=4 ep=4, tp=8 ep=8) Concurrency expanded across the saturation cliff for each SKU; cpu offload range extended to 384/512 on NVIDIA where applicable.
Per empirical compute ceilings observed in prior runs (mean in-flight reqs mid-test on each platform): H100 tp=4 ep=4 ceiling ~10 (KV cliff ~6 -> cpu zone 6-10) H200 tp=4 ceiling ~35 (KV cliff ~29 -> cpu zone 29-35) B200 tp=4 ceiling ~50 (KV cliff ~48 -> very narrow) B300 tp=4 ceiling ~60 (KV cliff ~85 -> compute saturates first) MI300X tp=4 ceiling ~20 (estimated) MI355X tp=4 ep=4 ceiling ~60 Previous conc lists (1..256, even up to 512) wasted 30-min slots on sub-jobs that just queue 200+ requests waiting on a server only running 4-50 in flight, leading to client-side 600s timeout cascades. New lists "creep up" to 2-3x the ceiling, then stop. NVIDIA cpu offload range narrowed to the zone between KV cliff and compute ceiling, where offloading can actually relieve KV pressure without compute already being the bottleneck. AMD (mi300x, mi355x) keeps offload=none only.
Per user feedback: past the compute ceiling, throughput plateaus and extra conc just adds queue depth and client timeouts -- wasted slots. Reallocate sampling budget to densify around the cliff(s) for each SKU. Per-SKU strategy (compute ceiling empirical, KV cliff analytical): H100 tp=4 ep=4 ceil 10 KV 6 -> dense 4-12 (sweet spot for cpu demo) H200 tp=4 ceil 35 KV 29 -> dense 24-40 (narrow cpu window) B200 tp=4 ceil 50 KV 48 -> dense 32-56 (cliffs colocated) B300 tp=4 ceil 60 KV 85 -> dense 48-72 (compute first; cpu won't help) MI300X tp=4 ceil 25 KV 52 -> dense 16-32 (compute first; AMD no cpu) MI355X tp=4 ep=4 ceil 60 KV 91 -> dense 48-72 (compute first; AMD no cpu) Dense step (every 4-8 conc) around the cliffs to resolve the inflection; sparse step (doubling) below the cliffs for baseline; one point ~1.3-1.5x ceiling to confirm plateau. NVIDIA cpu offload range overlaps with none from KV cliff to ~ceiling for direct same-conc comparison; doesn't extend past 1.3x ceiling.
- AMD launchers (mi300x, mi355x) drop VLLM_USE_SIMPLE_KV_OFFLOAD env var. SimpleCPUOffloadConnector isn't supported on rocm; native OffloadingConnector works (still passes --kv_offloading_backend native flag). - Add cpu offload entries to AMD master configs (mi300x, mi355x). - Add b300-p1 runner group (subset of b300 nodes 13-17 with the b300-p1 label) and target it from the b300 minimax config.
The agentic-coding benchmark IS a prefix-cache benchmark — the whole point is measuring KV reuse across multi-turn conversations and across users (with the per-user salt enabling deterministic prefix overlap). Disabling prefix caching defeats the entire purpose. Removed from 7 launchers that had it: dsv4_fp8_h200.sh gptoss_fp4_b200.sh (was in config.yaml) kimik2.5_fp4_mi355x.sh kimik2.5_int4_b200.sh minimaxm2.5_fp4_b200.sh minimaxm2.5_fp8_mi300x.sh minimaxm2.5_fp8_mi355x.sh vLLM defaults to prefix caching ON when no flag is passed.
ROCM_AITER_FA was the suspect for both: 1. Worker dies on cpu offload (gpt-oss using UNIFIED_ATTN works fine on the same launcher pattern + image) 2. Prefix-cache Prometheus counters never increment (observability gap on FA backend, while UNIFIED_ATTN reports correctly on mi300x) Swap to ROCM_AITER_UNIFIED_ATTN to test both fixes in one shot.
The cpu range needs full overlap with none past the KV cliff so the no-offload throughput collapse is visible at the same conc points where cpu offload sustains throughput. B200 tp=4 (KV cliff conc=48): none: [1,2,4,8,16,32,48,56,64,96,128] (was capped at 64) cpu: [48,56,64,96,128] (was capped at 64) B300 tp=4 (KV cliff conc=85): none: [1,2,4,8,16,32,48,64,96,128,192] (was capped at 96) cpu: [48,64,96,128,192] (was capped at 96) Past the cliff, the no-offload curve should collapse (recompute storm, client-side timeouts), while cpu-offload sustains the compute ceiling.
- TP=8 none: [1, 2, 4, 8, 16, 24, 32, 40, 48] (unchanged baseline) - TP=8 cpu: [32, 40, 48, 56] (was [1..48]) Lower concurrencies fit entirely on-GPU at MI355X's 288 GB HBM; running cpu offload at conc<32 just adds the offload-path overhead without measuring anything new. Restrict cpu to the cliff region where it actually matters, and probe one step past the prior cap with conc=56.
Final per-metric stat set is now mean / p75 / p90 / p95 / std (was mean / median / p90 / p99 / p99.9 / std). Applied across: - utils/process_agentic_result.py: stats_for(), QPS aggregator, input/output_tokens, output_tokens_expected - utils/summarize.py: single-node and multi-node CSV column headers and row formatters - utils/test_process_agentic_result.py: SUMMARIZE_KEYS contract Rationale: p99/p99.9 were dominated by trace-end stragglers and weren't useful operational signal at the concurrency we sweep; p75 captures the tail where the agentic workload actually starts diverging from the median, and p95 is the standard 'tail-but-not- catastrophe' percentile that fits between p90 and the dropped p99.
Picks up cquil11/aiperf@4efdd6e8 "[RecordProcessor] Drop context-overflow records for AGENTIC_REPLAY scenarios": context-overflow errors mid- trajectory are already handled by agentic_replay.handle_credit_return (recycles the conversation, spawns a fresh trajectory), so the parser- classified context_overflow records were being double-counted as both end-of-trajectory signals AND error metrics. Now they're dropped at the record_processor_service layer before the MetricRecordsMessage push -- no contribution to failure totals, no entry in profile_export.jsonl, no tick on error counters. Existing ContextOverflowCountMetric continues to work outside AGENTIC_REPLAY scenarios for diagnostic purposes. Effect on Kimi agentic results in this repo: the "errors=N" line in the per-job logs and the failure column in aggregated CSVs will only count real failures (server 5xx, parse errors, malformed responses), not the expected end-of-trajectory context-overflow events.
Picks up cquil11/aiperf@8f41bc7b. New --failed-request-threshold flag (float in [0,1], default None=disabled) on the agentic-benchmark profile entrypoint. When PROFILING-phase error_records/total_records exceeds the threshold after a grace floor of max(concurrency, 10) records, RecordsManager broadcasts ProfileCancelCommand on the message bus, the timing/server-metrics/gpu-telemetry managers tear down their work, and the run exits non-zero via the existing cancel path. Composes cleanly with the prior context-overflow drop: in AGENTIC_REPLAY scenarios, context-overflow events are excluded from error_records, so the threshold measures only real failures (server 5xx, parse errors, malformed responses) and won't trip on the expected end-of-trajectory overflow signal. Usage example: aiperf profile ... --failed-request-threshold 0.05 (abort if >5% real failures after grace floor)
aiperf submodule pointer -> 343f33c6 picks up: - [LoadGen] --failed-request-threshold (in-flight abort; already wired via earlier ee76801 bump) - [AgenticReplay] --trajectory-start-min-ratio / --trajectory-start-max-ratio (configurable replacement for the previously hardcoded 0%-70% k_i range) - [AgenticReplay] per-trajectory warmup completion log lines (start_turn, trace_id, status) benchmark_lib.sh wires three new aiperf flags into build_replay_cmd for all agentic launchers: - --failed-request-threshold 0.05 (kill run early if real-failure rate > 5%) - --trajectory-start-min-ratio 0.25 - --trajectory-start-max-ratio 0.75 (sample k_i from 25%-75% of the trace)
Picks up cquil11/aiperf@fccb8471. TrajectorySource now emits a one-block info line right after building the trajectory list, showing per-lane (k_i, num_turns, pct) plus configured/observed range summary. Lets you verify that --trajectory-start-{min,max}-ratio produced the expected distribution before any requests fire, no need to wait for warmup completion lines.
Conflicts resolved: - .github/configs/amd-master.yaml (dsv4-fp4-mi355x-atom): took main's simplified single-range conc form from PR #1311 (we had the older discrete-point version) - .github/configs/nvidia-master.yaml (kimik2.5-int4-b200-vllm): kept our bump-rationale comment alongside main's v0.20.2 image (both sides agreed on the image, only the comment was new on ours) - .github/configs/nvidia-master.yaml (minimaxm2.5-fp8-{h100,h200}-vllm): took main's v0.20.2 image bumps (we still had v0.19.1) Cleanup: - Drop our .gitignore additions (the 'scripts/debug_*.sh' line) per review feedback -- match main - Drop docs/AGENTIC_TEST_COVERAGE.md and docs/AGENTIC_TEST_RESULTS.md (agent-generated planning slop, not load-bearing)
We don't need to plot any pareto frontiers from this repo -- aiperf has its own plotting tutorial and any downstream visualization can read the bmk_agentic JSON / aggregate exports directly. Removed: - utils/agentic-benchmark/scripts/plot_sweep_overview.py (v0.1 carryover) - utils/agentic-benchmark/analysis/plot_pareto.py - utils/generate_aiperf_plots.py (added earlier in this PR; not needed)
Picks up cquil11/aiperf@7d880a1e. The earlier context-overflow drop (commit 4efdd6e8) broke the records-side <-> credit-side counter invariant by returning early from _on_inference_results: records-side total_records lagged credit-side final_requests_completed by one for every overflow event, so the completion barrier at records_tracker.py:144-147 never converged. End of every PROFILING phase hung for the full benchmark_grace_period before timing out and cancelling in-flight credits. Fix preserves the original intent (context-overflow events stay out of metrics) while keeping the invariant intact: overflow records flow through normally but carry a context_overflow_skip flag on the MetricRecordMetadata; RecordsManager counts them toward total_records (classified as success so error counters stay at 0) but skips the error tracker, accumulators, stream exporters, and the --failed-request-threshold abort check.
Per PR review feedback, this branch must not alter any fixed-seq-len scenarios or non-agentic functionality from origin/main. Restored to match origin/main exactly: - amd-master.yaml: re-add qwen3.5-fp4-mi355x-atom + minimaxm2.5-fp4-mi355x-atom entries (both have only fixed-seq-len scenarios; were missing from our branch since the v0.1 merge) - nvidia-master.yaml: replace dsv4-fp4-gb200-dynamo-vllm fixed-seq-len block with origin/main version (we had drifted to a 1k/1k extrapolated layout; main is the canonical 8k/1k Pareto-mirrored block) - nvidia-master.yaml: kimik2.5-int4-h100-vllm new entry has agentic-coding only (no fixed-seq-len) to keep the fixed-seq-len surface identical to main
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
|
||
| # Generate metrics_plots.png from the same aiperf artifacts. Best-effort: | ||
| # don't fail the launcher if plot generation has trouble (e.g. matplotlib | ||
| # missing in a stripped-down image). The agg JSON is the success gate. | ||
| python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || true | ||
| } |
There was a problem hiding this comment.
🔴 The new write_agentic_result_json calls python3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir" to produce results/metrics_plots.png, and the workflow lists that PNG in the agentic artifact upload bundle — but utils/generate_aiperf_plots.py is not present anywhere in this PR or the repo. The call is wrapped in 2>&1 || true and the upload uses if-no-files-found: ignore, so the launcher and workflow appear green while every agentic run silently fails to emit the advertised plot. Either commit the missing script or remove the invocation at benchmarks/benchmark_lib.sh:1037 and the results/metrics_plots.png line at .github/workflows/benchmark-tmpl.yml:248.
Extended reasoning...
What the bug is
write_agentic_result_json in benchmarks/benchmark_lib.sh (lines 1026-1038) runs the post-run aggregator and then invokes a plotter:
python3 "$INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py" "$result_dir" 2>&1 || trueThis function is the final step of every agentic launcher under benchmarks/single_node/agentic/*.sh in this PR. Its output results/metrics_plots.png is then listed as one of the files in the agentic artifact upload at .github/workflows/benchmark-tmpl.yml:248. The comment block immediately above the plotter call advertises it as an intentional feature (Generate metrics_plots.png from the same aiperf artifacts), and the build_replay_cmd comment at lines 1008-1014 explicitly justifies the 1-second --slice-duration "so the post-run plotter has per-window time series … Without this, aiperf only emits aggregate stats and the 6x2 panels collapse to flat lines."
But utils/generate_aiperf_plots.py is not committed anywhere in the PR or the repo. A repo-wide search finds exactly one reference — the invocation itself — and no file matching **/generate_aiperf_plots*.
Why nothing currently fails
The call is wrapped in 2>&1 || true, so python: can't open file '.../utils/generate_aiperf_plots.py': [Errno 2] No such file or directory is captured into benchmark.log and the launcher's exit status stays clean. Separately, the workflow upload step uses if-no-files-found: ignore, so the missing results/metrics_plots.png is silently dropped from the uploaded bundle. There is no other check that would surface the missing artifact.
Step-by-step proof
- An agentic-coding job runs, e.g.
benchmarks/single_node/agentic/dsv4_fp4_b200_vllm.sh. - The script ends with
write_agentic_result_json "$RESULT_DIR"(line 156 of the new launcher). write_agentic_result_jsoninvokespython3 $INFMAX_CONTAINER_WORKSPACE/utils/generate_aiperf_plots.py "$result_dir".- Python exits non-zero with
No such file or directorybecause the script does not exist. || trueswallows the exit code; onlybenchmark.logrecords the error.- The aggregated
*.jsonis created, so the workflow's retry-based existence check at.github/workflows/benchmark-tmpl.ymlpasses. - The
Upload agentic raw resultsstep listsresults/metrics_plots.pngbut the file does not exist;if-no-files-found: ignorecauses the missing file to be silently skipped. - The artifact bundle is published without the advertised PNG, and no warning surfaces in the run summary.
Impact
The metrics_plots.png artifact is advertised in the PR description (build_replay_cmd comment) and explicitly listed in the workflow upload, but it will never be produced for any agentic run. Every agentic benchmark.log will also carry a noisy python: can't open file … line, complicating future log triage. This is not a correctness bug for the agg JSON path (the success gate is the JSON, not the PNG), so the pipeline still goes green — but the promised per-window time-series visualization is missing from every run.
How to fix
Two options, either is sufficient:
- Commit
utils/generate_aiperf_plots.pyalongside this PR. The 1-second--slice-durationplumbing inbuild_replay_cmdand the workflow artifact reference were clearly added in anticipation of this file. - Remove the plotter call from
write_agentic_result_json(benchmarks/benchmark_lib.sh:1034-1037) and dropresults/metrics_plots.pngfrom.github/workflows/benchmark-tmpl.yml:248. The--slice-duration 1.0flag inbuild_replay_cmdcan also be removed if no other consumer needs the per-window timeslice JSON, butprofile_export_aiperf_timeslices.{json,csv}are also in the upload bundle, so it may still be useful.
Option 1 matches the apparent intent (the launcher comment is written assuming the plotter exists). Option 2 is the safer "make the PR self-consistent" path.
| # Trace metadata lookup: conversation_id (= trace id) -> per-turn dict with | ||
| # ``hash_ids`` and ``output_length``. Built lazily from the HF dataset cache. | ||
| _TRACE_METADATA_CACHE: dict[str, list[dict]] | None = None | ||
| _HF_DATASET = "semianalysisai/cc-traces-weka-042026" |
There was a problem hiding this comment.
🔴 utils/process_agentic_result.py:40 hardcodes _HF_DATASET = "semianalysisai/cc-traces-weka-042026" (the v0.1 name) while benchmarks/benchmark_lib.sh:908 in this same PR downloads semianalysisai/cc-traces-weka-no-subagents-051226 (v0.2, matching the PR description and the aiperf submodule loader). _hf_traces_dir() builds its lookup path from _HF_DATASET, so the production HF cache directory datasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/ is never found, _load_trace_metadata() returns {}, and every shipped agentic agg JSON has theoretical_cache_hit_rate=null with all output_tokens_expected stats missing — silently breaking the "theoretical cache-hit computed from trace metadata" feature this PR advertises. Fix is a one-line constant bump (and the matching test fixture path at utils/test_process_agentic_result.py:408); ideally derive the name from a shared constant / env var shared with resolve_trace_source.
Extended reasoning...
What goes wrong
Two locations in this PR disagree on the HF dataset name:
- Producer —
benchmarks/benchmark_lib.sh:908(resolve_trace_source) callshf download --repo-type dataset semianalysisai/cc-traces-weka-no-subagents-051226. This is the only producer of the local HF cache for the agentic path, and it matches the PR description ("v0.2:semianalysisai/cc-traces-weka-no-subagents-051226") and the aiperf submodule loader (semianalysis_cc_traces_wekais the stable alias for this dated revision). - Consumer —
utils/process_agentic_result.py:40hardcodes_HF_DATASET = "semianalysisai/cc-traces-weka-042026"— the v0.1 name from before this PR.
_hf_traces_dir() (process_agentic_result.py:118–146) builds its lookup path from _HF_DATASET:
org, name = _HF_DATASET.split("/", 1)
snapshots = cache_root / f"datasets--{org}--{name}" / "snapshots"So it searches $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshots/, which never exists in production. The function returns None, _load_trace_metadata() returns an empty dict, and:
compute_cache_stats()(process_agentic_result.py ~395–432): the theoretical-cache-hit walk runs only ifmetadatais truthy. Empty dict ⇒result["theoretical_cache_hit_rate"]staysNone.compute_workload_stats()(process_agentic_result.py ~277–294): themean/p75/p90/p95/std_output_tokens_expectedblock is gated byif metadata:. Empty dict ⇒ none of those keys are emitted.
Impact
Every agentic-coding result JSON shipped to the downstream aggregator has theoretical_cache_hit_rate: null and is missing all five output_tokens_expected stats — directly contradicting this PR's claim: "theoretical cache-hit computed from trace metadata." The per-launcher print(f" Theoretical cache hit rate: ...") at the end of process_agentic_result.py simply never prints in production. End-users consuming the schema downstream see silent data loss, not a crash.
Why CI doesn't catch this
The new unit test utils/test_process_agentic_result.py:408 (test_processor_loads_traces_jsonl_for_theoretical_cache) builds its fake HF cache snapshot under hf_cache / "datasets--semianalysisai--cc-traces-weka-042026" / "snapshots" / "abc" — mirroring the same stale name the production code reads. The test passes because both sides agree on the wrong name. Nothing else exercises _HF_DATASET against the real dataset, and test_processor_emits_required_summarize_keys (line 264) doesn't include the optional theoretical_cache_hit_rate / output_tokens_expected keys in its required-key set, so their absence isn't flagged either.
Step-by-step proof
- A B200 job runs
benchmarks/single_node/agentic/kimik2.5_fp4_b200.sh, which callsresolve_trace_source(benchmark_lib.sh:907–916). - That downloads
semianalysisai/cc-traces-weka-no-subagents-051226into$HF_HUB_CACHE. After download,$HF_HUB_CACHEcontainsdatasets--semianalysisai--cc-traces-weka-no-subagents-051226/snapshots/<rev>/traces.jsonl(the HF hub layout). - The launcher runs aiperf, which writes
results/trace_replay/profile_export.jsonletc. write_agentic_result_jsoninvokespython3 utils/process_agentic_result.py._load_trace_metadata()calls_hf_traces_dir(), which computessnapshots = $HF_HUB_CACHE/datasets--semianalysisai--cc-traces-weka-042026/snapshotsfrom_HF_DATASET.snapshots.is_dir()is False (the cache directory uses the real dataset name-no-subagents-051226). ReturnsNone._load_trace_metadata()returns{}._TRACE_METADATA_CACHE = {}.- In
compute_cache_stats(),if metadata:is False ⇒theoretical_cache_hit_ratestaysNone. - In
compute_workload_stats(), the trace-metadata block at the bottom of the function never executes ⇒ no*_output_tokens_expectedkeys are written. - Aggregator emits
agg_*.jsonwith"theoretical_cache_hit_rate": nulland no expected-output stats. The per-launcherprint(" Theoretical cache hit rate: ...")is gated onagg.get("theoretical_cache_hit_rate") is not None, so it never prints.
How to fix
Cheapest fix is a one-line constant bump in utils/process_agentic_result.py:
_HF_DATASET = "semianalysisai/cc-traces-weka-no-subagents-051226"Plus the matching path in utils/test_process_agentic_result.py:408 (hf_cache / "datasets--semianalysisai--cc-traces-weka-no-subagents-051226" / "snapshots" / "abc"). Better: read the dataset name from a shared module-level constant (or env var) that both resolve_trace_source and process_agentic_result.py consume, so future dataset bumps can't desync again. As a regression guard, add an assertion in the test that the constant equals what resolve_trace_source actually downloads — or simply have the test set _HF_DATASET via the env-based shared constant rather than hardcoding the path.
|
|
||
| is_multinode = env_bool('IS_MULTINODE') | ||
| tp = env_int('TP', 1) | ||
| ep = env_int('EP_SIZE', 1) | ||
| dp_attention = os.environ.get('DP_ATTENTION', 'false') | ||
| def build_agg( | ||
| records: list[dict], | ||
| aggregate: dict, | ||
| server_metrics: dict, | ||
| ) -> dict: | ||
| """Compose the agg_*.json body from the three aiperf inputs.""" | ||
| is_multinode = env_bool("IS_MULTINODE") | ||
| tp = env_int("TP", 1) | ||
| ep = env_int("EP_SIZE", 1) | ||
| dp_attention = os.environ.get("DP_ATTENTION", "false") | ||
| num_gpus = tp | ||
|
|
||
| if is_multinode: | ||
| prefill_num_workers = env_int('PREFILL_NUM_WORKERS') | ||
| prefill_tp = env_int('PREFILL_TP') | ||
| prefill_ep = env_int('PREFILL_EP', 1) | ||
| prefill_dp_attention = os.environ.get('PREFILL_DP_ATTN', 'false') | ||
| decode_num_workers = env_int('DECODE_NUM_WORKERS') | ||
| decode_tp = env_int('DECODE_TP') | ||
| decode_ep = env_int('DECODE_EP', 1) | ||
| decode_dp_attention = os.environ.get('DECODE_DP_ATTN', 'false') | ||
| prefill_num_workers = env_int("PREFILL_NUM_WORKERS") | ||
| prefill_tp = env_int("PREFILL_TP") |
There was a problem hiding this comment.
🟡 build_agg() in utils/process_agentic_result.py sets both "num_requests_total" and "num_requests_successful" to len(records), but load_records() above explicitly filters out error records via 'if obj.get("error"): continue'. The two fields are therefore always equal — any downstream consumer computing a failure rate as 1 - successful/total will see 0% even when aiperf actually had errored requests. Fix is small: count error records during load_records and surface the count, or pull the true total from the aiperf aggregate JSON (request_count is already loaded but unused for these keys).
Extended reasoning...
What the bug is
utils/process_agentic_result.py rewrites the legacy agg JSON for the aiperf-based agentic pipeline. The legacy CSV-based implementation correctly distinguished the two fields:
# legacy (removed in this PR)
"num_requests_total": len(rows), # all CSV rows
"num_requests_successful": len(successful), # filtered by success == 'True'The new code sets both to the same value:
# utils/process_agentic_result.py (build_agg, ~lines 601-603)
"num_requests_total": len(records),
"num_requests_successful": len(records),But records is the output of load_records() (lines 93–105), which explicitly drops error rows:
obj = json.loads(line)
if obj.get("error"):
continue
records.append(obj)So records only contains successful requests; the two emitted fields are mathematically identical for every run.
Step-by-step proof
- aiperf writes
profile_export.jsonlwith one line per request. Failed requests carry anerrorkey (verified in aiperf source). load_records()reads that file and skips any line whereobj.get("error")is truthy.build_agg()then does"num_requests_total": len(records)and"num_requests_successful": len(records)— samerecordslist.- Imagine a run with 950 records emitted by aiperf, 50 of which carry
"error": "...".load_records()returns a list of 900 entries. The agg JSON now claimsnum_requests_total = 900andnum_requests_successful = 900, so any consumer computing1 - successful/totalreads 0% failure rate even though aiperf saw 50 failures.
Why existing code doesn't prevent it
aiperf's --failed-request-threshold 0.05 aborts the run upstream above 5% error rate, but that's a coarse gate — non-zero error counts below 5% still pass through silently, and the labels still lie about the true total. The aggregate JSON (profile_export_aiperf.json) IS loaded by this script and contains request_count (the true total including errors) but it is never consulted for these two keys.
Impact
In-repo blast radius is bounded: utils/summarize.py headers don't read these fields, and the only consumer asserting on them is utils/test_process_agentic_result.py:491. The two fields nominally exist to track distinct quantities — that's why the pre-PR code emitted them separately — so any external dashboard / script (downstream Pareto analysis, the team's spreadsheets, etc.) computing failure rate from the emitted schema will see 0% regardless of actual aiperf error count. This is a silent schema regression.
How to fix
Two clean options:
-
Count errors during loading. Change
load_recordsto return(records, error_count)(or a second pass), then inbuild_agg:"num_requests_total": len(records) + error_count, "num_requests_successful": len(records),
-
Pull from the aggregate.
aggregate["request_count"]is already loaded:"num_requests_total": aggregate.get("request_count", len(records)), "num_requests_successful": len(records),
Either restores the original semantics of the two-field design.
| def _final_value(metric_name: str) -> float | None: | ||
| entry = metrics_by_name.get(metric_name) | ||
| if not isinstance(entry, dict): | ||
| return None | ||
| series = entry.get("series") or [] | ||
| if not isinstance(series, list): | ||
| return None | ||
| for stats_key in ("total", "max", "avg"): | ||
| agg = 0.0 | ||
| found = False | ||
| for s in series: | ||
| if not isinstance(s, dict): | ||
| continue | ||
| stats = s.get("stats") | ||
| if not isinstance(stats, dict): | ||
| continue | ||
| v = stats.get(stats_key) | ||
| if v is None: | ||
| continue | ||
| try: | ||
| agg += float(v) | ||
| found = True | ||
| except (TypeError, ValueError): | ||
| continue | ||
| if found: | ||
| return agg | ||
| return None |
There was a problem hiding this comment.
🟡 compute_cache_stats._final_value() sums values across all per-engine metric series, which is correct for counters (vllm:prefix_cache_hits/queries, vllm:prompt_tokens, vllm:kv_offload_bytes_*) but wrong for the one percentage gauge in the mapping at lines 489–498 — vllm:cpu_kv_cache_usage_perc → cpu_kv_cache_usage_pct. With the DP-attn configs this PR adds (dsv4-fp4-b200-vllm dp-attn:true, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), vLLM emits one /metrics series per DP engine, so two engines at 50% report 100% and eight at 50% report 400%. Easiest fix: pick avg (not total) for percentage gauges, e.g. a small per-metric strategy table keyed on metric name.
Extended reasoning...
What the bug is. _final_value() walks ("total", "max", "avg") and, for the first stats key any series has, accumulates agg += float(v) across every series entry. The intent (see the comment at lines 447–448: "We aggregate across series (multiple endpoints / label sets) and prefer total for counters, then max/avg for gauges") is to sum counters and aggregate gauges, but the implementation sums in both cases.
Which metric is affected. Of the five metrics looked up via this helper for compute_cache_stats:
vllm:prefix_cache_hits/vllm:prefix_cache_queries— counters. Summing across series (and dividing in lockstep at lines 481–482) is correct.vllm:cpu_prefix_cache_hits/vllm:cpu_prefix_cache_queries— counters, same lockstep ratio.vllm:prompt_tokens/vllm:generation_tokens— counters.vllm:kv_offload_bytes_*/vllm:kv_offload_time_*— counters.vllm:cpu_kv_cache_usage_perc→cpu_kv_cache_usage_pct— a gauge expressing a percentage. This one is wrong when there are multiple series.
Why DP-attn configs trigger it. In a single-engine layout (TP-only), vLLM exposes one /metrics series, so the loop sums a single value and the result is right by accident. With DP-attn (added in this PR for dsv4-fp4-b200-vllm, dsv4-fp4-b300-vllm, dsv4-fp8-h200-vllm), each DP engine runs its own scheduler and emits its own series — typically tagged by engine index. The loop now sums N independent gauge readings.
Step-by-step proof. Take the dsv4-fp8-h200-vllm agentic-coding entry added in this PR:
- { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16] }The launcher (dsv4_fp8_h200.sh) runs --data-parallel-size 8 → 8 DP engines, each exporting vllm:cpu_kv_cache_usage_perc. Say each engine has its CPU offload pool 50% full. The scrape produces a metrics dict shaped like:
"vllm:cpu_kv_cache_usage_perc": {
"series": [
{"labels": {"engine": "0"}, "stats": {"max": 50.0, "avg": 50.0, ...}},
{"labels": {"engine": "1"}, "stats": {"max": 50.0, "avg": 50.0, ...}},
... 6 more identical entries ...
]
}
_final_value() tries "total" first — not present on a gauge — then "max", which is present on every series. It iterates and accumulates agg = 50 + 50 + 50 + 50 + 50 + 50 + 50 + 50 = 400.0, then returns 400.0. That value is written to cpu_kv_cache_usage_pct in the agg_*.json payload. The true aggregate utilization is ~50%, not 400%.
Impact. cpu_kv_cache_usage_pct is a diagnostic field for analyzing CPU-offload behaviour, not a headline metric. utils/summarize.py doesn't read it (verified — it consumes only the throughput/latency keys plus topology fields), so summary tables are unaffected. But the field is published in the per-run JSON and would mislead anyone inspecting the offload regime of the new DP-attn configs — exactly the configs this PR is intended to characterize. The misreport is also obviously wrong (>100%, often by a large multiple) so it's likely to be spotted, but it crowds noise into the very signal the new sweeps are designed to produce.
Suggested fix. Cheapest correct version is a per-metric aggregation strategy table, e.g.:
_AGG = {
"vllm:cpu_kv_cache_usage_perc": "avg",
# everything else defaults to summing "total" -> "max" -> "avg"
}
def _final_value(metric_name):
mode = _AGG.get(metric_name, "sum")
...
if mode == "avg":
# mean across series of stats["avg"] (or "max" if "avg" missing)
else:
# existing sum behaviorAlternatively, pick avg instead of max when the only key available is a gauge stat — but the table form is clearer about which metrics are gauges. Existing tests (test_processor_aggregates_across_multiple_series) cover only the counter case; adding a percentage-gauge fixture would lock the behaviour in.
Per PR review: this branch must not modify any existing master-yaml
entry. If our agentic-coding work needs different metadata (image,
runner, or extra scenarios) than what main has for an existing entry,
the change must live in a separate sibling config-key with the
'-agentic' suffix — leaving the original entry byte-identical to
origin/main so the fixed-seq-len test surface and any other consumer
of the original key sees zero change from this PR.
For every entry that previously diverged from main, this commit:
1. Restores the original entry text verbatim to match origin/main
2. Emits a sibling '<name>-agentic' entry containing our metadata
overrides + only the agentic-coding scenarios block
amd-master.yaml: +6 sibling -agentic entries
- glm5.1-fp4-mi355x-sglang-agentic
- kimik2.5-fp4-mi355x-vllm-agentic
- minimaxm2.5-fp8-mi300x-vllm-agentic
- minimaxm2.5-fp8-mi325x-vllm-agentic
- minimaxm2.5-fp8-mi355x-vllm-agentic
- qwen3.5-fp8-mi355x-sglang-agentic
nvidia-master.yaml: +16 sibling -agentic entries
- dsv4-fp4-b200-vllm-agentic, dsv4-fp4-b300-vllm-agentic,
dsv4-fp8-h200-vllm-agentic, glm5-fp8-b200-sglang-agentic,
gptoss-fp4-b200-vllm-agentic, kimik2.5-fp4-b200-vllm-agentic,
kimik2.5-fp4-b300-vllm-agentic, kimik2.5-int4-b200-vllm-agentic,
kimik2.5-int4-h200-vllm-agentic, minimaxm2.5-fp4-b200-vllm-agentic,
minimaxm2.5-fp8-b200-vllm-agentic, minimaxm2.5-fp8-b300-vllm-agentic,
minimaxm2.5-fp8-h100-vllm-agentic, minimaxm2.5-fp8-h200-vllm-agentic,
qwen3.5-bf16-b200-sglang-agentic, qwen3.5-fp8-b200-sglang-agentic
Brand-new entries that don't exist on main (only kimik2.5-int4-h100-vllm
in this PR) stay as-is with agentic-coding scenarios only — no
fixed-seq-len block added.
Dispatch instructions: agentic sweeps now reference the '-agentic'
suffixed config-keys, e.g.
gh workflow run ... -f generate-cli-command="test-config
--config-keys kimik2.5-int4-h200-vllm-agentic
--scenario-type agentic-coding"
Every split sibling now leads with a comment block explaining why it exists. Two flavors: Metadata divergence — lists the specific field(s) that differ from main: # Diverged from kimik2.5-int4-h200-vllm (agentic-coding sibling). Reasons below; # the original kimik2.5-int4-h200-vllm entry is left identical to origin/main so # its fixed-seq-len sweep is unaffected. # - runner: 'h200' -> 'h200-dgxc' Scenarios-only divergence — metadata matches main exactly; the split exists because we added or modified the agentic-coding scenarios: # Diverged from dsv4-fp4-b300-vllm (agentic-coding sibling). Metadata is # identical to origin/main's dsv4-fp4-b300-vllm; the split exists because this # PR adds an agentic-coding scenarios block that differs from main # (either main had none or had a different conc/offload sweep). # The original dsv4-fp4-b300-vllm entry stays byte-identical to origin/main. Annotations were generated programmatically from the field-level diff against origin/main (utils/aiperf scripted; no manual edits). Existing in-entry rationale comments are preserved below the header.
…oad modes Extends kimik2.5-fp4-mi355x-vllm-agentic with TP=4 sweep at cliff-region concurrencies on both offload modes. MI355X has 288 GB HBM/GPU so the TP=4 half-node weight footprint (~62 GB/GPU) leaves plenty of headroom unlike B200's 192 GB constraint. Restricted to cliff concurrencies (no low-conc points) since the TP=4 vs TP=8 comparison is most useful at the KV-pressure transition, not at lightly-loaded points.
| out: dict[str, list[dict]] = {} | ||
| traces_dir = _hf_traces_dir() | ||
| if traces_dir is None: | ||
| _TRACE_METADATA_CACHE = out |
| ) | ||
| if per_turn: | ||
| out[trace_id] = per_turn | ||
| _TRACE_METADATA_CACHE = out |
Summary
Migrates
scenario-type: agentic-codingfrom kv-cache-tester to aiperf (cjq/weka-live-assistant-responses), adds Kimi K2.5 agentic sweeps across B200, B300, H100, H200, MI355X, and pulls in a stack of agentic-specific aiperf features.bmk_agenticJSON schema preserved — downstream aggregators unaffected.Dataset (v0.2):
semianalysisai/cc-traces-weka-no-subagents-051226. Pulled via aiperf's--public-dataset semianalysis_cc_traces_wekaflag — the loader name is a stable alias for the dated HF revision, so bumping the dataset means just rev'ing the alias on the aiperf side.InferenceX repo additions since v0.1
benchmarks/benchmark_lib.sh: newbuild_replay_cmdemitsaiperf profile --scenario inferencex-agentx-mvp …with--streaming --use-server-token-count --random-seed 42 --failed-request-threshold 0.05 --trajectory-start-{min,max}-ratio 0.25/0.75;install_agentic_depseditable-installs aiperf in-container;AIPERF_DATASET_{CONFIGURATION_TIMEOUT,MMAP_CACHE_DIR,WEKA_LIVE_ASSISTANT_RESPONSES}wiredutils/process_agentic_result.pyrewritten to consume aiperfprofile_export.jsonl+profile_export_aiperf.json+server_metrics_export.json; preserves all keyssummarize.pyconsumes; theoretical cache-hit computed from trace metadatautils/summarize.pyper-metric stat set: dropmedian / p99 / p99.9, addp75 / p95(keepmean / p90 / std)utils/test_process_agentic_result.pyfixture-driven contract test for every key downstream readskimik2.5_{fp4_b200, fp4_b300, fp4_mi355x, int4_b200, int4_h100, int4_h200}.shwith per-target tunings (fp8 KV on Hopper,--block-size=1+ AITER on MI355X,lazy_offload=trueJSON form on H200+MI355X,VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0on B200 for TP=4).github/configs/{nvidia,amd}-master.yaml:agentic-codingblocks for all 5 SKUs with explicitconc-listperoffloadingmode +runner: h200-dgxcpin (cache mount availability)runners/launch_h200-dgxc-slurm.sh: mount/home/sa-shared/gharunners/ai-perf-cache→/aiperf_mmap_cache, exportAIPERF_DATASET_MMAP_CACHE_DIR(parity with B200 DGXC + MI355X)utils/find_reusable_sweep_run.py,utils/validate_reusable_sweep_artifacts.py+ tests (aiperf's content-addressed mmap cache replaces them)aiperf submodule additions since v0.1
inferencex-agentx-mvpscenario plugin: locks--num-dataset-entries,--inter-turn-delay-cap-seconds 60,--cache-bust first_turn_prefix,timing_mode=AGENTIC_REPLAY, requires--ignore-eosandweka_traceloaderAGENTIC_REPLAYtiming strategy +TrajectorySource: per-trajectory k_i sampling, WARMUP→PROFILING resume at k_i+1, FIFO recycle queue, context-overflow trajectory short-circuitweka_traceloader withAIPERF_DATASET_WEKA_LIVE_ASSISTANT_RESPONSES=1mode (user-only deltas, server-threaded assistant responses) — sources fromsemianalysisai/cc-traces-weka-no-subagents-051226ContextOverflowCountMetric+scenario.context_overflow.is_context_overflow_responsesubstring classifier;RecordProcessordrops overflow records before metrics push in AGENTIC_REPLAY scenarios--failed-request-threshold(float in [0,1]): in-flightProfileCancelCommandbroadcast when error rate exceeds threshold aftermax(concurrency, 10)grace floor--trajectory-start-{min,max}-ratio: replaces hardcoded 0%-70% k_i range; seed-deterministic per (random_seed, trace_id)TrajectorySourcebuild-time summary table (range cfg + per-lane k_i/N/pct/trace_id); per-trajectory warmup-completion log lines on credit returnAIPERF_DATASET_MMAP_CACHE_DIR);AIPERF_DATASET_CONFIGURATION_TIMEOUT+AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUTenv knobs (default 300s → 1800s in our launchers)--use-server-token-count: derive ISL/OSL from serverusage.prompt_tokens/completion_tokensinstead of client tokenizer (avoids CPU-pinning at high conc on custom-tokenizer models)srvline with server-side running-avg throughput;intvtyper-user row replaces splittin/tout; ISL/OSL p50/p75/p90/p99 row; per-warmup-completion lines emitted in non-TTY runsTest plan
none: 9/9 cleancpu: blocked on cluster recoveryResults