[disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1#1201
[disclaimer: MVP/experimental] feat: agentic trace replay benchmark MVP v0.1#1201
Conversation
193b514 to
7241d5b
Compare
7241d5b to
5c94fd3
Compare
81cd578 to
9b10a34
Compare
Adds end-to-end agentic-coding benchmark infrastructure on top of the
existing fixed-seq-len harness. New components:
Trace replayer
- New utils/trace-replay submodule (kv-cache-tester @ agentx-minimized)
driving multi-turn HF-dataset traces against any OpenAI-compatible
endpoint at fixed concurrency.
- --debug-trace captures full per-request prompt/response, every
streamed chunk via chunk.model_dump(), and integer token IDs
(apply_chat_template prompt + logprobs.content completion) into
debug_trace.jsonl.
- Per-model delta-field abstraction (gpt-oss → delta.reasoning, default
→ delta.reasoning_content) so reasoning-heavy responses are counted
and appended to conversation history correctly.
- Input-token metric reads server's usage.prompt_tokens (authoritative)
rather than the local apply_chat_template estimate which breaks for
gpt-oss harmony's chat template.
- Per-user 8-token salt prefix on conversation[0] so two in-flight
users replaying the same trace_id don't accidentally share KV-cache
blocks.
- Period summary: counts up elapsed instead of down remaining; replaces
the dispatch-jitter "Wait time" with the trace's true "Inter-turn
time" sourced from RequestMetrics.delay_expected.
- 5s quiesce between warmup completion and metrics-collector start so
warmup-tail prefill doesn't bleed into period 1.
Workflow plumbing
- e2e-tests.yml: workflow_dispatch + workflow_call inputs for
debug-trace (boolean) and duration-override (string seconds), forwarded
to test-sweep-agentic and test-sweep-multi-node-agentic jobs.
- benchmark-tmpl.yml + benchmark-multinode-tmpl.yml: debug-trace input
mapped to DEBUG_TRACE env var; duration override threads through to
matrix.config.duration.
- benchmark_lib.sh: build_replay_cmd / resolve_trace_source /
install_agentic_deps / write_agentic_result_json helpers; consumes
DEBUG_TRACE → --debug-trace.
- runners/launch_*.sh: shared agentic mode dispatch + scenario routing.
- runners/launch_b200-dgxc-slurm.sh → launch_b200-dgxc.sh rename to
match the actual runner.name observed by the workflow.
Result aggregation
- utils/agentic-benchmark/{bench,analysis,scripts}: metrics collector
(vllm/sglang Prometheus parsers), pareto plotter, per-config
distribution analyzer, sweep aggregator.
- utils/process_agentic_result.py: per-job results.json builder.
- utils/matrix_logic: agentic-coding scenario plumbing in
generate_sweep_configs.py + validation.py.
Examples (one per vendor)
- benchmarks/single_node/agentic/dsr1_fp4_b200.sh — NVIDIA.
- benchmarks/single_node/agentic/dsr1_fp4_mi355x.sh — AMD.
- Matching agentic-coding sections in nvidia-master.yaml
(dsr1-fp4-b200-sglang) and amd-master.yaml (dsr1-fp4-mi355x-sglang).
All other model-specific launchers and matrix entries are deliberately
left out of this PR; downstream PRs add them on a per-model basis.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9b10a34 to
d94f011
Compare
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
DISCLAIMER: Since AgentX is currently in MVP v0.1 phase , this can subject at anyone
AgentX MVP v0.1, new & improved MVP coming every week till we hit v1.0, i.e. v0.2 coming in an week, v0.3 coming in 2 weeks, etc.
Note: there are approximately ~4k lines of actual diff. The rest are from indentation refactors / backfill / etc.
Foreword: as a human reading this code here are the files that are actually important to read and verify line by line:
.github/workflowsbenchmarks/utils/agentic-benchmark/benchutils/trace-replay(the submodule) -- this is of most important as this is the core logic for trace replaying!!!Everything else is largely not important and can be thought of as new age assembly.
This is not intended to be a final implementation rather it is an MVP. The goal is to merge this in as the foundation for the agentic trace replay benchmark and iterate fast as a community to converge to a final solution.
The dataset itself is expected to change as SemiAnalysis collects more representative production traces. Currently the traces are out of data and may inflate KV cache hit rate due to fewer subagent branches. Nevertheless, the general workload shapes are very similar (avg ISL / avg OSL, inter-turn latency, etc) and thus this dataset can be used for testing in the meantime.
The benchmark has so far been publicly verified on:
The actual trace replaying logic is contained in the trace replayer submodule. This points to a minimized branch of Callan Fox's (WEKA) https://github.com/callanjfox/kv-cache-tester -- specifically it is this branch. It is undecided whether this will remain a submodule or be pulled directly into upstream. The code is relatively simple. It expects a set of trace JSON objects and replays them independently subject to some level of concurrency. This code is subject to change as we iterate from this v0.1 version.
Full RFC
DRAFT: InferenceX AgentX RFC [MVP v0.1]
Introduction
InferenceX currently only has single turn, "fixed sequence length" scenarios (8k1k, 1k1k). These scenarios all use random data and are isolated single turn requests, thus they do not benefit from prefix caching. Prefix caching is the default in production systems as it significantly increases input token throughput and decreases TTFT in multi turn scenarios by caching computed key and value vectors for each request. Current InferenceX results can be thought of as a baseline, i.e., how efficiently can chip X + engine Y serve model Z (possibly with some extra optimizations like speculative decoding) with some fixed input/output tensor shapes (with some slight variance).
Relevant code: #1103
Agentic Coding Workloads
The current "baseline" benchmark is still useful, as results are a strong indication of the raw end-to-end capability of the chips + SW stacks. However, InferenceX will benefit from additional scenarios that benchmark specific real-world workloads such as agentic coding, which is the topic of this RFC. With the rise of coding harnesses such as Claude Code, Codex, Cursor, OpenCode, etc., agentic workflows have dominated. These workflows are characterized by:
Some of (but not all) of the allowed optimizations/restrictions:
As always, the goal of InferenceX is to highlight the most optimal inference configurations so long as they are actually employed in the real world. This agentic benchmark will shift closer to an end-to-end system benchmark rather than just a chip+kernel benchmark.
InferenceX Integration
To add this scenario to InferenceX, the main challenge is not the integration, rather finding a dataset that accurately reflects production agentic workflows. These datasets are few and far between because they are extremely valuable and not many are made publicly available.
Over the past month in collaboration with AMD (and others in the community that have their core engineers talk to us), we have hosted a proxy that captures internal Claude Code traces. So far, we have collected 36B tokens worth of anonymized traces. In order to respect users' privacy, none of the raw conversation (text) is stored. Rather, prompts are tokenized by the proxy and then hashed into blocks of 64 tokens. The blocks are unique to each session/conversation. This allows traces to be replayed back in a way that emulates KV cache reuse patterns without exposing the original conversation.
Although our sample size is still relatively small, this is the best way to collect and replay realistic production agentic coding traces.
There are many trace replayer implementations that can be used and to be frank, which one matters not. At the core, a trace replayer just spawns clients, fills in anonymous token blocks with synthetic data, constructs multiturn conversations as they were originally captured, sends these conversations to an API server, and then records basic telemetry data like TTFT, TPOT, QPS, etc.
For InferenceX integration, we choose to use Callan Fox's (WEKA) replayer, which can be found here. We chose this because Callan has done a lot of work over the past year on this topic and has been using this replayer himself. It is simple. Beyond this, there is no legitimate reason this replayer should necessarily be chosen over another, or vice versa. Trace replayers are quite simple, especially when the cost of coding custom solutions has gone to zero. This can be changed at any time.
To make the scope manageable, we will prioritize running this benchmark on the following models:
Methodology
In this section we will explain the way in which we will run the agentic benchmark and collect results.
To obtain a Pareto curve of configurations, we will "sweep" among the number of concurrent conversations. Each client replays one trace at a time. When a client finishes a trace, the trace gets recycled into the queue of available traces. The traces will be replayed at their original speed, including the time between turns but not including the time it took to receive the response originally. We cap the maximum time between turns at 60 seconds.
The traces may / will include subagents. Many of these can be launched in parallel, thus increasing the number of total inflight requests above the specified concurrency for a brief period. The assumption for the benchmark is that all subagents will be routed to the same model, regardless of what model the subagent was routed to in the original trace.
Some of the traces will have context length longer than the model's
max_model_len. When this occurs, the conversations will be truncated accordingly.In order to simulate a steady state start, the initial herd of conversations are randomly started anywhere between 0-70% of their conversation length. For warmup, we will prefill all of these requests before beginning the metrics collection of the benchmark. We will collect metrics from the end of warmup (when the steady-state benchmark begins) and the end of the duration – we will not capture metrics during cooldown (when in-flight requests are finishing).
For this MVP, the benchmarks have been running for 30 minutes which have been showing steady state results. We can potentially run the benchmarks for a shorter duration, but likely 15 minutes at the very minimum in order to actually run deep enough to reach steady state and trigger KV offloading (if enabled). Developers can also test with a shorter duration to get a proxy of how the configurations perform and then run for longer on official submission. We will need to decide on official guidelines for submission such that all submissions are on an even playing field. The main issue for consideration here is that lower concurrency runs see fewer requests by construction.
We will collect P50, P75, P90, and P99 metrics for TTFT, TPOT, etc. We can decide later on what makes sense to actually advertise statistically on inferencex dot com based on the sample sizes we observe.
For this MVP, we are not using traces captured by our internal Claude Code proxy. This is because of the immaturity of the proxy that we must iterate on to collect a sufficient amount of quality traces. Luckily, WEKA has provided some great traces already that can be found here. The HuggingFace dataset card shows the metrics for the traces in this dataset. You can see that the defining statistics are roughly the same as those we collected internally. The main "issue" with this dataset is it is slightly out of date, and has far fewer subagents spawned in the conversations. This ultimately causes higher cache hit rates because of fewer context branches (i.e., the conversations only grow and never split, leading to fewer prefill passes).
To be clear, we plan on using our internal traces soon. We are using the existing traces for this MVP so that we can iterate fast.
Outstanding Considerations
There are some outstanding things to consider when replaying traces:
How to fill in anonymized token blocks. There are a few options:
vocabulary.pythat has a pool of words related to different things (coding, etc.) and sentence structures, then attempts to model semi-coherent English sentences for the promptsWhether to follow the original traces output token lengths exactly or let the model generate naturally
--ignore-eos, which causes gibberish outputHow to ensure some level of determinism without encouraging gamifying of the workload shapes?
Is running for a fixed duration actually the correct way to do things?