[None][feat] adp_router: round-robin first turn of each conversation by lancelly · Pull Request #14744 · NVIDIA/TensorRT-LLM

lancelly · 2026-05-29T08:49:44Z

No description provided.

…aware routers Both the disagg-level ConversationRouter and KvCacheAwareRouter now report whether a request is the first turn of its conversation, surfaced in the metadata dict returned by get_next_server and in a debug log line. - Factor the conversation-id lookup into a shared BlockHashMixin._get_conversation_id staticmethod (reads the serve-layer disaggregated_params.conversation_id populated upstream from session headers by OpenAIDisaggServer._extract_conversation_id), and drop ConversationRouter's duplicate copy. - ConversationRouter derives is_first_turn from session-table membership (no new state) and includes it in the STICKY / IMPLICIT / FALLBACK return paths. - KvCacheAwareRouter has no session tracking, so add a small LRU-bounded _seen_conversations set (max_sessions, default 100000) plus _record_conversation_turn() to flag the first turn with matching semantics. is_first_turn is False when no conversation id is available (e.g. the client sent no X-Session-ID header), i.e. 'not known to be a first turn'. Adds unit tests covering the seen-set LRU helper and the is_first_turn output of both routers. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

…ayer conversation_id existed only on the serve-layer DisaggregatedParams (openai_protocol). to_llm_disaggregated_params() did not copy it and the executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py, == LlmDisaggregatedParams) had no such field, so it was silently dropped when a worker converted the incoming request (openai_server.py to_llm_disaggregated_params). As a result worker-side consumers that read request.py_disaggregated_params.conversation_id (e.g. the ADP router) only ever saw None, even though the orchestrator routed on a real conversation id from the X-Session-ID header. - Add conversation_id to the executor-layer DisaggregatedParams. - Copy it through both to_llm_disaggregated_params and to_disaggregated_params so the serve<->executor round-trip preserves it. This makes the conversation id available all the way to the worker; the orchestrator-level routers are unaffected (they already read the serve layer directly). Extends the converter unit tests to assert conversation_id propagates and adds a round-trip test. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

KV cache-aware ADP routing pins every new conversation to whichever rank first cached the shared system-prompt prefix: a first turn matches that ~2k-token shared prefix (>match_rate_threshold) so cache affinity routes it there, and once it lands it sticks. With a long shared system prompt this funnels new conversations onto one or two ranks (observed cumulative routed CV ~40-56%, max/min >4x across 8 DEP ranks), which then form a chunked-prefill backlog and raise TTFT. Add kv_cache_routing_first_turn_round_robin (AttentionDpConfig, default False). When enabled, the first turn of each conversation (is_first_turn, computed by the serve router and propagated to the worker via disaggregated_params) is routed round-robin across eligible ranks instead of by affinity score; subsequent turns keep the affinity path and stick to the rank holding their own conversation history. This spreads unit births evenly, so every rank caches the shared prefix and affinity stops funnelling. Plumb is_first_turn alongside conversation_id: add the field to the executor- and serve-layer DisaggregatedParams, copy it through both converters, and have the disagg service write the router's is_first_turn onto the ctx request (ctx-first and gen-first paths). The round-robin cursor is mutated identically on every TP rank (same new_requests, same order), matching the existing cold-start warmup, so the distributed routing protocol stays in lock-step. Measured on DSv4-Pro 1p1d c=96 (DEP8): routed CV 41-56% -> 7%, max/min 4.2x -> 1.2x, cache hit 95.2% -> 96.5%, TTFT p95 8.4s -> 7.4s, server output throughput +30%. Affinity stickiness for turn>=2 preserved (96% of decisions still AFFINITY). Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

…->rank affinity Adds an instance-level ADP router that round-robins the first request of each conversation across ranks, then pins every subsequent request with the same conversation_id to that conversation's first-turn rank. This keeps a multi-turn conversation's growing KV-cache prefix on one rank (maximizing block reuse, minimizing cross-rank migration) while spreading new conversations evenly. Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match length and loses a conversation when its blocks are evicted), the conversation_id -> rank map is explicit and survives eviction. Inspired by the serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744, applied at the intra-instance ADP-rank level. conversation_id is read from py_disaggregated_params.conversation_id (serve-side propagated from X-Session-ID); falls back to load-balanced round-robin when it is absent, so behavior degrades gracefully. Selected via the new attention_dp_config.kv_cache_routing_conversation_affinity flag (kv_cache_routing_max_sessions bounds the LRU map). Includes unit tests covering first-turn RR, stickiness, conv-less fallback, cross-rank determinism, LRU eviction, sticky overflow, and factory selection. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

lancelly added 2 commits May 28, 2026 20:08

github-actions Bot assigned lancelly May 29, 2026

lancelly force-pushed the feat/router-first-turn-rr branch from 57758a3 to c978776 Compare May 29, 2026 09:05

lancelly mentioned this pull request Jun 5, 2026

[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP #14983

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] adp_router: round-robin first turn of each conversation#14744

[None][feat] adp_router: round-robin first turn of each conversation#14744
lancelly wants to merge 3 commits into
NVIDIA:feat/deepseek_v4from
lancelly:feat/router-first-turn-rr

lancelly commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lancelly commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lancelly commented May 29, 2026 •

edited

Loading