[None][feat] adp_router: round-robin first turn of each conversation#14744
Draft
lancelly wants to merge 3 commits into
Draft
[None][feat] adp_router: round-robin first turn of each conversation#14744lancelly wants to merge 3 commits into
lancelly wants to merge 3 commits into
Conversation
…aware routers Both the disagg-level ConversationRouter and KvCacheAwareRouter now report whether a request is the first turn of its conversation, surfaced in the metadata dict returned by get_next_server and in a debug log line. - Factor the conversation-id lookup into a shared BlockHashMixin._get_conversation_id staticmethod (reads the serve-layer disaggregated_params.conversation_id populated upstream from session headers by OpenAIDisaggServer._extract_conversation_id), and drop ConversationRouter's duplicate copy. - ConversationRouter derives is_first_turn from session-table membership (no new state) and includes it in the STICKY / IMPLICIT / FALLBACK return paths. - KvCacheAwareRouter has no session tracking, so add a small LRU-bounded _seen_conversations set (max_sessions, default 100000) plus _record_conversation_turn() to flag the first turn with matching semantics. is_first_turn is False when no conversation id is available (e.g. the client sent no X-Session-ID header), i.e. 'not known to be a first turn'. Adds unit tests covering the seen-set LRU helper and the is_first_turn output of both routers. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
…ayer conversation_id existed only on the serve-layer DisaggregatedParams (openai_protocol). to_llm_disaggregated_params() did not copy it and the executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py, == LlmDisaggregatedParams) had no such field, so it was silently dropped when a worker converted the incoming request (openai_server.py to_llm_disaggregated_params). As a result worker-side consumers that read request.py_disaggregated_params.conversation_id (e.g. the ADP router) only ever saw None, even though the orchestrator routed on a real conversation id from the X-Session-ID header. - Add conversation_id to the executor-layer DisaggregatedParams. - Copy it through both to_llm_disaggregated_params and to_disaggregated_params so the serve<->executor round-trip preserves it. This makes the conversation id available all the way to the worker; the orchestrator-level routers are unaffected (they already read the serve layer directly). Extends the converter unit tests to assert conversation_id propagates and adds a round-trip test. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
KV cache-aware ADP routing pins every new conversation to whichever rank first cached the shared system-prompt prefix: a first turn matches that ~2k-token shared prefix (>match_rate_threshold) so cache affinity routes it there, and once it lands it sticks. With a long shared system prompt this funnels new conversations onto one or two ranks (observed cumulative routed CV ~40-56%, max/min >4x across 8 DEP ranks), which then form a chunked-prefill backlog and raise TTFT. Add kv_cache_routing_first_turn_round_robin (AttentionDpConfig, default False). When enabled, the first turn of each conversation (is_first_turn, computed by the serve router and propagated to the worker via disaggregated_params) is routed round-robin across eligible ranks instead of by affinity score; subsequent turns keep the affinity path and stick to the rank holding their own conversation history. This spreads unit births evenly, so every rank caches the shared prefix and affinity stops funnelling. Plumb is_first_turn alongside conversation_id: add the field to the executor- and serve-layer DisaggregatedParams, copy it through both converters, and have the disagg service write the router's is_first_turn onto the ctx request (ctx-first and gen-first paths). The round-robin cursor is mutated identically on every TP rank (same new_requests, same order), matching the existing cold-start warmup, so the distributed routing protocol stays in lock-step. Measured on DSv4-Pro 1p1d c=96 (DEP8): routed CV 41-56% -> 7%, max/min 4.2x -> 1.2x, cache hit 95.2% -> 96.5%, TTFT p95 8.4s -> 7.4s, server output throughput +30%. Affinity stickiness for turn>=2 preserved (96% of decisions still AFFINITY). Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
57758a3 to
c978776
Compare
lancelly
added a commit
to lancelly/TensorRT-LLM
that referenced
this pull request
Jun 5, 2026
…->rank affinity Adds an instance-level ADP router that round-robins the first request of each conversation across ranks, then pins every subsequent request with the same conversation_id to that conversation's first-turn rank. This keeps a multi-turn conversation's growing KV-cache prefix on one rank (maximizing block reuse, minimizing cross-rank migration) while spreading new conversations evenly. Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match length and loses a conversation when its blocks are evicted), the conversation_id -> rank map is explicit and survives eviction. Inspired by the serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744, applied at the intra-instance ADP-rank level. conversation_id is read from py_disaggregated_params.conversation_id (serve-side propagated from X-Session-ID); falls back to load-balanced round-robin when it is absent, so behavior degrades gracefully. Selected via the new attention_dp_config.kv_cache_routing_conversation_affinity flag (kv_cache_routing_max_sessions bounds the LRU map). Includes unit tests covering first-turn RR, stickiness, conv-less fallback, cross-rank determinism, LRU eviction, sticky overflow, and factory selection. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
lancelly
added a commit
to lancelly/TensorRT-LLM
that referenced
this pull request
Jun 5, 2026
…->rank affinity Adds an instance-level ADP router that round-robins the first request of each conversation across ranks, then pins every subsequent request with the same conversation_id to that conversation's first-turn rank. This keeps a multi-turn conversation's growing KV-cache prefix on one rank (maximizing block reuse, minimizing cross-rank migration) while spreading new conversations evenly. Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match length and loses a conversation when its blocks are evicted), the conversation_id -> rank map is explicit and survives eviction. Inspired by the serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744, applied at the intra-instance ADP-rank level. conversation_id is read from py_disaggregated_params.conversation_id (serve-side propagated from X-Session-ID); falls back to load-balanced round-robin when it is absent, so behavior degrades gracefully. Selected via the new attention_dp_config.kv_cache_routing_conversation_affinity flag (kv_cache_routing_max_sessions bounds the LRU map). Includes unit tests covering first-turn RR, stickiness, conv-less fallback, cross-rank determinism, LRU eviction, sticky overflow, and factory selection. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.