Skip to content

[None][feat] adp_router: round-robin first turn of each conversation#14744

Draft
lancelly wants to merge 3 commits into
NVIDIA:feat/deepseek_v4from
lancelly:feat/router-first-turn-rr
Draft

[None][feat] adp_router: round-robin first turn of each conversation#14744
lancelly wants to merge 3 commits into
NVIDIA:feat/deepseek_v4from
lancelly:feat/router-first-turn-rr

Conversation

@lancelly
Copy link
Copy Markdown
Collaborator

@lancelly lancelly commented May 29, 2026

No description provided.

lancelly added 2 commits May 28, 2026 20:08
…aware routers

Both the disagg-level ConversationRouter and KvCacheAwareRouter now report
whether a request is the first turn of its conversation, surfaced in the
metadata dict returned by get_next_server and in a debug log line.

- Factor the conversation-id lookup into a shared
  BlockHashMixin._get_conversation_id staticmethod (reads the
  serve-layer disaggregated_params.conversation_id populated upstream
  from session headers by OpenAIDisaggServer._extract_conversation_id),
  and drop ConversationRouter's duplicate copy.
- ConversationRouter derives is_first_turn from session-table membership
  (no new state) and includes it in the STICKY / IMPLICIT / FALLBACK
  return paths.
- KvCacheAwareRouter has no session tracking, so add a small
  LRU-bounded _seen_conversations set (max_sessions, default 100000)
  plus _record_conversation_turn() to flag the first turn with matching
  semantics.

is_first_turn is False when no conversation id is available (e.g. the
client sent no X-Session-ID header), i.e. 'not known to be a first turn'.

Adds unit tests covering the seen-set LRU helper and the is_first_turn
output of both routers.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
…ayer

conversation_id existed only on the serve-layer DisaggregatedParams
(openai_protocol). to_llm_disaggregated_params() did not copy it and the
executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py,
== LlmDisaggregatedParams) had no such field, so it was silently dropped
when a worker converted the incoming request (openai_server.py
to_llm_disaggregated_params). As a result worker-side consumers that read
request.py_disaggregated_params.conversation_id (e.g. the ADP router) only
ever saw None, even though the orchestrator routed on a real conversation
id from the X-Session-ID header.

- Add conversation_id to the executor-layer DisaggregatedParams.
- Copy it through both to_llm_disaggregated_params and
  to_disaggregated_params so the serve<->executor round-trip preserves it.

This makes the conversation id available all the way to the worker; the
orchestrator-level routers are unaffected (they already read the serve
layer directly).

Extends the converter unit tests to assert conversation_id propagates and
adds a round-trip test.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
KV cache-aware ADP routing pins every new conversation to whichever rank
first cached the shared system-prompt prefix: a first turn matches that
~2k-token shared prefix (>match_rate_threshold) so cache affinity routes
it there, and once it lands it sticks. With a long shared system prompt
this funnels new conversations onto one or two ranks (observed cumulative
routed CV ~40-56%, max/min >4x across 8 DEP ranks), which then form a
chunked-prefill backlog and raise TTFT.

Add kv_cache_routing_first_turn_round_robin (AttentionDpConfig, default
False). When enabled, the first turn of each conversation
(is_first_turn, computed by the serve router and propagated to the worker
via disaggregated_params) is routed round-robin across eligible ranks
instead of by affinity score; subsequent turns keep the affinity path and
stick to the rank holding their own conversation history. This spreads
unit births evenly, so every rank caches the shared prefix and affinity
stops funnelling.

Plumb is_first_turn alongside conversation_id: add the field to the
executor- and serve-layer DisaggregatedParams, copy it through both
converters, and have the disagg service write the router's is_first_turn
onto the ctx request (ctx-first and gen-first paths).

The round-robin cursor is mutated identically on every TP rank (same
new_requests, same order), matching the existing cold-start warmup, so
the distributed routing protocol stays in lock-step.

Measured on DSv4-Pro 1p1d c=96 (DEP8): routed CV 41-56% -> 7%,
max/min 4.2x -> 1.2x, cache hit 95.2% -> 96.5%, TTFT p95 8.4s -> 7.4s,
server output throughput +30%. Affinity stickiness for turn>=2 preserved
(96% of decisions still AFFINITY).

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
@lancelly lancelly force-pushed the feat/router-first-turn-rr branch from 57758a3 to c978776 Compare May 29, 2026 09:05
lancelly added a commit to lancelly/TensorRT-LLM that referenced this pull request Jun 5, 2026
…->rank affinity

Adds an instance-level ADP router that round-robins the first request of each
conversation across ranks, then pins every subsequent request with the same
conversation_id to that conversation's first-turn rank. This keeps a multi-turn
conversation's growing KV-cache prefix on one rank (maximizing block reuse,
minimizing cross-rank migration) while spreading new conversations evenly.

Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match
length and loses a conversation when its blocks are evicted), the
conversation_id -> rank map is explicit and survives eviction. Inspired by the
serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744,
applied at the intra-instance ADP-rank level.

conversation_id is read from py_disaggregated_params.conversation_id (serve-side
propagated from X-Session-ID); falls back to load-balanced round-robin when it
is absent, so behavior degrades gracefully. Selected via the new
attention_dp_config.kv_cache_routing_conversation_affinity flag
(kv_cache_routing_max_sessions bounds the LRU map).

Includes unit tests covering first-turn RR, stickiness, conv-less fallback,
cross-rank determinism, LRU eviction, sticky overflow, and factory selection.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
lancelly added a commit to lancelly/TensorRT-LLM that referenced this pull request Jun 5, 2026
…->rank affinity

Adds an instance-level ADP router that round-robins the first request of each
conversation across ranks, then pins every subsequent request with the same
conversation_id to that conversation's first-turn rank. This keeps a multi-turn
conversation's growing KV-cache prefix on one rank (maximizing block reuse,
minimizing cross-rank migration) while spreading new conversations evenly.

Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match
length and loses a conversation when its blocks are evicted), the
conversation_id -> rank map is explicit and survives eviction. Inspired by the
serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744,
applied at the intra-instance ADP-rank level.

conversation_id is read from py_disaggregated_params.conversation_id (serve-side
propagated from X-Session-ID); falls back to load-balanced round-robin when it
is absent, so behavior degrades gracefully. Selected via the new
attention_dp_config.kv_cache_routing_conversation_affinity flag
(kv_cache_routing_max_sessions bounds the LRU map).

Includes unit tests covering first-turn RR, stickiness, conv-less fallback,
cross-rank determinism, LRU eviction, sticky overflow, and factory selection.

Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant