[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP by lancelly · Pull Request #14983 · NVIDIA/TensorRT-LLM

lancelly · 2026-06-05T02:06:47Z

What

Adds ConversationAwareADPRouter, an instance-level attention-DP router that:

round-robins the first request of each conversation across DP ranks, then
pins every subsequent request carrying the same conversation_id to that conversation's first-turn rank.

This keeps a multi-turn conversation's growing KV-cache prefix on a single rank — maximizing block reuse and minimizing recompute / cross-rank migration — while spreading the birth of new conversations evenly.

Why / vs `KVCacheAwareADPRouter`

KVCacheAwareADPRouter infers affinity from probed prefix-match length. That affinity is lost the moment a conversation's blocks are evicted: the request then re-routes by load and the conversation can migrate ranks. ConversationAwareADPRouter keeps an explicit conversation_id -> rank LRU map, so stickiness is deterministic and survives eviction. It also does not require KV-cache block reuse to function (though it is most beneficial with it).

Inspired by the serve-level ConversationRouter (tensorrt_llm/serve/router.py) and the first-turn-round-robin idea from #14744, applied at the intra-instance ADP-rank level.

How it is wired

Selected via AttentionDpConfig.kv_cache_routing_conversation_affinity (takes precedence over enable_kv_cache_aware_routing when both are set). kv_cache_routing_max_sessions bounds the LRU map.
conversation_id is read from req.py_disaggregated_params.conversation_id (serve-side propagated from the X-Session-ID header). When absent — header not sent, non-disaggregated, or propagation not present — the request falls back to load-balanced round-robin and is not recorded, so behavior degrades gracefully to DefaultADPRouter-style spreading.
Deterministic across TP ranks: route_requests runs locally on every rank with no broadcast, so the round-robin cursor and the conversation -> rank map evolve identically (same new_requests, same order) — the same invariant the existing warmup cursor relies on. Divergence would deadlock the allgather protocol.

Tests

tests/unittest/_torch/executor/test_adp_router.py::TestConversationAwareADPRouter covers: first-turn round-robin, stickiness across turns, conversation-less fallback (unrecorded), cross-rank determinism, LRU eviction, sticky overflow (keeps mapping), explicit attention_dp_rank, and factory selection (enabled/disabled).

Notes

Draft. The full benefit requires conversation_id reaching the worker's py_disaggregated_params (serve-side propagation from X-Session-ID); without it the router is a safe round-robin no-op.

Update — now self-contained (2 commits)

[None][fix] serve: propagate conversation_id to the executor/worker layer — the base this router needs: adds conversation_id to the executor-layer DisaggregatedParams and copies it through to_llm_disaggregated_params / to_disaggregated_params, so the worker actually sees the id the orchestrator routed on (without it the router safely degrades to round-robin).
[None][feat] Add ConversationAwareADPRouter … — the router. Sticky returns are capped at the loose fair_share_multiplier * fair_share (expected), never the hard max_num_active_requests: a rank exceeding expected breaks the ADP padding invariant (py_executor._pad_attention_dp_dummy_request) and hangs the instance. Regression test test_returned_expected_covers_every_rank guards this.

…ayer conversation_id existed only on the serve-layer DisaggregatedParams (openai_protocol). to_llm_disaggregated_params() did not copy it and the executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py, == LlmDisaggregatedParams) had no such field, so it was silently dropped when a worker converted the incoming request (openai_server.py to_llm_disaggregated_params). As a result worker-side consumers that read request.py_disaggregated_params.conversation_id (e.g. the ADP router) only ever saw None, even though the orchestrator routed on a real conversation id from the X-Session-ID header. - Add conversation_id to the executor-layer DisaggregatedParams. - Copy it through both to_llm_disaggregated_params and to_disaggregated_params so the serve<->executor round-trip preserves it. This makes the conversation id available all the way to the worker; the orchestrator-level routers are unaffected (they already read the serve layer directly). Extends the converter unit tests to assert conversation_id propagates and adds a round-trip test. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

…->rank affinity Adds an instance-level ADP router that round-robins the first request of each conversation across ranks, then pins every subsequent request with the same conversation_id to that conversation's first-turn rank. This keeps a multi-turn conversation's growing KV-cache prefix on one rank (maximizing block reuse, minimizing cross-rank migration) while spreading new conversations evenly. Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match length and loses a conversation when its blocks are evicted), the conversation_id -> rank map is explicit and survives eviction. Inspired by the serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744, applied at the intra-instance ADP-rank level. conversation_id is read from py_disaggregated_params.conversation_id (serve-side propagated from X-Session-ID); falls back to load-balanced round-robin when it is absent, so behavior degrades gracefully. Selected via the new attention_dp_config.kv_cache_routing_conversation_affinity flag (kv_cache_routing_max_sessions bounds the LRU map). Includes unit tests covering first-turn RR, stickiness, conv-less fallback, cross-rank determinism, LRU eviction, sticky overflow, and factory selection. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>

github-actions Bot assigned lancelly Jun 5, 2026

lancelly force-pushed the feat/conversation-aware-adp-router-main branch from bbf3ae0 to 823ca8a Compare June 5, 2026 02:43

lancelly added 2 commits June 4, 2026 20:34

lancelly force-pushed the feat/conversation-aware-adp-router-main branch from 823ca8a to 5b29ef5 Compare June 5, 2026 03:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP#14983

[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP#14983
lancelly wants to merge 2 commits into
NVIDIA:mainfrom
lancelly:feat/conversation-aware-adp-router-main

lancelly commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lancelly commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why / vs KVCacheAwareADPRouter

How it is wired

Tests

Notes

Update — now self-contained (2 commits)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lancelly commented Jun 5, 2026 •

edited

Loading

Why / vs `KVCacheAwareADPRouter`