[None][feat] Add ConversationAwareADPRouter: explicit conversation->rank affinity for attention DP#14983
Draft
lancelly wants to merge 2 commits into
Draft
Conversation
bbf3ae0 to
823ca8a
Compare
…ayer conversation_id existed only on the serve-layer DisaggregatedParams (openai_protocol). to_llm_disaggregated_params() did not copy it and the executor-layer DisaggregatedParams (tensorrt_llm/disaggregated_params.py, == LlmDisaggregatedParams) had no such field, so it was silently dropped when a worker converted the incoming request (openai_server.py to_llm_disaggregated_params). As a result worker-side consumers that read request.py_disaggregated_params.conversation_id (e.g. the ADP router) only ever saw None, even though the orchestrator routed on a real conversation id from the X-Session-ID header. - Add conversation_id to the executor-layer DisaggregatedParams. - Copy it through both to_llm_disaggregated_params and to_disaggregated_params so the serve<->executor round-trip preserves it. This makes the conversation id available all the way to the worker; the orchestrator-level routers are unaffected (they already read the serve layer directly). Extends the converter unit tests to assert conversation_id propagates and adds a round-trip test. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
…->rank affinity Adds an instance-level ADP router that round-robins the first request of each conversation across ranks, then pins every subsequent request with the same conversation_id to that conversation's first-turn rank. This keeps a multi-turn conversation's growing KV-cache prefix on one rank (maximizing block reuse, minimizing cross-rank migration) while spreading new conversations evenly. Unlike KVCacheAwareADPRouter (which infers affinity from probed prefix-match length and loses a conversation when its blocks are evicted), the conversation_id -> rank map is explicit and survives eviction. Inspired by the serve-level ConversationRouter and the first-turn-round-robin idea from NVIDIA#14744, applied at the intra-instance ADP-rank level. conversation_id is read from py_disaggregated_params.conversation_id (serve-side propagated from X-Session-ID); falls back to load-balanced round-robin when it is absent, so behavior degrades gracefully. Selected via the new attention_dp_config.kv_cache_routing_conversation_affinity flag (kv_cache_routing_max_sessions bounds the LRU map). Includes unit tests covering first-turn RR, stickiness, conv-less fallback, cross-rank determinism, LRU eviction, sticky overflow, and factory selection. Signed-off-by: Lance Liao <108499334+lancelly@users.noreply.github.com>
823ca8a to
5b29ef5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
ConversationAwareADPRouter, an instance-level attention-DP router that:conversation_idto that conversation's first-turn rank.This keeps a multi-turn conversation's growing KV-cache prefix on a single rank — maximizing block reuse and minimizing recompute / cross-rank migration — while spreading the birth of new conversations evenly.
Why / vs
KVCacheAwareADPRouterKVCacheAwareADPRouterinfers affinity from probed prefix-match length. That affinity is lost the moment a conversation's blocks are evicted: the request then re-routes by load and the conversation can migrate ranks.ConversationAwareADPRouterkeeps an explicitconversation_id -> rankLRU map, so stickiness is deterministic and survives eviction. It also does not require KV-cache block reuse to function (though it is most beneficial with it).Inspired by the serve-level
ConversationRouter(tensorrt_llm/serve/router.py) and the first-turn-round-robin idea from #14744, applied at the intra-instance ADP-rank level.How it is wired
AttentionDpConfig.kv_cache_routing_conversation_affinity(takes precedence overenable_kv_cache_aware_routingwhen both are set).kv_cache_routing_max_sessionsbounds the LRU map.conversation_idis read fromreq.py_disaggregated_params.conversation_id(serve-side propagated from theX-Session-IDheader). When absent — header not sent, non-disaggregated, or propagation not present — the request falls back to load-balanced round-robin and is not recorded, so behavior degrades gracefully toDefaultADPRouter-style spreading.route_requestsruns locally on every rank with no broadcast, so the round-robin cursor and theconversation -> rankmap evolve identically (samenew_requests, same order) — the same invariant the existing warmup cursor relies on. Divergence would deadlock the allgather protocol.Tests
tests/unittest/_torch/executor/test_adp_router.py::TestConversationAwareADPRoutercovers: first-turn round-robin, stickiness across turns, conversation-less fallback (unrecorded), cross-rank determinism, LRU eviction, sticky overflow (keeps mapping), explicitattention_dp_rank, and factory selection (enabled/disabled).Notes
conversation_idreaching the worker'spy_disaggregated_params(serve-side propagation fromX-Session-ID); without it the router is a safe round-robin no-op.Update — now self-contained (2 commits)
[None][fix] serve: propagate conversation_id to the executor/worker layer— the base this router needs: addsconversation_idto the executor-layerDisaggregatedParamsand copies it throughto_llm_disaggregated_params/to_disaggregated_params, so the worker actually sees the id the orchestrator routed on (without it the router safely degrades to round-robin).[None][feat] Add ConversationAwareADPRouter …— the router. Sticky returns are capped at the loosefair_share_multiplier * fair_share(expected), never the hardmax_num_active_requests: a rank exceedingexpectedbreaks the ADP padding invariant (py_executor._pad_attention_dp_dummy_request) and hangs the instance. Regression testtest_returned_expected_covers_every_rankguards this.