perf: optimize guided decoding with xgrammar upgrade, batched API, and async D2H overlap#4605
perf: optimize guided decoding with xgrammar upgrade, batched API, and async D2H overlap#4605windreamer wants to merge 6 commits into
Conversation
9a942b6 to
3fdff4a
Compare
There was a problem hiding this comment.
Pull request overview
This PR optimizes guided decoding performance in TurboMind by upgrading xgrammar and refactoring guided-decoding paths to use batched matcher APIs plus CUDA stream/event orchestration to overlap host-device transfers with GPU work. It also reduces Python-side overhead by reusing a lazily constructed GrammarCompiler and fixes a PyTorch guided-decoding type mismatch introduced by the xgrammar upgrade.
Changes:
- Upgrade xgrammar to v0.2.1 and switch C++ guided decoding to batched matcher APIs (
BatchFillNextTokenBitmask/BatchAcceptToken). - Overlap
output_idsD2H copies with GPU kernels via a secondary CUDA stream and split guided decoding update intoScheduleUpdate+FinishUpdate. - Cache
GrammarCompilerperTurboMindinstance (lazy init) and fix PyTorchaccept_tokento pass a Pythonintvia.item().
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/generation/guided_decoding.h | Adds batched matcher + CUDA stream/event members; splits Update into two phases. |
| src/turbomind/generation/guided_decoding.cc | Implements batched xgrammar calls and async D2H overlap using events/streams; adds needs_apply gating. |
| src/turbomind/generation/generation.cc | Integrates ScheduleUpdate/FinishUpdate around AppendTokenIds and stop_criteria to enable overlap. |
| src/turbomind/generation/CMakeLists.txt | Exposes xgrammar/core linkage publicly for guided_decoding consumers. |
| lmdeploy/turbomind/turbomind.py | Introduces lazy-shared GrammarCompiler and removes per-request instantiation. |
| lmdeploy/pytorch/engine/logits_process.py | Passes token id as Python int (.item()) to guided decoding manager. |
| CMakeLists.txt | Bumps FetchContent xgrammar tag to v0.2.1. |
Comments suppressed due to low confidence (1)
src/turbomind/generation/guided_decoding.cc:135
- Similarly,
FinishUpdate()allocatesactive_matchersandactive_token_idsevery step without reserving. Reserving (or persisting these vectors in the phaseData) would reduce per-token allocation overhead, especially for large batch sizes.
// Collect active matchers and their token IDs for batch AcceptToken
std::vector<xgrammar::GrammarMatcher> active_matchers;
std::vector<int32_t> active_token_ids;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
src/turbomind/generation/guided_decoding.cc:129
FinishUpdate()callsd2h_done_.Sync()on all TP ranks even though only rank 0 performsBatchAcceptToken. On non-zero ranks this host-side wait is pure overhead (and can also introduce unnecessary CPU/GPU synchronization points). Consider moving the sync + matcher update under thetp_group_->rank() == 0branch, or early-returning for non-zero ranks.
if (auto& d = *data_.at(phase); d.active) {
// Wait only for the D2H copy to complete — the main stream's
// AppendTokenIds + stop_criteria may still be executing on GPU.
d2h_done_.Sync();
if (tp_group_->rank() == 0) {
lmdeploy/pytorch/engine/logits_process.py:484
- The guided-decoding
accept_tokencall site changed to pass a Pythonint, but there is no unit test intests/pytorch/engine/test_logits_process.pycovering theguided_decoding_managerintegration path (e.g., thataccept_tokenis invoked with the expected token values/types). Adding a small test with a stubGuidedDecodingManagerwould help prevent regressions when sampling runs on CUDA tensors.
if self.guided_decoding_manager and self.guided_processors:
for i, processor in self.guided_processors.items():
self.guided_decoding_manager.accept_token(processor, result[i].item())
b5a3678 to
49c617f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/turbomind/generation/guided_decoding.cc:139
FinishUpdate()iterates over alld.matchersand calls batchAcceptTokenusingoutput_ids_buf_[i]for every non-terminated matcher. However, only the firstgeneration_sizesequences receive a newly sampled token each step; for the remaining slotsoutput_ids_buf_may contain stale values (since sampling runs withbatch_size = logits.shape(0)). This can advance grammar state incorrectly for sequences that were not generating this step. Limit the loop togeneration_size(saved fromScheduleUpdate()), or gate on the per-requestgeneratingmask.
if (auto& d = *data_.at(phase); d.active && tp_group_->rank() == 0) {
// Wait only for the D2H copy to complete — the main stream's
// AppendTokenIds + stop_criteria may still be executing on GPU.
d2h_done_.Sync();
// Collect active matchers and their token IDs for batch AcceptToken
std::vector<xgrammar::GrammarMatcher> active_matchers;
std::vector<int32_t> active_token_ids;
active_matchers.reserve(d.matchers.size());
active_token_ids.reserve(d.matchers.size());
for (size_t i = 0; i < d.matchers.size(); ++i) {
if (const auto& m = d.matchers[i]; m && !m->IsTerminated()) {
active_matchers.emplace_back(*m);
active_token_ids.emplace_back(output_ids_buf_[i]);
}
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
ea5c50b to
e24c147
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
e24c147 to
33563fe
Compare
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
33563fe to
ba4efab
Compare
… CUDA stream Split GuidedDecoding::Update() into ScheduleUpdate() + FinishUpdate() to enable D2H copy of output_ids on a secondary CUDA stream, overlapping with AppendTokenIds and stop_criteria GPU kernels on the main stream. - ScheduleUpdate(): records sampling_done event on main stream, launches async D2H copy on d2h_stream_ (waits for sampling_done first) - FinishUpdate(): syncs on d2h_done event, then runs BatchAcceptToken on CPU - Adds d2h_stream_, sampling_done_, d2h_done_ members (created once in ctor) - Eliminates the blocking cudaStreamSynchronize that previously stalled the CPU between sampling and AcceptToken This is optimization 5 (Plan I): independent CUDA stream for D2H copy parallelism, removing a sync point in the decode step hot path.
FillMask, ScheduleUpdate, and FinishUpdate previously iterated over d.matchers.size() entries, but only the first generation_size (= logits.shape(0)) slots are actively generating. Entries beyond that index contain stale output_ids and unused bitmasks. - FillMask: limit matcher iteration and reserve to gs = logits.shape(0) - ScheduleUpdate: copy only gs output_ids entries for D2H transfer - FinishUpdate: add TensorMap& env param, iterate only over gs slots Fixes review comments on PR InternLM#4605 (3280137130, 3280137198).
…nternal CMake dep
ba4efab to
84a90a2
Compare
Motivation
Guided decoding in TurboMind has several performance bottlenecks: per-matcher loops for
FillNextTokenBitmask/AcceptToken, synchronous D2H copy blocking the main stream, and aGrammarCompilerinstantiated per request. This PR addresses all three to reduce guided decoding overhead.Modification
BatchFillNextTokenBitmask,BatchAcceptToken).FillNextTokenBitmask/AcceptTokenloops withBatchGrammarMatcher::BatchFillNextTokenBitmaskandBatchAcceptToken, reducing per-token overhead proportional to batch size.output_idsD2H copy on an independent CUDA stream so it overlaps withAppendTokenIds+stop_criteriaon the main stream, hiding the copy latency.GrammarCompiler(Python) — createGrammarCompileronce perTurboMindinstance (lazily) instead of per request, avoiding repeated tokenizer introspection.accept_token— convert tensor to Python int (.item()) before passing to xgrammar, fixing a type mismatch with the new API.target_link_librariesfrom PRIVATE to PUBLIC for theguided_decodingstatic library so that dependent targets correctly propagate xgrammar headers.BC-breaking (Optional)
None. The API surface is unchanged; only internal guided decoding paths are affected.
Checklist