bench: fix v5 tokenizer fix when --model is an HF Hub repo id#1381
Merged
Conversation
_fix_tokenizer_for_sglang resolved tokenizer.json via Path(model_path), which only works for local model directories. For HF Hub repo ids (e.g. "nvidia/DeepSeek-R1-0528-FP4-V2") the path lookup silently no-ops and a bare `except Exception: pass` hid the failure. Combined with the transformers 5.6.0 AutoTokenizer dispatch change that now returns a broken LlamaTokenizer for ByteLevel-BPE models, this produced a ~5x client/server tokenizer mismatch and a false throughput regression (DeepSeek-R1 FP4 on B200: 1970 -> 425 tok/s). Fall back to huggingface_hub.hf_hub_download for repo ids, surface resolution failures as warnings, and add an info log when the fix actually rewires pre_tokenizer/decoder so client/server alignment is visible. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
cquil11
approved these changes
May 15, 2026
Collaborator
cquil11
left a comment
There was a problem hiding this comment.
seems reasonable. thx for the fix
functionstackx
added a commit
that referenced
this pull request
May 20, 2026
…o-MTP/MTP) (#1502) * perf-changelog: re-run DSR1 SGLang agg configs to pick up tokenizer fix Re-runs DSR1 SGLang agg configs on B200/B300 (FP8/FP4, no-MTP/MTP) to pick up the tokenizer fix from #1381. * perf-changelog: set PR link to #1502 * launcher(b300-nv): drop nodelist pinning; restore perf-changelog entries - runners/launch_b300-nv.sh: remove --nodelist=b300-[001-006,008-012,017-020] from salloc so jobs can land on any healthy B300 node. - perf-changelog.yaml: restore ~18 entries that were unintentionally dropped during a prior rebase; net effect of this branch is now just the new DSR1 SGLang agg re-run entry. * perf-changelog: sync to origin/main and append DSR1 re-run entry at end * Update perf-changelog.yaml --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
3 tasks
functionstackx
pushed a commit
that referenced
this pull request
May 27, 2026
vLLM's tokenizer wrapper can raise AttributeError on all_special_tokens_extended with newer transformers (e.g. Qwen3.5). Use backend_request_func.get_tokenizer on fallback so client tokenization keeps the sglang v5 fix (#1381); bare AutoTokenizer only if backend is unavailable. Co-authored-by: cliu1004@amd.com <cliu1004@amd.com@mia1-p01-g18.mia.tensorwave.lan> Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With transformers 5.6.0, an AutoTokenizer dispatch change now returns a broken LlamaTokenizer for ByteLevel-BPE models which causes the tokens to be multiplied by ~4x resulting in an apparent throughput regression sgl-project/sglang#25315
#937 has a patch to fix this, however it wasn't applying if a HF model card ID was passed to
--model; It only would apply when local directories were passed. This PR applies it for both cases, and also removed the exception hiding to hopefully make future issues more apparent.