v3 collapse closures + tokenizers.Encoding offset API#87
Open
hallerite wants to merge 1 commit into
Open
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c3c51a7. Configure here.
ApprovabilityVerdict: Needs human review This PR changes core tokenization functionality by switching from HuggingFace's tokenizer API to the underlying Rust You can customize Macroscope's approvability policy. Learn more. |
7878261 to
bf84a66
Compare
transformers from the offset pathTwo related refactors of the emit_text_segments / attribute_text_segments pipeline: 1. ``emit_text_segments`` closures across 8 hand-coded renderers (qwen3, qwen35, glm45, glm5, deepseek_v3, nemotron3, laguna_xs2, minimax_m2) get a "collapse-or-fallback" pattern: adjacent same-label segments are folded into one ``emit_text`` call (preserves internal BPE merges, skips the offset path); only genuinely mixed-label runs go through ``attribute_text_segments``. Most rendering paths end up homogeneous after collapse, so the offset machinery only runs when it actually has to. 2. ``attribute_text_segments`` is rewritten to use the Rust ``tokenizers.Encoding`` API directly — ``.encode().ids`` / ``.encode().offsets`` — instead of going through ``transformers``'s ``return_offsets_mapping=True`` dict API. This unblocks the future ``transformers``-optional path (issue #31): a BYO ``tokenizers.Tokenizer`` works without any ``transformers`` wrapper. ``_get_offset_tokenizer`` becomes a 2-path resolver (direct Rust tokenizer, or extract ``.backend_tokenizer`` from a ``PreTrainedTokenizerFast``); no second tokenizer load, no probe-verify, no AutoTokenizer fallback — all of those existed in the previous version of this PR to coordinate with the fastokens shim, which is gone after #95. ``minimax_m2.emit_token_overlap_body`` and ``qwen3_vl._Emitter._flush`` are updated to call the new ``Encoding``-based offset API directly. ``tokenizers>=0.20`` becomes an explicit core dependency — it was already a transitive of ``transformers``, but the new ``attribute_text_segments`` imports from ``tokenizers`` at the module level so we declare it. Tests: 2248 passed, 88 skipped, 1 xfailed (baseline parity with #95). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bf84a66 to
afa4c9b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Context
#95 (remove fastokens) merged into main. With fastokens gone, the offset path collapses to "use the user's tokenizer's
backend_tokenizerdirectly" — no second tokenizer load, no probe-verify, no AutoTokenizer fallback. This PR builds on that with two related refactors.What this does
1. v3 collapse-or-fallback closures (8 renderers)
emit_text_segmentsclosures inqwen3,qwen35,glm45,glm5,deepseek_v3,nemotron3,laguna_xs2,minimax_m2get a "collapse adjacent same-label segments, then attribute the rest" pattern:Homogeneous-label runs (most rendering paths after collapse) go through a single
emit_text— preserves internal BPE merges, skips the offset attribution path entirely. Only genuinely mixed-label runs hitattribute_text_segments.2.
attribute_text_segmentsrewritten on thetokenizers.EncodingAPIPreviously called
tokenizer(text, return_offsets_mapping=True)(transformers dict API). Now uses the Rust library's nativeEncoding.ids/Encoding.offsetsdirectly:This unblocks the future
transformers-optional path (issue #31): a BYOtokenizers.Tokenizer(no transformers wrapper) works directly.minimax_m2.emit_token_overlap_bodyandqwen3_vl._Emitter._flushare updated to the same API.3.
_get_offset_tokenizersimplified to two pathsDirect
tokenizers.Tokenizer(BYO Rust BPE) or extract.backend_tokenizerfrom aPreTrainedTokenizerFast. No second tokenizer load, no probe-verify, no AutoTokenizer fallback — all of those existed in this PR's pre-rebase version to coordinate with the fastokens shim, which is gone after #95.4.
tokenizers>=0.20explicit depAlready transitive via
transformers, butattribute_text_segmentsimports fromtokenizersat module level so we declare it.Tests
test_get_offset_tokenizer_rejects_offsetless_byoupdated to match the new error message ("fast tokenizer with atokenizers.Tokenizerbackend").🤖 Generated with Claude Code
Note
Drop
transformersfrom the offset encoding path by usingtokenizers.Tokenizerdirectlyreturn_offsets_mapping=Truetokenizer call in_get_offset_tokenizer(renderers/base.py) with a direct return of atokenizers.Tokenizer(either passed directly or viabackend_tokenizer).tokenizers>=0.20as a runtime dependency in pyproject.toml, removing the implicit dependency ontransformersfor offset-aware encoding.attribute_text_segmentsto usetokenizers.Encoding.offsetsinstead of the HuggingFace fast tokenizer offset mapping API.emit_text_segmentsacross all renderer implementations.PreTrainedTokenizerFast(with a Rust backend) or a baretokenizers.Tokenizer; slow tokenizers without abackend_tokenizernow raise aRuntimeError.Macroscope summarized afa4c9b.