Conversation
Adds packages/renderers/ — a standalone package for deterministic
message-to-token conversion that replaces Jinja chat templates.
Renderers (6 total):
- Qwen3Renderer, Qwen35Renderer (Qwen family)
- GLM5Renderer, GLM45Renderer (GLM family)
- MiniMaxM2Renderer (MiniMax M2/M2.5)
- DefaultRenderer (fallback: uses tokenizer.apply_chat_template)
Each renderer implements:
- render_ids(messages) -> token IDs (messages -> tokens)
- parse_response(token_ids) -> ParsedResponse (tokens -> structured message)
- get_stop_token_ids() -> stop tokens
RendererClient: new verifiers client type ("renderer") that uses
renderers for all tokenization. Sends token IDs to vLLM /v1/completions
directly. No MITO/TITO prefix matching, no /tokenize calls.
Auto-detection: create_renderer(tokenizer) picks the right renderer
from tokenizer special tokens. Falls back to DefaultRenderer for
unsupported models.
… attribution 175 parametrized tests across 7 models × 25 cases: - test_render_ids: token-for-token correctness against apply_chat_template - test_parse_response: content/reasoning/tool extraction - test_build_helpers: supervised sample + trajectory step Fixes: - GLM-5/GLM-4.5/MiniMax: None content rendered as "None" (matches Jinja) - GLM-4.5: BPE boundary fix for content + \n before <tool_call> - DefaultRenderer: incremental rendering for per-token message attribution Adding a new model family = add one entry to conftest.RENDERER_MODELS.
INTELLECT-3.1: auto-detects to DefaultRenderer (apply_chat_template fallback) because its tokenizer has aggressive BPE merges that break piece-by-piece encoding. The IntellectRenderer is available as "intellect" for future optimization but is not the auto-detect default. Kimi K2.5: identified as needing a custom KimiRenderer (TODO). The template uses unique tokens (<|im_user|>, <|im_assistant|>, <|im_middle|>) and always appends a generation prompt, making the DefaultRenderer's incremental approach incompatible. Skipped in barrage tests for now. Also: - Fixed DefaultRenderer to always pass tokenize=True (Kimi returns str by default) - Fixed _expected() in tests to handle tokenizers returning str - 200 barrage tests passing across 8 models Model coverage: - Qwen3 (custom) ✓ - Qwen3.5 (custom) ✓ - GLM-5 / GLM-4.7-Flash (custom) ✓ - GLM-4.5-Air (custom) ✓ - MiniMax-M2.5 (custom) ✓ - INTELLECT-3.1 (default) ✓ - Qwen2.5 (default) ✓ - Kimi K2.5 — TODO: needs KimiRenderer
KimiRenderer for moonshotai/Kimi-K2.5: - Unique format: <|im_user|>/<|im_assistant|>/<|im_middle|> role tokens - TypeScript namespace tool definitions - Tool calls via <|tool_calls_section_begin|>/<|tool_call_begin|> tokens - All 25 barrage tests passing Auto-detection: replaced fragile token-sniffing heuristics with a simple MODEL_RENDERER_MAP that maps model name prefixes to renderer names. Falls back to DefaultRenderer for unknown models. 225 barrage tests across 9 models, all passing.
New renderers/parsing.py with extraction functions ported from vLLM: - extract_reasoning_qwen (qwen3_reasoning_parser) - extract_reasoning_glm (basic think/content split) - extract_reasoning_minimax (minimax_m2_reasoning_parser) - extract_reasoning_kimi (kimi_k2_reasoning_parser) - extract_tool_calls_hermes (hermes_tool_parser — Qwen3 JSON) - extract_tool_calls_qwen35xml (qwen3xml_tool_parser — Qwen3.5 XML) - extract_tool_calls_glm (glm4_moe/glm47_moe_tool_parser) - extract_tool_calls_minimax (minimax_m2_tool_parser) - extract_tool_calls_kimi (kimi_k2_tool_parser) Same regex patterns, same edge case handling as vLLM. All renderers now delegate parse_response() to these shared functions. Truncation: <think> present without </think> → truncated reasoning. No <think> marker → plain content (not assumed truncated). 312 barrage tests passing.
…ded text Replaced all decode-then-regex parsing with token ID scanning: - Find special token boundaries (</think>, <tool_call>, etc.) by their token IDs directly in the sequence - Decode only the text segments between boundaries - No false positives from content that happens to look like special tokens Each model family has a dedicated parse function: - parse_qwen3: Hermes JSON tool calls by <tool_call> token ID - parse_qwen35: XML tool calls + <think>/<\think> by token ID - parse_glm: <arg_key>/<arg_value> pairs by token ID - parse_minimax: <minimax:tool_call> by token ID, invoke/parameter by text - parse_kimi: full token-level (section/begin/end/arg_begin all by ID) Truncation: <think> token present without </think> → truncated reasoning. No <think> token → plain content. 312 barrage tests passing.
- Strip at first stop token (truncate) instead of only trailing - Remove <think> from reasoning_ids regardless of position (fixes GLM-4.5 where \n precedes <think> in completion) Verified end-to-end for all 6 model families: ✓ Qwen3: thinking + content + tool calls + names + args ✓ Qwen3.5: thinking + content + tool calls + names + args ✓ GLM-5: thinking + content + tool calls + names + args ✓ GLM-4.5: thinking + content + tool calls + names + args ✓ MiniMax: thinking + content + tool calls + names + args ✓ Kimi: content + tool calls + args (no thinking/names by design) 312 barrage tests passing.
- Removed KimiRenderer (too complex for now, needs more iteration) - Removed unused messages.py (normalize_messages, deserialize_tool_calls, strip_message_content — not imported anywhere in the package) - Cleaned up parse_kimi from parsing.py - 273 barrage tests passing across 7 models
The proxy now forwards to /v1/generate (our custom endpoint) instead of /v1/completions. For VLM, extracts raw images from messages and sends them alongside renderer's token IDs. vLLM processes images server-side while text tokenization is fully client-side via the Renderer. Also updated client.py to use /v1/generate.
…S and strong Message typing Add five new model-family renderers with full render/parse support: - DeepSeekV3Renderer: fullwidth Unicode tokens, <think> text tags, tool call section markers - KimiK2Renderer: im_user/im_assistant/im_system format, tool_calls_section markers, default system prompt - KimiK25Renderer: extends K2 with <think> prefill, vision/media support, TypeScript tool declarations - Nemotron3Renderer: Qwen-style im_start/im_end with XML tool declarations, universal thinking blocks - GptOssRenderer: Harmony channel-based format (analysis/commentary/final), TypeScript tools Also introduces strong typing across the package: - Message, Content, ContentPart, ToolCall, ToolSpec TypedDicts in base.py - All renderer signatures updated from dict[str, Any] to proper types - Renderer protocol updated to use Message and ToolSpec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cast OpenAI SDK message/tool types to Message/ToolSpec at the renderer_client boundary. Add override annotations for methods that legitimately change the response type from OpenAIChatResponse to dict. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tCompletionsClient RendererClient does client-side tokenization via /v1/generate, not the chat completions API. Inheriting from OpenAIChatCompletionsClient was wrong — it forced type mismatches (OpenAIChatResponse vs dict) that required override annotations. Now inherits Client[AsyncOpenAI, list[RendererMessage], dict, ToolSpec] with its own to_native_prompt that converts verifiers Pydantic messages to renderer TypedDicts cleanly. No casts, no type: ignore overrides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
RendererClient now uses a shared RendererPool (32 slots by default) that offloads render_ids() and parse_response() to threads via asyncio.to_thread(). HuggingFace fast tokenizers release the GIL during Rust encoding, so concurrent rollouts tokenize in parallel instead of serializing on the event loop. Benchmarks on 30-core EPYC with 22K-token conversations: N=8: 164ms → 46ms (3.6x) N=16: 330ms → 103ms (3.2x) N=32: 659ms → 196ms (3.4x) When a single Renderer is passed (tests, simple usage), the original non-threaded path is preserved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
`_encode_tools_typescript` filtered with `tool.get("type") != "function"`
which silently dropped every flat `ToolSpec` (the TypedDict in
`renderers.base`: `{name, description, parameters}` with no `type` key).
Production callers pass `ToolSpec`; tests happen to use the OpenAI
envelope format `{"type":"function","function":{...}}`, which is why
the regression slipped through.
Now accept both shapes: unwrap `tool["function"]` when the envelope is
present, otherwise treat the dict as a flat ToolSpec. Non-function
envelope types (e.g. `"_plugin"`) are still skipped.
Caught by Cursor Bugbot in PR #1068 thread r3151243707.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM serves gpt-oss via the `openai-harmony` reference encoder, which is also what the model was trained on. The previous hand-rolled ~550-line implementation only covered a subset of the harmony spec (no system preamble, no auto channel-routing line, no canonical ``<|return|>`` for terminal turns, partial TS schema rendering for tools), and HF's chat_template.jinja diverges from harmony in small ways anyway. Approach: thin adapter over `openai-harmony`. Per-message `enc.render(m)` produces token streams that concatenate byte-identical to `enc.render_conversation` (verified empirically), so we get per-token attribution for free — emit `enc.render(m)` for each caller message and tag tokens with the caller index. The system+developer prefix needs `enc.render_conversation` (not per-message) because harmony injects a channel-routing line into SystemContent based on conversation-level info; per-message rendering doesn't see that. Caller messages map to harmony as: - first `system` → `DeveloperContent.with_instructions(content)` - `user` → `Role.USER` - `assistant` final-channel for text + commentary-channel recipient=`functions.<name>` per tool_call - `tool` → `Role.TOOL` with `name=functions.<name>`, recipient=`assistant`, channel=`commentary` - historical `reasoning_content` is dropped (matches harmony's `render_conversation` behaviour — analysis-channel messages are stripped from rendered history; reasoning is per-turn only) Last assistant final-channel close gets patched from `<|end|>` to `<|return|>` to match `render_conversation_for_training`. Tests: - `packages/renderers/tests/test_gpt_oss_harmony_parity.py`: 7 new parity tests against `enc.render_conversation_for_training` — no-system, system+user, terminal-assistant `<|return|>`, tools layout, full tool-call+result cycle, reasoning-content stripping, generation prompt scaffolding. - conftest matrix: `openai/gpt-oss-20b` added; an autouse fixture skips it for HF-parity test files (`test_render_ids`, `test_build_helpers`, `test_parse_response*`) since the renderer intentionally matches harmony, not HF Jinja. Filed under conftest so a single skip rule covers all four files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`OpenAIChatCompletionsTokenClient` and its tests were dropped on the renderers branch in favour of the new `RendererClient`. Restore them so both paths coexist on this branch — callers can opt into either via `ClientConfig.client_type`: - `"renderer"` → renderers-v2 path (client-side tokenization + `/v1/generate`) - `"openai_chat_completions_token"` → server-side token-aware chat (`/v1/chat/completions/tokens`) Files restored verbatim from origin/main: - verifiers/clients/openai_chat_completions_token_client.py - tests/test_openai_chat_completions_token_client.py Plumbing: - verifiers/clients/__init__.py: re-add the import, the `resolve_client` dispatch case, and the `__all__` entry. - verifiers/types.py: re-add `"openai_chat_completions_token"` to the `ClientType` Literal. The server-side endpoint (`serving_chat_with_tokens.py`) lives in prime-rl and will need a matching restoration on that side before this client can be exercised end-to-end on the renderers branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`auto_system_injected` (line 160) was computed but never read; the actual auto-injection bookkeeping uses `auto_system_idx`. The unused loop variable `name` in `_render_assistant`'s tool_calls block was dead since the K2 template emits arguments only — function names live in the `tool_call['id']` field. CI fix only — no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The logger init and the bridge-metrics module-level helpers were declared between two import groups, which made ruff flag every import below them as E402 (module-level import not at top of file). Move them after the imports — same code, just relocated — so ruff is happy. CI fix only — no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three classes of pre-existing test failures on the renderers branch
(visible in CI for weeks); none affect production behavior, all
unblock the test job from going green.
1. ``test_renderer_client_honors_configured_renderer_name`` and
``test_renderer_client_uses_renderer_model_name_override``:
``_get_renderer_or_pool`` now passes ``tool_parser=None,
reasoning_parser=None`` to ``create_renderer``. Update the
``assert_called_once_with`` mocks to match.
2. ``test_get_incremental_prompt_ids_*`` (3 tests with the
``_BridgeRenderer`` fake): the fake had only ``render_ids`` and
``parse_response`` from the old diff-based bridge protocol —
``_get_incremental_prompt_ids`` now calls
``renderer.bridge_to_next_turn``. Add a ``bridge_to_next_turn``
method that returns ``prev_prompt + prev_completion + trailing +
extension``, mimicking what a real bridge stitches together. Track
bridge calls separately so the "without re-rendering completion"
test can assert that ``render_ids`` is NOT called and the bridge
path is taken.
3. Two parametrize cases that test features that were prototyped but
not merged into the renderers package — strict xfail so they
auto-flip to xpass once the feature lands:
- ``test_get_incremental_prompt_ids_bridges_over_truncated_step
[Qwen/Qwen2.5-0.5B-Instruct]``: DefaultRenderer's
``bridge_to_next_turn`` always returns None.
``synthesize_close_on_truncation`` for unknown templates was
prototyped in site-packages but never merged.
- ``test_extension_break_emits_diagnostic_log``: the bridge no
longer surfaces a break for Qwen3.5's strip-thinking-from-history
pattern, so the diagnostic log never fires for this scenario.
Needs a different repro (e.g. trajectory with empty
``completion_ids`` or an invalid tail) to exercise the log path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tests CI runs ``ruff format --check`` in addition to ``ruff check``. The format job was failing on 17 files — purely formatting-only changes that were never normalised after recent edits to: - packages/renderers/renderers/ (client, glm45, glm5, gpt_oss, kimi_k25, minimax_m2, parsers, parsing) - packages/renderers/tests/ (test_bridge, test_client, test_gpt_oss_harmony_parity, test_incremental, test_parsers, test_roundtrip) - tests/test_renderer_client.py, tests/test_renderer_e2e.py - verifiers/clients/renderer_client.py Ran ``uv run ruff format``; both ``ruff format --check`` and ``ruff check`` are now clean. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erClient
When TITO was originally dropped on the renderers branch, both docs
mentioning ``openai_chat_completions_token`` were rewritten to point
at the new ``renderer`` client. Now that TITO is restored (commit
``748d03e0``) and lives next to ``RendererClient``, the docs need to
list both:
- ``docs/evaluation.md``: extend the ``--api-client-type`` flag's
enumerated list to include both ``openai_chat_completions_token``
and ``renderer``.
- ``docs/reference.md``:
- re-add ``"openai_chat_completions_token"`` to the ``ClientType``
Literal block (matches ``verifiers/types.py``).
- re-add the ``OpenAIChatCompletionsTokenClient`` row to the
Built-in Client Implementations table, with a one-line note
distinguishing it from the renderer client (server-side
templating + ``/v1/chat/completions/tokens`` route vs client-side
tokenization through the ``renderers`` package).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sample The function returns ``(token_ids, loss_mask)`` for any caller-defined masking policy — its only specifically-supervised aspect was the name. ``build_training_sample`` better reflects the canonical ``(ids, mask)`` builder used across both SFT and RL training paths, matches the wording in trainer code/configs, and reads cleaner alongside ``build_trajectory_step``. The previous name is dropped without a deprecation alias because ``renderers`` isn't released yet — no external users to break. Touches: - packages/renderers/renderers/base.py (def) - packages/renderers/renderers/__init__.py (import + __all__) - packages/renderers/tests/test_build_helpers.py (module docstring, import, 2 test names, 2 call sites) - 4 docstring/comment mentions in default.py, kimi_k2.py, nemotron3.py, test_message_indices.py prime-rl has no callers (verified via grep). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- openai_chat_completions_token_client: tokenize() default
extra_kwargs={} is a mutable-default footgun; use None and lazily
initialize.
- environment / types: restore start_timer (perf_counter) for
monotonic elapsed-ms; the consolidation onto start_time (time.time)
could produce negative deltas on NTP step.
- docs/reference: ClientType list was missing nemorl_chat_completions;
add it and the corresponding row in the Built-in Clients table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DEBUG-only diagnostic in _log_extension_break (and the last_reason / last_detail tracking around it in _get_incremental_prompt_ids) was reaching into renderer._tokenizer to decode token windows on bridge failure, which was an encapsulation violation through a private attr. The diagnostic was xfailing in tests and not exercised on main; better diagnostics can be reintroduced once the design is clearer. Removes ~270 lines: the helper, the dedicated logger, the per-step category tracking, and the xfailing observability test in test_renderer_e2e. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanup passes folded into one commit since they share files: * Drop synthesize_close_on_truncation: every hand-coded renderer hardcoded True at the class level and DefaultRenderer's bridge_to_next_turn returns None unconditionally regardless of the flag, so the runtime knob never did anything user-visible. Removed from the Renderer Protocol, both factories (create_renderer, create_renderer_pool), DefaultRenderer.__init__, the misleading "ignoring for renderer=X" log, and from every hand-coded renderer's bridge (the `synthesize_close=(self._x if self.synthesize_close_on_truncation else None)` ternary becomes a direct `synthesize_close=self._x`). Tests updated: test_bridge drops the opt-out branch, test_incremental drops the obsolete opt-out test, test_renderer_client loses its `synth_close` parameter and the now-impossible Qwen2.5/default xfail entry. * Strip multimodal: the renderer pipeline can't carry image bytes through /v1/generate, and the config validator already routes VLMs to MITO. Removed ImagePart from the type union (and the package's public exports), all multimodal handling in Qwen3VLRenderer (~250 lines: image loaders, processor wiring, multimodal-content branches in render/render_ids/bridge), KimiK25Renderer's _emit_image + media tokens, and the image/video branches in Nemotron3 / Qwen35 content rendering. Passing image content to any renderer now raises ValueError. README §6 updated. Multimodal tests in test_render_ids dropped (kept the auto-routing smoke test for Qwen3-VL). Net: +45 / -644 lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
||
| client_idx: int = 0 | ||
| client_type: ClientType = "openai_chat_completions" | ||
| renderer: str = "auto" |
There was a problem hiding this comment.
hm in wondering if we should bundle those somehow? so like a discriminated union of client types with some shared args but mostly disjoint
| ### What we gain | ||
|
|
||
| - **RL correctness.** A prompt/completion split we control, which is exactly what `bridge_to_next_turn` relies on to keep rollouts from fragmenting under truncation or re-tokenization. | ||
| - **Testable parity.** Per-model renderers are plain Python. We can render the same conversation through the renderer and through HF's `apply_chat_template` and assert token-level parity. Every edge case (empty thinking, multiple tool calls, truncated turns) becomes a unit test instead of undefined behavior buried inside Jinja. |
|
|
||
| - **RL correctness.** A prompt/completion split we control, which is exactly what `bridge_to_next_turn` relies on to keep rollouts from fragmenting under truncation or re-tokenization. | ||
| - **Testable parity.** Per-model renderers are plain Python. We can render the same conversation through the renderer and through HF's `apply_chat_template` and assert token-level parity. Every edge case (empty thinking, multiple tool calls, truncated turns) becomes a unit test instead of undefined behavior buried inside Jinja. | ||
| - **Escape hatch.** Anything without a hand-coded renderer falls back to `DefaultRenderer` (a generic `apply_chat_template` wrapper), which mirrors the previous TITO path. |
There was a problem hiding this comment.
what sort of guarantees, if any, can we give for this fallback? what are the pros and cons compare to fallback to mito?
There was a problem hiding this comment.
yeah, let me be more clear about this
|
|
||
| ### Per-renderer bridges | ||
|
|
||
| Each hand-coded renderer implements `bridge_to_next_turn` directly for its model's chat template — no shared generic helper, just Python that knows what tokens the template would insert between turns. Qwen3's bridge knows about `<|im_start|>role\n … <|im_end|>\n`; GLM's bridge knows that turns end when the next role marker appears; DeepSeek V3, Kimi K2/K2.5, Nemotron-3, GPT-OSS, MiniMax each have their own. On a clean stop, vLLM's `completion_ids` already includes the template's close token; on truncation, the renderer synthesizes the canonical close (`<|im_end|>`, `<|endoftext|>`, or the equivalent for that model) so the extension invariant still holds, and the synthetic close is masked out of the loss because the model didn't produce it. |
* docs/evaluation.md: --api-client-type list was missing nemorl_chat_completions and ordered inconsistently with verifiers/types.py; align with the source of truth. * base.py / renderer_client.py: simplify factory closures from default-arg-as-pseudo-closure (factory(_name=…, _model=…, …)) to plain factory() -> Renderer; the captured locals are stable for the function's lifetime, no late-binding footgun. * clients: drop redundant maybe_normalize_messages calls from OpenAIChatCompletionsClient.to_native_prompt and RendererClient.to_native_prompt. PR #1027 explicitly centralized message normalization in the env loop and removed downstream copies; we re-introduced the redundancy by accident. * renderer_client.py: collapse _get_renderer_or_pool's inline factory + RendererPool() construction into a direct create_renderer_pool() call (single source of truth for pool construction lives in packages/renderers/renderers/base.py). * README: switch the create_renderer / create_renderer_pool import examples to the top-level package path. Tests in test_renderer_client.py updated to patch create_renderer_pool instead of the now-uncalled create_renderer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderer ClientType is RL-specific (token preservation across turns, multi-turn extension invariant via bridge_to_next_turn, truncation-safe close synthesis), but neither training.md nor faqs.md mentioned it. * training.md: add an "Inference Client Types" subsection under "RL Rules of Thumb" that contrasts MITO / TITO / renderer and recommends renderer (or TITO as fallback) for RL workloads. Also updates "Non-Increasing Chat Templates" to note that the renderer client sidesteps the Qwen3/DeepSeek <think>-stripping issue by tokenizing client-side. * faqs.md: add a Training FAQ "Which client_type should I use for RL training?" with the same three-way breakdown and a link to training.md for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderer client type ships hand-coded support for only a subset of models and corner cases are still being shaken out. Production RL workloads should use openai_chat_completions_token (TITO) — it's the tried-and-tested path with broad coverage. Try renderer when you want the stronger token-preservation guarantees and your model has a hand-coded renderer. * training.md / faqs.md: tag renderer with *(experimental)* and flip the recommendation to TITO-first. * renderers/README.md: add an explicit "Status: experimental" callout in the intro; drop the misleading "Replaces the old TITO client" line — TITO and renderer ship side-by-side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ients Mika #7 — the renderer client had 15 module-level helpers as a trailer block after the class, while every other client (openai_chat_completions, openai_chat_completions_token, etc.) co-locates helpers above the class. Pure positional move; helpers stay module-level (they're pure functions tested directly via import). Nothing extracted to clients/utils since _normalize_for_comparison has renderer-specific semantics (tool_call.arguments JSON-decode, None-filtering) that don't match the TITO version. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two trailing blank lines left over from removing the diagnostic test in 9afd448. ruff format --check caught it; ruff check passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugbot findings on PR #1068: 1. ``build_training_sample`` accepted a ``collapse_consecutive_tool_messages`` parameter that was never referenced in the body. No actual caller (verified across verifiers + prime-rl) — prime-rl SFT uses its own ``build_incremental_token_mask`` which has its own collapse implementation. Drop the dead parameter. 2. ``KimiK25Renderer`` silently dropped ``role="tool_declare"`` messages in the input list with ``if role == "tool_declare": continue``, regardless of whether ``tools=`` was passed. The K2.5 chat template actually iterates every message — tool_declare included — through ``set_roles`` + ``render_content``, emitting ``<|im_system|>tool_declare<|im_middle|>{content}<|im_end|>``. The ``tools=`` parameter is a separate path that fires once before the loop, not a deduplicating gate. Removing the early-skip lets tool_declare messages flow through the existing generic content handler, matching the template exactly. Adds a regression test ``test_kimi_k25_tool_declare_message_without_tools_param`` verifying parity against ``apply_chat_template`` for this case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e2ede3f. Configure here.
…ink> KimiK2Renderer was calling _extract_thinking on every assistant turn, which split inline <think>...</think> out of content and then discarded the extracted reasoning. Result: any caller passing content like "<think>secret</think>visible" got "visible" emitted instead of the verbatim string, disagreeing with apply_chat_template. Kimi K2's chat template emits ``message.content`` verbatim — there is no reasoning_content support, no inline-tag stripping. The separate reasoning_content field is just dropped (the template never reads it). Drop _extract_thinking entirely (single caller) and emit content directly. Adds test_kimi_k2_inline_think_tags_render_verbatim which asserts parity against apply_chat_template for the bugbot-flagged case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
completions_request was building its return dict without threading through id, model, or created from vLLM's /generate response — so RendererClient.from_native_response always fell back to its defaults (id="", created=0, model=""), and downstream Response objects had empty metadata even when vLLM populated those fields. Pass the three fields through with safe defaults (empty string / 0) so callers using them for logging, request correlation, or model attribution see real values. Adds test_from_native_response_propagates_id_model_created as a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Summary
Adds
packages/renderers/— a standalone package that owns message ↔ token conversion as an alternative to vLLM's Jinja chat templates and the existing TITO/MITO server-side machinery. Every sampled assistant turn keeps its exact tokens across the rollout boundary; the trainer sees bit-for-bit what vLLM produced.Full design, motivation, and examples of the failure modes this fixes:
packages/renderers/README.md.Renderer matrix
Qwen3Renderer/Qwen35Renderer/Qwen36Renderer/Qwen3VLRendererGLM5Renderer/GLM45RendererMiniMaxM2RendererKimiK2Renderer/KimiK25RendererDeepSeekV3RendererNemotron3RendererGptOssRendererapply_chat_template)DefaultRenderertool_parser/reasoning_parserArchitecture
The Renderer Protocol:
render()/render_ids()— messages → tokens with per-token message attribution for loss masking.parse_response()— completion tokens → structured message via token-ID boundary scanning (no regex on decoded text).get_stop_token_ids()— turn-close tokens.bridge_to_next_turn()— extendsprev_prompt + prev_completionwith the new turn's tokens; returnsNoneif the renderer can't prove prefix-stability (caller falls back to fresh render).Key design decisions
Per-renderer bridges, hand-coded. No shared
chatml_bridge/glm_bridgehelper — that approach rendered[dummy_assistant, *new_messages]and diffed against[dummy_assistant]to extract extension tokens, which broke on templates that treated the dummy as an invalid prefix (GLM-5.1 wraps the last assistant with empty<think></think>, harmony's assistant uses different channels historical vs latest, Kimi auto-injects a default system). Each renderer's bridge now hand-emits the new-turn tokens by calling the same per-role inline helpers thatrender()uses, so the two paths can't silently diverge. Two small shared primitives remain:trim_to_turn_close(scanprev_completionfor a template-specific close token; on truncation, append the canonical close so the bridge still extends) andreject_assistant_in_extension(bridges refuse to re-tokenize model-sampled assistant content).Truncation handling. Hand-coded renderers always synthesize the canonical turn-close (
<|im_end|>,<|endoftext|>, harmony's<|end|>, …) when vLLM hitsmax_tokensand the prior completion has no close token, so the next prompt still extends the prior step's tokens verbatim; the synthetic token lands in the merged sample'sprompt_ids(mask=False) and never enters loss or KL.DefaultRenderer.bridge_to_next_turnreturnsNoneunconditionally — it wraps an unknown Jinja template and can't prove the extension contract holds — so the caller falls back to a fresh re-render.Pluggable parsers for
DefaultRenderer. Hand-coded renderers bake parsing in.DefaultRenderertakes optionaltool_parser=/reasoning_parser=kwargs wired to registries inrenderers.parsers. Built-ins today:qwen3,qwen3.5,glm,deepseek_v3for tools;thinkfor reasoning.RendererClient +
/v1/generateAdds
RendererClient(verifiers client type"renderer") — renders messages client-side, POSTs raw token IDs to vLLM's/v1/generate, parses completions back into structured responses. Multi-turn rollouts reuse the prior step's exact tokens throughbridge_to_next_turn; no re-rendering of sampled content.A
RendererPooloffloads sync tokenization to threads so concurrent rollouts tokenize in parallel instead of blocking the event loop.Test plan
packages/renderers/tests/test_render_ids.py— multi-model parity matrix vsapply_chat_template(1 documented xfail for an upstream Jinja bug on Qwen3-VLcontent=None).packages/renderers/tests/test_roundtrip.py— render → parse round-trip per renderer: content, reasoning, single and multiple tool-calls.packages/renderers/tests/test_bridge.py— bridge contract invariants per hand-coded renderer: extends prev verbatim, rejects assistant-role extension, synthesizes close on truncation, extension contains the new-message content.packages/renderers/tests/test_incremental.py— unit coverage oftrim_to_turn_close+reject_assistant_in_extensionedge cases.packages/renderers/tests/test_parsers.py/test_parse_response.py/test_parse_response_robustness.py— parsing on truncated / malformed output; includes regression test forparse_qwen3JSON-decode-error fallback.tests/test_renderer_e2e.py— end-to-end TITO rollout with scripted vLLM; asserts token preservation and multi-turn bridge extension.samples_per_rollout= 1.00, reward + KL track main. Qwen3.5-35B-A3B + mini-swe-agent-plus: 0 breaks vs main's 32 in the same step (see README §4 for the concrete break modes).🤖 Generated with Claude Code
Note
Medium Risk
Large, mostly additive change that introduces new tokenization/parsing/bridging logic and a new dependency; errors here can subtly affect rollout correctness and RL training data integrity, though existing client paths remain available.
Overview
Introduces a new standalone
packages/rendererspackage that implements aRendererprotocol, per-model message→token renderers (Qwen/GLM/Kimi/DeepSeek/MiniMax/Nemotron/GPT-OSS) withbridge_to_next_turnsupport, and a fallbackDefaultRendererthat wrapstokenizer.apply_chat_templatewith optional tool/reasoning parsers.Adds a renderer-backed inference path (
renderers.client.completions_request) that sendsprompt_token_idsto vLLM’s/generateendpoint, parses completion token IDs back into structured outputs, and supports parallel tokenization viaRendererPool.Updates docs (
evaluation.md,training.md,faqs.md,reference.md) to document the newrendererclient_type, expand client type listings/descriptions, and clarify RL training tradeoffs between MITO/TITO/renderer approaches.Reviewed by Cursor Bugbot for commit ee6cdb5. Bugbot is set up for automated code reviews on this repo. Configure here.