Skip to content

feat: add renderers package#1068

Merged
hallerite merged 89 commits intomainfrom
renderers
Apr 29, 2026
Merged

feat: add renderers package#1068
hallerite merged 89 commits intomainfrom
renderers

Conversation

@hallerite
Copy link
Copy Markdown
Member

@hallerite hallerite commented Mar 25, 2026

Summary

Adds packages/renderers/ — a standalone package that owns message ↔ token conversion as an alternative to vLLM's Jinja chat templates and the existing TITO/MITO server-side machinery. Every sampled assistant turn keeps its exact tokens across the rollout boundary; the trainer sees bit-for-bit what vLLM produced.

Status: experimental. Available alongside the production openai_chat_completions_token (TITO) client, which remains the recommended path for production training. Renderers offer stronger token-preservation guarantees but only ship hand-coded support for a subset of models, and corner cases are still being shaken out. The package is text-only — Qwen3VLRenderer works against the Qwen3-VL tokenizer for text conversations only; multimodal training should continue to use MITO.

Full design, motivation, and examples of the failure modes this fixes: packages/renderers/README.md.

Renderer matrix

Template family Renderer class Models
Qwen chatml Qwen3Renderer / Qwen35Renderer / Qwen36Renderer / Qwen3VLRenderer Qwen3, Qwen3.5, Qwen3.6, Qwen3-VL (text-only)
GLM (next-turn markers) GLM5Renderer / GLM45Renderer GLM-5, GLM-5.1, GLM-4.5, GLM-4.7
MiniMax chatml-variant MiniMaxM2Renderer MiniMax-M2 / M2.5
Kimi im-role format KimiK2Renderer / KimiK25Renderer Kimi K2, K2.5, K2.6
DeepSeek V3 tool-section DeepSeekV3Renderer DeepSeek V3
Nemotron chatml + XML tools Nemotron3Renderer Nemotron-3 Nano / Super
OpenAI harmony GptOssRenderer GPT-OSS
Fallback (apply_chat_template) DefaultRenderer anything else, with optional tool_parser / reasoning_parser

Architecture

messages → renderer.render_ids() → [token IDs]
                                       ↓
                              vLLM /v1/generate
                                       ↓
[completion IDs] → renderer.parse_response() → ParsedResponse(content, reasoning_content, tool_calls)

The Renderer Protocol:

  • render() / render_ids() — messages → tokens with per-token message attribution for loss masking.
  • parse_response() — completion tokens → structured message via token-ID boundary scanning (no regex on decoded text).
  • get_stop_token_ids() — turn-close tokens.
  • bridge_to_next_turn() — extends prev_prompt + prev_completion with the new turn's tokens; returns None if the renderer can't prove prefix-stability (caller falls back to fresh render).

Key design decisions

  • Per-renderer bridges, hand-coded. No shared chatml_bridge / glm_bridge helper — that approach rendered [dummy_assistant, *new_messages] and diffed against [dummy_assistant] to extract extension tokens, which broke on templates that treated the dummy as an invalid prefix (GLM-5.1 wraps the last assistant with empty <think></think>, harmony's assistant uses different channels historical vs latest, Kimi auto-injects a default system). Each renderer's bridge now hand-emits the new-turn tokens by calling the same per-role inline helpers that render() uses, so the two paths can't silently diverge. Two small shared primitives remain: trim_to_turn_close (scan prev_completion for a template-specific close token; on truncation, append the canonical close so the bridge still extends) and reject_assistant_in_extension (bridges refuse to re-tokenize model-sampled assistant content).

  • Truncation handling. Hand-coded renderers always synthesize the canonical turn-close (<|im_end|>, <|endoftext|>, harmony's <|end|>, …) when vLLM hits max_tokens and the prior completion has no close token, so the next prompt still extends the prior step's tokens verbatim; the synthetic token lands in the merged sample's prompt_ids (mask=False) and never enters loss or KL. DefaultRenderer.bridge_to_next_turn returns None unconditionally — it wraps an unknown Jinja template and can't prove the extension contract holds — so the caller falls back to a fresh re-render.

  • Pluggable parsers for DefaultRenderer. Hand-coded renderers bake parsing in. DefaultRenderer takes optional tool_parser= / reasoning_parser= kwargs wired to registries in renderers.parsers. Built-ins today: qwen3, qwen3.5, glm, deepseek_v3 for tools; think for reasoning.

RendererClient + /v1/generate

Adds RendererClient (verifiers client type "renderer") — renders messages client-side, POSTs raw token IDs to vLLM's /v1/generate, parses completions back into structured responses. Multi-turn rollouts reuse the prior step's exact tokens through bridge_to_next_turn; no re-rendering of sampled content.

A RendererPool offloads sync tokenization to threads so concurrent rollouts tokenize in parallel instead of blocking the event loop.

Test plan

  • packages/renderers/tests/test_render_ids.py — multi-model parity matrix vs apply_chat_template (1 documented xfail for an upstream Jinja bug on Qwen3-VL content=None).
  • packages/renderers/tests/test_roundtrip.py — render → parse round-trip per renderer: content, reasoning, single and multiple tool-calls.
  • packages/renderers/tests/test_bridge.py — bridge contract invariants per hand-coded renderer: extends prev verbatim, rejects assistant-role extension, synthesizes close on truncation, extension contains the new-message content.
  • packages/renderers/tests/test_incremental.py — unit coverage of trim_to_turn_close + reject_assistant_in_extension edge cases.
  • packages/renderers/tests/test_parsers.py / test_parse_response.py / test_parse_response_robustness.py — parsing on truncated / malformed output; includes regression test for parse_qwen3 JSON-decode-error fallback.
  • tests/test_renderer_e2e.py — end-to-end TITO rollout with scripted vLLM; asserts token preservation and multi-turn bridge extension.
  • E2E in prime-rl #2278: wordle multi-turn samples_per_rollout = 1.00, reward + KL track main. Qwen3.5-35B-A3B + mini-swe-agent-plus: 0 breaks vs main's 32 in the same step (see README §4 for the concrete break modes).

🤖 Generated with Claude Code


Note

Medium Risk
Large, mostly additive change that introduces new tokenization/parsing/bridging logic and a new dependency; errors here can subtly affect rollout correctness and RL training data integrity, though existing client paths remain available.

Overview
Introduces a new standalone packages/renderers package that implements a Renderer protocol, per-model message→token renderers (Qwen/GLM/Kimi/DeepSeek/MiniMax/Nemotron/GPT-OSS) with bridge_to_next_turn support, and a fallback DefaultRenderer that wraps tokenizer.apply_chat_template with optional tool/reasoning parsers.

Adds a renderer-backed inference path (renderers.client.completions_request) that sends prompt_token_ids to vLLM’s /generate endpoint, parses completion token IDs back into structured outputs, and supports parallel tokenization via RendererPool.

Updates docs (evaluation.md, training.md, faqs.md, reference.md) to document the new renderer client_type, expand client type listings/descriptions, and clarify RL training tradeoffs between MITO/TITO/renderer approaches.

Reviewed by Cursor Bugbot for commit ee6cdb5. Bugbot is set up for automated code reviews on this repo. Configure here.

@hallerite hallerite changed the title feat: add renderers package — deterministic chat template rendering feat: add renderers package Mar 25, 2026
Comment thread packages/renderers/renderers/kimi.py Outdated
Comment thread verifiers/types.py
Comment thread packages/renderers/renderers/messages.py Outdated
Comment thread verifiers/clients/renderer_client.py
Comment thread packages/renderers/renderers/client.py Outdated
@hallerite hallerite marked this pull request as draft March 26, 2026 00:03
hallerite added 11 commits April 7, 2026 14:32
Adds packages/renderers/ — a standalone package for deterministic
message-to-token conversion that replaces Jinja chat templates.

Renderers (6 total):
- Qwen3Renderer, Qwen35Renderer (Qwen family)
- GLM5Renderer, GLM45Renderer (GLM family)
- MiniMaxM2Renderer (MiniMax M2/M2.5)
- DefaultRenderer (fallback: uses tokenizer.apply_chat_template)

Each renderer implements:
- render_ids(messages) -> token IDs (messages -> tokens)
- parse_response(token_ids) -> ParsedResponse (tokens -> structured message)
- get_stop_token_ids() -> stop tokens

RendererClient: new verifiers client type ("renderer") that uses
renderers for all tokenization. Sends token IDs to vLLM /v1/completions
directly. No MITO/TITO prefix matching, no /tokenize calls.

Auto-detection: create_renderer(tokenizer) picks the right renderer
from tokenizer special tokens. Falls back to DefaultRenderer for
unsupported models.
… attribution

175 parametrized tests across 7 models × 25 cases:
- test_render_ids: token-for-token correctness against apply_chat_template
- test_parse_response: content/reasoning/tool extraction
- test_build_helpers: supervised sample + trajectory step

Fixes:
- GLM-5/GLM-4.5/MiniMax: None content rendered as "None" (matches Jinja)
- GLM-4.5: BPE boundary fix for content + \n before <tool_call>
- DefaultRenderer: incremental rendering for per-token message attribution

Adding a new model family = add one entry to conftest.RENDERER_MODELS.
INTELLECT-3.1: auto-detects to DefaultRenderer (apply_chat_template fallback)
because its tokenizer has aggressive BPE merges that break piece-by-piece
encoding. The IntellectRenderer is available as "intellect" for future
optimization but is not the auto-detect default.

Kimi K2.5: identified as needing a custom KimiRenderer (TODO).
The template uses unique tokens (<|im_user|>, <|im_assistant|>, <|im_middle|>)
and always appends a generation prompt, making the DefaultRenderer's
incremental approach incompatible. Skipped in barrage tests for now.

Also:
- Fixed DefaultRenderer to always pass tokenize=True (Kimi returns str by default)
- Fixed _expected() in tests to handle tokenizers returning str
- 200 barrage tests passing across 8 models

Model coverage:
- Qwen3 (custom) ✓
- Qwen3.5 (custom) ✓
- GLM-5 / GLM-4.7-Flash (custom) ✓
- GLM-4.5-Air (custom) ✓
- MiniMax-M2.5 (custom) ✓
- INTELLECT-3.1 (default) ✓
- Qwen2.5 (default) ✓
- Kimi K2.5 — TODO: needs KimiRenderer
KimiRenderer for moonshotai/Kimi-K2.5:
- Unique format: <|im_user|>/<|im_assistant|>/<|im_middle|> role tokens
- TypeScript namespace tool definitions
- Tool calls via <|tool_calls_section_begin|>/<|tool_call_begin|> tokens
- All 25 barrage tests passing

Auto-detection: replaced fragile token-sniffing heuristics with a simple
MODEL_RENDERER_MAP that maps model name prefixes to renderer names.
Falls back to DefaultRenderer for unknown models.

225 barrage tests across 9 models, all passing.
New renderers/parsing.py with extraction functions ported from vLLM:
- extract_reasoning_qwen (qwen3_reasoning_parser)
- extract_reasoning_glm (basic think/content split)
- extract_reasoning_minimax (minimax_m2_reasoning_parser)
- extract_reasoning_kimi (kimi_k2_reasoning_parser)
- extract_tool_calls_hermes (hermes_tool_parser — Qwen3 JSON)
- extract_tool_calls_qwen35xml (qwen3xml_tool_parser — Qwen3.5 XML)
- extract_tool_calls_glm (glm4_moe/glm47_moe_tool_parser)
- extract_tool_calls_minimax (minimax_m2_tool_parser)
- extract_tool_calls_kimi (kimi_k2_tool_parser)

Same regex patterns, same edge case handling as vLLM.
All renderers now delegate parse_response() to these shared functions.

Truncation: <think> present without </think> → truncated reasoning.
No <think> marker → plain content (not assumed truncated).

312 barrage tests passing.
…ded text

Replaced all decode-then-regex parsing with token ID scanning:
- Find special token boundaries (</think>, <tool_call>, etc.) by their
  token IDs directly in the sequence
- Decode only the text segments between boundaries
- No false positives from content that happens to look like special tokens

Each model family has a dedicated parse function:
- parse_qwen3: Hermes JSON tool calls by <tool_call> token ID
- parse_qwen35: XML tool calls + <think>/<\think> by token ID
- parse_glm: <arg_key>/<arg_value> pairs by token ID
- parse_minimax: <minimax:tool_call> by token ID, invoke/parameter by text
- parse_kimi: full token-level (section/begin/end/arg_begin all by ID)

Truncation: <think> token present without </think> → truncated reasoning.
No <think> token → plain content.

312 barrage tests passing.
- Strip at first stop token (truncate) instead of only trailing
- Remove <think> from reasoning_ids regardless of position (fixes GLM-4.5
  where \n precedes <think> in completion)

Verified end-to-end for all 6 model families:
  ✓ Qwen3:   thinking + content + tool calls + names + args
  ✓ Qwen3.5: thinking + content + tool calls + names + args
  ✓ GLM-5:   thinking + content + tool calls + names + args
  ✓ GLM-4.5: thinking + content + tool calls + names + args
  ✓ MiniMax: thinking + content + tool calls + names + args
  ✓ Kimi:    content + tool calls + args (no thinking/names by design)

312 barrage tests passing.
- Removed KimiRenderer (too complex for now, needs more iteration)
- Removed unused messages.py (normalize_messages, deserialize_tool_calls,
  strip_message_content — not imported anywhere in the package)
- Cleaned up parse_kimi from parsing.py
- 273 barrage tests passing across 7 models
hallerite and others added 10 commits April 7, 2026 15:49
The proxy now forwards to /v1/generate (our custom endpoint) instead of
/v1/completions. For VLM, extracts raw images from messages and sends
them alongside renderer's token IDs. vLLM processes images server-side
while text tokenization is fully client-side via the Renderer.

Also updated client.py to use /v1/generate.
…S and strong Message typing

Add five new model-family renderers with full render/parse support:
- DeepSeekV3Renderer: fullwidth Unicode tokens, <think> text tags, tool call section markers
- KimiK2Renderer: im_user/im_assistant/im_system format, tool_calls_section markers, default system prompt
- KimiK25Renderer: extends K2 with <think> prefill, vision/media support, TypeScript tool declarations
- Nemotron3Renderer: Qwen-style im_start/im_end with XML tool declarations, universal thinking blocks
- GptOssRenderer: Harmony channel-based format (analysis/commentary/final), TypeScript tools

Also introduces strong typing across the package:
- Message, Content, ContentPart, ToolCall, ToolSpec TypedDicts in base.py
- All renderer signatures updated from dict[str, Any] to proper types
- Renderer protocol updated to use Message and ToolSpec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Cast OpenAI SDK message/tool types to Message/ToolSpec at the
renderer_client boundary. Add override annotations for methods
that legitimately change the response type from OpenAIChatResponse
to dict.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tCompletionsClient

RendererClient does client-side tokenization via /v1/generate, not the
chat completions API. Inheriting from OpenAIChatCompletionsClient was
wrong — it forced type mismatches (OpenAIChatResponse vs dict) that
required override annotations.

Now inherits Client[AsyncOpenAI, list[RendererMessage], dict, ToolSpec]
with its own to_native_prompt that converts verifiers Pydantic messages
to renderer TypedDicts cleanly. No casts, no type: ignore overrides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review April 14, 2026 17:19
RendererClient now uses a shared RendererPool (32 slots by default)
that offloads render_ids() and parse_response() to threads via
asyncio.to_thread(). HuggingFace fast tokenizers release the GIL
during Rust encoding, so concurrent rollouts tokenize in parallel
instead of serializing on the event loop.

Benchmarks on 30-core EPYC with 22K-token conversations:
  N=8:  164ms → 46ms  (3.6x)
  N=16: 330ms → 103ms (3.2x)
  N=32: 659ms → 196ms (3.4x)

When a single Renderer is passed (tests, simple usage), the original
non-threaded path is preserved.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread packages/renderers/renderers/deepseek_v3.py
Comment thread packages/renderers/renderers/gpt_oss.py Outdated
Comment thread docs/reference.md
Comment thread packages/renderers/renderers/kimi_k25.py
hallerite and others added 9 commits April 28, 2026 23:34
`_encode_tools_typescript` filtered with `tool.get("type") != "function"`
which silently dropped every flat `ToolSpec` (the TypedDict in
`renderers.base`: `{name, description, parameters}` with no `type` key).
Production callers pass `ToolSpec`; tests happen to use the OpenAI
envelope format `{"type":"function","function":{...}}`, which is why
the regression slipped through.

Now accept both shapes: unwrap `tool["function"]` when the envelope is
present, otherwise treat the dict as a flat ToolSpec. Non-function
envelope types (e.g. `"_plugin"`) are still skipped.

Caught by Cursor Bugbot in PR #1068 thread r3151243707.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vLLM serves gpt-oss via the `openai-harmony` reference encoder, which
is also what the model was trained on. The previous hand-rolled
~550-line implementation only covered a subset of the harmony spec
(no system preamble, no auto channel-routing line, no canonical
``<|return|>`` for terminal turns, partial TS schema rendering for
tools), and HF's chat_template.jinja diverges from harmony in small
ways anyway.

Approach: thin adapter over `openai-harmony`. Per-message
`enc.render(m)` produces token streams that concatenate byte-identical
to `enc.render_conversation` (verified empirically), so we get
per-token attribution for free — emit `enc.render(m)` for each caller
message and tag tokens with the caller index. The system+developer
prefix needs `enc.render_conversation` (not per-message) because
harmony injects a channel-routing line into SystemContent based on
conversation-level info; per-message rendering doesn't see that.

Caller messages map to harmony as:
- first `system` → `DeveloperContent.with_instructions(content)`
- `user` → `Role.USER`
- `assistant` final-channel for text + commentary-channel
  recipient=`functions.<name>` per tool_call
- `tool` → `Role.TOOL` with `name=functions.<name>`,
  recipient=`assistant`, channel=`commentary`
- historical `reasoning_content` is dropped (matches harmony's
  `render_conversation` behaviour — analysis-channel messages are
  stripped from rendered history; reasoning is per-turn only)

Last assistant final-channel close gets patched from `<|end|>` to
`<|return|>` to match `render_conversation_for_training`.

Tests:
- `packages/renderers/tests/test_gpt_oss_harmony_parity.py`: 7 new
  parity tests against `enc.render_conversation_for_training` —
  no-system, system+user, terminal-assistant `<|return|>`, tools
  layout, full tool-call+result cycle, reasoning-content stripping,
  generation prompt scaffolding.
- conftest matrix: `openai/gpt-oss-20b` added; an autouse fixture
  skips it for HF-parity test files (`test_render_ids`,
  `test_build_helpers`, `test_parse_response*`) since the renderer
  intentionally matches harmony, not HF Jinja. Filed under conftest
  so a single skip rule covers all four files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`OpenAIChatCompletionsTokenClient` and its tests were dropped on the
renderers branch in favour of the new `RendererClient`. Restore them
so both paths coexist on this branch — callers can opt into either
via `ClientConfig.client_type`:

- `"renderer"` → renderers-v2 path (client-side tokenization +
  `/v1/generate`)
- `"openai_chat_completions_token"` → server-side token-aware chat
  (`/v1/chat/completions/tokens`)

Files restored verbatim from origin/main:
- verifiers/clients/openai_chat_completions_token_client.py
- tests/test_openai_chat_completions_token_client.py

Plumbing:
- verifiers/clients/__init__.py: re-add the import, the
  `resolve_client` dispatch case, and the `__all__` entry.
- verifiers/types.py: re-add `"openai_chat_completions_token"` to the
  `ClientType` Literal.

The server-side endpoint (`serving_chat_with_tokens.py`) lives in
prime-rl and will need a matching restoration on that side before this
client can be exercised end-to-end on the renderers branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`auto_system_injected` (line 160) was computed but never read; the
actual auto-injection bookkeeping uses `auto_system_idx`. The unused
loop variable `name` in `_render_assistant`'s tool_calls block was
dead since the K2 template emits arguments only — function names live
in the `tool_call['id']` field.

CI fix only — no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The logger init and the bridge-metrics module-level helpers were
declared between two import groups, which made ruff flag every import
below them as E402 (module-level import not at top of file). Move
them after the imports — same code, just relocated — so ruff is happy.

CI fix only — no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three classes of pre-existing test failures on the renderers branch
(visible in CI for weeks); none affect production behavior, all
unblock the test job from going green.

1. ``test_renderer_client_honors_configured_renderer_name`` and
   ``test_renderer_client_uses_renderer_model_name_override``:
   ``_get_renderer_or_pool`` now passes ``tool_parser=None,
   reasoning_parser=None`` to ``create_renderer``. Update the
   ``assert_called_once_with`` mocks to match.

2. ``test_get_incremental_prompt_ids_*`` (3 tests with the
   ``_BridgeRenderer`` fake): the fake had only ``render_ids`` and
   ``parse_response`` from the old diff-based bridge protocol —
   ``_get_incremental_prompt_ids`` now calls
   ``renderer.bridge_to_next_turn``. Add a ``bridge_to_next_turn``
   method that returns ``prev_prompt + prev_completion + trailing +
   extension``, mimicking what a real bridge stitches together. Track
   bridge calls separately so the "without re-rendering completion"
   test can assert that ``render_ids`` is NOT called and the bridge
   path is taken.

3. Two parametrize cases that test features that were prototyped but
   not merged into the renderers package — strict xfail so they
   auto-flip to xpass once the feature lands:
   - ``test_get_incremental_prompt_ids_bridges_over_truncated_step
     [Qwen/Qwen2.5-0.5B-Instruct]``: DefaultRenderer's
     ``bridge_to_next_turn`` always returns None.
     ``synthesize_close_on_truncation`` for unknown templates was
     prototyped in site-packages but never merged.
   - ``test_extension_break_emits_diagnostic_log``: the bridge no
     longer surfaces a break for Qwen3.5's strip-thinking-from-history
     pattern, so the diagnostic log never fires for this scenario.
     Needs a different repro (e.g. trajectory with empty
     ``completion_ids`` or an invalid tail) to exercise the log path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… tests

CI runs ``ruff format --check`` in addition to ``ruff check``. The
format job was failing on 17 files — purely formatting-only changes
that were never normalised after recent edits to:

- packages/renderers/renderers/ (client, glm45, glm5, gpt_oss,
  kimi_k25, minimax_m2, parsers, parsing)
- packages/renderers/tests/ (test_bridge, test_client,
  test_gpt_oss_harmony_parity, test_incremental, test_parsers,
  test_roundtrip)
- tests/test_renderer_client.py, tests/test_renderer_e2e.py
- verifiers/clients/renderer_client.py

Ran ``uv run ruff format``; both ``ruff format --check`` and ``ruff
check`` are now clean. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erClient

When TITO was originally dropped on the renderers branch, both docs
mentioning ``openai_chat_completions_token`` were rewritten to point
at the new ``renderer`` client. Now that TITO is restored (commit
``748d03e0``) and lives next to ``RendererClient``, the docs need to
list both:

- ``docs/evaluation.md``: extend the ``--api-client-type`` flag's
  enumerated list to include both ``openai_chat_completions_token``
  and ``renderer``.
- ``docs/reference.md``:
  - re-add ``"openai_chat_completions_token"`` to the ``ClientType``
    Literal block (matches ``verifiers/types.py``).
  - re-add the ``OpenAIChatCompletionsTokenClient`` row to the
    Built-in Client Implementations table, with a one-line note
    distinguishing it from the renderer client (server-side
    templating + ``/v1/chat/completions/tokens`` route vs client-side
    tokenization through the ``renderers`` package).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sample

The function returns ``(token_ids, loss_mask)`` for any caller-defined
masking policy — its only specifically-supervised aspect was the name.
``build_training_sample`` better reflects the canonical
``(ids, mask)`` builder used across both SFT and RL training paths,
matches the wording in trainer code/configs, and reads cleaner
alongside ``build_trajectory_step``.

The previous name is dropped without a deprecation alias because
``renderers`` isn't released yet — no external users to break.

Touches:
- packages/renderers/renderers/base.py (def)
- packages/renderers/renderers/__init__.py (import + __all__)
- packages/renderers/tests/test_build_helpers.py
  (module docstring, import, 2 test names, 2 call sites)
- 4 docstring/comment mentions in default.py, kimi_k2.py,
  nemotron3.py, test_message_indices.py

prime-rl has no callers (verified via grep).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/renderers/renderers/client.py
hallerite and others added 3 commits April 29, 2026 04:26
- openai_chat_completions_token_client: tokenize() default
  extra_kwargs={} is a mutable-default footgun; use None and lazily
  initialize.
- environment / types: restore start_timer (perf_counter) for
  monotonic elapsed-ms; the consolidation onto start_time (time.time)
  could produce negative deltas on NTP step.
- docs/reference: ClientType list was missing nemorl_chat_completions;
  add it and the corresponding row in the Built-in Clients table.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DEBUG-only diagnostic in _log_extension_break (and the
last_reason / last_detail tracking around it in
_get_incremental_prompt_ids) was reaching into renderer._tokenizer
to decode token windows on bridge failure, which was an
encapsulation violation through a private attr. The diagnostic
was xfailing in tests and not exercised on main; better
diagnostics can be reintroduced once the design is clearer.

Removes ~270 lines: the helper, the dedicated logger, the per-step
category tracking, and the xfailing observability test in
test_renderer_e2e.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanup passes folded into one commit since they share files:

* Drop synthesize_close_on_truncation: every hand-coded renderer
  hardcoded True at the class level and DefaultRenderer's
  bridge_to_next_turn returns None unconditionally regardless of
  the flag, so the runtime knob never did anything user-visible.
  Removed from the Renderer Protocol, both factories
  (create_renderer, create_renderer_pool), DefaultRenderer.__init__,
  the misleading "ignoring for renderer=X" log, and from every
  hand-coded renderer's bridge (the
  `synthesize_close=(self._x if self.synthesize_close_on_truncation
  else None)` ternary becomes a direct `synthesize_close=self._x`).
  Tests updated: test_bridge drops the opt-out branch,
  test_incremental drops the obsolete opt-out test, test_renderer_client
  loses its `synth_close` parameter and the now-impossible
  Qwen2.5/default xfail entry.

* Strip multimodal: the renderer pipeline can't carry image bytes
  through /v1/generate, and the config validator already routes
  VLMs to MITO. Removed ImagePart from the type union (and the
  package's public exports), all multimodal handling in
  Qwen3VLRenderer (~250 lines: image loaders, processor wiring,
  multimodal-content branches in render/render_ids/bridge),
  KimiK25Renderer's _emit_image + media tokens, and the image/video
  branches in Nemotron3 / Qwen35 content rendering. Passing image
  content to any renderer now raises ValueError. README §6 updated.
  Multimodal tests in test_render_ids dropped (kept the auto-routing
  smoke test for Qwen3-VL).

Net: +45 / -644 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread docs/evaluation.md Outdated
Comment thread verifiers/types.py

client_idx: int = 0
client_type: ClientType = "openai_chat_completions"
renderer: str = "auto"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm in wondering if we should bundle those somehow? so like a discriminated union of client types with some shared args but mostly disjoint

Comment thread verifiers/clients/renderer_client.py Outdated
Comment thread verifiers/clients/renderer_client.py Outdated
Comment thread verifiers/clients/renderer_client.py Outdated
### What we gain

- **RL correctness.** A prompt/completion split we control, which is exactly what `bridge_to_next_turn` relies on to keep rollouts from fragmenting under truncation or re-tokenization.
- **Testable parity.** Per-model renderers are plain Python. We can render the same conversation through the renderer and through HF's `apply_chat_template` and assert token-level parity. Every edge case (empty thinking, multiple tool calls, truncated turns) becomes a unit test instead of undefined behavior buried inside Jinja.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LFGGG


- **RL correctness.** A prompt/completion split we control, which is exactly what `bridge_to_next_turn` relies on to keep rollouts from fragmenting under truncation or re-tokenization.
- **Testable parity.** Per-model renderers are plain Python. We can render the same conversation through the renderer and through HF's `apply_chat_template` and assert token-level parity. Every edge case (empty thinking, multiple tool calls, truncated turns) becomes a unit test instead of undefined behavior buried inside Jinja.
- **Escape hatch.** Anything without a hand-coded renderer falls back to `DefaultRenderer` (a generic `apply_chat_template` wrapper), which mirrors the previous TITO path.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what sort of guarantees, if any, can we give for this fallback? what are the pros and cons compare to fallback to mito?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let me be more clear about this

Comment thread packages/renderers/README.md Outdated
Comment thread packages/renderers/README.md

### Per-renderer bridges

Each hand-coded renderer implements `bridge_to_next_turn` directly for its model's chat template — no shared generic helper, just Python that knows what tokens the template would insert between turns. Qwen3's bridge knows about `<|im_start|>role\n … <|im_end|>\n`; GLM's bridge knows that turns end when the next role marker appears; DeepSeek V3, Kimi K2/K2.5, Nemotron-3, GPT-OSS, MiniMax each have their own. On a clean stop, vLLM's `completion_ids` already includes the template's close token; on truncation, the renderer synthesizes the canonical close (`<|im_end|>`, `<|endoftext|>`, or the equivalent for that model) so the extension invariant still holds, and the synthetic close is masked out of the loss because the model didn't produce it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

legendary

Comment thread packages/renderers/renderers/kimi_k25.py
Comment thread docs/evaluation.md Outdated
hallerite and others added 3 commits April 29, 2026 05:05
* docs/evaluation.md: --api-client-type list was missing
  nemorl_chat_completions and ordered inconsistently with
  verifiers/types.py; align with the source of truth.
* base.py / renderer_client.py: simplify factory closures from
  default-arg-as-pseudo-closure (factory(_name=…, _model=…, …)) to
  plain factory() -> Renderer; the captured locals are stable for
  the function's lifetime, no late-binding footgun.
* clients: drop redundant maybe_normalize_messages calls from
  OpenAIChatCompletionsClient.to_native_prompt and
  RendererClient.to_native_prompt. PR #1027 explicitly centralized
  message normalization in the env loop and removed downstream
  copies; we re-introduced the redundancy by accident.
* renderer_client.py: collapse _get_renderer_or_pool's inline factory
  + RendererPool() construction into a direct create_renderer_pool()
  call (single source of truth for pool construction lives in
  packages/renderers/renderers/base.py).
* README: switch the create_renderer / create_renderer_pool import
  examples to the top-level package path.

Tests in test_renderer_client.py updated to patch
create_renderer_pool instead of the now-uncalled create_renderer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderer ClientType is RL-specific (token preservation across
turns, multi-turn extension invariant via bridge_to_next_turn,
truncation-safe close synthesis), but neither training.md nor
faqs.md mentioned it.

* training.md: add an "Inference Client Types" subsection under
  "RL Rules of Thumb" that contrasts MITO / TITO / renderer and
  recommends renderer (or TITO as fallback) for RL workloads.
  Also updates "Non-Increasing Chat Templates" to note that the
  renderer client sidesteps the Qwen3/DeepSeek <think>-stripping
  issue by tokenizing client-side.
* faqs.md: add a Training FAQ "Which client_type should I use for
  RL training?" with the same three-way breakdown and a link to
  training.md for details.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The renderer client type ships hand-coded support for only a subset
of models and corner cases are still being shaken out. Production
RL workloads should use openai_chat_completions_token (TITO) — it's
the tried-and-tested path with broad coverage. Try renderer when
you want the stronger token-preservation guarantees and your model
has a hand-coded renderer.

* training.md / faqs.md: tag renderer with *(experimental)* and flip
  the recommendation to TITO-first.
* renderers/README.md: add an explicit "Status: experimental" callout
  in the intro; drop the misleading "Replaces the old TITO client"
  line — TITO and renderer ship side-by-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/renderers/renderers/base.py Outdated
Comment thread docs/training.md
hallerite and others added 3 commits April 29, 2026 05:28
…ients

Mika #7 — the renderer client had 15 module-level helpers as a
trailer block after the class, while every other client
(openai_chat_completions, openai_chat_completions_token, etc.)
co-locates helpers above the class. Pure positional move; helpers
stay module-level (they're pure functions tested directly via
import). Nothing extracted to clients/utils since
_normalize_for_comparison has renderer-specific semantics
(tool_call.arguments JSON-decode, None-filtering) that don't match
the TITO version.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two trailing blank lines left over from removing the diagnostic
test in 9afd448. ruff format --check caught it; ruff check passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two bugbot findings on PR #1068:

1. ``build_training_sample`` accepted a
   ``collapse_consecutive_tool_messages`` parameter that was never
   referenced in the body. No actual caller (verified across verifiers
   + prime-rl) — prime-rl SFT uses its own ``build_incremental_token_mask``
   which has its own collapse implementation. Drop the dead parameter.

2. ``KimiK25Renderer`` silently dropped ``role="tool_declare"`` messages
   in the input list with ``if role == "tool_declare": continue``,
   regardless of whether ``tools=`` was passed. The K2.5 chat template
   actually iterates every message — tool_declare included — through
   ``set_roles`` + ``render_content``, emitting
   ``<|im_system|>tool_declare<|im_middle|>{content}<|im_end|>``. The
   ``tools=`` parameter is a separate path that fires once before the
   loop, not a deduplicating gate. Removing the early-skip lets
   tool_declare messages flow through the existing generic content
   handler, matching the template exactly.

Adds a regression test
``test_kimi_k25_tool_declare_message_without_tools_param`` verifying
parity against ``apply_chat_template`` for this case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e2ede3f. Configure here.

Comment thread packages/renderers/renderers/kimi_k2.py
…ink>

KimiK2Renderer was calling _extract_thinking on every assistant turn,
which split inline <think>...</think> out of content and then discarded
the extracted reasoning. Result: any caller passing content like
"<think>secret</think>visible" got "visible" emitted instead of the
verbatim string, disagreeing with apply_chat_template.

Kimi K2's chat template emits ``message.content`` verbatim — there is
no reasoning_content support, no inline-tag stripping. The separate
reasoning_content field is just dropped (the template never reads it).

Drop _extract_thinking entirely (single caller) and emit content
directly. Adds test_kimi_k2_inline_think_tags_render_verbatim which
asserts parity against apply_chat_template for the bugbot-flagged case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite requested a review from willccbb April 29, 2026 01:39
completions_request was building its return dict without threading
through id, model, or created from vLLM's /generate response — so
RendererClient.from_native_response always fell back to its defaults
(id="", created=0, model=""), and downstream Response objects had
empty metadata even when vLLM populated those fields.

Pass the three fields through with safe defaults (empty string / 0)
so callers using them for logging, request correlation, or model
attribution see real values. Adds
test_from_native_response_propagates_id_model_created as a regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hallerite hallerite merged commit b7dd85f into main Apr 29, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants