feat: preserve_*_thinking override flags by hallerite · Pull Request #3 · PrimeIntellect-ai/renderers

hallerite · 2026-05-05T16:01:51Z

Summary

Adds two override flags to Renderer.render / render_ids:

preserve_thinking_between_tool_calls — when a past assistant message has tool_calls and is immediately followed by a tool result, emit its reasoning_content even when the chat template default would drop it. Captures the in-flight asst→tool→asst→tool chain where the trace is load-bearing for the next call.
preserve_all_thinking — emit reasoning_content for every past-assistant turn. Strict superset of the above.

These are override knobs only: they restore thinking the template would drop, never strip thinking it emits. Defaults are byte-identical to apply_chat_template (existing 784-test parity suite confirms).

Per-renderer impact

Renderer	Behaviour
Qwen3, Qwen3.5, Qwen3.6	Drop condition `msg_idx > last_query_index` extended with override
GLM-4.5, GLM-5, GLM-5.1, GLM-4.7-Flash	`msg_idx > last_user_index` extended with override
MiniMax-M2	`conv_idx > last_user_index` extended with override
Nemotron-3	`is_last_turn` extended with override
Kimi-K2.5 / K2.6	hist/suffix split extended on `is_suffix`
gpt-oss	analysis-channel message emitted on past-asst when override fires
Qwen3-VL	No-op — template doesn't emit thinking for past-asst (Protocol parity only)
Kimi-K2	No-op — template never reads `reasoning_content`
DeepSeek-V3	No-op — template already always preserves `reasoning_content`
DefaultRenderer	Raises `NotImplementedError` — apply_chat_template fallback can't selectively re-emit

Tests

tests/test_preserve_thinking.py covers:

Classification correctness for should_preserve_past_thinking helper
Default render unchanged when flags are explicitly off (matrix)
Strict-subset growth: len(default) ≤ len(between_tool_calls) ≤ len(all) (matrix)
No-op renderers actually no-op
DefaultRenderer raises on either flag

48 new tests pass; 2 cleanly skipped on Qwen2.5 (uses DefaultRenderer, exercised separately).

Version

Bumps 0.1.6 → 0.1.7. Defaults are unchanged so no breakage for existing callers.

Two override knobs on ``Renderer.render`` / ``render_ids``: - ``preserve_thinking_between_tool_calls``: when a past assistant message has ``tool_calls`` and is immediately followed by a ``tool`` result, emit its ``reasoning_content`` even when the chat template default would drop it. Captures the in-flight asst→tool→asst→tool chain where the trace is load-bearing for the next call. - ``preserve_all_thinking``: emit ``reasoning_content`` for every past-assistant turn. Strict superset of the above. Both flags are *override* knobs — they only restore thinking the template would drop, never strip thinking it emits. Defaults are a no-op: rendered output stays byte-identical to ``apply_chat_template``, which the existing 784-test parity suite confirms. Per-renderer impact: - Qwen3 / Qwen3.5 / Qwen3.6 / GLM-4.5 / GLM-5 / GLM-5.1 / GLM-4.7-Flash / MiniMax-M2 / Nemotron-3 / Kimi-K2.5 / Kimi-K2.6: drop conditions extended to honour the override. - Qwen3-VL / Kimi-K2 / DeepSeek-V3: templates already preserve thinking (or have no place to re-emit it). Flags accepted as no-ops with comments — Protocol parity only. - gpt-oss: harmony's analysis-channel message is emitted on past-asst when the override fires (default still drops, matching trained history-style rendering). - DefaultRenderer (apply_chat_template fallback): raises ``NotImplementedError`` if either flag is set. Selective re-emit isn't possible without knowing the template's structure. Tests: ``tests/test_preserve_thinking.py`` covers default-unchanged, strict-subset growth (default ≤ between ≤ all), and DefaultRenderer raise behaviour across the full conftest matrix. Bump version 0.1.6 → 0.1.7.

Previously the classifier only fired on an asst with ``tool_calls`` whose immediate next message was a ``tool`` — that misses the post-tool-result asst that closes the same loop. The intent of "between tool calls" is the whole asst→tool→asst→...→tool→asst chain between two user turns. New rule: an asst at index ``i`` is in a tool-cycle segment if the user-bounded segment containing ``i`` (left/right bounds being the adjacent ``user`` messages, or start/end of conversation) contains at least one ``tool`` message. Once a new ``user`` arrives, the cycle closes and template default resumes. In a ``S-U-A-T-A-U-A-T-A`` shape with reasoning on every asst, this flag now keeps thinking on A2, A4, A6, A8 (whereas the previous rule only kept A2 and the default-template-kept A6/A8). For a mixed shape like ``S-U-A-U-A-T-A-U-A`` the flag preserves only the middle pair (A4, A6 — the ones inside the tool-bearing segment) and stays strictly weaker than ``preserve_all_thinking``.

Previous fix was too aggressive — it preserved thinking for any asst in *any* user-bounded segment that contained a tool message, including older segments behind a subsequent ``user`` turn. The intent of the flag is narrower: keep thinking inside the *current* A-T-...-A loop (after the most recent ``user``). Once a new ``user`` arrives, the previous tool block becomes "older" and its thinking is dropped (template default). That matches how chat templates already treat multi-turn contexts and avoids stuffing the prompt with stale reasoning the model wasn't trained to attend over user-turn boundaries. ``preserve_all_thinking`` is the escape hatch for callers who do want older blocks preserved too. New rule: ``msg_idx > last_user_idx`` AND the current segment has at least one ``tool`` message. Validated end-to-end against Qwen3 on ``S-U-A-T-A-U-A-T-A`` and ``S-U-A-T-A`` — older block drops, current block keeps, classification helper unit-tested for both shapes. Adds ``test_preserve_btc_on_live_cycle_matches_all`` per renderer: in a live tool cycle (no trailing user), btc and all should produce identical token sequences since they preserve the same set of asst messages.

Per-renderer test that walks the full conftest matrix on a ``S-U-A-T-A-U-A-T-A`` conversation with unique sentinel strings (``REASON-A2``..``REASON-A8``) in each asst's ``reasoning_content``, then greps the decoded output to verify which sentinels are visible. - ``preserve_all_thinking=True``: every sentinel must appear (or none, for never-preserves renderers like Kimi-K2 / Qwen3-VL — flags are Protocol-only no-ops there). - ``preserve_thinking_between_tool_calls=True``: the current-block sentinels (A6, A8) must always appear; older-block sentinels (A2, A4) follow template default and aren't asserted universally. Also drops the manual ``Bridge interaction`` mismatch concern: each of the 14 active renderers now has end-to-end visibility coverage rather than just the strict-subset token-count invariant.

PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3 → #5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated against PyPI without ever publishing. Roll forward to 0.1.6 once for the actual release; tag ``renderers-v0.1.6`` post-merge to fire the publish workflow (.github/workflows/publish.yml asserts the tag's ``-vX.Y.Z`` suffix matches pyproject.toml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…preserve_thinking) (#6) * feat: drop legacy chat-template-pass-through kwargs ``GLM5Renderer.__init__`` no longer accepts ``clear_thinking``, and ``Qwen36Renderer.__init__`` no longer accepts ``preserve_thinking``. Both were chat-template-kwarg pass-throughs whose default values are now baked straight into the renderer: - GLM-5 default: drop ``<think>`` on past-cycle assistants, keep on in-flight cycle and trailing assistant. (What ``clear_thinking=True`` did before.) The render gate is now ``msg_idx > last_user_index or preserve_thinking`` — clean and consistent with the rest of the family. - Qwen3.6 default: identical to Qwen3.5 (``msg_idx > last_query_index`` via the inherited ``_should_render_thinking``). The class shrinks to just the ``_render_arg_value`` override, which is the only real template delta vs Qwen3.5. Callers who want the toggled-on behaviour pass ``preserve_all_thinking=True`` to ``create_renderer`` — the renderer-agnostic spelling of the same intent landed in #3/#5. Two regression tests pin the dropped kwargs (constructor must raise ``TypeError`` when either is passed). The README's §1 "Preserving past reasoning" subsection (added in #5) already motivates the override flags; no new prose needed here. §3.d is unchanged (still cross-links to the override flag for compaction). Bump 0.1.9 → 0.1.10. 895 tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: consolidate version to 0.1.6 (PyPI-aligned) PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3 → #5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated against PyPI without ever publishing. Roll forward to 0.1.6 once for the actual release; tag ``renderers-v0.1.6`` post-merge to fire the publish workflow (.github/workflows/publish.yml asserts the tag's ``-vX.Y.Z`` suffix matches pyproject.toml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hallerite added 4 commits May 5, 2026 16:00

hallerite marked this pull request as ready for review May 6, 2026 23:05

hallerite merged commit d222cf3 into main May 6, 2026
6 checks passed

This was referenced May 7, 2026

feat: bind preserve_*_thinking at construction (drop _bind_preserve_defaults) #5

Merged

feat: drop legacy chat-template-pass-through kwargs (clear_thinking, preserve_thinking) #6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: preserve_*_thinking override flags#3

feat: preserve_*_thinking override flags#3
hallerite merged 4 commits intomainfrom
feat/preserve-thinking-flags

hallerite commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented May 5, 2026

Summary

Per-renderer impact

Tests

Version

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant