Merged
Conversation
Two override knobs on ``Renderer.render`` / ``render_ids``: - ``preserve_thinking_between_tool_calls``: when a past assistant message has ``tool_calls`` and is immediately followed by a ``tool`` result, emit its ``reasoning_content`` even when the chat template default would drop it. Captures the in-flight asst→tool→asst→tool chain where the trace is load-bearing for the next call. - ``preserve_all_thinking``: emit ``reasoning_content`` for every past-assistant turn. Strict superset of the above. Both flags are *override* knobs — they only restore thinking the template would drop, never strip thinking it emits. Defaults are a no-op: rendered output stays byte-identical to ``apply_chat_template``, which the existing 784-test parity suite confirms. Per-renderer impact: - Qwen3 / Qwen3.5 / Qwen3.6 / GLM-4.5 / GLM-5 / GLM-5.1 / GLM-4.7-Flash / MiniMax-M2 / Nemotron-3 / Kimi-K2.5 / Kimi-K2.6: drop conditions extended to honour the override. - Qwen3-VL / Kimi-K2 / DeepSeek-V3: templates already preserve thinking (or have no place to re-emit it). Flags accepted as no-ops with comments — Protocol parity only. - gpt-oss: harmony's analysis-channel message is emitted on past-asst when the override fires (default still drops, matching trained history-style rendering). - DefaultRenderer (apply_chat_template fallback): raises ``NotImplementedError`` if either flag is set. Selective re-emit isn't possible without knowing the template's structure. Tests: ``tests/test_preserve_thinking.py`` covers default-unchanged, strict-subset growth (default ≤ between ≤ all), and DefaultRenderer raise behaviour across the full conftest matrix. Bump version 0.1.6 → 0.1.7.
Previously the classifier only fired on an asst with ``tool_calls`` whose immediate next message was a ``tool`` — that misses the post-tool-result asst that closes the same loop. The intent of "between tool calls" is the whole asst→tool→asst→...→tool→asst chain between two user turns. New rule: an asst at index ``i`` is in a tool-cycle segment if the user-bounded segment containing ``i`` (left/right bounds being the adjacent ``user`` messages, or start/end of conversation) contains at least one ``tool`` message. Once a new ``user`` arrives, the cycle closes and template default resumes. In a ``S-U-A-T-A-U-A-T-A`` shape with reasoning on every asst, this flag now keeps thinking on A2, A4, A6, A8 (whereas the previous rule only kept A2 and the default-template-kept A6/A8). For a mixed shape like ``S-U-A-U-A-T-A-U-A`` the flag preserves only the middle pair (A4, A6 — the ones inside the tool-bearing segment) and stays strictly weaker than ``preserve_all_thinking``.
Previous fix was too aggressive — it preserved thinking for any asst in *any* user-bounded segment that contained a tool message, including older segments behind a subsequent ``user`` turn. The intent of the flag is narrower: keep thinking inside the *current* A-T-...-A loop (after the most recent ``user``). Once a new ``user`` arrives, the previous tool block becomes "older" and its thinking is dropped (template default). That matches how chat templates already treat multi-turn contexts and avoids stuffing the prompt with stale reasoning the model wasn't trained to attend over user-turn boundaries. ``preserve_all_thinking`` is the escape hatch for callers who do want older blocks preserved too. New rule: ``msg_idx > last_user_idx`` AND the current segment has at least one ``tool`` message. Validated end-to-end against Qwen3 on ``S-U-A-T-A-U-A-T-A`` and ``S-U-A-T-A`` — older block drops, current block keeps, classification helper unit-tested for both shapes. Adds ``test_preserve_btc_on_live_cycle_matches_all`` per renderer: in a live tool cycle (no trailing user), btc and all should produce identical token sequences since they preserve the same set of asst messages.
Per-renderer test that walks the full conftest matrix on a ``S-U-A-T-A-U-A-T-A`` conversation with unique sentinel strings (``REASON-A2``..``REASON-A8``) in each asst's ``reasoning_content``, then greps the decoded output to verify which sentinels are visible. - ``preserve_all_thinking=True``: every sentinel must appear (or none, for never-preserves renderers like Kimi-K2 / Qwen3-VL — flags are Protocol-only no-ops there). - ``preserve_thinking_between_tool_calls=True``: the current-block sentinels (A6, A8) must always appear; older-block sentinels (A2, A4) follow template default and aren't asserted universally. Also drops the manual ``Bridge interaction`` mismatch concern: each of the 14 active renderers now has end-to-end visibility coverage rather than just the strict-subset token-count invariant.
This was referenced May 7, 2026
hallerite
added a commit
that referenced
this pull request
May 7, 2026
PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3 → #5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated against PyPI without ever publishing. Roll forward to 0.1.6 once for the actual release; tag ``renderers-v0.1.6`` post-merge to fire the publish workflow (.github/workflows/publish.yml asserts the tag's ``-vX.Y.Z`` suffix matches pyproject.toml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hallerite
added a commit
that referenced
this pull request
May 7, 2026
…preserve_thinking) (#6) * feat: drop legacy chat-template-pass-through kwargs ``GLM5Renderer.__init__`` no longer accepts ``clear_thinking``, and ``Qwen36Renderer.__init__`` no longer accepts ``preserve_thinking``. Both were chat-template-kwarg pass-throughs whose default values are now baked straight into the renderer: - GLM-5 default: drop ``<think>`` on past-cycle assistants, keep on in-flight cycle and trailing assistant. (What ``clear_thinking=True`` did before.) The render gate is now ``msg_idx > last_user_index or preserve_thinking`` — clean and consistent with the rest of the family. - Qwen3.6 default: identical to Qwen3.5 (``msg_idx > last_query_index`` via the inherited ``_should_render_thinking``). The class shrinks to just the ``_render_arg_value`` override, which is the only real template delta vs Qwen3.5. Callers who want the toggled-on behaviour pass ``preserve_all_thinking=True`` to ``create_renderer`` — the renderer-agnostic spelling of the same intent landed in #3/#5. Two regression tests pin the dropped kwargs (constructor must raise ``TypeError`` when either is passed). The README's §1 "Preserving past reasoning" subsection (added in #5) already motivates the override flags; no new prose needed here. §3.d is unchanged (still cross-links to the override flag for compaction). Bump 0.1.9 → 0.1.10. 895 tests pass locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: consolidate version to 0.1.6 (PyPI-aligned) PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3 → #5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated against PyPI without ever publishing. Roll forward to 0.1.6 once for the actual release; tag ``renderers-v0.1.6`` post-merge to fire the publish workflow (.github/workflows/publish.yml asserts the tag's ``-vX.Y.Z`` suffix matches pyproject.toml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds two override flags to
Renderer.render/render_ids:preserve_thinking_between_tool_calls— when a past assistant message hastool_callsand is immediately followed by atoolresult, emit itsreasoning_contenteven when the chat template default would drop it. Captures the in-flight asst→tool→asst→tool chain where the trace is load-bearing for the next call.preserve_all_thinking— emitreasoning_contentfor every past-assistant turn. Strict superset of the above.These are override knobs only: they restore thinking the template would drop, never strip thinking it emits. Defaults are byte-identical to
apply_chat_template(existing 784-test parity suite confirms).Per-renderer impact
msg_idx > last_query_indexextended with overridemsg_idx > last_user_indexextended with overrideconv_idx > last_user_indexextended with overrideis_last_turnextended with overrideis_suffixreasoning_contentreasoning_contentNotImplementedError— apply_chat_template fallback can't selectively re-emitTests
tests/test_preserve_thinking.pycovers:should_preserve_past_thinkinghelperlen(default) ≤ len(between_tool_calls) ≤ len(all)(matrix)48 new tests pass; 2 cleanly skipped on Qwen2.5 (uses DefaultRenderer, exercised separately).
Version
Bumps
0.1.6 → 0.1.7. Defaults are unchanged so no breakage for existing callers.