Skip to content

feat: preserve_*_thinking override flags#3

Merged
hallerite merged 4 commits intomainfrom
feat/preserve-thinking-flags
May 6, 2026
Merged

feat: preserve_*_thinking override flags#3
hallerite merged 4 commits intomainfrom
feat/preserve-thinking-flags

Conversation

@hallerite
Copy link
Copy Markdown
Member

Summary

Adds two override flags to Renderer.render / render_ids:

  • preserve_thinking_between_tool_calls — when a past assistant message has tool_calls and is immediately followed by a tool result, emit its reasoning_content even when the chat template default would drop it. Captures the in-flight asst→tool→asst→tool chain where the trace is load-bearing for the next call.
  • preserve_all_thinking — emit reasoning_content for every past-assistant turn. Strict superset of the above.

These are override knobs only: they restore thinking the template would drop, never strip thinking it emits. Defaults are byte-identical to apply_chat_template (existing 784-test parity suite confirms).

Per-renderer impact

Renderer Behaviour
Qwen3, Qwen3.5, Qwen3.6 Drop condition msg_idx > last_query_index extended with override
GLM-4.5, GLM-5, GLM-5.1, GLM-4.7-Flash msg_idx > last_user_index extended with override
MiniMax-M2 conv_idx > last_user_index extended with override
Nemotron-3 is_last_turn extended with override
Kimi-K2.5 / K2.6 hist/suffix split extended on is_suffix
gpt-oss analysis-channel message emitted on past-asst when override fires
Qwen3-VL No-op — template doesn't emit thinking for past-asst (Protocol parity only)
Kimi-K2 No-op — template never reads reasoning_content
DeepSeek-V3 No-op — template already always preserves reasoning_content
DefaultRenderer Raises NotImplementedError — apply_chat_template fallback can't selectively re-emit

Tests

tests/test_preserve_thinking.py covers:

  • Classification correctness for should_preserve_past_thinking helper
  • Default render unchanged when flags are explicitly off (matrix)
  • Strict-subset growth: len(default) ≤ len(between_tool_calls) ≤ len(all) (matrix)
  • No-op renderers actually no-op
  • DefaultRenderer raises on either flag

48 new tests pass; 2 cleanly skipped on Qwen2.5 (uses DefaultRenderer, exercised separately).

Version

Bumps 0.1.6 → 0.1.7. Defaults are unchanged so no breakage for existing callers.

hallerite added 4 commits May 5, 2026 16:00
Two override knobs on ``Renderer.render`` / ``render_ids``:

- ``preserve_thinking_between_tool_calls``: when a past assistant
  message has ``tool_calls`` and is immediately followed by a ``tool``
  result, emit its ``reasoning_content`` even when the chat template
  default would drop it. Captures the in-flight asst→tool→asst→tool
  chain where the trace is load-bearing for the next call.
- ``preserve_all_thinking``: emit ``reasoning_content`` for every
  past-assistant turn. Strict superset of the above.

Both flags are *override* knobs — they only restore thinking the
template would drop, never strip thinking it emits. Defaults are a
no-op: rendered output stays byte-identical to ``apply_chat_template``,
which the existing 784-test parity suite confirms.

Per-renderer impact:

- Qwen3 / Qwen3.5 / Qwen3.6 / GLM-4.5 / GLM-5 / GLM-5.1 / GLM-4.7-Flash
  / MiniMax-M2 / Nemotron-3 / Kimi-K2.5 / Kimi-K2.6: drop conditions
  extended to honour the override.
- Qwen3-VL / Kimi-K2 / DeepSeek-V3: templates already preserve thinking
  (or have no place to re-emit it). Flags accepted as no-ops with
  comments — Protocol parity only.
- gpt-oss: harmony's analysis-channel message is emitted on past-asst
  when the override fires (default still drops, matching trained
  history-style rendering).
- DefaultRenderer (apply_chat_template fallback): raises
  ``NotImplementedError`` if either flag is set. Selective re-emit
  isn't possible without knowing the template's structure.

Tests: ``tests/test_preserve_thinking.py`` covers default-unchanged,
strict-subset growth (default ≤ between ≤ all), and DefaultRenderer
raise behaviour across the full conftest matrix.

Bump version 0.1.6 → 0.1.7.
Previously the classifier only fired on an asst with ``tool_calls``
whose immediate next message was a ``tool`` — that misses the
post-tool-result asst that closes the same loop. The intent of
"between tool calls" is the whole asst→tool→asst→...→tool→asst chain
between two user turns.

New rule: an asst at index ``i`` is in a tool-cycle segment if the
user-bounded segment containing ``i`` (left/right bounds being the
adjacent ``user`` messages, or start/end of conversation) contains at
least one ``tool`` message. Once a new ``user`` arrives, the cycle
closes and template default resumes.

In a ``S-U-A-T-A-U-A-T-A`` shape with reasoning on every asst, this
flag now keeps thinking on A2, A4, A6, A8 (whereas the previous rule
only kept A2 and the default-template-kept A6/A8). For a mixed shape
like ``S-U-A-U-A-T-A-U-A`` the flag preserves only the middle pair
(A4, A6 — the ones inside the tool-bearing segment) and stays
strictly weaker than ``preserve_all_thinking``.
Previous fix was too aggressive — it preserved thinking for any asst in
*any* user-bounded segment that contained a tool message, including
older segments behind a subsequent ``user`` turn.

The intent of the flag is narrower: keep thinking inside the *current*
A-T-...-A loop (after the most recent ``user``). Once a new ``user``
arrives, the previous tool block becomes "older" and its thinking is
dropped (template default). That matches how chat templates already
treat multi-turn contexts and avoids stuffing the prompt with stale
reasoning the model wasn't trained to attend over user-turn boundaries.
``preserve_all_thinking`` is the escape hatch for callers who do want
older blocks preserved too.

New rule: ``msg_idx > last_user_idx`` AND the current segment has at
least one ``tool`` message. Validated end-to-end against Qwen3 on
``S-U-A-T-A-U-A-T-A`` and ``S-U-A-T-A`` — older block drops, current
block keeps, classification helper unit-tested for both shapes.

Adds ``test_preserve_btc_on_live_cycle_matches_all`` per renderer:
in a live tool cycle (no trailing user), btc and all should produce
identical token sequences since they preserve the same set of asst
messages.
Per-renderer test that walks the full conftest matrix on a
``S-U-A-T-A-U-A-T-A`` conversation with unique sentinel strings
(``REASON-A2``..``REASON-A8``) in each asst's ``reasoning_content``,
then greps the decoded output to verify which sentinels are visible.

- ``preserve_all_thinking=True``: every sentinel must appear (or none,
  for never-preserves renderers like Kimi-K2 / Qwen3-VL — flags are
  Protocol-only no-ops there).
- ``preserve_thinking_between_tool_calls=True``: the current-block
  sentinels (A6, A8) must always appear; older-block sentinels (A2,
  A4) follow template default and aren't asserted universally.

Also drops the manual ``Bridge interaction`` mismatch concern: each of
the 14 active renderers now has end-to-end visibility coverage rather
than just the strict-subset token-count invariant.
@hallerite hallerite marked this pull request as ready for review May 6, 2026 23:05
@hallerite hallerite merged commit d222cf3 into main May 6, 2026
6 checks passed
hallerite added a commit that referenced this pull request May 7, 2026
PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3#5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated
against PyPI without ever publishing. Roll forward to 0.1.6 once for
the actual release; tag ``renderers-v0.1.6`` post-merge to fire the
publish workflow (.github/workflows/publish.yml asserts the tag's
``-vX.Y.Z`` suffix matches pyproject.toml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hallerite added a commit that referenced this pull request May 7, 2026
…preserve_thinking) (#6)

* feat: drop legacy chat-template-pass-through kwargs

``GLM5Renderer.__init__`` no longer accepts ``clear_thinking``, and
``Qwen36Renderer.__init__`` no longer accepts ``preserve_thinking``.
Both were chat-template-kwarg pass-throughs whose default values are
now baked straight into the renderer:

- GLM-5 default: drop ``<think>`` on past-cycle assistants, keep on
  in-flight cycle and trailing assistant. (What ``clear_thinking=True``
  did before.) The render gate is now ``msg_idx > last_user_index or
  preserve_thinking`` — clean and consistent with the rest of the
  family.
- Qwen3.6 default: identical to Qwen3.5 (``msg_idx > last_query_index``
  via the inherited ``_should_render_thinking``). The class shrinks to
  just the ``_render_arg_value`` override, which is the only real
  template delta vs Qwen3.5.

Callers who want the toggled-on behaviour pass
``preserve_all_thinking=True`` to ``create_renderer`` — the
renderer-agnostic spelling of the same intent landed in #3/#5. Two
regression tests pin the dropped kwargs (constructor must raise
``TypeError`` when either is passed).

The README's §1 "Preserving past reasoning" subsection (added in #5)
already motivates the override flags; no new prose needed here. §3.d
is unchanged (still cross-links to the override flag for compaction).

Bump 0.1.9 → 0.1.10. 895 tests pass locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: consolidate version to 0.1.6 (PyPI-aligned)

PyPI's latest published renderers is 0.1.5 — the in-tree bumps in #3#5 → this PR (0.1.6 → 0.1.7 → 0.1.8 → 0.1.9 → 0.1.10) accumulated
against PyPI without ever publishing. Roll forward to 0.1.6 once for
the actual release; tag ``renderers-v0.1.6`` post-merge to fire the
publish workflow (.github/workflows/publish.yml asserts the tag's
``-vX.Y.Z`` suffix matches pyproject.toml).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant