Skip to content

Fix TITO bridge extraction and truncation handling#1005

Merged
eligotts merged 9 commits intomainfrom
eli/fix-tito
Mar 30, 2026
Merged

Fix TITO bridge extraction and truncation handling#1005
eligotts merged 9 commits intomainfrom
eli/fix-tito

Conversation

@eligotts
Copy link
Copy Markdown
Contributor

@eligotts eligotts commented Mar 11, 2026

Summary

  • Fix bridge extraction: Rewrote TITO bridge token extraction to use a dummy-assistant dual-tokenization approach that correctly handles all chat templates (Qwen3, Qwen3.5, GLM-4.5/4.7) including GLM's stop-token-as-role-marker pattern and Qwen3's context-dependent think block injection.
  • Fix truncation gate: The TITO truncation gate now checks both tokens["is_truncated"] (seq_len overflow) and response.message.is_truncated (finish_reason="length" from vLLM). Previously only seq_len overflow was checked, so max_tokens truncation was missed — TITO would attempt bridge stitching on a completion without a stop token, producing malformed token sequences.

Problem

When a completion hit max_tokens (e.g., thinking model burns through token budget), vLLM returns finish_reason="length" and does not include a stop token in completion token_ids. The TITO client's truncation gate only checked TrajectoryStepTokens["is_truncated"], which reflects seq_len overflow but not max_tokens truncation. So it proceeded with bridge extraction, grabbed a random content token as the "stop token" for gap calculation, and produced broken stitched sequences.

Testing

Empirically validated stop token behavior across 6 models on prime-rl's custom vLLM server (/v1/chat/completions/tokens endpoint):

Model Stop Token stop ✓ tool_calls ✓ length (no stop) ✓
Qwen3-4B-Instruct-2507 151645 <|im_end|>
Qwen3-0.6B 151645 <|im_end|> n/a
Qwen3-8B 151645 <|im_end|>
Qwen3-30B-A3B 151645 <|im_end|>
Qwen3.5-4B 248046
GLM-4.7-Flash 154827 <|user|> / 154829 <|observation|>

Confirmed across all models: finish_reason=length never includes a stop token in token_ids. Bridge extraction and GLM dedup logic validated for both tool-call multi-turn (wiki-search style) and user-message multi-turn (alphabet-sort style).

Ran RL training experiments with use_token_client=true across Qwen3-4B, Qwen3-0.6B, Qwen3-30B-A3B, and Qwen3.5-4B on wiki-search and alphabet-sort environments, confirming no TITO errors.

Test plan

  • Run alphabet-sort with low max_tokens (e.g., 32) to trigger frequent finish_reason=length on multi-turn — verify MITO fallback fires and rollout completes correctly
  • Run wiki-search with use_token_client=true — verify TITO path works for tool-call multi-turn
  • Verify no regression on single-turn environments

🤖 Generated with Claude Code


Note

Medium Risk
Touches core prompt/token stitching logic used for KV-cache reuse across turns; failures can corrupt prompts or force MITO fallback, though changes add additional validation and safe fallbacks.

Overview
Fixes TITO prompt stitching in OpenAIChatCompletionsTokenClient.get_prompt_ids by replacing full-prompt slicing/suffix caching with a dummy-assistant dual-tokenization approach that extracts only the minimal “bridge” tokens needed after a cached prefix, including handling templates where stop tokens can act as role markers.

Adds stricter gating to avoid stitching when the matched previous completion was truncated (checking both token parsing overflow and finish_reason="length" via response.message.is_truncated), plus validates that the post-prefix “env tail” is composed of tool messages with an optional trailing user message; otherwise it falls back to message-based inference (MITO).

Written by Cursor Bugbot for commit ed76a51. This will update automatically on new commits. Configure here.

mikasenghaas and others added 5 commits March 6, 2026 23:14
Replaces the previous bridge extraction that tokenized the real assistant
message (breaking on Qwen3's context-dependent think block injection) with
a robust dual-tokenization approach:

- Tokenize [dummy_assistant + env_messages] with gen=True
- Tokenize [dummy_assistant] with gen=False
- Extract bridge via subtraction, accounting for the gap between the
  engine's stop token and the template's inter-turn separator
- Dedup stop tokens that double as role markers (GLM's <|observation|>)
- Handle truncated completions by falling back to MITO
- Support multiple trailing env messages (multi-tool responses)

Verified across Qwen3, Qwen2.5, Hermes, and GLM model families with
exact bridge matches on all edge cases (empty content, unicode, injection
attempts, multi-tool, multi-turn chains).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TITO client's truncation gate only checked
TrajectoryStepTokens["is_truncated"], which reflects seq_len overflow
but not max_tokens truncation (finish_reason="length" from vLLM).

When a completion hit max_tokens, is_truncated was False (total
sequence fit within seq_len), so TITO proceeded with bridge extraction.
Without a stop token at the end of the truncated completion, the gap
calculation used a random content token, producing malformed stitched
sequences.

Now checks both sources:
- tokens["is_truncated"] for seq_len overflow
- response.message.is_truncated for finish_reason="length"

Either triggers MITO fallback, which re-tokenizes from messages
correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Trailing env message count misses mixed tool+user sequences
    • Removed the premature break after counting a trailing user message so preceding trailing tool/observation messages are now included in the env-message bridge count.

Create PR

Or push these changes by commenting:

@cursor push 8c90cb20f7
Preview (8c90cb20f7)
diff --git a/verifiers/clients/openai_chat_completions_token_client.py b/verifiers/clients/openai_chat_completions_token_client.py
--- a/verifiers/clients/openai_chat_completions_token_client.py
+++ b/verifiers/clients/openai_chat_completions_token_client.py
@@ -42,7 +42,6 @@
         elif role == "user" and count == 0:
             # A user follow-up (not a tool response) is also an env message
             count = 1
-            break
         else:
             break
     return count

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment thread verifiers/clients/openai_chat_completions_token_client.py Outdated
cursoragent and others added 4 commits March 11, 2026 03:07
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _count_trailing_env_messages with direct derivation from the
prefix match result. The env messages are simply
prompt_messages[prefix_len:] — no need to independently re-derive
from the tail.

Validate the pattern: all tool messages, with optionally a user
message last. Falls back to MITO on unexpected message shapes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eligotts eligotts merged commit 8ac737c into main Mar 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants