-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
wmc.py uses 1 token ≈ 4 characters for token estimation. Cosmos Reason2 2B uses a sub-word tokenizer where complex vocabulary, code, emojis, or non-English text can tokenize to significantly more tokens than this estimate. The WMC budget of 1400 tokens may actually be 1600–1800 real tokens, pushing the total context over the 2048 hard limit.
Steps to Reproduce
- Have a conversation containing code snippets or technical terms
- Let WMC fill up to budget
- Send another message — vLLM may return 400 Bad Request
Expected Behavior
Context window always stays safely under 2048 tokens.
Actual Behavior
Rough estimation underestimates real token count. Combined with system prompt (~300 tokens) and EMC injection (~300 tokens), total may exceed 2048, causing vLLM to return an error or truncate the prompt.
Error Logs
# Possible:
[ERROR] [cnc]: ❌ vLLM HTTP 400
# Or silent truncation mid-sentence
Environment
- Hardware: Jetson Orin Nano Super 8GB
- OS: Ubuntu 22.04 / JetPack 6.2.2
- ROS2: Humble
- Python: 3.10
- Package: scs
- Node: wmc
Affected Files
wmc.py line 34 — CHARS_PER_TOKEN = 4
wmc.py line 37 — _estimate_tokens()
Possible Fix
Option A — Conservative budget (quick fix):
WMC_TOKEN_BUDGET = 1100 # reduce budget to give more headroom
CHARS_PER_TOKEN = 3 # more conservative estimateOption B — Use actual tokenizer (accurate fix):
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-2B-Instruct")
def _estimate_tokens(text: str) -> int:
return len(tokenizer.encode(text))Additional Context
Cosmos Reason2 2B is based on Qwen3-VL-2B. Option A is safe for M1. Option B is accurate but adds startup latency and RAM usage. Recommended: Option A now, Option B in M2.
Metadata
Metadata
Assignees
Labels
Projects
Status