Description
LM Studio prefix caching requires the tokenized prompt prefix to remain identical across turns. With Qwen / QwQ-family chat templates served via LM Studio's OpenAI-compatible endpoint, two things make the prefix unstable across turns when OpenCode replays the conversation history as-is:
- Assistant reasoning content is replayed. Qwen chat templates render historical assistant
reasoning (think) blocks differently on later turns once a new user message is appended, so the same historical assistant turn tokenizes differently the next time it appears in the prompt.
- Raw
role: "tool" messages are tokenized non-deterministically. Qwen-style chat templates render past tool results inconsistently when they sit as standalone role: "tool" messages, but render them stably when they appear as <tool_response>...</tool_response> blocks inside the preceding user/assistant text.
Either behavior breaks LM Studio prefix caching even though the underlying conversation history has not meaningfully changed, so a long context can fully re-tokenize on every turn.
Expected: when OpenCode replays the same conversation history through LM Studio's OpenAI-compatible endpoint with a Qwen-family model, the prompt prefix is byte-stable, so the prefix cache hits on every turn after the first.
Actual: the prefix changes turn-over-turn, causing a large or full prompt-cache miss against LM Studio.
This is provider/model compatibility behavior specific to LM Stud
[Truncado — 2479 chars totais]
Description
LM Studio prefix caching requires the tokenized prompt prefix to remain identical across turns. With Qwen / QwQ-family chat templates served via LM Studio's OpenAI-compatible endpoint, two things make the prefix unstable across turns when OpenCode replays the conversation history as-is:
reasoning(think) blocks differently on later turns once a new user message is appended, so the same historical assistant turn tokenizes differently the next time it appears in the prompt.role: "tool"messages are tokenized non-deterministically. Qwen-style chat templates render past tool results inconsistently when they sit as standalonerole: "tool"messages, but render them stably when they appear as<tool_response>...</tool_response>blocks inside the preceding user/assistant text.Either behavior breaks LM Studio prefix caching even though the underlying conversation history has not meaningfully changed, so a long context can fully re-tokenize on every turn.
Expected: when OpenCode replays the same conversation history through LM Studio's OpenAI-compatible endpoint with a Qwen-family model, the prompt prefix is byte-stable, so the prefix cache hits on every turn after the first.
Actual: the prefix changes turn-over-turn, causing a large or full prompt-cache miss against LM Studio.
This is provider/model compatibility behavior specific to LM Stud