Skip to content

#26750 — LM Studio + Qwen: prompt prefix is not byte-stable across turns, breaks prefix cache #907

@ElioNeto

Description

@ElioNeto

Description

LM Studio prefix caching requires the tokenized prompt prefix to remain identical across turns. With Qwen / QwQ-family chat templates served via LM Studio's OpenAI-compatible endpoint, two things make the prefix unstable across turns when OpenCode replays the conversation history as-is:

  1. Assistant reasoning content is replayed. Qwen chat templates render historical assistant reasoning (think) blocks differently on later turns once a new user message is appended, so the same historical assistant turn tokenizes differently the next time it appears in the prompt.
  2. Raw role: "tool" messages are tokenized non-deterministically. Qwen-style chat templates render past tool results inconsistently when they sit as standalone role: "tool" messages, but render them stably when they appear as <tool_response>...</tool_response> blocks inside the preceding user/assistant text.

Either behavior breaks LM Studio prefix caching even though the underlying conversation history has not meaningfully changed, so a long context can fully re-tokenize on every turn.

Expected: when OpenCode replays the same conversation history through LM Studio's OpenAI-compatible endpoint with a Qwen-family model, the prompt prefix is byte-stable, so the prefix cache hits on every turn after the first.

Actual: the prefix changes turn-over-turn, causing a large or full prompt-cache miss against LM Studio.

This is provider/model compatibility behavior specific to LM Stud

[Truncado — 2479 chars totais]

Metadata

Metadata

Assignees

No one assigned

    Labels

    DORDefinition of Ready — issue meets readiness criteriaarea:corebugSomething isn't workingmedium

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions