feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348
feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348mgalgs wants to merge 1 commit into
Conversation
By default OpenRouter can route your request to providers that serve quantized versions of the model [1], which can result in substantially worse output for coding. Avoid getting quantized models by setting the `provider.quantizations` parameter to full precision (fp8 or better) when using OpenRouter. [1] https://openrouter.ai/docs/features/provider-routing#quantization
Co-authored-by: Gregory Shikhman <shikhman@google.com>
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review
|
Please resolve the merge conflicts before this is merged. Thanks! — gpt-5.4 via Qwen Code /review |
|
Thanks for the contribution, but closing this one — the premise isn't quite right for the CLI layer. Provider routing on OpenRouter (which providers, which precisions, fallback behavior) is a user-configurable concern. OpenRouter exposes it through both the API ( If finer control from inside qwen-code is desirable, the right shape would be a generic passthrough for OpenRouter |
… (78 total)
OpenRouter is the default BYOK catalog — 200+ models behind one OpenAI-shape
endpoint. Landing it first forces the trait surface against the maximum
amount of provider variance in a single integration.
Wire format verified 2026-05-14 against openrouter.ai/docs:
- /api/reference/authentication: Bearer + HTTP-Referer + X-OpenRouter-Title
(the rename from X-Title — adapter authors who copy old SDK examples will
silently lose attribution; we get it right from line one)
- /api/reference/streaming: data: {json}\n\n framing, [DONE] terminator,
": OPENROUTER PROCESSING" keepalive comments to ignore
- /guides/routing/provider-selection: provider.quantizations accepts
int4 | int8 | fp4 | fp6 | fp8 | fp16 | bf16 | fp32 | unknown
Defaults that matter:
- Precision::Exact ON by default → quantizations = ["fp16","bf16","fp32"].
Refuses OpenRouter's default FP4/Int4 routing — the CJK-encoding-breaks
failure mode documented in RooCodeInc/Roo-Code#11325, QwenLM/qwen-code#348.
Mixed precision is opt-in via with_routing().
- DataCollection::Deny by default. Bet 3 trust hinges on users believing
their prompts aren't training data.
- stream_options.include_usage=true so cost telemetry isn't blocked on the
socket closing.
Adapter shape:
- build_body() is sync + pure — unit-testable without HTTP mocks
- map_chunk() handles tool-call accumulation: ToolCallStart → ToolInputDelta*
→ ToolInputEnd → Finish (closing every started call before Finish)
- Synthesizes tool-call index when upstream doesn't expose it (answers the
"every tool call is index 0" bug)
- delta.reasoning passthrough → Event::ReasoningDelta (covers DeepSeek and
Anthropic-extended-thinking routed via OR)
- prompt_tokens_details.cached_tokens → Usage.cache_read_tokens
- completion_tokens_details.reasoning_tokens → Usage.reasoning_tokens
- ProviderError carries typed ErrorClass from HTTP code or message heuristic
Trait-surface deltas (required to make the adapter ergonomic):
- Call.tools is now Vec<Arc<Tool>> (was Vec<ToolId>). Adapters need
schema + description, not just IDs, to build the wire shape.
- Projected.tools is now Vec<Arc<Tool>> matching that. The speculative
BTreeMap<SchemaHash, Bytes> dedup map went away — OpenAI's tools array
doesn't dedup by schema-hash anyway. Re-add when a provider actually needs
it (system-prompt-based tool definitions might).
- ToolRegistry::schema_bytes(hash) added for adapters that want to budget
on-wire token cost before sending.
15 new tests:
- 9 in openrouter_request.rs — body shape: required OpenAI fields,
Precision::Exact pins quantizations correctly, Precision::Mixed omits the
field, DataCollection round-trips, tool function-call shape,
assistant/tool-result message pair round-trips, provider.order +
provider.ignore, temperature/max_tokens/stop pass-through
- 6 in openrouter_sse.rs — text delta accumulation + Finish + usage, comment
lines ignored, delta.reasoning maps to ReasoningDelta, full tool-call
streaming dance (Start → 3 deltas → End → Finish), byte-boundary fragments
reassemble correctly, cached_tokens/reasoning_tokens map to Usage
Deliberately not in this commit (follow-ups):
- Live integration test (would bill OpenRouter; recorded fixtures cover wire
invariants without spending money)
- models.dev catalog hydration (separate concern, lives in catalog/ module)
- USD cost calculation (needs catalog pricing — tracks with catalog work)
- Image base64 dep — tiny inline encoder for v0; swap to `base64` crate when
a second adapter needs it
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TLDR
By default OpenRouter can route your request to providers that serve quantized versions of the model [1], which can result in substantially worse output for coding.
This patch is an attempt at avoiding getting quantized models by setting the
provider.quantizationsparameter to full precision (fp8 or better) when using OpenRouter.[1] https://openrouter.ai/docs/features/provider-routing#quantization
Dive Deeper
I'm not totally sure this is working at the moment... I tried testing this by setting quantizations field to just
['fp4'], but my requests aren't being routed to thefp4provider that I know is available for the model I was testing...So for now this PR is more for discussion about feasibility of this approach. If the maintainers are open to this I'll keep pounding on it and will also add new tests.
Reviewer Test Plan
qwen/qwen3-codermodel.quantizationsfield in this patch to be['fp4']and start qwen-code.Testing Matrix