feat(openaiContentGenerator): Avoid quantized models on OpenRouter by mgalgs · Pull Request #348 · QwenLM/qwen-code

mgalgs · 2025-08-15T23:20:46Z

TLDR

By default OpenRouter can route your request to providers that serve quantized versions of the model [1], which can result in substantially worse output for coding.

This patch is an attempt at avoiding getting quantized models by setting the provider.quantizations parameter to full precision (fp8 or better) when using OpenRouter.

[1] https://openrouter.ai/docs/features/provider-routing#quantization

Dive Deeper

I'm not totally sure this is working at the moment... I tried testing this by setting quantizations field to just ['fp4'], but my requests aren't being routed to the fp4 provider that I know is available for the model I was testing...

So for now this PR is more for discussion about feasibility of this approach. If the maintainers are open to this I'll keep pounding on it and will also add new tests.

Reviewer Test Plan

Configure qwen-code to use OpenRouter and the qwen/qwen3-coder model.
Temporarily modify the quantizations field in this patch to be ['fp4'] and start qwen-code.
Make some requests and verify in your OpenRouter activity dashboard that your requests are being routed to a provider serving the fp4 version of the model (DeepInfra (Turbo) is serving an fp4 version at the moment).

Testing Matrix

	🍏	🪟	🐧
npm run	❓	❓	✅
npx	❓	❓	❓
Docker	❓	❓	❓
Podman	❓	-	-
Seatbelt	❓	-	-

By default OpenRouter can route your request to providers that serve quantized versions of the model [1], which can result in substantially worse output for coding. Avoid getting quantized models by setting the `provider.quantizations` parameter to full precision (fp8 or better) when using OpenRouter. [1] https://openrouter.ai/docs/features/provider-routing#quantization

Co-authored-by: Gregory Shikhman <shikhman@google.com>

wenshao

No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review

wenshao · 2026-04-19T10:01:58Z

Please resolve the merge conflicts before this is merged. Thanks! — gpt-5.4 via Qwen Code /review

tanzhenxin · 2026-04-27T02:06:08Z

Thanks for the contribution, but closing this one — the premise isn't quite right for the CLI layer.

Provider routing on OpenRouter (which providers, which precisions, fallback behavior) is a user-configurable concern. OpenRouter exposes it through both the API (provider.* fields) and account-level preferences. Hardcoding a quantizations allowlist in qwen-code takes that decision away from users who may legitimately want quantized variants for cost or latency, silently changes behavior for everyone using OpenRouter today, and the list will rot as new precision tiers appear (fp4, mxfp4, etc.).

If finer control from inside qwen-code is desirable, the right shape would be a generic passthrough for OpenRouter provider options, not a baked-in precision opinion. Happy to look at that as a separate proposal.

… (78 total) OpenRouter is the default BYOK catalog — 200+ models behind one OpenAI-shape endpoint. Landing it first forces the trait surface against the maximum amount of provider variance in a single integration. Wire format verified 2026-05-14 against openrouter.ai/docs: - /api/reference/authentication: Bearer + HTTP-Referer + X-OpenRouter-Title (the rename from X-Title — adapter authors who copy old SDK examples will silently lose attribution; we get it right from line one) - /api/reference/streaming: data: {json}\n\n framing, [DONE] terminator, ": OPENROUTER PROCESSING" keepalive comments to ignore - /guides/routing/provider-selection: provider.quantizations accepts int4 | int8 | fp4 | fp6 | fp8 | fp16 | bf16 | fp32 | unknown Defaults that matter: - Precision::Exact ON by default → quantizations = ["fp16","bf16","fp32"]. Refuses OpenRouter's default FP4/Int4 routing — the CJK-encoding-breaks failure mode documented in RooCodeInc/Roo-Code#11325, QwenLM/qwen-code#348. Mixed precision is opt-in via with_routing(). - DataCollection::Deny by default. Bet 3 trust hinges on users believing their prompts aren't training data. - stream_options.include_usage=true so cost telemetry isn't blocked on the socket closing. Adapter shape: - build_body() is sync + pure — unit-testable without HTTP mocks - map_chunk() handles tool-call accumulation: ToolCallStart → ToolInputDelta* → ToolInputEnd → Finish (closing every started call before Finish) - Synthesizes tool-call index when upstream doesn't expose it (answers the "every tool call is index 0" bug) - delta.reasoning passthrough → Event::ReasoningDelta (covers DeepSeek and Anthropic-extended-thinking routed via OR) - prompt_tokens_details.cached_tokens → Usage.cache_read_tokens - completion_tokens_details.reasoning_tokens → Usage.reasoning_tokens - ProviderError carries typed ErrorClass from HTTP code or message heuristic Trait-surface deltas (required to make the adapter ergonomic): - Call.tools is now Vec<Arc<Tool>> (was Vec<ToolId>). Adapters need schema + description, not just IDs, to build the wire shape. - Projected.tools is now Vec<Arc<Tool>> matching that. The speculative BTreeMap<SchemaHash, Bytes> dedup map went away — OpenAI's tools array doesn't dedup by schema-hash anyway. Re-add when a provider actually needs it (system-prompt-based tool definitions might). - ToolRegistry::schema_bytes(hash) added for adapters that want to budget on-wire token cost before sending. 15 new tests: - 9 in openrouter_request.rs — body shape: required OpenAI fields, Precision::Exact pins quantizations correctly, Precision::Mixed omits the field, DataCollection round-trips, tool function-call shape, assistant/tool-result message pair round-trips, provider.order + provider.ignore, temperature/max_tokens/stop pass-through - 6 in openrouter_sse.rs — text delta accumulation + Finish + usage, comment lines ignored, delta.reasoning maps to ReasoningDelta, full tool-call streaming dance (Start → 3 deltas → End → Finish), byte-boundary fragments reassemble correctly, cached_tokens/reasoning_tokens map to Usage Deliberately not in this commit (follow-ups): - Live integration test (would bill OpenRouter; recorded fixtures cover wire invariants without spending money) - models.dev catalog hydration (separate concern, lives in catalog/ module) - USD cost calculation (needs catalog pricing — tracks with catalog work) - Image base64 dep — tiny inline encoder for v0; swap to `base64` crate when a second adapter needs it Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

halfaipg pushed a commit to AIPowerGrid/grid-code that referenced this pull request Aug 16, 2025

Add UI memory indicator. (QwenLM#348)

74c6fe5

Co-authored-by: Gregory Shikhman <shikhman@google.com>

wenshao approved these changes Apr 19, 2026

View reviewed changes

tanzhenxin closed this Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348

feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348
mgalgs wants to merge 1 commit into
QwenLM:mainfrom
mgalgs:main

mgalgs commented Aug 15, 2025

Uh oh!

wenshao left a comment

Uh oh!

wenshao commented Apr 19, 2026

Uh oh!

tanzhenxin commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mgalgs commented Aug 15, 2025

TLDR

Dive Deeper

Reviewer Test Plan

Testing Matrix

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

wenshao commented Apr 19, 2026

Uh oh!

tanzhenxin commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants