Skip to content

feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348

Closed
mgalgs wants to merge 1 commit into
QwenLM:mainfrom
mgalgs:main
Closed

feat(openaiContentGenerator): Avoid quantized models on OpenRouter#348
mgalgs wants to merge 1 commit into
QwenLM:mainfrom
mgalgs:main

Conversation

@mgalgs
Copy link
Copy Markdown

@mgalgs mgalgs commented Aug 15, 2025

TLDR

By default OpenRouter can route your request to providers that serve quantized versions of the model [1], which can result in substantially worse output for coding.

This patch is an attempt at avoiding getting quantized models by setting the provider.quantizations parameter to full precision (fp8 or better) when using OpenRouter.

[1] https://openrouter.ai/docs/features/provider-routing#quantization

Dive Deeper

I'm not totally sure this is working at the moment... I tried testing this by setting quantizations field to just ['fp4'], but my requests aren't being routed to the fp4 provider that I know is available for the model I was testing...

So for now this PR is more for discussion about feasibility of this approach. If the maintainers are open to this I'll keep pounding on it and will also add new tests.

Reviewer Test Plan

  • Configure qwen-code to use OpenRouter and the qwen/qwen3-coder model.
  • Temporarily modify the quantizations field in this patch to be ['fp4'] and start qwen-code.
  • Make some requests and verify in your OpenRouter activity dashboard that your requests are being routed to a provider serving the fp4 version of the model (DeepInfra (Turbo) is serving an fp4 version at the moment).
image

Testing Matrix

🍏 🪟 🐧
npm run
npx
Docker
Podman - -
Seatbelt - -

By default OpenRouter can route your request to providers that serve
quantized versions of the model [1], which can result in substantially worse
output for coding.

Avoid getting quantized models by setting the `provider.quantizations`
parameter to full precision (fp8 or better) when using OpenRouter.

[1] https://openrouter.ai/docs/features/provider-routing#quantization
halfaipg pushed a commit to AIPowerGrid/grid-code that referenced this pull request Aug 16, 2025
Co-authored-by: Gregory Shikhman <shikhman@google.com>
Copy link
Copy Markdown
Collaborator

@wenshao wenshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found. LGTM! ✅ — gpt-5.4 via Qwen Code /review

@wenshao
Copy link
Copy Markdown
Collaborator

wenshao commented Apr 19, 2026

Please resolve the merge conflicts before this is merged. Thanks! — gpt-5.4 via Qwen Code /review

@tanzhenxin
Copy link
Copy Markdown
Collaborator

Thanks for the contribution, but closing this one — the premise isn't quite right for the CLI layer.

Provider routing on OpenRouter (which providers, which precisions, fallback behavior) is a user-configurable concern. OpenRouter exposes it through both the API (provider.* fields) and account-level preferences. Hardcoding a quantizations allowlist in qwen-code takes that decision away from users who may legitimately want quantized variants for cost or latency, silently changes behavior for everyone using OpenRouter today, and the list will rot as new precision tiers appear (fp4, mxfp4, etc.).

If finer control from inside qwen-code is desirable, the right shape would be a generic passthrough for OpenRouter provider options, not a baked-in precision opinion. Happy to look at that as a separate proposal.

@tanzhenxin tanzhenxin closed this Apr 27, 2026
stubbi added a commit to stubbi/ditto that referenced this pull request May 14, 2026
… (78 total)

OpenRouter is the default BYOK catalog — 200+ models behind one OpenAI-shape
endpoint. Landing it first forces the trait surface against the maximum
amount of provider variance in a single integration.

Wire format verified 2026-05-14 against openrouter.ai/docs:
- /api/reference/authentication: Bearer + HTTP-Referer + X-OpenRouter-Title
  (the rename from X-Title — adapter authors who copy old SDK examples will
  silently lose attribution; we get it right from line one)
- /api/reference/streaming: data: {json}\n\n framing, [DONE] terminator,
  ": OPENROUTER PROCESSING" keepalive comments to ignore
- /guides/routing/provider-selection: provider.quantizations accepts
  int4 | int8 | fp4 | fp6 | fp8 | fp16 | bf16 | fp32 | unknown

Defaults that matter:
- Precision::Exact ON by default → quantizations = ["fp16","bf16","fp32"].
  Refuses OpenRouter's default FP4/Int4 routing — the CJK-encoding-breaks
  failure mode documented in RooCodeInc/Roo-Code#11325, QwenLM/qwen-code#348.
  Mixed precision is opt-in via with_routing().
- DataCollection::Deny by default. Bet 3 trust hinges on users believing
  their prompts aren't training data.
- stream_options.include_usage=true so cost telemetry isn't blocked on the
  socket closing.

Adapter shape:
- build_body() is sync + pure — unit-testable without HTTP mocks
- map_chunk() handles tool-call accumulation: ToolCallStart → ToolInputDelta*
  → ToolInputEnd → Finish (closing every started call before Finish)
- Synthesizes tool-call index when upstream doesn't expose it (answers the
  "every tool call is index 0" bug)
- delta.reasoning passthrough → Event::ReasoningDelta (covers DeepSeek and
  Anthropic-extended-thinking routed via OR)
- prompt_tokens_details.cached_tokens → Usage.cache_read_tokens
- completion_tokens_details.reasoning_tokens → Usage.reasoning_tokens
- ProviderError carries typed ErrorClass from HTTP code or message heuristic

Trait-surface deltas (required to make the adapter ergonomic):
- Call.tools is now Vec<Arc<Tool>> (was Vec<ToolId>). Adapters need
  schema + description, not just IDs, to build the wire shape.
- Projected.tools is now Vec<Arc<Tool>> matching that. The speculative
  BTreeMap<SchemaHash, Bytes> dedup map went away — OpenAI's tools array
  doesn't dedup by schema-hash anyway. Re-add when a provider actually needs
  it (system-prompt-based tool definitions might).
- ToolRegistry::schema_bytes(hash) added for adapters that want to budget
  on-wire token cost before sending.

15 new tests:
- 9 in openrouter_request.rs — body shape: required OpenAI fields,
  Precision::Exact pins quantizations correctly, Precision::Mixed omits the
  field, DataCollection round-trips, tool function-call shape,
  assistant/tool-result message pair round-trips, provider.order +
  provider.ignore, temperature/max_tokens/stop pass-through
- 6 in openrouter_sse.rs — text delta accumulation + Finish + usage, comment
  lines ignored, delta.reasoning maps to ReasoningDelta, full tool-call
  streaming dance (Start → 3 deltas → End → Finish), byte-boundary fragments
  reassemble correctly, cached_tokens/reasoning_tokens map to Usage

Deliberately not in this commit (follow-ups):
- Live integration test (would bill OpenRouter; recorded fixtures cover wire
  invariants without spending money)
- models.dev catalog hydration (separate concern, lives in catalog/ module)
- USD cost calculation (needs catalog pricing — tracks with catalog work)
- Image base64 dep — tiny inline encoder for v0; swap to `base64` crate when
  a second adapter needs it

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants