-
Notifications
You must be signed in to change notification settings - Fork 2
Scale native qwen35 CUDA lane to beat Ollama on 2B, 4B, and 9B #606
Copy link
Copy link
Closed
Description
Summary
Psionic now has a native CUDA qwen35 lane that is ahead of local Ollama on qwen3.5:0.8b decode throughput on this host.
Current documented checkpoint:
- Psionic
qwen3.5:0.8b: about523.20 tok/sdecode - local Ollama
qwen3.5:0.8b: about328.72 tok/sdecode - canonical docs:
docs/INFERENCE_ENGINE.mdanddocs/NON_GPT_OSS_QWEN35_PILOT.md - current optimization headroom still sits in the native qwen35 CUDA runtime, not in subprocess delegation
Goal
Scale the same native Psionic CUDA lane up the Qwen3.5 Ollama rows and beat local Ollama on the same host for:
qwen3.5:2bqwen3.5:4bqwen3.5:9b
The immediate delivery bar is:
- download the Ollama rows locally
- resolve the exact GGUF blob paths and digests for each pulled row
- benchmark Psionic versus local Ollama on the same prompt and token cap for each row
- beat local Ollama on decode throughput for
2band4b - keep one master comparison doc in
docs/updated with Psionic versus Ollama results and the exact checkpoint that produced them
Required Work
- Pull
qwen3.5:2b,qwen3.5:4b, andqwen3.5:9bthrough local Ollama and capture the installed blob paths under the Ollama model store. - Extend the existing qwen35 benchmark runner and docs from the
0.8bpilot to a multi-row comparison matrix. - Measure native Psionic CUDA decode throughput versus local Ollama on the same prompt and generation cap.
- Identify the first native Psionic regressions that appear as model width and layer count increase.
- Push native CUDA optimizations until Psionic is ahead on
2band4b. - Record the remaining gap and next bottlenecks for
9bif it does not clear in the same pass.
Notes
- The runtime target is native Psionic CUDA inference. No Ollama subprocess execution belongs in the serving path.
- The current qwen35 lane is already generic at the architecture level; the main unknown is larger-row throughput, not admission.
- Benchmark claims must stay tied to exact host-local measurements and exact artifact digests.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels