Scale native qwen35 CUDA lane to beat Ollama on 2B, 4B, and 9B

## Summary

Psionic now has a native CUDA `qwen35` lane that is ahead of local Ollama on `qwen3.5:0.8b` decode throughput on this host.

Current documented checkpoint:
- Psionic `qwen3.5:0.8b`: about `523.20 tok/s` decode
- local Ollama `qwen3.5:0.8b`: about `328.72 tok/s` decode
- canonical docs: `docs/INFERENCE_ENGINE.md` and `docs/NON_GPT_OSS_QWEN35_PILOT.md`
- current optimization headroom still sits in the native qwen35 CUDA runtime, not in subprocess delegation

## Goal

Scale the same native Psionic CUDA lane up the Qwen3.5 Ollama rows and beat local Ollama on the same host for:
- `qwen3.5:2b`
- `qwen3.5:4b`
- `qwen3.5:9b`

The immediate delivery bar is:
- download the Ollama rows locally
- resolve the exact GGUF blob paths and digests for each pulled row
- benchmark Psionic versus local Ollama on the same prompt and token cap for each row
- beat local Ollama on decode throughput for `2b` and `4b`
- keep one master comparison doc in `docs/` updated with Psionic versus Ollama results and the exact checkpoint that produced them

## Required Work

- Pull `qwen3.5:2b`, `qwen3.5:4b`, and `qwen3.5:9b` through local Ollama and capture the installed blob paths under the Ollama model store.
- Extend the existing qwen35 benchmark runner and docs from the `0.8b` pilot to a multi-row comparison matrix.
- Measure native Psionic CUDA decode throughput versus local Ollama on the same prompt and generation cap.
- Identify the first native Psionic regressions that appear as model width and layer count increase.
- Push native CUDA optimizations until Psionic is ahead on `2b` and `4b`.
- Record the remaining gap and next bottlenecks for `9b` if it does not clear in the same pass.

## Notes

- The runtime target is native Psionic CUDA inference. No Ollama subprocess execution belongs in the serving path.
- The current qwen35 lane is already generic at the architecture level; the main unknown is larger-row throughput, not admission.
- Benchmark claims must stay tied to exact host-local measurements and exact artifact digests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale native qwen35 CUDA lane to beat Ollama on 2B, 4B, and 9B #606

Summary

Goal

Required Work

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scale native qwen35 CUDA lane to beat Ollama on 2B, 4B, and 9B #606

Description

Summary

Goal

Required Work

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions