Skip to content

Scale native qwen35 CUDA lane to beat Ollama on 2B, 4B, and 9B #606

@AtlantisPleb

Description

@AtlantisPleb

Summary

Psionic now has a native CUDA qwen35 lane that is ahead of local Ollama on qwen3.5:0.8b decode throughput on this host.

Current documented checkpoint:

  • Psionic qwen3.5:0.8b: about 523.20 tok/s decode
  • local Ollama qwen3.5:0.8b: about 328.72 tok/s decode
  • canonical docs: docs/INFERENCE_ENGINE.md and docs/NON_GPT_OSS_QWEN35_PILOT.md
  • current optimization headroom still sits in the native qwen35 CUDA runtime, not in subprocess delegation

Goal

Scale the same native Psionic CUDA lane up the Qwen3.5 Ollama rows and beat local Ollama on the same host for:

  • qwen3.5:2b
  • qwen3.5:4b
  • qwen3.5:9b

The immediate delivery bar is:

  • download the Ollama rows locally
  • resolve the exact GGUF blob paths and digests for each pulled row
  • benchmark Psionic versus local Ollama on the same prompt and token cap for each row
  • beat local Ollama on decode throughput for 2b and 4b
  • keep one master comparison doc in docs/ updated with Psionic versus Ollama results and the exact checkpoint that produced them

Required Work

  • Pull qwen3.5:2b, qwen3.5:4b, and qwen3.5:9b through local Ollama and capture the installed blob paths under the Ollama model store.
  • Extend the existing qwen35 benchmark runner and docs from the 0.8b pilot to a multi-row comparison matrix.
  • Measure native Psionic CUDA decode throughput versus local Ollama on the same prompt and generation cap.
  • Identify the first native Psionic regressions that appear as model width and layer count increase.
  • Push native CUDA optimizations until Psionic is ahead on 2b and 4b.
  • Record the remaining gap and next bottlenecks for 9b if it does not clear in the same pass.

Notes

  • The runtime target is native Psionic CUDA inference. No Ollama subprocess execution belongs in the serving path.
  • The current qwen35 lane is already generic at the architecture level; the main unknown is larger-row throughput, not admission.
  • Benchmark claims must stay tied to exact host-local measurements and exact artifact digests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions