Skip to content

Faster Qwen3-TTS iq4xs #2173

@FNsi

Description

@FNsi

Currently vulkan is too slow, so I did some change with ai, it works. Just I cannot create a fork about kcpp since I forked original llama.cpp (and Id like to keep it.) then that's a issue.

Edit one sentence conclusion: no need to dequant 4bit tensor to f32 in fly for all backend

Summary from ai

Session Summary: Qwen3TTS Vulkan Backend Optimization

Initial Problem

The Qwen3TTS implementation was running at RTF=6.2 (0.16x realtime) on Vulkan GPU - meaning it took 6.2 seconds to generate 1 second of audio. The goal was to reach RTF < 1 (faster than real-time).

Investigation Phase

  1. Analyzed timing data with QWEN3_TTS_TIMING enabled
  2. Discovered the real bottleneck : Not graph building/allocation (which was only ~3 ms/frame), but the actual computation
  3. Found the root cause : The code was casting ffn_down weights to F32 before every matmul, forcing the Vulkan backend to:
    • Dequantize 4-bit weights to F32 on-the-fly
    • Use F32 matmul instead of native quantized matmul (2x memory bandwidth)

The Fix (Minimal Change)

Removed unnecessary ggml_cast to F32 for ffn_down weights in 5 places in tts_transformer.cpp :

// BEFORE (slow):
struct ggml_tensor * ffn_down_f32 = 
ggml_cast(ctx0, layer.ffn_down, 
GGML_TYPE_F32);
cur = mul_mat(ctx0, ffn_down_f32, 
cur);

// AFTER (fast):
cur = mul_mat(ctx0, layer.ffn_down, 
cur);

Locations Fixed

  1. build_prefill_forward_graph - Talker prefill
  2. build_step_graph - Talker step
  3. build_code_pred_graph - Code predictor non-AR
  4. build_code_pred_prefill_graph - Code predictor prefill
  5. build_code_pred_step_graph_impl - Code predictor step

Results

Test Case Before After "Hello world." RTF=6.2 RTF=0.925 "The quick brown fox..." RTF=~6.2 RTF=0.874 Long sentence RTF=~6.2 RTF=0.867

~7x speedup with just a 5-line change (10 lines removed, 5 lines added)!

What We Also Tried (But Didn't Work)

  1. Graph caching - Attempted to cache step graphs to avoid rebuilding, but hit ggml scheduler assertion errors when reusing cached graphs across resets
  2. Non-autoregressive code prediction - Built a parallel code predictor with RoPE and causal mask, but it produced incorrect results (model was trained autoregressively)
  3. Different quantizations - Tested q8_0 vs iq4_xs, but iq4_xs was already optimal

Key Insight

The performance issue was not about algorithmic complexity or graph overhead. It was a simple issue where the code was unnecessarily forcing F32 computation when the Vulkan backend could handle quantized weights natively. This is a common pattern in ML inference: always let the backend use its native format when possible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions