Faster Qwen3-TTS iq4xs

Currently vulkan is too slow, so I did some change with ai, it works. Just I cannot create a fork about kcpp since I forked original llama.cpp (and Id like to keep it.) then that's a issue. 

**Edit** one sentence conclusion: no need to dequant 4bit tensor to f32 in fly for all backend

**Summary from ai**


## Session Summary: Qwen3TTS Vulkan Backend Optimization
### Initial Problem
The Qwen3TTS implementation was running at RTF=6.2 (0.16x realtime) on Vulkan GPU - meaning it took 6.2 seconds to generate 1 second of audio. The goal was to reach RTF < 1 (faster than real-time).

### Investigation Phase
1. Analyzed timing data with QWEN3_TTS_TIMING enabled
2. Discovered the real bottleneck : Not graph building/allocation (which was only ~3 ms/frame), but the actual computation
3. Found the root cause : The code was casting ffn_down weights to F32 before every matmul, forcing the Vulkan backend to:
   - Dequantize 4-bit weights to F32 on-the-fly
   - Use F32 matmul instead of native quantized matmul (2x memory bandwidth)
### The Fix (Minimal Change)
Removed unnecessary ggml_cast to F32 for ffn_down weights in 5 places in tts_transformer.cpp :

```
// BEFORE (slow):
struct ggml_tensor * ffn_down_f32 = 
ggml_cast(ctx0, layer.ffn_down, 
GGML_TYPE_F32);
cur = mul_mat(ctx0, ffn_down_f32, 
cur);

// AFTER (fast):
cur = mul_mat(ctx0, layer.ffn_down, 
cur);
```
### Locations Fixed
1. build_prefill_forward_graph - Talker prefill
2. build_step_graph - Talker step
3. build_code_pred_graph - Code predictor non-AR
4. build_code_pred_prefill_graph - Code predictor prefill
5. build_code_pred_step_graph_impl - Code predictor step
### Results
Test Case Before After "Hello world." RTF=6.2 RTF=0.925 "The quick brown fox..." RTF=~6.2 RTF=0.874 Long sentence RTF=~6.2 RTF=0.867

~7x speedup with just a 5-line change (10 lines removed, 5 lines added)!

### What We Also Tried (But Didn't Work)
1. Graph caching - Attempted to cache step graphs to avoid rebuilding, but hit ggml scheduler assertion errors when reusing cached graphs across resets
2. Non-autoregressive code prediction - Built a parallel code predictor with RoPE and causal mask, but it produced incorrect results (model was trained autoregressively)
3. Different quantizations - Tested q8_0 vs iq4_xs, but iq4_xs was already optimal
### Key Insight
The performance issue was not about algorithmic complexity or graph overhead. It was a simple issue where the code was unnecessarily forcing F32 computation when the Vulkan backend could handle quantized weights natively. This is a common pattern in ML inference: always let the backend use its native format when possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster Qwen3-TTS iq4xs #2173

Session Summary: Qwen3TTS Vulkan Backend Optimization

Initial Problem

Investigation Phase

The Fix (Minimal Change)

Locations Fixed

Results

What We Also Tried (But Didn't Work)

Key Insight

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Faster Qwen3-TTS iq4xs #2173

Description

Session Summary: Qwen3TTS Vulkan Backend Optimization

Initial Problem

Investigation Phase

The Fix (Minimal Change)

Locations Fixed

Results

What We Also Tried (But Didn't Work)

Key Insight

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions