Misc. bug: can not stop output

### Name and Version

v0.3.0

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
~/beellama.cpp/build/bin/llama-server -m ~/models/Qwen3.6-27B/Qwen3.6-27B-Q4_K_M-MTP.gguf --spec-type draft-mtp --spec-draft-n-max 2 --parallel 1 --host 0.0.0.0 --port 8080 --no-mmap -ctk q4_0 -ctv q4_0 --ctx-size 131072
```

### Problem description & steps to reproduce

try use v0.3.0 branch ,i ask llm to generate a html ,and the first thing i notice is ,it generate entire code in thinking process,and keeps gettting `I still have errors. I'll write it more carefully.` and it will write code again, and i think the same prompt for old version llama ,it will write code after thinking process.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```
0.00.061.256 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.061.261 I device_info:
0.00.209.057 I   - CUDA0   : Tesla V100-SXM2-16GB (16144 MiB, 15833 MiB free)
0.00.325.048 I   - CUDA1   : Tesla V100-SXM2-16GB (16144 MiB, 15833 MiB free)
0.00.325.064 I   - CPU     : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (31931 MiB, 31931 MiB free)
0.00.325.188 I system_info: n_threads = 14 (n_threads_batch = 14) / 28 | CUDA : ARCHS = 700 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.325.310 I srv          init: using 27 threads for HTTP server
0.00.325.493 I srv         start: binding port with default address family
0.00.326.812 I srv  llama_server: loading model
0.00.326.825 I srv    load_model: loading model '/home/lunaon/models/Qwen3.6-27B/Qwen3.6-27B-Q4_K_M-MTP.gguf'
0.01.057.228 I srv    load_model: [spec] estimated memory usage of MTP context is 2121.07 MiB
0.01.057.245 I srv    load_model: auto-enabled kv-unified: spec decode backup doesn't need separate KV stream
0.01.057.255 I common_init_result: fitting params to device memory ...
0.01.057.256 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.33.342.188 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.33.895.843 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
TCQ decode: context-adaptive V alpha enabled
0.34.095.169 I srv    load_model: shrunk recurrent state to 1 cells before draft load (deferred 1 backup cells)
0.34.095.178 I srv    load_model: creating MTP draft context against the target model '/home/lunaon/models/Qwen3.6-27B/Qwen3.6-27B-Q4_K_M-MTP.gguf'
0.34.095.215 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.34.636.023 I srv    load_model: expanded recurrent state to 2 cells before speculative GPU buffers
0.34.636.028 I srv    load_model: initializing slots, n_slots = 1
0.35.295.128 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.35.295.136 I common_speculative_impl_draft_mtp: - n_max=2, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.35.295.138 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.35.295.310 I srv    load_model: speculative decoding context initialized
0.35.295.316 I slot   load_model: id  0 | task -1 | speculative decoding context initialized
0.35.295.317 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 131072
0.35.295.407 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.35.295.409 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.35.295.410 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.35.295.411 I srv    load_model: context checkpoints enabled, max = 32, min spacing = 256
0.35.295.457 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.35.326.975 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.35.350.422 I srv          init: init: chat template, thinking = 1
0.35.350.470 I srv  llama_server: model loaded
0.35.350.476 I srv  llama_server: server is listening on http://0.0.0.0:8080
0.35.350.491 I srv  update_slots: all slots are idle
1.18.038.356 I srv  params_from_: Chat format: peg-native
1.18.038.650 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
1.18.129.792 I srv  recurrent_sh: shrunk recurrent state to 1 cells for prompt cache (before prompt cache save/load, removed 1 backup cells)
1.18.129.797 I srv  get_availabl: updating prompt cache
1.18.129.802 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
1.18.129.808 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 131072 tokens, 8589934592 est)
1.18.201.455 I srv  recurrent_ex: expanded recurrent state to 2 cells after prompt cache (after prompt cache save/load)
1.18.201.460 I srv  get_availabl: prompt cache update took 71.66 ms
1.18.201.519 I reasoning-budget: activated, budget=2147483647 tokens
1.18.201.541 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
1.20.166.308 I slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 239, pos_max = 239, n_tokens = 240, size = 150.568 MiB)
1.22.603.194 I slot print_timing: id  0 | task 0 | n_decoded =    100, tg =  41.97 t/s
1.25.650.471 I slot print_timing: id  0 | task 0 | n_decoded =    215, tg =  39.60 t/s
1.28.694.704 I slot print_timing: id  0 | task 0 | n_decoded =    318, tg =  37.53 t/s
1.31.743.199 I slot print_timing: id  0 | task 0 | n_decoded =    428, tg =  37.14 t/s
1.34.765.830 I slot print_timing: id  0 | task 0 | n_decoded =    596, tg =  40.98 t/s
1.37.772.368 I slot print_timing: id  0 | task 0 | n_decoded =    723, tg =  41.19 t/s
1.40.815.818 I slot print_timing: id  0 | task 0 | n_decoded =    882, tg =  42.83 t/s
1.43.824.605 I slot print_timing: id  0 | task 0 | n_decoded =   1017, tg =  43.12 t/s

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: can not stop output #44

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: can not stop output #44

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions