0.00.061.256 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.061.261 I device_info:
0.00.209.057 I - CUDA0 : Tesla V100-SXM2-16GB (16144 MiB, 15833 MiB free)
0.00.325.048 I - CUDA1 : Tesla V100-SXM2-16GB (16144 MiB, 15833 MiB free)
0.00.325.064 I - CPU : Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz (31931 MiB, 31931 MiB free)
0.00.325.188 I system_info: n_threads = 14 (n_threads_batch = 14) / 28 | CUDA : ARCHS = 700 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
0.00.325.310 I srv init: using 27 threads for HTTP server
0.00.325.493 I srv start: binding port with default address family
0.00.326.812 I srv llama_server: loading model
0.00.326.825 I srv load_model: loading model '/home/lunaon/models/Qwen3.6-27B/Qwen3.6-27B-Q4_K_M-MTP.gguf'
0.01.057.228 I srv load_model: [spec] estimated memory usage of MTP context is 2121.07 MiB
0.01.057.245 I srv load_model: auto-enabled kv-unified: spec decode backup doesn't need separate KV stream
0.01.057.255 I common_init_result: fitting params to device memory ...
0.01.057.256 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.33.342.188 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.33.895.843 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
TCQ decode: context-adaptive V alpha enabled
0.34.095.169 I srv load_model: shrunk recurrent state to 1 cells before draft load (deferred 1 backup cells)
0.34.095.178 I srv load_model: creating MTP draft context against the target model '/home/lunaon/models/Qwen3.6-27B/Qwen3.6-27B-Q4_K_M-MTP.gguf'
0.34.095.215 W llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.34.636.023 I srv load_model: expanded recurrent state to 2 cells before speculative GPU buffers
0.34.636.028 I srv load_model: initializing slots, n_slots = 1
0.35.295.128 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.35.295.136 I common_speculative_impl_draft_mtp: - n_max=2, n_min=0, p_min=0.00, n_embd=5120, backend_sampling=1
0.35.295.138 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=f16, cache_v=f16, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.35.295.310 I srv load_model: speculative decoding context initialized
0.35.295.316 I slot load_model: id 0 | task -1 | speculative decoding context initialized
0.35.295.317 I slot load_model: id 0 | task -1 | new slot, n_ctx = 131072
0.35.295.407 I srv load_model: prompt cache is enabled, size limit: 8192 MiB
0.35.295.409 I srv load_model: use `--cache-ram 0` to disable the prompt cache
0.35.295.410 I srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.35.295.411 I srv load_model: context checkpoints enabled, max = 32, min spacing = 256
0.35.295.457 I srv init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.35.326.975 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
0.35.350.422 I srv init: init: chat template, thinking = 1
0.35.350.470 I srv llama_server: model loaded
0.35.350.476 I srv llama_server: server is listening on http://0.0.0.0:8080
0.35.350.491 I srv update_slots: all slots are idle
1.18.038.356 I srv params_from_: Chat format: peg-native
1.18.038.650 I slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
1.18.129.792 I srv recurrent_sh: shrunk recurrent state to 1 cells for prompt cache (before prompt cache save/load, removed 1 backup cells)
1.18.129.797 I srv get_availabl: updating prompt cache
1.18.129.802 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
1.18.129.808 I srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 131072 tokens, 8589934592 est)
1.18.201.455 I srv recurrent_ex: expanded recurrent state to 2 cells after prompt cache (after prompt cache save/load)
1.18.201.460 I srv get_availabl: prompt cache update took 71.66 ms
1.18.201.519 I reasoning-budget: activated, budget=2147483647 tokens
1.18.201.541 I slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
1.20.166.308 I slot create_check: id 0 | task 0 | created context checkpoint 1 of 32 (pos_min = 239, pos_max = 239, n_tokens = 240, size = 150.568 MiB)
1.22.603.194 I slot print_timing: id 0 | task 0 | n_decoded = 100, tg = 41.97 t/s
1.25.650.471 I slot print_timing: id 0 | task 0 | n_decoded = 215, tg = 39.60 t/s
1.28.694.704 I slot print_timing: id 0 | task 0 | n_decoded = 318, tg = 37.53 t/s
1.31.743.199 I slot print_timing: id 0 | task 0 | n_decoded = 428, tg = 37.14 t/s
1.34.765.830 I slot print_timing: id 0 | task 0 | n_decoded = 596, tg = 40.98 t/s
1.37.772.368 I slot print_timing: id 0 | task 0 | n_decoded = 723, tg = 41.19 t/s
1.40.815.818 I slot print_timing: id 0 | task 0 | n_decoded = 882, tg = 42.83 t/s
1.43.824.605 I slot print_timing: id 0 | task 0 | n_decoded = 1017, tg = 43.12 t/s
Name and Version
v0.3.0
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
try use v0.3.0 branch ,i ask llm to generate a html ,and the first thing i notice is ,it generate entire code in thinking process,and keeps gettting
I still have errors. I'll write it more carefully.and it will write code again, and i think the same prompt for old version llama ,it will write code after thinking process.First Bad Commit
No response
Relevant log output
Logs