Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
c6c4f7c
Update chat.cpp to support (at least) qwen3 + tool_choice = required
ExtReMLapin Aug 11, 2025
42937a5
refactored changes to follow string tern op
ExtReMLapin Aug 11, 2025
5796938
fixing editorconfig-checker CI (tailing whitespace)
ExtReMLapin Aug 12, 2025
de07a43
Merge branch 'ggml-org:master' into fix_qwen_reasoning_tool_calling_r…
ExtReMLapin Aug 25, 2025
79e4a7b
hermes 2 pro tool calling, better support for thinking (thinking tag …
Aug 25, 2025
dbae921
qwen hermes tool calling : fixed grammar rules names
Aug 25, 2025
86493dd
fixed really weird grammar crash `Unexpected empty grammar stack afte…
Aug 25, 2025
bb5e352
also apply the hotcrashfix here, just in case
Aug 25, 2025
6d5f561
reverted changes done to grammar_lazy for hermes 2
Aug 26, 2025
352274e
if there is enable_thinking enabled but hermes model doesn't support …
Aug 26, 2025
0e55830
fix thinking-content eating closing think tag | ref #8953
Aug 26, 2025
e62cd70
removed `?` from grammar as it doesn't crash on linux, probably worth…
Aug 26, 2025
fbef0fa
server: higher timeout for tests (#15621)
JohannesGaessler Aug 27, 2025
5a0e3ef
cuda: Add cublasLt_static linking when GGML_STATIC is enabled (#15622)
matiaslin Aug 28, 2025
46d9caa
model-conversion : add mmproj conversion target (#15628)
danbev Aug 28, 2025
d35a1e8
cli : change log to warning to explain reason for stopping (#15604)
jrincayc Aug 28, 2025
64387f6
gguf-py: byteswapping improvements (#12851)
AlekseiNikiforovIBM Aug 28, 2025
8a4280c
kv-cache : remove LLAMA_SET_ROWS checks (#15505)
ggerganov Aug 28, 2025
55042b3
scripts: add sqlite3 check for compare-commits.sh (#15633)
am17an Aug 28, 2025
84ab83c
model : jina-embeddings-v3 support (#13693)
CISC Aug 28, 2025
c8d0d14
kv-cache : fix find_slot to not search for continuous slot (#15638)
ggerganov Aug 28, 2025
7380414
ggml : fix SSM_SCAN for n_groups > 1 (#15625)
compilade Aug 28, 2025
6c442f4
ggml-cpu: fix invalid hsum build in debug s390x (#15634)
taronaeo Aug 28, 2025
c97dc09
CUDA: add conv2d (#15635)
mnehete32 Aug 28, 2025
2f28a1c
Merge branch 'ggml-org:master' into fix_qwen_reasoning_tool_calling_r…
ExtReMLapin Aug 28, 2025
310701b
fixed crash with "auto" mode, trigger was missing
Aug 28, 2025
a8bca68
fix: Compute the full sum in llama-eval-callback, not just the sum of…
gabe-l-hart Aug 28, 2025
e8d99dd
nvidia nemotron nano v2 (nemotronh) (#15507)
gabe-l-hart Aug 29, 2025
009b709
CUDA: fuse adds, fuse add with rms norm (#15631)
am17an Aug 29, 2025
60e5eee
chat : Seed OSS thinking + tool call support (#15552)
pwilkin Aug 29, 2025
8101786
CUDA: fix bug in rms_norm fusion (#15660)
am17an Aug 29, 2025
792b44f
server : add documentation for `parallel_tool_calls` param (#15647)
ExtReMLapin Aug 29, 2025
3d16b29
scripts: strip "AMD Instinct" from GPU name (#15668)
JohannesGaessler Aug 29, 2025
d82f6aa
server : removed obsolete doc (#15670)
l29ah Aug 29, 2025
ef47691
CANN: FIx compiler warnings (#15661)
noemotiovon Aug 30, 2025
696fccf
vulkan: Skip syncing for prealloc_y when it is reused (#15544)
jeffbolznv Aug 30, 2025
38ad381
CUDA: use FP32 arithmetic for conv2d (#15683)
JohannesGaessler Aug 30, 2025
e81b8e4
llama: use FA + max. GPU layers by default (#15434)
JohannesGaessler Aug 30, 2025
dd89255
Update build.md to remove MSVC arm64 notes (#15684)
slaren Aug 30, 2025
4d74393
ggml: update kleidiai to v1.13.0 (#15663)
chaxu01 Aug 30, 2025
94e82c7
vulkan: clamp matmul and FA results to the max finite value (#15652)
jeffbolznv Aug 31, 2025
b97c9ed
vulkan: Allow fallback to sysmem memory when vidmem is full (#15649)
jeffbolznv Aug 31, 2025
5c16b9c
vulkan : remove unused portability_enumeration_ext variable (#15679)
danbev Aug 31, 2025
c37052a
vulkan: mul_mat_id coopmat2 optimizations (#15546)
jeffbolznv Aug 31, 2025
bbbf5ec
vulkan: handle large sizes for get_rows (#15686)
jeffbolznv Aug 31, 2025
7d3c9f2
ci : explicitly set fa off or on (#15692)
CISC Aug 31, 2025
9777032
llama : separate compute buffer reserve from fattn check (#15696)
slaren Aug 31, 2025
2749662
llama : fix fattn reserve call n_seqs parameter (#15699)
slaren Aug 31, 2025
4efd5a8
metal : fix checks for available FA kernels (#15700)
ggerganov Aug 31, 2025
0d161f0
server : enable /slots by default and make it secure (#15630)
ggerganov Aug 31, 2025
e92d53b
sampling : optimize samplers by reusing bucket sort (#15665)
ggerganov Aug 31, 2025
3dc7397
CANN: fix RoPE cache issue on multi-device (#15629)
hipudding Sep 1, 2025
b9382c3
CANN: Optimize MUL_MAT_ID (#15658)
hipudding Sep 1, 2025
b66df9d
CUDA: fix build error from ambiguous __half conversions in conv2d (#1…
qnixsynapse Sep 1, 2025
5688afa
Merge branch 'ggml-org:master' into fix_qwen_reasoning_tool_calling_r…
ExtReMLapin Sep 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions ci/run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -386,10 +386,10 @@ function gg_run_open_llama_7b_v2 {

(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log

(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log

function check_ppl {
qnt="$1"
Expand Down Expand Up @@ -520,8 +520,8 @@ function gg_run_pythia_1_4b {

(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test_60} -ngl 99 -c 128 -b 128 --chunks 1 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log

(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log

function check_ppl {
qnt="$1"
Expand Down Expand Up @@ -651,10 +651,10 @@ function gg_run_pythia_2_8b {

(time ./bin/llama-imatrix --model ${model_f16} -f ${wiki_test} -t 1 -ngl 99 -c 2048 -b 512 --chunks 4 ) 2>&1 | tee -a $OUT/${ci}-imatrix.log

(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 10 -c 0 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa off ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log
(time ./bin/llama-save-load-state --model ${model_q4_0} -ngl 99 -c 0 -fa on ) 2>&1 | tee -a $OUT/${ci}-save-load-state.log

function check_ppl {
qnt="$1"
Expand Down
48 changes: 21 additions & 27 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1545,10 +1545,18 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
}
).set_examples({LLAMA_EXAMPLE_IMATRIX, LLAMA_EXAMPLE_PERPLEXITY, LLAMA_EXAMPLE_RETRIEVAL}));
add_opt(common_arg(
{"-fa", "--flash-attn"},
string_format("enable Flash Attention (default: %s)", params.flash_attn ? "enabled" : "disabled"),
[](common_params & params) {
params.flash_attn = true;
{"-fa", "--flash-attn"}, "FA",
string_format("set Flash Attention use ('on', 'off', or 'auto', default: '%s')", llama_flash_attn_type_name(params.flash_attn_type)),
[](common_params & params, const std::string & value) {
if (value == "on" || value == "enabled") {
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_ENABLED;
} else if (value == "off" || value == "disabled") {
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_DISABLED;
} else if (value == "auto") {
params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_AUTO;
} else {
throw std::runtime_error(string_format("error: unkown value for --flash-attn: '%s'\n", value.c_str()));
}
}
).set_env("LLAMA_ARG_FLASH_ATTN"));
add_opt(common_arg(
Expand Down Expand Up @@ -2555,15 +2563,15 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
{"--lora"}, "FNAME",
"path to LoRA adapter (can be repeated to use multiple adapters)",
[](common_params & params, const std::string & value) {
params.lora_adapters.push_back({ std::string(value), 1.0, nullptr });
params.lora_adapters.push_back({ std::string(value), 1.0, "", "", nullptr });
}
// we define this arg on both COMMON and EXPORT_LORA, so when showing help message of export-lora, it will be categorized as "example-specific" arg
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}));
add_opt(common_arg(
{"--lora-scaled"}, "FNAME", "SCALE",
"path to LoRA adapter with user defined scaling (can be repeated to use multiple adapters)",
[](common_params & params, const std::string & fname, const std::string & scale) {
params.lora_adapters.push_back({ fname, std::stof(scale), nullptr });
params.lora_adapters.push_back({ fname, std::stof(scale), "", "", nullptr });
}
// we define this arg on both COMMON and EXPORT_LORA, so when showing help message of export-lora, it will be categorized as "example-specific" arg
).set_examples({LLAMA_EXAMPLE_COMMON, LLAMA_EXAMPLE_EXPORT_LORA}));
Expand Down Expand Up @@ -2954,20 +2962,20 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.endpoint_metrics = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_METRICS"));
add_opt(common_arg(
{"--slots"},
string_format("enable slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
[](common_params & params) {
params.endpoint_slots = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_SLOTS"));
add_opt(common_arg(
{"--props"},
string_format("enable changing global properties via POST /props (default: %s)", params.endpoint_props ? "enabled" : "disabled"),
[](common_params & params) {
params.endpoint_props = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_PROPS"));
add_opt(common_arg(
{"--slots"},
string_format("enable slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
[](common_params & params) {
params.endpoint_slots = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_SLOTS"));
add_opt(common_arg(
{"--no-slots"},
"disables slots monitoring endpoint",
Expand Down Expand Up @@ -3459,8 +3467,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_repo = "ggml-org/Qwen2.5-Coder-1.5B-Q8_0-GGUF";
params.model.hf_file = "qwen2.5-coder-1.5b-q8_0.gguf";
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand All @@ -3475,8 +3481,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_repo = "ggml-org/Qwen2.5-Coder-3B-Q8_0-GGUF";
params.model.hf_file = "qwen2.5-coder-3b-q8_0.gguf";
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand All @@ -3491,8 +3495,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_repo = "ggml-org/Qwen2.5-Coder-7B-Q8_0-GGUF";
params.model.hf_file = "qwen2.5-coder-7b-q8_0.gguf";
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand All @@ -3508,10 +3510,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_file = "qwen2.5-coder-7b-q8_0.gguf";
params.speculative.model.hf_repo = "ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF";
params.speculative.model.hf_file = "qwen2.5-coder-0.5b-q8_0.gguf";
params.speculative.n_gpu_layers = 99;
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand All @@ -3527,10 +3526,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_file = "qwen2.5-coder-14b-q8_0.gguf";
params.speculative.model.hf_repo = "ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF";
params.speculative.model.hf_file = "qwen2.5-coder-0.5b-q8_0.gguf";
params.speculative.n_gpu_layers = 99;
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand All @@ -3545,8 +3541,6 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.model.hf_repo = "ggml-org/Qwen3-Coder-30B-A3B-Instruct-Q8_0-GGUF";
params.model.hf_file = "qwen3-coder-30b-a3b-instruct-q8_0.gguf";
params.port = 8012;
params.n_gpu_layers = 99;
params.flash_attn = true;
params.n_ubatch = 1024;
params.n_batch = 1024;
params.n_ctx = 0;
Expand Down
Loading
Loading