Skip to content

Releases: ggml-org/llama.cpp

b4988

28 Mar 21:56
3714c3e
Compare
Choose a tag to compare
llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631)

b4987

28 Mar 19:06
b4ae508
Compare
Choose a tag to compare
metal : improve FA + improve MoE (#12612)

* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci

b4986

28 Mar 18:42
b86f600
Compare
Choose a tag to compare
vulkan: fix coopmat shader generation when cross-compiling (#12272)

* vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>

* Only call coop-mat shaders once

* Fix whitespace

---------

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>

b4985

28 Mar 18:02
dd373dd
Compare
Choose a tag to compare
llama: fix error on bad grammar (#12628)

b4984

28 Mar 08:59
5d01670
Compare
Choose a tag to compare
server : include speculative decoding stats when timings_per_token is…

b4982

28 Mar 08:31
1373176
Compare
Choose a tag to compare
llamafile : ppc64le GEMV forwarding for FP32. (#12594)

This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.

The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.

This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.

The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

b4981

28 Mar 07:03
ab6ab8f
Compare
Choose a tag to compare
rpc : send hash when tensor data is above some fixed threshold (#12496)

* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency

b4980

27 Mar 23:32
2099a9d
Compare
Choose a tag to compare
server : Support listening on a unix socket (#12613)

* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

b4978

27 Mar 16:00
5dec47d
Compare
Choose a tag to compare
opencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)

* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope

b4977

27 Mar 11:54
f125b8d
Compare
Choose a tag to compare
llama : add PLM GGUF Conversion & Inference Support (#12457)

* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch