Releases · ggml-org/llama.cpp

28 Mar 21:56

3714c3e

b4988 Latest

Latest

llama : fix incorrect Qwen2Moe ffn_moe_out graph callback (#12631)

Assets 25

cudart-llama-bin-win-cu11.7-x64.zip

303 MB 2025-03-28T21:56:29Z
cudart-llama-bin-win-cu12.4-x64.zip

373 MB 2025-03-28T21:56:42Z
llama-b4988-bin-macos-arm64.zip

24.7 MB 2025-03-28T21:56:59Z
llama-b4988-bin-macos-x64.zip

26.4 MB 2025-03-28T21:57:01Z
llama-b4988-bin-ubuntu-arm64.zip

26.9 MB 2025-03-28T21:57:03Z
llama-b4988-bin-ubuntu-vulkan-x64.zip

33 MB 2025-03-28T21:57:04Z
llama-b4988-bin-ubuntu-x64.zip

28.5 MB 2025-03-28T21:57:07Z
llama-b4988-bin-win-avx-x64.zip

17.2 MB 2025-03-28T21:57:08Z
llama-b4988-bin-win-avx2-x64.zip

17.2 MB 2025-03-28T21:57:10Z
llama-b4988-bin-win-avx512-x64.zip

17.3 MB 2025-03-28T21:57:12Z
Source code (zip)

2025-03-28T21:13:02Z
Source code (tar.gz)

2025-03-28T21:13:02Z

28 Mar 19:06

github-actions

b4987

b4ae508

b4987

metal : improve FA + improve MoE (#12612)

* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci

Assets 26

28 Mar 18:42

github-actions

b4986

b86f600

b4986

vulkan: fix coopmat shader generation when cross-compiling (#12272)

* vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the
vulkan-shaders-gen project building on the host, which leads to build
failure because of the cross-compiling code expecting coopmat{,2}
shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders
subproject.

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>

* Only call coop-mat shaders once

* Fix whitespace

---------

Signed-off-by: Icenowy Zheng <uwu@icenowy.me>
Co-authored-by: bandoti <141645996+bandoti@users.noreply.github.com>

Assets 25

28 Mar 18:02

github-actions

b4985

dd373dd

b4985

llama: fix error on bad grammar (#12628)

Assets 25

28 Mar 08:59

github-actions

b4984

5d01670

b4984

server : include speculative decoding stats when timings_per_token is…

Assets 26

28 Mar 08:31

github-actions

b4982

1373176

b4982

llamafile : ppc64le GEMV forwarding for FP32. (#12594)

This patch enables usage of MMA when one of the
dimensions of the matrix(ie either M or N) is 1. This
is useful in case of token generation where N < 2.

The concept of 'GEMV Forwarding' is used where when one
of the matrix has a single row/column, the elements are
broadcasted, instead of using packing routine to prepack
the matrix elements.

This change results in 5% - 15% improvement in total
speed(ie all tokens/total time), across various batch
sizes. This is in comparision with the corresponding
dot product implementation.

The patch is tested with FP32 models of Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf on a IBM POWER10 machine.

Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>

Assets 26

28 Mar 07:03

github-actions

b4981

ab6ab8f

b4981

rpc : send hash when tensor data is above some fixed threshold (#12496)

* rpc : send hash when tensor data is above some fixed threshold

ref #10095

* rpc : put cache under $HOME/.cache/llama.cpp

* try to fix win32 build

* another try to fix win32 build

* remove llama as dependency

Assets 26

27 Mar 23:32

github-actions

b4980

2099a9d

b4980

server : Support listening on a unix socket (#12613)

* server : Bump cpp-httplib to include AF_UNIX windows support

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

* server : Allow running the server example on a unix socket

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

---------

Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@docker.com>

Assets 26

27 Mar 16:00

github-actions

b4978

5dec47d

b4978

opencl: add multi and vision rope, `gelu_quick` and `im2col` (#12600)

* opencl: add `im2col`

* opencl: add `gelu_quick`

* opencl: add mrope

* opencl: add vision rope

Assets 26

27 Mar 11:54

github-actions

b4977

f125b8d

b4977

llama : add PLM GGUF Conversion & Inference Support (#12457)

* add edgellm model arch[conversation feature doesn't work]

* remove output.weight layer for edgellm arch

* [Model] update the name of the model

* update the name of model arch in convert gguf

* [Model] Refarctor the model arch into llama-model

* [Bug] Fix the bug in create attn kv

* [Code] Fix editorconfig erros

* [Code] Remove Trailing whitespace

* [Code] Remove Trailing whitespace

* [Code] Change the order of model arch in list

* [Code] Fix flake8 Lint errors

* Remove trailing white space

* [Code] Remove  call in model arch

Assets 25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggml-org/llama.cpp

b4988

b4987

b4986

b4985

b4984

b4982

b4981

b4980

b4978

b4977