Skip to content

Releases: ggerganov/llama.cpp

b3030

29 May 03:17
504f0c3
Compare
Choose a tag to compare
ggml : fix typo in ggml.c (#7603)

b3029

28 May 23:45
b864b50
Compare
Choose a tag to compare
[SYCL] Align GEMM dispatch (#7566)

* align GEMM dispatch

b3028

28 May 21:58
02c1eca
Compare
Choose a tag to compare
Tokenizer WPM fixes (#7500)

* Update random test: add_bos_token.
* Update random test: add WPM models for testing.
* Build vocab.special_tokens_cache using vocab token types.
* Fix and improve WPM preprocessing.
  - Fix unicode edge case combinations.
  - Split by whitspace in the same pass.
* Discard all tokens when no matching found.

b3027

28 May 21:47
6bd12ce
Compare
Choose a tag to compare
sycl : fix assert (#7563)

b3026

28 May 21:19
5442939
Compare
Choose a tag to compare
llama : support small Granite models (#7481)

* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>

b3025

28 May 19:50
56411a9
Compare
Choose a tag to compare
vulkan: properly initialize vulkan devices for LLAMA_SPLIT_MODE_NONE …

…(#7552)

b3024

28 May 18:37
2b737ca
Compare
Choose a tag to compare
rpc : resource management rework (#7562)

* rpc : resource management rework

* address review comments

b3023

28 May 17:41
ee3dff6
Compare
Choose a tag to compare
Add support for DeepseekV2ForCausalLM (#7519)

* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

b3021

28 May 15:55
8b99e2a
Compare
Choose a tag to compare
llama : handle unknown utf8 bytes (#7588)

b3019

28 May 12:04
e2b0650
Compare
Choose a tag to compare
[SYCL]fix ggml_sycl_mul_mat_id() to match the change of api (#7436)

* fix mul_mat_id to match the change of api

* rm comment

* rm unused or duplicated code, rename as review comment