Sync Nexa's llama.cpp Fork with Upstream Updates #47

xsxszab · 2025-03-04T18:55:43Z

Description

This pull request updates Nexa's llama.cpp fork to incorporate the latest upstream changes (commit 06c2b1)) while preserving Nexa-specific modifications.

Key Updates

Merged all upstream changes while ensuring Nexa's custom modifications remain intact.
Updated Nexa-specific header and source files under ./common/ to align with llama.cpp changes.
Updated Nexa's example implementations (OmniVLM, OmniAudio, QwenAudio) to be compatible with the latest llama.cpp API. Updated all deprecated but still functional codes as well.

Signed-off-by: thxCode <thxcode0824@gmail.com>

* Add initial ggml cmake package * Add build numbers to ggml find-package * Expand variables with GGML_ prefix * Guard against adding to cache variable twice * Add git to msys2 workflow * Handle ggml-cpu-* variants * Link ggml/ggml-base libraries to their targets * Replace main-cmake-pkg with simple-cmake-pkg * Interface features require c_std_90 * Fix typo * Removed unnecessary bracket from status message * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update examples/simple-cmake-pkg/README.md Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…gml-org#11422) Signed-off-by: rare-magma <rare-magma@posteo.eu>

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

* ci : do not fail-fast for docker * build arm64/amd64 separatedly * fix pip * no fast fail * vulkan: try jammy

…-org#11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

* impl::load change map bpe_ranks to onordered map for reduce time of impl::load on 30% * llama_model_loader::init_mapping - replace new llama_mmap to std::make_unique<llama_mmap> for clean code & reduce (/2) time of running init_mappings * Update src/llama-vocab.cpp --------- Co-authored-by: lexasub <empty@empty.ru> Co-authored-by: Diego Devesa <slarengh@gmail.com>

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

https://huggingface.co/docs/hub/en/ollama Signed-off-by: Eric Curtin <ecurtin@redhat.com>

The HTTP client in llama-run only prints an error in case the download of a resource failed. If the model name in the CLI parameter list is missing, this causes the application to crash. In order to prevent this, a check for the required model parameter has been added and errors for resource downloads get propagated to the caller. Signed-off-by: Michael Engel <mengel@redhat.com>

Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during ggml-org#5021. To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it). * SYCL: SOFTMAX F16 mask support and other fixes * test-backend-ops: Add F16 mask test cases

Signed-off-by: rare-magma <rare-magma@posteo.eu>

As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>

…le instantation bug (ggml-org#11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

* ci : fix build CPU arm64 * failed, trying ubuntu 22 * vulkan: ubuntu 24 * vulkan : jammy --> noble

…-org#11473) The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True

ggml-org#11466)

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

This commit enables the `--no-warmup` option for the llama-embeddings. The motivation for this change is to allow the user to disable the warmup when running the the program.

…(ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>

* Add option to not print stack on abort Add option/envvar to disable stack printing on abort. Also link some unittests with Threads to fix link errors on ubuntu/g++11. * Update ggml/src/ggml.c --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>

People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

…gml-org#11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging

* server : update auto gen files comments This commit updates the 'auto generated files' comments in server.cpp and removes `deps.sh` from the comment. The motivation for this change is that `deps.sh` was removed in Commit 91c36c2 ("server : (web ui) Various improvements, now use vite as bundler (ggml-org#10599)"). * squash! server : update auto gen files comments [no ci] Move comments about file generation to README.md. * squash! server : update auto gen files comments [no ci] Remove the comments in server.cpp that mention that information can be found in the README.md file.

…-org#11360) * vulkan: initial support for IQ3_S * vulkan: initial support for IQ3_XXS * vulkan: initial support for IQ2_XXS * vulkan: initial support for IQ2_XS * vulkan: optimize Q3_K by removing branches * vulkan: implement dequantize variants for coopmat2 * vulkan: initial support for IQ2_S * vulkan: vertically realign code * port failing dequant callbacks from mul_mm * Fix array length mismatches * vulkan: avoid using workgroup size before it is referenced * tests: increase timeout for Vulkan llvmpipe backend --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

thxCode and others added 30 commits January 26, 2025 16:20

rpc: fix register position (ggml-org#11424)

1d8ee06

Signed-off-by: thxCode <thxcode0824@gmail.com>

docker: add missing vulkan library to base layer and update to 24.04 (g…

6f53d8a

…gml-org#11422) Signed-off-by: rare-magma <rare-magma@posteo.eu>

metal : use residency sets (ggml-org#11427)

178a7eb

* metal : use residency sets ggml-ci * metal : restore commandBufferWithUnretainedReferences calls [no ci] * metal : release descriptors ggml-ci * metal : check env GGML_METAL_NO_RESIDENCY ggml-ci * metal : fix build + clean-up ggml-ci

docker : fix ARM build and Vulkan build (ggml-org#11434)

caf773f

* ci : do not fail-fast for docker * build arm64/amd64 separatedly * fix pip * no fast fail * vulkan: try jammy

metal: Handle null returned from MTLCreateSystemDefaultDevice() (ggml…

acd38ef

…-org#11441) This fixes segmentation fault error when running tests when no metal devices are available (for example, when not linked with Core Graphics framework or otherwise).

llama: refactor llama_decode_impl (ggml-org#11381)

df984e0

AMD: parse the architecture as supplied by gcnArchName (ggml-org#11244)

d6d24cd

The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.

Add new hf protocol for ollama (ggml-org#11449)

a4417dd

https://huggingface.co/docs/hub/en/ollama Signed-off-by: Eric Curtin <ecurtin@redhat.com>

docker: add perplexity and bench commands to full image (ggml-org#11438)

f643120

Signed-off-by: rare-magma <rare-magma@posteo.eu>

cmake : don't fail on GGML_CPU=OFF (ggml-org#11457)

4bf3119

docker: allow installing pip packages system-wide (ggml-org#11437)

d7d1ecc

Signed-off-by: rare-magma <rare-magma@posteo.eu>

Add github protocol pulling and http:// (ggml-org#11465)

7fee288

As pulling protocols to llama-run Signed-off-by: Eric Curtin <ecurtin@redhat.com>

HIP: Only call rocblas_initialize on rocblas versions with the multip…

cae9fb4

…le instantation bug (ggml-org#11080) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.

HIP: Supress transformation warning in softmax.cu

be5ef79

loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.

ci : fix build CPU arm64 (ggml-org#11472)

d0c0804

* ci : fix build CPU arm64 * failed, trying ubuntu 22 * vulkan: ubuntu 24 * vulkan : jammy --> noble

server : Fixed wrong function name in llamacpp server unit test (ggml…

cf8cc85

…-org#11473) The test_completion_stream_with_openai_library() function is actually with stream=False by default, and test_completion_with_openai_library() with stream=True

cmake: add hints for locating ggml on Windows using Llama find-package (

794fe23

ggml-org#11466)

llama: fix missing k_cache store for rwkv6qwen2 (ggml-org#11445)

325afb3

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

embedding : enable --no-warmup option (ggml-org#11475)

b636228

This commit enables the `--no-warmup` option for the llama-embeddings. The motivation for this change is to allow the user to disable the warmup when running the the program.

ggml-cpu : fix ggml_graph_compute_thread did not terminate on abort. …

d2e518e

…(ggml/1065) some threads kept looping and failed to terminate properly after an abort during CPU execution. Co-authored-by: issi <issi@gmail.com>

sync : ggml

8158577

Parse https://ollama.com/library/ syntax (ggml-org#11480)

f0d4b29

People search for ollama models using the web ui, this change allows one to copy the url from the browser and for it to be compatible with llama-run. Signed-off-by: Eric Curtin <ecurtin@redhat.com>

vulkan: Catch pipeline creation failure and print an error message (g…

2711d02

…gml-org#11436) * vulkan: Catch pipeline creation failure and print an error message Also, fix some warnings from my on-demand compile change. * vulkan: fix pipeline creation logging

xsxszab removed examples devops python script android server ggml nix labels Mar 4, 2025

Davidqian123 approved these changes Mar 5, 2025

View reviewed changes

Merges master into current branch.

0ee75e7

github-actions bot added documentation Improvements or additions to documentation Kompute Apple Metal SYCL Nvidia GPU Vulkan testing build examples devops python script android server ggml nix labels Mar 5, 2025

Simplifies code base.

4c7c557

Davidqian123 approved these changes Mar 5, 2025

View reviewed changes

Adds todos.

f6c655b

Davidqian123 merged commit 41aa79a into master Mar 5, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync Nexa's llama.cpp Fork with Upstream Updates #47

Sync Nexa's llama.cpp Fork with Upstream Updates #47

Uh oh!

xsxszab commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Sync Nexa's llama.cpp Fork with Upstream Updates #47

Sync Nexa's llama.cpp Fork with Upstream Updates #47

Uh oh!

Conversation

xsxszab commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Updates

Uh oh!

Uh oh!

Uh oh!

xsxszab commented Mar 4, 2025 •

edited

Loading