Skip to content

[pull] master from ggml-org:master#96

Merged
pull[bot] merged 6 commits into
CrazyForks:masterfrom
ggml-org:master
May 25, 2026
Merged

[pull] master from ggml-org:master#96
pull[bot] merged 6 commits into
CrazyForks:masterfrom
ggml-org:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 25, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

ggerganov and others added 6 commits May 25, 2026 12:43
* Refactored Compressed Tensors NVFP4 support for new base.py

* Support compressed-tensors NVFP4 conversion

* Moved Qwen MTP remap into filter_tensors

* simplify

* pathlib no longer used

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: add fast walsh-hadamard transform

* review: add unrolls + change size_t -> int

* warp size 64

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
ffn_latent_down/up are declared GGML_OP_MUL in LLM_TENSOR_INFOS but
nemotron-h feeds them through ggml_mul_mat. The loader buft probe asks
the backend about the declared op, so it tested an elementwise MUL on a
q8_0 weight. That used to return true unconditionally and the weight
stayed on GPU by luck. Once supports_op told the truth, the probe got a
no and the loader pushed the weight and its matmul to CPU, splitting the
graph. Tagging it MUL_MAT asks the real question, the math is unchanged.

Verified on Nemotron 3 Super 120B Q5_K_M: from 64.9 back to 103.22 t/s.
@pull pull Bot locked and limited conversation to collaborators May 25, 2026
@pull pull Bot added the ⤵️ pull label May 25, 2026
@pull pull Bot merged commit 328874d into CrazyForks:master May 25, 2026
3 of 20 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants