-
Notifications
You must be signed in to change notification settings - Fork 0
Build llama quantize
tools/convert.py outputs an unquantized F16/BF16 .gguf. To get smaller Q*_K_M / IQ* quants you also need a patched llama-quantize binary built from llama.cpp at tag b3962 with tools/lcpp.patch applied.
The patch:
- Doubles
GGML_MAX_NAME(64 → 128) so quantization can preserve longer diffusion-model tensor names. - Adds
gguf_set_tensor_ndim()so the writer can override the on-diskndimmetadata for tensors whose stored shape differs from the runtime ndim (used by Comfy diffusion archs where 5-D tensors get reshaped to 4-D for storage). - Adjusts
src/llama.cpptensor-name handling for the newGGML_MAX_NAMEbound.
The previous-generation instructions (clone upstream → checkout tag → git apply lcpp.patch → fight line endings → cmake) still work and live in tools/README.md. This page documents the shortcut version: clone a pre-patched fork and build directly.
| OS | Need |
|---|---|
| Linux |
build-essential, cmake ≥ 3.18, git. Optionally NVIDIA CUDA toolkit ≥ 12.0 for GPU quantize. |
| macOS | Xcode CLT (xcode-select --install), cmake (Homebrew or via the Python cmake wheel). |
| Windows | Visual Studio 2022 (Desktop C++ workload) or MinGW-w64. CMake on PATH. |
CPU-only llama-quantize is fine for almost all diffusion model sizes (a 12 B Flux F16 takes a couple of minutes on a modern desktop CPU). Use CUDA only if you regularly quantize very large models.
git clone -b city96 https://github.com/Randy420Marsh/llama.cpp.git
cd llama.cppThe city96 branch on Randy420Marsh/llama.cpp is upstream ggml-org/llama.cpp at tag b3962 with lcpp.patch already applied (commit f8dfcc87). No git apply step. No CRLF normalisation. No --ignore-whitespace workaround.
If you ever want to update the base — e.g. apply the patch on top of a newer tag than b3962 — see Re-patching against a different upstream tag at the bottom.
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j$(nproc) --target llama-quantizeThe -DCMAKE_CXX_STANDARD=17 flag is the important one. CUDA toolkits 12.x and 13.x require C++17, and CMake defaults to C++11 on some toolchains.
Output: llama.cpp/build/bin/llama-quantize.
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_STANDARD=17 \
-DCMAKE_CXX_STANDARD_REQUIRED=ON \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_STANDARD=17
cmake --build build --config Release -j$(nproc) --target llama-quantizeAdd -DCMAKE_CUDA_ARCHITECTURES=89 (or your card's compute capability, e.g. 86 for Ampere, 75 for Turing) if CMake complains about not knowing what architectures to target.
cmake -B build ^
-G "Visual Studio 17 2022" -A x64 ^
-DCMAKE_CXX_STANDARD=17 ^
-DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j --target llama-quantizeMSVC defaults to C++14 — the explicit CMAKE_CXX_STANDARD=17 is mandatory.
Output: llama.cpp\build\bin\Release\llama-quantize.exe.
The patched build links llama-quantize against shared libggml.so / libllama.so inside the build tree, not against any system-installed copy. Run from another directory and you get:
error while loading shared libraries: libggml.so: cannot open shared object file
Fix:
export LD_LIBRARY_PATH=/path/to/llama.cpp/build/src:/path/to/llama.cpp/build/ggml/src:$LD_LIBRARY_PATHMake it persistent in ~/.bashrc / ~/.zshrc, or set it from the script that launches gguf_gui.py (the GUI spawns llama-quantize as a subprocess and inherits the env).
macOS uses DYLD_LIBRARY_PATH instead. Windows doesn't need any equivalent — the loader picks up DLLs from build\bin\Release\ automatically.
./build/bin/llama-quantize --help 2>&1 | head -5If it prints usage you're done. If it crashes with a shared-library error, re-read step 4. If --help works but a real quantize step crashes on a long tensor name, you didn't actually pick up the patch — verify with:
strings ./build/bin/llama-quantize | grep -c gguf_set_tensor_ndimYou should see ≥ 1. Zero means you built an unpatched binary (probably the wrong checkout).
The city96 branch is pinned to b3962 (Oct 2024) — that's the tag lcpp.patch was authored against. If you need a newer base (e.g. for a CUDA-toolkit-compatibility fix that landed after b3962), regenerating the patch is straightforward:
git clone https://github.com/ggml-org/llama.cpp.git llama-new
cd llama-new
git checkout <newer-tag>
git apply ../ComfyUI-GGUF/tools/lcpp.patch # likely conflicts
# resolve conflicts in ggml/include/ggml.h, ggml/src/ggml.c, src/llama.cpp
git diff > ../ComfyUI-GGUF/tools/lcpp.patch # save new patchThen either push the rebuilt patch to your own fork as a new branch, or open a PR against Randy420Marsh/llama.cpp to update the city96 branch — note that the branch name is conventional, not a guarantee of a specific upstream tag.