Skip to content

Build llama quantize

Easter Ledge edited this page May 22, 2026 · 1 revision

Build the patched llama-quantize

tools/convert.py outputs an unquantized F16/BF16 .gguf. To get smaller Q*_K_M / IQ* quants you also need a patched llama-quantize binary built from llama.cpp at tag b3962 with tools/lcpp.patch applied.

The patch:

  • Doubles GGML_MAX_NAME (64 → 128) so quantization can preserve longer diffusion-model tensor names.
  • Adds gguf_set_tensor_ndim() so the writer can override the on-disk ndim metadata for tensors whose stored shape differs from the runtime ndim (used by Comfy diffusion archs where 5-D tensors get reshaped to 4-D for storage).
  • Adjusts src/llama.cpp tensor-name handling for the new GGML_MAX_NAME bound.

The previous-generation instructions (clone upstream → checkout tag → git apply lcpp.patch → fight line endings → cmake) still work and live in tools/README.md. This page documents the shortcut version: clone a pre-patched fork and build directly.


1. Prerequisites

OS Need
Linux build-essential, cmake ≥ 3.18, git. Optionally NVIDIA CUDA toolkit ≥ 12.0 for GPU quantize.
macOS Xcode CLT (xcode-select --install), cmake (Homebrew or via the Python cmake wheel).
Windows Visual Studio 2022 (Desktop C++ workload) or MinGW-w64. CMake on PATH.

CPU-only llama-quantize is fine for almost all diffusion model sizes (a 12 B Flux F16 takes a couple of minutes on a modern desktop CPU). Use CUDA only if you regularly quantize very large models.


2. Clone the pre-patched branch

git clone -b city96 https://github.com/Randy420Marsh/llama.cpp.git
cd llama.cpp

The city96 branch on Randy420Marsh/llama.cpp is upstream ggml-org/llama.cpp at tag b3962 with lcpp.patch already applied (commit f8dfcc87). No git apply step. No CRLF normalisation. No --ignore-whitespace workaround.

If you ever want to update the base — e.g. apply the patch on top of a newer tag than b3962 — see Re-patching against a different upstream tag at the bottom.


3. Configure & build

Linux / macOS (CPU)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_STANDARD=17 \
  -DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j$(nproc) --target llama-quantize

The -DCMAKE_CXX_STANDARD=17 flag is the important one. CUDA toolkits 12.x and 13.x require C++17, and CMake defaults to C++11 on some toolchains.

Output: llama.cpp/build/bin/llama-quantize.

Linux / macOS (CUDA)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_STANDARD=17 \
  -DCMAKE_CXX_STANDARD_REQUIRED=ON \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_STANDARD=17
cmake --build build --config Release -j$(nproc) --target llama-quantize

Add -DCMAKE_CUDA_ARCHITECTURES=89 (or your card's compute capability, e.g. 86 for Ampere, 75 for Turing) if CMake complains about not knowing what architectures to target.

Windows (Visual Studio 2022)

cmake -B build ^
  -G "Visual Studio 17 2022" -A x64 ^
  -DCMAKE_CXX_STANDARD=17 ^
  -DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j --target llama-quantize

MSVC defaults to C++14 — the explicit CMAKE_CXX_STANDARD=17 is mandatory.

Output: llama.cpp\build\bin\Release\llama-quantize.exe.


4. (Linux only) Export LD_LIBRARY_PATH

The patched build links llama-quantize against shared libggml.so / libllama.so inside the build tree, not against any system-installed copy. Run from another directory and you get:

error while loading shared libraries: libggml.so: cannot open shared object file

Fix:

export LD_LIBRARY_PATH=/path/to/llama.cpp/build/src:/path/to/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH

Make it persistent in ~/.bashrc / ~/.zshrc, or set it from the script that launches gguf_gui.py (the GUI spawns llama-quantize as a subprocess and inherits the env).

macOS uses DYLD_LIBRARY_PATH instead. Windows doesn't need any equivalent — the loader picks up DLLs from build\bin\Release\ automatically.


5. Smoke test

./build/bin/llama-quantize --help 2>&1 | head -5

If it prints usage you're done. If it crashes with a shared-library error, re-read step 4. If --help works but a real quantize step crashes on a long tensor name, you didn't actually pick up the patch — verify with:

strings ./build/bin/llama-quantize | grep -c gguf_set_tensor_ndim

You should see ≥ 1. Zero means you built an unpatched binary (probably the wrong checkout).


Re-patching against a different upstream tag

The city96 branch is pinned to b3962 (Oct 2024) — that's the tag lcpp.patch was authored against. If you need a newer base (e.g. for a CUDA-toolkit-compatibility fix that landed after b3962), regenerating the patch is straightforward:

git clone https://github.com/ggml-org/llama.cpp.git llama-new
cd llama-new
git checkout <newer-tag>
git apply ../ComfyUI-GGUF/tools/lcpp.patch     # likely conflicts
# resolve conflicts in ggml/include/ggml.h, ggml/src/ggml.c, src/llama.cpp
git diff > ../ComfyUI-GGUF/tools/lcpp.patch    # save new patch

Then either push the rebuilt patch to your own fork as a new branch, or open a PR against Randy420Marsh/llama.cpp to update the city96 branch — note that the branch name is conventional, not a guarantee of a specific upstream tag.