Build llama quantize

Build the patched `llama-quantize`

tools/convert.py outputs an unquantized F16/BF16 .gguf. To get smaller Q*_K_M / IQ* quants you also need a patched llama-quantize binary built from llama.cpp at tag b3962 with tools/lcpp.patch applied.

The patch:

Doubles GGML_MAX_NAME (64 → 128) so quantization can preserve longer diffusion-model tensor names.
Adds gguf_set_tensor_ndim() so the writer can override the on-disk ndim metadata for tensors whose stored shape differs from the runtime ndim (used by Comfy diffusion archs where 5-D tensors get reshaped to 4-D for storage).
Adjusts src/llama.cpp tensor-name handling for the new GGML_MAX_NAME bound.

The previous-generation instructions (clone upstream → checkout tag → git apply lcpp.patch → fight line endings → cmake) still work and live in tools/README.md. This page documents the shortcut version: clone a pre-patched fork and build directly.

1. Prerequisites

OS	Need
Linux	`build-essential`, `cmake ≥ 3.18`, `git`. Optionally NVIDIA CUDA toolkit ≥ 12.0 for GPU quantize.
macOS	Xcode CLT (`xcode-select --install`), `cmake` (Homebrew or via the Python `cmake` wheel).
Windows	Visual Studio 2022 (Desktop C++ workload) or MinGW-w64. CMake on `PATH`.

CPU-only llama-quantize is fine for almost all diffusion model sizes (a 12 B Flux F16 takes a couple of minutes on a modern desktop CPU). Use CUDA only if you regularly quantize very large models.

2. Clone the pre-patched branch

git clone -b city96 https://github.com/Randy420Marsh/llama.cpp.git
cd llama.cpp

The city96 branch on Randy420Marsh/llama.cpp is upstream ggml-org/llama.cpp at tag b3962 with lcpp.patch already applied (commit f8dfcc87). No git apply step. No CRLF normalisation. No --ignore-whitespace workaround.

If you ever want to update the base — e.g. apply the patch on top of a newer tag than b3962 — see Re-patching against a different upstream tag at the bottom.

3. Configure & build

Linux / macOS (CPU)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_STANDARD=17 \
  -DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j$(nproc) --target llama-quantize

The -DCMAKE_CXX_STANDARD=17 flag is the important one. CUDA toolkits 12.x and 13.x require C++17, and CMake defaults to C++11 on some toolchains.

Output: llama.cpp/build/bin/llama-quantize.

Linux / macOS (CUDA)

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_STANDARD=17 \
  -DCMAKE_CXX_STANDARD_REQUIRED=ON \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_STANDARD=17
cmake --build build --config Release -j$(nproc) --target llama-quantize

Add -DCMAKE_CUDA_ARCHITECTURES=89 (or your card's compute capability, e.g. 86 for Ampere, 75 for Turing) if CMake complains about not knowing what architectures to target.

Windows (Visual Studio 2022)

cmake -B build ^
  -G "Visual Studio 17 2022" -A x64 ^
  -DCMAKE_CXX_STANDARD=17 ^
  -DCMAKE_CXX_STANDARD_REQUIRED=ON
cmake --build build --config Release -j --target llama-quantize

MSVC defaults to C++14 — the explicit CMAKE_CXX_STANDARD=17 is mandatory.

Output: llama.cpp\build\bin\Release\llama-quantize.exe.

4. (Linux only) Export `LD_LIBRARY_PATH`

The patched build links llama-quantize against shared libggml.so / libllama.so inside the build tree, not against any system-installed copy. Run from another directory and you get:

error while loading shared libraries: libggml.so: cannot open shared object file

Fix:

export LD_LIBRARY_PATH=/path/to/llama.cpp/build/src:/path/to/llama.cpp/build/ggml/src:$LD_LIBRARY_PATH

Make it persistent in ~/.bashrc / ~/.zshrc, or set it from the script that launches gguf_gui.py (the GUI spawns llama-quantize as a subprocess and inherits the env).

macOS uses DYLD_LIBRARY_PATH instead. Windows doesn't need any equivalent — the loader picks up DLLs from build\bin\Release\ automatically.

5. Smoke test

./build/bin/llama-quantize --help 2>&1 | head -5

If it prints usage you're done. If it crashes with a shared-library error, re-read step 4. If --help works but a real quantize step crashes on a long tensor name, you didn't actually pick up the patch — verify with:

strings ./build/bin/llama-quantize | grep -c gguf_set_tensor_ndim

You should see ≥ 1. Zero means you built an unpatched binary (probably the wrong checkout).

Re-patching against a different upstream tag

The city96 branch is pinned to b3962 (Oct 2024) — that's the tag lcpp.patch was authored against. If you need a newer base (e.g. for a CUDA-toolkit-compatibility fix that landed after b3962), regenerating the patch is straightforward:

git clone https://github.com/ggml-org/llama.cpp.git llama-new
cd llama-new
git checkout <newer-tag>
git apply ../ComfyUI-GGUF/tools/lcpp.patch     # likely conflicts
# resolve conflicts in ggml/include/ggml.h, ggml/src/ggml.c, src/llama.cpp
git diff > ../ComfyUI-GGUF/tools/lcpp.patch    # save new patch

Then either push the rebuilt patch to your own fork as a new branch, or open a PR against Randy420Marsh/llama.cpp to update the city96 branch — note that the branch name is conventional, not a guarantee of a specific upstream tag.

Home

Setup

Build the patched llama-quantize

Reference

Repos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build llama quantize

Build the patched `llama-quantize`

1. Prerequisites

2. Clone the pre-patched branch

3. Configure & build

Linux / macOS (CPU)

Linux / macOS (CUDA)

Windows (Visual Studio 2022)

4. (Linux only) Export `LD_LIBRARY_PATH`

5. Smoke test

Re-patching against a different upstream tag

Uh oh!

Uh oh!

Clone this wiki locally

Build llama quantize

Build the patched llama-quantize

1. Prerequisites

2. Clone the pre-patched branch

3. Configure & build

Linux / macOS (CPU)

Linux / macOS (CUDA)

Windows (Visual Studio 2022)

4. (Linux only) Export LD_LIBRARY_PATH

5. Smoke test

Re-patching against a different upstream tag

Uh oh!

Uh oh!

Clone this wiki locally

Build the patched `llama-quantize`

4. (Linux only) Export `LD_LIBRARY_PATH`