High-performance llama.cpp bindings for Go β run LLMs locally with the power of C++ and the simplicity of Go.
Getting Started β’ Features β’ API Reference β’ GPU Acceleration β’ Examples
Note: The original
go-skynet/go-llama.cpprepository was unmaintained for over a year. As the llama.cpp ecosystem evolved rapidly with new features, samplers, and breaking API changes, the Go bindings fell behind.I decided to fork and actively maintain this project to ensure the Go community has access to the latest llama.cpp capabilities. This fork is fully updated to support the modern llama.cpp API including the new sampler chain architecture and GGUF format.
- β Updated to latest llama.cpp β Full compatibility with modern GGUF models
- β New Sampler Chain API β Modern sampling architecture with composable samplers
- β XTC Sampler β Cross-Token Coherence for improved generation quality
- β DRY Sampler β "Don't Repeat Yourself" penalty to reduce repetition
- β TopNSigma Sampler β Statistical sampling for better token selection
- β Model Info API β Query model metadata (vocab size, layers, parameters, etc.)
- β Chat Templates β Native support for model chat templates
- β Fixed Build System β Proper static linking with all CPU optimizations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π― Performance First β
β ββββββββββββββββββββ β
β β’ Zero-copy data passing to C++ β
β β’ Minimal CGO overhead β
β β’ Native CPU optimizations (AVX, AVX2, AVX-512) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π§ Flexible Sampling β
β ββββββββββββββββββββ β
β β’ Temperature, Top-K, Top-P, Min-P β
β β’ Repetition & Presence Penalties β
β β’ XTC, DRY, TopNSigma (NEW!) β
β β’ Mirostat v1 & v2 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β‘ GPU Acceleration β
β ββββββββββββββββββββ β
β β’ NVIDIA CUDA / cuBLAS β
β β’ AMD ROCm / HIPBlas β
β β’ Apple Metal (M1/M2/M3) β
β β’ OpenCL / CLBlast β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β π¦ Model Support β
β ββββββββββββββββββββ β
β β’ All GGUF quantization formats β
β β’ LLaMA, Mistral, Qwen, Phi, and 100+ architectures β
β β’ LoRA adapter loading β
β β’ Embeddings generation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Go 1.20+
- C/C++ compiler (GCC, Clang, or MSVC)
- CMake 3.14+
- (Optional) CUDA Toolkit for NVIDIA GPU support
- (Optional) ROCm for AMD GPU support
# Clone with submodules
git clone --recurse-submodules https://github.com/AshkanYarmoradi/go-llama.cpp
cd go-llama.cpp
# Build the bindings
make libbinding.a
# Run the example
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/path/to/model.gguf" -t 8package main
import (
"fmt"
llama "github.com/AshkanYarmoradi/go-llama.cpp"
)
func main() {
// Load model
model, err := llama.New("model.gguf",
llama.SetContext(2048),
llama.SetGPULayers(35),
)
if err != nil {
panic(err)
}
defer model.Free()
// Generate text
response, err := model.Predict("Explain quantum computing in simple terms:",
llama.SetTemperature(0.7),
llama.SetTopP(0.9),
llama.SetTokens(256),
)
if err != nil {
panic(err)
}
fmt.Println(response)
}// XTC (Cross-Token Coherence) - Improves coherence between tokens
llama.SetXTC(probability, threshold float64)
// DRY (Don't Repeat Yourself) - Reduces repetitive patterns
llama.SetDRY(multiplier, base float64, allowedLength, penaltyLastN int)
// TopNSigma - Statistical sampling based on standard deviations
llama.SetTopNSigma(n float64)// Get comprehensive model metadata
info := model.GetModelInfo()
fmt.Printf("Model: %s\n", info.Description)
fmt.Printf("Vocabulary: %d tokens\n", info.NVocab)
fmt.Printf("Context Length: %d\n", info.NCtxTrain)
fmt.Printf("Embedding Dim: %d\n", info.NEmbd)
fmt.Printf("Layers: %d\n", info.NLayer)
fmt.Printf("Parameters: %d\n", info.NParams)
fmt.Printf("Size: %d bytes\n", info.Size)
// Get chat template for chat models
template := model.GetChatTemplate()| Option | Description | Default |
|---|---|---|
SetTemperature(t) |
Randomness (0.0 = deterministic) | 0.8 |
SetTopK(k) |
Limit to top K tokens | 40 |
SetTopP(p) |
Nucleus sampling threshold | 0.9 |
SetMinP(p) |
Minimum probability threshold | 0.05 |
SetRepeatPenalty(p) |
Penalize repeated tokens | 1.1 |
SetPresencePenalty(p) |
Penalize token presence | 0.0 |
SetFrequencyPenalty(p) |
Penalize token frequency | 0.0 |
SetXTC(prob, thresh) |
Cross-token coherence | disabled |
SetDRY(...) |
Don't Repeat Yourself | disabled |
SetTopNSigma(n) |
Statistical sigma sampling | disabled |
SetMirostat(mode) |
Mirostat sampling (1 or 2) | disabled |
SetMirostatTAU(tau) |
Mirostat target entropy | 5.0 |
SetMirostatETA(eta) |
Mirostat learning rate | 0.1 |
This library works exclusively with the modern gguf file format. The legacy ggml format is no longer supported.
Need
ggmlsupport? Use the legacy tag:pre-gguf
# Convert HuggingFace models to GGUF
python llama.cpp/convert_hf_to_gguf.py /path/to/model --outfile model.gguf
# Quantize for smaller size
./llama.cpp/build/bin/llama-quantize model.gguf model-q4_k_m.gguf Q4_K_MThe default build uses optimized CPU code with automatic SIMD detection.
make libbinding.a
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -t 8BUILD_TYPE=openblas make libbinding.a
CGO_LDFLAGS="-lopenblas" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run -tags openblas ./examples -m "model.gguf" -t 8BUILD_TYPE=cublas make libbinding.a
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" \
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -ngl 35BUILD_TYPE=hipblas make libbinding.a
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ \
CGO_LDFLAGS="-O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc -lrocblas -lhipblas" \
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -ngl 64BUILD_TYPE=clblas CLBLAS_DIR=/path/to/clblast make libbinding.a
CGO_LDFLAGS="-lOpenCL -lclblast -L/usr/local/lib64/" \
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf"BUILD_TYPE=metal make libbinding.a
CGO_LDFLAGS="-framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" \
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go build ./examples/main.go
cp build/bin/ggml-metal.metal .
./main -m "model.gguf" -ngl 1model.SetTokenCallback(func(token string) bool {
fmt.Print(token)
return true // continue generation
})
model.Predict("Write a story about a robot:",
llama.SetTokens(500),
llama.SetTemperature(0.8),
)model, _ := llama.New("model.gguf", llama.EnableEmbeddings())
embeddings, _ := model.Embeddings("The quick brown fox")
fmt.Printf("Vector dimension: %d\n", len(embeddings))model, _ := llama.New("base-model.gguf",
llama.SetLoraAdapter("adapter.bin"),
llama.SetLoraBase("base-model.gguf"),
)Contributions are welcome! This fork is actively maintained and I'm happy to review PRs for:
- Bug fixes
- Performance improvements
- New llama.cpp feature bindings
- Documentation improvements
- Test coverage
- llama.cpp β The C++ inference engine
- GGUF Format β Model file format specification
- Hugging Face GGUF Models β Pre-quantized models
MIT License β see LICENSE for details.
Built with π¦ by the Go + LLM community
If you find this useful, consider giving it a β