Skip to content

AshkanYarmoradi/go-llama.cpp

Β 
Β 

Repository files navigation

πŸ¦™ go-llama.cpp

Blazing Fast LLM Inference in Go

Go Reference License: MIT Go Report Card

High-performance llama.cpp bindings for Go β€” run LLMs locally with the power of C++ and the simplicity of Go.

Getting Started β€’ Features β€’ API Reference β€’ GPU Acceleration β€’ Examples


🌟 About This Fork

Note: The original go-skynet/go-llama.cpp repository was unmaintained for over a year. As the llama.cpp ecosystem evolved rapidly with new features, samplers, and breaking API changes, the Go bindings fell behind.

I decided to fork and actively maintain this project to ensure the Go community has access to the latest llama.cpp capabilities. This fork is fully updated to support the modern llama.cpp API including the new sampler chain architecture and GGUF format.

What's New in This Fork (December 2025)

  • βœ… Updated to latest llama.cpp β€” Full compatibility with modern GGUF models
  • βœ… New Sampler Chain API β€” Modern sampling architecture with composable samplers
  • βœ… XTC Sampler β€” Cross-Token Coherence for improved generation quality
  • βœ… DRY Sampler β€” "Don't Repeat Yourself" penalty to reduce repetition
  • βœ… TopNSigma Sampler β€” Statistical sampling for better token selection
  • βœ… Model Info API β€” Query model metadata (vocab size, layers, parameters, etc.)
  • βœ… Chat Templates β€” Native support for model chat templates
  • βœ… Fixed Build System β€” Proper static linking with all CPU optimizations

πŸš€ Features

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  🎯 Performance First                                           β”‚
β”‚  ────────────────────                                           β”‚
β”‚  β€’ Zero-copy data passing to C++                                β”‚
β”‚  β€’ Minimal CGO overhead                                         β”‚
β”‚  β€’ Native CPU optimizations (AVX, AVX2, AVX-512)               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  πŸ”§ Flexible Sampling                                           β”‚
β”‚  ────────────────────                                           β”‚
β”‚  β€’ Temperature, Top-K, Top-P, Min-P                            β”‚
β”‚  β€’ Repetition & Presence Penalties                              β”‚
β”‚  β€’ XTC, DRY, TopNSigma (NEW!)                                  β”‚
β”‚  β€’ Mirostat v1 & v2                                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  ⚑ GPU Acceleration                                            β”‚
β”‚  ────────────────────                                           β”‚
β”‚  β€’ NVIDIA CUDA / cuBLAS                                        β”‚
β”‚  β€’ AMD ROCm / HIPBlas                                          β”‚
β”‚  β€’ Apple Metal (M1/M2/M3)                                      β”‚
β”‚  β€’ OpenCL / CLBlast                                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  πŸ“¦ Model Support                                               β”‚
β”‚  ────────────────────                                           β”‚
β”‚  β€’ All GGUF quantization formats                               β”‚
β”‚  β€’ LLaMA, Mistral, Qwen, Phi, and 100+ architectures          β”‚
β”‚  β€’ LoRA adapter loading                                        β”‚
β”‚  β€’ Embeddings generation                                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Requirements

  • Go 1.20+
  • C/C++ compiler (GCC, Clang, or MSVC)
  • CMake 3.14+
  • (Optional) CUDA Toolkit for NVIDIA GPU support
  • (Optional) ROCm for AMD GPU support

🎯 Quick Start

Installation

# Clone with submodules
git clone --recurse-submodules https://github.com/AshkanYarmoradi/go-llama.cpp
cd go-llama.cpp

# Build the bindings
make libbinding.a

# Run the example
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "/path/to/model.gguf" -t 8

Basic Usage

package main

import (
    "fmt"
    llama "github.com/AshkanYarmoradi/go-llama.cpp"
)

func main() {
    // Load model
    model, err := llama.New("model.gguf",
        llama.SetContext(2048),
        llama.SetGPULayers(35),
    )
    if err != nil {
        panic(err)
    }
    defer model.Free()

    // Generate text
    response, err := model.Predict("Explain quantum computing in simple terms:",
        llama.SetTemperature(0.7),
        llama.SetTopP(0.9),
        llama.SetTokens(256),
    )
    if err != nil {
        panic(err)
    }
    
    fmt.Println(response)
}

πŸŽ›οΈ API Reference

New Sampler Options

// XTC (Cross-Token Coherence) - Improves coherence between tokens
llama.SetXTC(probability, threshold float64)

// DRY (Don't Repeat Yourself) - Reduces repetitive patterns  
llama.SetDRY(multiplier, base float64, allowedLength, penaltyLastN int)

// TopNSigma - Statistical sampling based on standard deviations
llama.SetTopNSigma(n float64)

Model Information API

// Get comprehensive model metadata
info := model.GetModelInfo()
fmt.Printf("Model: %s\n", info.Description)
fmt.Printf("Vocabulary: %d tokens\n", info.NVocab)
fmt.Printf("Context Length: %d\n", info.NCtxTrain)
fmt.Printf("Embedding Dim: %d\n", info.NEmbd)
fmt.Printf("Layers: %d\n", info.NLayer)
fmt.Printf("Parameters: %d\n", info.NParams)
fmt.Printf("Size: %d bytes\n", info.Size)

// Get chat template for chat models
template := model.GetChatTemplate()

All Sampling Options

Option Description Default
SetTemperature(t) Randomness (0.0 = deterministic) 0.8
SetTopK(k) Limit to top K tokens 40
SetTopP(p) Nucleus sampling threshold 0.9
SetMinP(p) Minimum probability threshold 0.05
SetRepeatPenalty(p) Penalize repeated tokens 1.1
SetPresencePenalty(p) Penalize token presence 0.0
SetFrequencyPenalty(p) Penalize token frequency 0.0
SetXTC(prob, thresh) Cross-token coherence disabled
SetDRY(...) Don't Repeat Yourself disabled
SetTopNSigma(n) Statistical sigma sampling disabled
SetMirostat(mode) Mirostat sampling (1 or 2) disabled
SetMirostatTAU(tau) Mirostat target entropy 5.0
SetMirostatETA(eta) Mirostat learning rate 0.1

⚠️ Important Notes

GGUF Format Only

This library works exclusively with the modern gguf file format. The legacy ggml format is no longer supported.

Need ggml support? Use the legacy tag: pre-gguf

Converting Models

# Convert HuggingFace models to GGUF
python llama.cpp/convert_hf_to_gguf.py /path/to/model --outfile model.gguf

# Quantize for smaller size
./llama.cpp/build/bin/llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

⚑ Acceleration

CPU (Default)

The default build uses optimized CPU code with automatic SIMD detection.

make libbinding.a
LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -t 8

OpenBLAS

BUILD_TYPE=openblas make libbinding.a
CGO_LDFLAGS="-lopenblas" LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run -tags openblas ./examples -m "model.gguf" -t 8

🟒 NVIDIA CUDA

BUILD_TYPE=cublas make libbinding.a
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" \
  LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -ngl 35

πŸ”΄ AMD ROCm

BUILD_TYPE=hipblas make libbinding.a
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ \
  CGO_LDFLAGS="-O3 --hip-link --rtlib=compiler-rt -unwindlib=libgcc -lrocblas -lhipblas" \
  LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf" -ngl 64

πŸ”΅ Intel OpenCL

BUILD_TYPE=clblas CLBLAS_DIR=/path/to/clblast make libbinding.a
CGO_LDFLAGS="-lOpenCL -lclblast -L/usr/local/lib64/" \
  LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go run ./examples -m "model.gguf"

🍎 Apple Metal (M1/M2/M3)

BUILD_TYPE=metal make libbinding.a
CGO_LDFLAGS="-framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders" \
  LIBRARY_PATH=$PWD C_INCLUDE_PATH=$PWD go build ./examples/main.go
cp build/bin/ggml-metal.metal .
./main -m "model.gguf" -ngl 1

πŸ“– Examples

Streaming Output

model.SetTokenCallback(func(token string) bool {
    fmt.Print(token)
    return true // continue generation
})

model.Predict("Write a story about a robot:",
    llama.SetTokens(500),
    llama.SetTemperature(0.8),
)

Embeddings

model, _ := llama.New("model.gguf", llama.EnableEmbeddings())

embeddings, _ := model.Embeddings("The quick brown fox")
fmt.Printf("Vector dimension: %d\n", len(embeddings))

With LoRA Adapter

model, _ := llama.New("base-model.gguf",
    llama.SetLoraAdapter("adapter.bin"),
    llama.SetLoraBase("base-model.gguf"),
)

🀝 Contributing

Contributions are welcome! This fork is actively maintained and I'm happy to review PRs for:

  • Bug fixes
  • Performance improvements
  • New llama.cpp feature bindings
  • Documentation improvements
  • Test coverage

πŸ“š Resources


πŸ“„ License

MIT License β€” see LICENSE for details.


Built with πŸ¦™ by the Go + LLM community

If you find this useful, consider giving it a ⭐

About

Blazing Fast LLM Inference in Go

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages

  • C++ 44.9%
  • Go 43.9%
  • Makefile 11.2%