Pure Go BERT-family transformer inference. No CGO. No ONNX. No native dependencies.
import "github.com/MichaelAyles/goformer"
model, err := goformer.Load("./bge-small-en-v1.5")
if err != nil {
log.Fatal(err)
}
embedding, err := model.Embed("DMA channel configuration")
// embedding is a []float32 of length model.Dims()A Go library that loads BERT-family model weights directly from HuggingFace safetensors format and runs inference to produce embeddings. Point it at a model directory downloaded from HuggingFace, call Embed(), get a float32 slice. No Python export step, no ONNX conversion, no native libraries.
Reference model: BGE-small-en-v1.5 (384-dim, 6 layers, 33M params). Any BERT-family model in safetensors format with a compatible tokeniser should work.
Every existing pure Go option for running transformer models requires ONNX format. To get ONNX, you need a Python environment with transformers, optimum, torch, and onnx — a non-trivial dependency chain just to produce the artifact your Go binary needs. If the point of writing in Go is to escape Python in production, requiring Python in your build pipeline undermines the argument.
goformer loads directly from the canonical safetensors weights published by model authors on HuggingFace. Same files, same format, same config. No intermediate conversion.
// Load reads model weights from a HuggingFace model directory
// (config.json, tokenizer.json, model.safetensors).
func Load(path string) (*Model, error)
// Embed produces a normalised embedding vector for the input text.
func (m *Model) Embed(text string) ([]float32, error)
// EmbedBatch produces embeddings for multiple texts, padded to the
// longest sequence and processed together.
func (m *Model) EmbedBatch(texts []string) ([][]float32, error)
// Dims returns the embedding dimensionality (e.g. 384 for BGE-small).
func (m *Model) Dims() int
// MaxSeqLen returns the maximum sequence length the model supports.
func (m *Model) MaxSeqLen() intThat is the entire public surface.
-
Download a model from HuggingFace:
# Using git (requires git-lfs) git clone https://huggingface.co/BAAI/bge-small-en-v1.5 # Or download files manually: config.json, tokenizer.json, model.safetensors
-
Load and embed:
model, err := goformer.Load("./bge-small-en-v1.5") if err != nil { log.Fatal(err) } // Single embedding vec, err := model.Embed("What is a DMA controller?") // Batch embedding vecs, err := model.EmbedBatch([]string{ "What is a DMA controller?", "How does SPI communication work?", "Configure the UART baud rate", })
-
Compute similarity:
func cosineSimilarity(a, b []float32) float32 { var dot float32 for i := range a { dot += a[i] * b[i] } return dot // vectors are already L2-normalised }
Benchmarked with BGE-small-en-v1.5 on Apple M1:
| Input | goformer (pure Go) | PyTorch (CPU) | ONNX Runtime (CPU) |
|---|---|---|---|
| Short (~5 tokens) | 154ms | 12.9ms | 3.0ms |
| Medium (~11 tokens) | 287ms | 13.4ms | 4.2ms |
| Long (~40 tokens) | 1.1s | 13.8ms | 8.8ms |
| Batch of 8 | 2.4s | 22.5ms | 17.6ms |
goformer is ~10-50x slower than optimised native runtimes. The trade-off is zero native dependencies — no CGO, no ONNX conversion pipeline, no Python in your build. For applications where embedding latency is not the bottleneck (offline indexing, RAG pipelines, document processing), this is an acceptable cost.
| Operation | Time | Allocs |
|---|---|---|
| Model load | 91ms | 261MB |
| MatMul 384×384 | 31ms | 0.6MB |
| MatMul 384×1536 | 158ms | 2.3MB |
| LayerNorm (128×384) | 139µs | 0 |
| Softmax (12×128×128) | 1.0ms | 0 |
| GELU (128×1536) | 436µs | 0 |
| Tokenise | 1.9µs | 1KB |
MatMul dominates inference time. There is headroom for tiling optimisation and SIMD.
- Safetensors parser reads the binary weight file directly — 8-byte header length, JSON metadata, raw float32 data at specified offsets.
- WordPiece tokeniser parses HuggingFace
tokenizer.jsonand produces token IDs + attention masks. - BERT forward pass: embedding lookup → N transformer layers (self-attention + FFN + layer norm with residual connections) → mean pooling → L2 normalisation.
- All math is pure Go float32 operations. Matrix multiplication uses loop tiling for cache locality.
All outputs validated against the HuggingFace Python transformers library:
| Test case | Cosine similarity | Max element-wise diff |
|---|---|---|
DMA channel configuration |
1.000000 | 0.000292 |
The quick brown fox jumps over the lazy dog |
0.999999 | 0.000389 |
Hello |
0.999999 | 0.000213 |
| Long paragraph (40 tokens) | 0.999999 | 0.000241 |
café résumé naïve (unicode) |
0.999999 | 0.000211 |
Hello, world! How's it going? |
1.000000 | 0.000199 |
- Token IDs: exact match against Python for all test cases
- Batch embeddings match single embeddings
- CPU only. No GPU acceleration.
- Inference only. No training or fine-tuning.
- BERT-family only. Encoder models with safetensors weights. Not GPT, T5, or other architectures.
- F32 and F16 weights. Float16 weights are converted to float32 at load time.
MIT
