High-performance, zero-allocation embedding inference in Go
MemRAG is a Go library that provides high-performance, zero-allocation embedding inference for retrieval-augmented generation (RAG) applications. It leverages the MemPipe inference engine to run ONNX-based embedding models with minimal memory allocation and optimal performance.
- Zero-Allocation Hot Path: Pre-allocated buffers for tokenizer and pooling operations eliminate GC pressure
- Dynamic Sequence Length: Engine reshapes to actual token count for faster processing of short inputs
- Multiple Pooling Strategies: Mean pooling, CLS pooling, and raw output support
- Concurrent Inference: Thread-safe engine pool with bounded concurrency via semaphores
- Extensible Operator Registry: Pluggable operator system for custom inference operations
- Model Descriptors: Decoupled model configuration for easy addition of new models
- Multiple Tokenizer Support: WordPiece (BERT), BPE, and SentencePiece tokenizers
MemRAG provides a complete embedding pipeline:
Text Input
│
▼
┌─────────────────┐
│ Tokenizer │ WordPiece/BPE/SentencePiece
│ (zero-alloc) │
└────────┬────────┘
│ token IDs, attention mask, type IDs
▼
┌─────────────────┐
│ MemPipe Engine │ ONNX inference with fused operators
└────────┬────────┘
│ hidden states
▼
┌─────────────────┐
│ Pooler │ Mean/CLS/No pooling
└────────┬────────┘
│ pooled vector
▼
┌─────────────────┐
│ Normalizer │ L2 normalization (optional)
└────────┬────────┘
│
▼
Embedding Vector
go get github.com/GoMemPipe/memragFirst, convert a HuggingFace embedding model to MemPipe format:
# Using the provided conversion script
python scripts/convert_bge_small.py --output ./models/bge-small-en-v1.5/This creates:
model.mpmodel- The inference modelvocab.txt- The tokenizer vocabulary
package main
import (
"context"
"fmt"
"log"
"github.com/GoMemPipe/memrag/model"
"github.com/GoMemPipe/memrag/model/descriptors"
"github.com/GoMemPipe/memrag/pipeline"
)
func main() {
// Load model descriptor
desc, ok := descriptors.Get("BAAI/bge-small-en-v1.5")
if !ok {
log.Fatal("descriptor not found")
}
// Load model assets
assets, err := model.LoadAssetsFromDir("./models/bge-small-en-v1.5/")
if err != nil {
log.Fatal(err)
}
// Create embedding pipeline
pipe, err := pipeline.New(desc, assets)
if err != nil {
log.Fatal(err)
}
defer pipe.Close()
// Generate embedding
ctx := context.Background()
vec, err := pipe.Embed(ctx, "Hello, world!")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Embedding dimension: %d\n", len(vec))
// Output: Embedding dimension: 384
}For production workloads with high concurrency:
// Create engine factory
factory := pool.NewEngineFactory(desc, assets, frozenReg)
// Create pool with capacity based on CPU cores
ep, err := pool.NewEnginePool(factory, runtime.NumCPU())
if err != nil {
log.Fatal(err)
}
// Wrap in service for easier access
service := pool.NewEmbeddingService(ep)
// Embed batch concurrently
texts := []string{
"Hello, world!",
"Welcome to MemRAG",
"Embedding models are useful",
}
results, err := service.EmbedBatch(ctx, texts)memrag/
├── cmd/ # Command-line tools
│ ├── embed/ # Embedding CLI
│ └── embed-wasm/ # WebAssembly embedding demo
├── docs/ # Documentation
│ └── MODEL_CONVERSION.md
├── examples/ # Usage examples
│ └── bge_embedding_example.go
├── model/ # Model loading and descriptors
│ ├── assets.go
│ ├── descriptor.go
│ └── descriptors/ # Built-in model descriptors
├── operator/ # Operator registry and middleware
├── pipeline/ # Core embedding pipeline
├── pool/ # Concurrent engine pool
├── tokenizer/ # Tokenizer implementations
│ ├── wordpiece.go # BERT-style WordPiece
│ └── ...
└── memrag.go # Package entry point
Currently supported models include:
- BAAI/bge-small-en-v1.5 - 384 dimensions, 512 max sequence length
- Additional models can be added via the descriptor system
- Model Conversion Guide - How to convert HuggingFace models
- Examples - Detailed usage examples
- API Reference - GoDoc
MemRAG is designed for minimal memory allocation and optimal throughput:
- Zero-allocation hot path: Pre-allocated buffers for tokenizer and pooling
- Dynamic sequence length: Reshapes engine to actual token count
- Connection pooling: Reuses pipeline instances via sync.Pool
- Bounded concurrency: Semaphore-based concurrency control
- Go 1.25+
- MemPipe v1.0.0
- Converted model in .mpmodel format
MIT License - see LICENSE for details.
- MemPipe - High-performance ONNX inference engine
- transformers - HuggingFace model conversion