MemRAG

High-performance, zero-allocation embedding inference in Go

Overview

MemRAG is a Go library that provides high-performance, zero-allocation embedding inference for retrieval-augmented generation (RAG) applications. It leverages the MemPipe inference engine to run ONNX-based embedding models with minimal memory allocation and optimal performance.

Key Features

Zero-Allocation Hot Path: Pre-allocated buffers for tokenizer and pooling operations eliminate GC pressure
Dynamic Sequence Length: Engine reshapes to actual token count for faster processing of short inputs
Multiple Pooling Strategies: Mean pooling, CLS pooling, and raw output support
Concurrent Inference: Thread-safe engine pool with bounded concurrency via semaphores
Extensible Operator Registry: Pluggable operator system for custom inference operations
Model Descriptors: Decoupled model configuration for easy addition of new models
Multiple Tokenizer Support: WordPiece (BERT), BPE, and SentencePiece tokenizers

Architecture

MemRAG provides a complete embedding pipeline:

Text Input
    │
    ▼
┌─────────────────┐
│   Tokenizer     │  WordPiece/BPE/SentencePiece
│  (zero-alloc)   │
└────────┬────────┘
         │ token IDs, attention mask, type IDs
         ▼
┌─────────────────┐
│  MemPipe Engine │  ONNX inference with fused operators
└────────┬────────┘
         │ hidden states
         ▼
┌─────────────────┐
│     Pooler      │  Mean/CLS/No pooling
└────────┬────────┘
         │ pooled vector
         ▼
┌─────────────────┐
│   Normalizer    │  L2 normalization (optional)
└────────┬────────┘
         │
         ▼
  Embedding Vector

Installation

go get github.com/GoMemPipe/memrag

Quick Start

1. Convert a Model

First, convert a HuggingFace embedding model to MemPipe format:

# Using the provided conversion script
python scripts/convert_bge_small.py --output ./models/bge-small-en-v1.5/

This creates:

model.mpmodel - The inference model
vocab.txt - The tokenizer vocabulary

2. Generate Embeddings

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/GoMemPipe/memrag/model"
    "github.com/GoMemPipe/memrag/model/descriptors"
    "github.com/GoMemPipe/memrag/pipeline"
)

func main() {
    // Load model descriptor
    desc, ok := descriptors.Get("BAAI/bge-small-en-v1.5")
    if !ok {
        log.Fatal("descriptor not found")
    }

    // Load model assets
    assets, err := model.LoadAssetsFromDir("./models/bge-small-en-v1.5/")
    if err != nil {
        log.Fatal(err)
    }

    // Create embedding pipeline
    pipe, err := pipeline.New(desc, assets)
    if err != nil {
        log.Fatal(err)
    }
    defer pipe.Close()

    // Generate embedding
    ctx := context.Background()
    vec, err := pipe.Embed(ctx, "Hello, world!")
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Embedding dimension: %d\n", len(vec))
    // Output: Embedding dimension: 384
}

3. Concurrent Embedding with Pool

For production workloads with high concurrency:

// Create engine factory
factory := pool.NewEngineFactory(desc, assets, frozenReg)

// Create pool with capacity based on CPU cores
ep, err := pool.NewEnginePool(factory, runtime.NumCPU())
if err != nil {
    log.Fatal(err)
}

// Wrap in service for easier access
service := pool.NewEmbeddingService(ep)

// Embed batch concurrently
texts := []string{
    "Hello, world!",
    "Welcome to MemRAG",
    "Embedding models are useful",
}
results, err := service.EmbedBatch(ctx, texts)

Project Structure

memrag/
├── cmd/                    # Command-line tools
│   ├── embed/             # Embedding CLI
│   └── embed-wasm/        # WebAssembly embedding demo
├── docs/                  # Documentation
│   └── MODEL_CONVERSION.md
├── examples/              # Usage examples
│   └── bge_embedding_example.go
├── model/                 # Model loading and descriptors
│   ├── assets.go
│   ├── descriptor.go
│   └── descriptors/       # Built-in model descriptors
├── operator/              # Operator registry and middleware
├── pipeline/              # Core embedding pipeline
├── pool/                  # Concurrent engine pool
├── tokenizer/             # Tokenizer implementations
│   ├── wordpiece.go       # BERT-style WordPiece
│   └── ...
└── memrag.go              # Package entry point

Supported Models

Currently supported models include:

BAAI/bge-small-en-v1.5 - 384 dimensions, 512 max sequence length
Additional models can be added via the descriptor system

Documentation

Model Conversion Guide - How to convert HuggingFace models
Examples - Detailed usage examples
API Reference - GoDoc

Performance

MemRAG is designed for minimal memory allocation and optimal throughput:

Zero-allocation hot path: Pre-allocated buffers for tokenizer and pooling
Dynamic sequence length: Reshapes engine to actual token count
Connection pooling: Reuses pipeline instances via sync.Pool
Bounded concurrency: Semaphore-based concurrency control

Requirements

Go 1.25+
MemPipe v1.0.0
Converted model in .mpmodel format

License

MIT License - see LICENSE for details.

Related Projects

MemPipe - High-performance ONNX inference engine
transformers - HuggingFace model conversion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MemRAG

Overview

Key Features

Architecture

Installation

Quick Start

1. Convert a Model

2. Generate Embeddings

3. Concurrent Embedding with Pool

Project Structure

Supported Models

Documentation

Performance

Requirements

License

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
cmd		cmd
docs		docs
examples		examples
model		model
operator		operator
pipeline		pipeline
pool		pool
scripts		scripts
tokenizer		tokenizer
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

MemRAG

Overview

Key Features

Architecture

Installation

Quick Start

1. Convert a Model

2. Generate Embeddings

3. Concurrent Embedding with Pool

Project Structure

Supported Models

Documentation

Performance

Requirements

License

Related Projects

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages