Skip to content

GoMemPipe/memrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MemRAG

High-performance, zero-allocation embedding inference in Go

Go Reference License Go Version

Overview

MemRAG is a Go library that provides high-performance, zero-allocation embedding inference for retrieval-augmented generation (RAG) applications. It leverages the MemPipe inference engine to run ONNX-based embedding models with minimal memory allocation and optimal performance.

Key Features

  • Zero-Allocation Hot Path: Pre-allocated buffers for tokenizer and pooling operations eliminate GC pressure
  • Dynamic Sequence Length: Engine reshapes to actual token count for faster processing of short inputs
  • Multiple Pooling Strategies: Mean pooling, CLS pooling, and raw output support
  • Concurrent Inference: Thread-safe engine pool with bounded concurrency via semaphores
  • Extensible Operator Registry: Pluggable operator system for custom inference operations
  • Model Descriptors: Decoupled model configuration for easy addition of new models
  • Multiple Tokenizer Support: WordPiece (BERT), BPE, and SentencePiece tokenizers

Architecture

MemRAG provides a complete embedding pipeline:

Text Input
    │
    ▼
┌─────────────────┐
│   Tokenizer     │  WordPiece/BPE/SentencePiece
│  (zero-alloc)   │
└────────┬────────┘
         │ token IDs, attention mask, type IDs
         ▼
┌─────────────────┐
│  MemPipe Engine │  ONNX inference with fused operators
└────────┬────────┘
         │ hidden states
         ▼
┌─────────────────┐
│     Pooler      │  Mean/CLS/No pooling
└────────┬────────┘
         │ pooled vector
         ▼
┌─────────────────┐
│   Normalizer    │  L2 normalization (optional)
└────────┬────────┘
         │
         ▼
  Embedding Vector

Installation

go get github.com/GoMemPipe/memrag

Quick Start

1. Convert a Model

First, convert a HuggingFace embedding model to MemPipe format:

# Using the provided conversion script
python scripts/convert_bge_small.py --output ./models/bge-small-en-v1.5/

This creates:

  • model.mpmodel - The inference model
  • vocab.txt - The tokenizer vocabulary

2. Generate Embeddings

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/GoMemPipe/memrag/model"
    "github.com/GoMemPipe/memrag/model/descriptors"
    "github.com/GoMemPipe/memrag/pipeline"
)

func main() {
    // Load model descriptor
    desc, ok := descriptors.Get("BAAI/bge-small-en-v1.5")
    if !ok {
        log.Fatal("descriptor not found")
    }

    // Load model assets
    assets, err := model.LoadAssetsFromDir("./models/bge-small-en-v1.5/")
    if err != nil {
        log.Fatal(err)
    }

    // Create embedding pipeline
    pipe, err := pipeline.New(desc, assets)
    if err != nil {
        log.Fatal(err)
    }
    defer pipe.Close()

    // Generate embedding
    ctx := context.Background()
    vec, err := pipe.Embed(ctx, "Hello, world!")
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("Embedding dimension: %d\n", len(vec))
    // Output: Embedding dimension: 384
}

3. Concurrent Embedding with Pool

For production workloads with high concurrency:

// Create engine factory
factory := pool.NewEngineFactory(desc, assets, frozenReg)

// Create pool with capacity based on CPU cores
ep, err := pool.NewEnginePool(factory, runtime.NumCPU())
if err != nil {
    log.Fatal(err)
}

// Wrap in service for easier access
service := pool.NewEmbeddingService(ep)

// Embed batch concurrently
texts := []string{
    "Hello, world!",
    "Welcome to MemRAG",
    "Embedding models are useful",
}
results, err := service.EmbedBatch(ctx, texts)

Project Structure

memrag/
├── cmd/                    # Command-line tools
│   ├── embed/             # Embedding CLI
│   └── embed-wasm/        # WebAssembly embedding demo
├── docs/                  # Documentation
│   └── MODEL_CONVERSION.md
├── examples/              # Usage examples
│   └── bge_embedding_example.go
├── model/                 # Model loading and descriptors
│   ├── assets.go
│   ├── descriptor.go
│   └── descriptors/       # Built-in model descriptors
├── operator/              # Operator registry and middleware
├── pipeline/              # Core embedding pipeline
├── pool/                  # Concurrent engine pool
├── tokenizer/             # Tokenizer implementations
│   ├── wordpiece.go       # BERT-style WordPiece
│   └── ...
└── memrag.go              # Package entry point

Supported Models

Currently supported models include:

  • BAAI/bge-small-en-v1.5 - 384 dimensions, 512 max sequence length
  • Additional models can be added via the descriptor system

Documentation

Performance

MemRAG is designed for minimal memory allocation and optimal throughput:

  • Zero-allocation hot path: Pre-allocated buffers for tokenizer and pooling
  • Dynamic sequence length: Reshapes engine to actual token count
  • Connection pooling: Reuses pipeline instances via sync.Pool
  • Bounded concurrency: Semaphore-based concurrency control

Requirements

  • Go 1.25+
  • MemPipe v1.0.0
  • Converted model in .mpmodel format

License

MIT License - see LICENSE for details.

Related Projects

About

High-performance, zero-allocation embedding inference in Go

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors