Skip to content

OpenSymbolicAI/benchmark-go-MultiHopRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenSymbolicAI

MultiHop RAG Benchmark (Go)

A Go implementation of the MultiHop RAG benchmark using the GoalSeeking pattern from OpenSymbolicAI. The agent answers complex questions requiring evidence from multiple documents through iterative retrieval-augmented generation.

Built as a single static binary with no external server dependencies — the vector database (chromem-go) is embedded and persisted to disk.

Benchmark Results

81.6% accuracy on the full MultiHop-RAG dataset (2,556 queries) using gpt-oss-120b:

Query Type Accuracy Correct / Total
Inference 87.7% 716 / 816
Comparison 73.9% 633 / 856
Temporal 75.0% 437 / 583
Null 99.3% 299 / 301
Overall 81.6% 2,085 / 2,556

Cross-language comparison (same model, same dataset):

Python C# Go
Overall 82.9% 83.8% 81.6%

See BENCHMARK_REPORT.md for full analysis, failure breakdown, and iteration history.

Prerequisites

  1. Go 1.22+
  2. A Fireworks AI API key (used for document embeddings and the default LLM)

Step-by-Step Setup

1. Build

go generate ./...
go build -o multihop-rag .

go generate scans //go:opensymbolicai directives in multihoprag/agent.go and produces the dispatch/metadata code. You only need to re-run it when you add or modify primitives or decompositions.

2. Configure API Keys

Create a .env file in the project root:

FIREWORKS_API_KEY=your-fireworks-api-key-here

# Optional — only needed if you use --provider openai/anthropic/groq
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
GROQ_API_KEY=your-key

Get a Fireworks key at fireworks.ai. The .env file is loaded automatically — no need to export variables.

3. Ingest the Corpus

The benchmark uses the MultiHop-RAG dataset: 609 news articles across 6 categories (tech, sports, entertainment, business, science, health).

Articles are downloaded from HuggingFace, chunked into ~300-word segments respecting paragraph boundaries, embedded via Fireworks (nomic-embed-text-v1.5), and stored locally via chromem-go.

Load the full corpus (609 articles, ~4,000 chunks):

./multihop-rag setup

Quick setup for testing (50 articles):

./multihop-rag setup --quick

Clear existing data and reload:

./multihop-rag setup --clear

All setup flags:

Flag Default Description
--quick Load only the first 50 articles
--max-articles N all (609) Limit to N articles
--clear Wipe existing data before loading
--chunk-size N 300 Target words per chunk
--max-chunks N 20 Max chunks per article

Data is persisted in ./chromem_db/. On subsequent runs, existing data is reused automatically.

4. Verify Setup

Quick smoke test — run a single query:

./multihop-rag query "What company was reported to invest billions to maintain its default search engine status?"

You should see the agent iterate through retrieval hops and produce an answer.

Running the Benchmark

Interactive Mode

Chat-like REPL — type questions, get multi-hop answers:

./multihop-rag interactive

Single Query

./multihop-rag query "Who was found guilty in the crypto trial?"

Benchmark Run

Loads queries from the MultiHopRAG dataset, runs them through the agent, and evaluates answers against ground truth using an LLM judge.

# Run 2 queries per type (default)
./multihop-rag run

# Run 5 inference queries only
./multihop-rag run --type inference --num 5

# Run 10 queries of each type with 10 parallel workers
./multihop-rag run --type all --num 10 --parallel 10

# Full dataset (2,556 queries)
./multihop-rag run --type all --num 0 --parallel 10

# Use a different model
./multihop-rag run --provider openai --model gpt-4o

Results are logged to logs/<timestamp>/ with per-query markdown traces, a summary, and JSON results.

All queries and interactive sessions are logged — not just benchmark runs.

CLI Reference

multihop-rag setup       [--quick] [--max-articles N] [--clear] [--chunk-size N] [--max-chunks N]
multihop-rag run         [--model M] [--provider P] [--type T] [--num N] [--parallel N] [--max-iterations N]
multihop-rag query       [--model M] [--provider P] [--max-iterations N] "your question"
multihop-rag interactive [--model M] [--provider P] [--max-iterations N]
Flag Default Description
--model accounts/fireworks/models/gpt-oss-120b LLM model
--provider fireworks LLM provider: fireworks, openai, anthropic, groq, ollama
--type all Query type filter: inference, comparison, temporal, null, all
--num 2 Number of queries per type
--parallel 1 Concurrent workers (goroutine pool with semaphore)
--max-iterations 5 Maximum GoalSeeking iterations per query
--quick Quick corpus setup (50 articles) if KB is empty
--reinit Clear and reload the knowledge base before running

Dataset

The MultiHop-RAG dataset contains:

  • 609 news articles across 6 categories: technology, sports, entertainment, business, science, health
  • 2,556 benchmark queries with ground-truth answers
  • 4 query types:
    • Inference (32%) — connect facts across multiple articles
    • Comparison (33%) — compare reporting between named sources
    • Temporal (23%) — assess consistency/change across time periods
    • Null (12%) — questions where the corpus lacks sufficient information

Each query requires evidence from 2–4 documents.

Architecture

The agent implements GoalSeekingAgent[*MultiHopContext] for iterative plan-execute-evaluate cycles:

User Query
    │
MultiHopGoalAgent via GoalSeeking.Seek(query)
    │── LOOP (max 5 iterations):
    │   │── 1. PLAN       ── LLM generates Go code (primitive calls)
    │   │── 2. EXECUTE    ── interpreter walks AST, dispatches primitives
    │   │── 3. INTROSPECT ── UpdateContext(): raw trace → structured context
    │   │── 4. EVALUATE   ── Evaluate(): sufficient + answer ready?
    │   └── 5. CONTINUE?  ── stop if achieved or max iterations
    │
    └── GoalSeekingResult { Status, Iterations, FinalAnswer }

The Introspection Boundary

UpdateContext() is the critical architectural feature. It transforms raw execution traces into structured context fields (evidence, entities, queries tried, sufficiency). The planner and evaluator never see raw traces — only these curated fields:

type MultiHopContext struct {
    Evidence          []EvidencePiece  // accumulated facts from each hop
    EntitiesFound     []string         // bridge entities linking documents
    CurrentAnswer     *string          // latest synthesized answer
    AnswerConfidence  float64          // computed from evidence + sufficiency
    QueriesTried      []string         // prevents duplicate queries
    Sufficient        bool             // evidence sufficiency flag
    InsufficientCount int              // tracks repeated insufficiency
}

Agent Primitives

Primitive Purpose
Retrieve(query, k) Semantic search across corpus
RetrieveByCategory(query, category, k) Filter by news category
RetrieveBySource(query, source, k) Filter by news outlet name
RetrieveFiltered(query, source, category, dateFrom, dateTo, k) Combined metadata filters with date range post-filtering
ExtractEvidence(context, question) LLM: extract relevant facts from documents
IdentifyEntities(text) LLM: find named entities for bridge reasoning
GenerateNextQuery(question, evidenceSoFar) LLM: plan next retrieval hop
SynthesizeAnswer(question, evidence) LLM: synthesize final answer from evidence
AssessSufficiency(question, evidence) LLM: check if evidence is sufficient
CombineContexts(documents) Merge documents into labeled context string

Primitives are registered via //go:opensymbolicai primitive directives and dispatch code is auto-generated by go generate.

Phased Decomposition Examples (13 patterns)

The agent teaches the LLM planner via //go:opensymbolicai decomposition directives. Examples are split into hop-1 (retrieve + extract + assess) and hop-2 (synthesize from accumulated knowledge) to teach the planner to spread work across iterations:

Hop 1 — Gather evidence (no synthesis):

  1. Inference hop — retrieve, extract, identify entities, assess sufficiency
  2. Multi-source inference hop — RetrieveBySource from each named source, extract, assess
  3. Inference corroboration hop — generate follow-up query, retrieve from different angle
  4. Source comparison hop — RetrieveBySource for each source, extract claims, assess
  5. Claim verification hop — verify each "does A while B" claim independently
  6. Temporal hop — RetrieveFiltered with date ranges for each period, extract, assess
  7. Chronological ordering hop — retrieve with dates, extract with publication dates
  8. Temporal change hop — extract specific positions/assessments from each period
  9. Cross-source entity hop — retrieve from two sources, extract, identify entities

Hop 2 — Synthesize from accumulated knowledge: 10. Inference synthesis — assess sufficiency + synthesize answer 11. Comparison synthesis — assess + synthesize comparison across sources 12. Temporal synthesis — assess + synthesize Yes/No consistency/change answer

Project Structure

├── go.mod                          # Module: core-go + chromem-go
├── main.go                         # CLI entry point: setup, run, query, interactive
│
├── multihoprag/                    # Core package
│   ├── models.go                   # Document, EvidencePiece, MultiHopContext, QueryItem
│   ├── retriever.go                # chromem-go wrapper + Fireworks embeddings
│   ├── agent.go                    # 10 primitives + 7 decompositions (//go:opensymbolicai)
│   ├── goal_agent.go               # GoalSeeking wrapper: UpdateContext, Evaluate, BuildGoalTask
│   ├── multihopragagent_opensymbolicai.go  # Auto-generated: Dispatch, Primitives, Decompositions
│   ├── dataset.go                  # HuggingFace downloader + text chunker
│   ├── evaluation.go               # LLM judge for semantic answer matching
│   ├── logging.go                  # Per-query markdown + summary.md + results.json
│   └── dotenv.go                   # .env file loader
│
├── chromem_db/                     # Persistent vector storage (auto-created)
└── logs/                           # Timestamped run logs (auto-created)

Go-Specific Design Decisions

Embedded vector DB — Uses chromem-go instead of ChromaDB. No external server process needed. Data persists as gob-encoded files on disk. For ~600 documents, exhaustive cosine search takes <1ms.

Date range post-filtering — chromem-go supports only exact-match metadata filters. RetrieveFiltered handles date ranges by fetching a larger result set (5× k) and filtering by published_ts in Go. Negligible cost at this corpus size.

Parallel benchmark — The --parallel N flag runs queries concurrently using a goroutine pool with a semaphore. Each worker gets its own LLM client and agent instance.

Code generationgo generate ./multihoprag/ scans //go:opensymbolicai directives and produces the Dispatch(), Primitives(), Decompositions(), and PrimitiveNames() methods. The generated code lives in multihoprag/*_opensymbolicai.go and should not be edited by hand.

Single binarygo build -o multihop-rag . produces one self-contained executable. No runtime dependencies, no node_modules, no Python environment.

Retry with backoff — Embedding API calls in AddDocuments retry up to 3 times with exponential backoff (2s, 4s, 8s) to handle transient Fireworks 500 errors during corpus ingestion.

Max-iterations fallback — When the GoalSeeking loop hits max iterations with no synthesized answer, the benchmark defaults to "Insufficient information." — correct for null queries and graceful degradation for API rate-limiting.

Phased decompositions — Decomposition examples are split into hop-1 (gather) and hop-2 (synthesize) patterns. This teaches the LLM planner to spread work across iterations instead of cramming retrieval + synthesis into a single plan — critical for temporal and comparison accuracy.

Makefile

build: generate
	go build -o multihop-rag .

generate:
	go generate ./...

setup: build
	./multihop-rag setup

run: build
	./multihop-rag run

clean:
	rm -rf multihop-rag chromem_db logs

Troubleshooting

"FIREWORKS_API_KEY environment variable is required"

  • Export it: export FIREWORKS_API_KEY=your-key

Queries return "Insufficient information"

  • If you used --quick (50 articles), many queries won't have matching articles
  • Load the full corpus: ./multihop-rag setup --clear

Slow embedding during setup

  • First-time setup embeds ~4,000 chunks via the Fireworks API. This takes a few minutes
  • Subsequent runs load from disk instantly

"query failed" errors during benchmark

  • Check that setup completed successfully: ./multihop-rag setup should report document count
  • Verify API key is valid: the embedding API and LLM API both use FIREWORKS_API_KEY

Many "(none)" answers in large benchmark runs

  • This is API rate-limiting from the LLM provider during long runs
  • The max-iterations fallback converts these to "Insufficient information." automatically
  • Reduce parallelism (--parallel 5) or use a provider with higher rate limits

Embedding API 500 errors during setup

  • Transient Fireworks errors. The retry logic handles these automatically (3 retries with exponential backoff)
  • If setup fails repeatedly, wait a few minutes and re-run without --clear — existing chunks are persisted

About

Go implementation of MultiHop-RAG benchmark using OpenSymbolicAI GoalSeeking pattern

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages