A Go implementation of the MultiHop RAG benchmark using the GoalSeeking pattern from OpenSymbolicAI. The agent answers complex questions requiring evidence from multiple documents through iterative retrieval-augmented generation.
Built as a single static binary with no external server dependencies — the vector database (chromem-go) is embedded and persisted to disk.
81.6% accuracy on the full MultiHop-RAG dataset (2,556 queries) using gpt-oss-120b:
| Query Type | Accuracy | Correct / Total |
|---|---|---|
| Inference | 87.7% | 716 / 816 |
| Comparison | 73.9% | 633 / 856 |
| Temporal | 75.0% | 437 / 583 |
| Null | 99.3% | 299 / 301 |
| Overall | 81.6% | 2,085 / 2,556 |
Cross-language comparison (same model, same dataset):
| Python | C# | Go | |
|---|---|---|---|
| Overall | 82.9% | 83.8% | 81.6% |
See BENCHMARK_REPORT.md for full analysis, failure breakdown, and iteration history.
- Go 1.22+
- A Fireworks AI API key (used for document embeddings and the default LLM)
go generate ./...
go build -o multihop-rag .go generate scans //go:opensymbolicai directives in multihoprag/agent.go and produces the dispatch/metadata code. You only need to re-run it when you add or modify primitives or decompositions.
Create a .env file in the project root:
FIREWORKS_API_KEY=your-fireworks-api-key-here
# Optional — only needed if you use --provider openai/anthropic/groq
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
GROQ_API_KEY=your-keyGet a Fireworks key at fireworks.ai. The .env file is loaded automatically — no need to export variables.
The benchmark uses the MultiHop-RAG dataset: 609 news articles across 6 categories (tech, sports, entertainment, business, science, health).
Articles are downloaded from HuggingFace, chunked into ~300-word segments respecting paragraph boundaries, embedded via Fireworks (nomic-embed-text-v1.5), and stored locally via chromem-go.
Load the full corpus (609 articles, ~4,000 chunks):
./multihop-rag setupQuick setup for testing (50 articles):
./multihop-rag setup --quickClear existing data and reload:
./multihop-rag setup --clearAll setup flags:
| Flag | Default | Description |
|---|---|---|
--quick |
— | Load only the first 50 articles |
--max-articles N |
all (609) | Limit to N articles |
--clear |
— | Wipe existing data before loading |
--chunk-size N |
300 | Target words per chunk |
--max-chunks N |
20 | Max chunks per article |
Data is persisted in
./chromem_db/. On subsequent runs, existing data is reused automatically.
Quick smoke test — run a single query:
./multihop-rag query "What company was reported to invest billions to maintain its default search engine status?"You should see the agent iterate through retrieval hops and produce an answer.
Chat-like REPL — type questions, get multi-hop answers:
./multihop-rag interactive./multihop-rag query "Who was found guilty in the crypto trial?"Loads queries from the MultiHopRAG dataset, runs them through the agent, and evaluates answers against ground truth using an LLM judge.
# Run 2 queries per type (default)
./multihop-rag run
# Run 5 inference queries only
./multihop-rag run --type inference --num 5
# Run 10 queries of each type with 10 parallel workers
./multihop-rag run --type all --num 10 --parallel 10
# Full dataset (2,556 queries)
./multihop-rag run --type all --num 0 --parallel 10
# Use a different model
./multihop-rag run --provider openai --model gpt-4oResults are logged to logs/<timestamp>/ with per-query markdown traces, a summary, and JSON results.
All queries and interactive sessions are logged — not just benchmark runs.
multihop-rag setup [--quick] [--max-articles N] [--clear] [--chunk-size N] [--max-chunks N]
multihop-rag run [--model M] [--provider P] [--type T] [--num N] [--parallel N] [--max-iterations N]
multihop-rag query [--model M] [--provider P] [--max-iterations N] "your question"
multihop-rag interactive [--model M] [--provider P] [--max-iterations N]
| Flag | Default | Description |
|---|---|---|
--model |
accounts/fireworks/models/gpt-oss-120b |
LLM model |
--provider |
fireworks |
LLM provider: fireworks, openai, anthropic, groq, ollama |
--type |
all |
Query type filter: inference, comparison, temporal, null, all |
--num |
2 |
Number of queries per type |
--parallel |
1 |
Concurrent workers (goroutine pool with semaphore) |
--max-iterations |
5 |
Maximum GoalSeeking iterations per query |
--quick |
— | Quick corpus setup (50 articles) if KB is empty |
--reinit |
— | Clear and reload the knowledge base before running |
The MultiHop-RAG dataset contains:
- 609 news articles across 6 categories: technology, sports, entertainment, business, science, health
- 2,556 benchmark queries with ground-truth answers
- 4 query types:
- Inference (32%) — connect facts across multiple articles
- Comparison (33%) — compare reporting between named sources
- Temporal (23%) — assess consistency/change across time periods
- Null (12%) — questions where the corpus lacks sufficient information
Each query requires evidence from 2–4 documents.
The agent implements GoalSeekingAgent[*MultiHopContext] for iterative plan-execute-evaluate cycles:
User Query
│
MultiHopGoalAgent via GoalSeeking.Seek(query)
│── LOOP (max 5 iterations):
│ │── 1. PLAN ── LLM generates Go code (primitive calls)
│ │── 2. EXECUTE ── interpreter walks AST, dispatches primitives
│ │── 3. INTROSPECT ── UpdateContext(): raw trace → structured context
│ │── 4. EVALUATE ── Evaluate(): sufficient + answer ready?
│ └── 5. CONTINUE? ── stop if achieved or max iterations
│
└── GoalSeekingResult { Status, Iterations, FinalAnswer }
UpdateContext() is the critical architectural feature. It transforms raw execution traces into structured context fields (evidence, entities, queries tried, sufficiency). The planner and evaluator never see raw traces — only these curated fields:
type MultiHopContext struct {
Evidence []EvidencePiece // accumulated facts from each hop
EntitiesFound []string // bridge entities linking documents
CurrentAnswer *string // latest synthesized answer
AnswerConfidence float64 // computed from evidence + sufficiency
QueriesTried []string // prevents duplicate queries
Sufficient bool // evidence sufficiency flag
InsufficientCount int // tracks repeated insufficiency
}| Primitive | Purpose |
|---|---|
Retrieve(query, k) |
Semantic search across corpus |
RetrieveByCategory(query, category, k) |
Filter by news category |
RetrieveBySource(query, source, k) |
Filter by news outlet name |
RetrieveFiltered(query, source, category, dateFrom, dateTo, k) |
Combined metadata filters with date range post-filtering |
ExtractEvidence(context, question) |
LLM: extract relevant facts from documents |
IdentifyEntities(text) |
LLM: find named entities for bridge reasoning |
GenerateNextQuery(question, evidenceSoFar) |
LLM: plan next retrieval hop |
SynthesizeAnswer(question, evidence) |
LLM: synthesize final answer from evidence |
AssessSufficiency(question, evidence) |
LLM: check if evidence is sufficient |
CombineContexts(documents) |
Merge documents into labeled context string |
Primitives are registered via //go:opensymbolicai primitive directives and dispatch code is auto-generated by go generate.
The agent teaches the LLM planner via //go:opensymbolicai decomposition directives. Examples are split into hop-1 (retrieve + extract + assess) and hop-2 (synthesize from accumulated knowledge) to teach the planner to spread work across iterations:
Hop 1 — Gather evidence (no synthesis):
- Inference hop — retrieve, extract, identify entities, assess sufficiency
- Multi-source inference hop — RetrieveBySource from each named source, extract, assess
- Inference corroboration hop — generate follow-up query, retrieve from different angle
- Source comparison hop — RetrieveBySource for each source, extract claims, assess
- Claim verification hop — verify each "does A while B" claim independently
- Temporal hop — RetrieveFiltered with date ranges for each period, extract, assess
- Chronological ordering hop — retrieve with dates, extract with publication dates
- Temporal change hop — extract specific positions/assessments from each period
- Cross-source entity hop — retrieve from two sources, extract, identify entities
Hop 2 — Synthesize from accumulated knowledge: 10. Inference synthesis — assess sufficiency + synthesize answer 11. Comparison synthesis — assess + synthesize comparison across sources 12. Temporal synthesis — assess + synthesize Yes/No consistency/change answer
├── go.mod # Module: core-go + chromem-go
├── main.go # CLI entry point: setup, run, query, interactive
│
├── multihoprag/ # Core package
│ ├── models.go # Document, EvidencePiece, MultiHopContext, QueryItem
│ ├── retriever.go # chromem-go wrapper + Fireworks embeddings
│ ├── agent.go # 10 primitives + 7 decompositions (//go:opensymbolicai)
│ ├── goal_agent.go # GoalSeeking wrapper: UpdateContext, Evaluate, BuildGoalTask
│ ├── multihopragagent_opensymbolicai.go # Auto-generated: Dispatch, Primitives, Decompositions
│ ├── dataset.go # HuggingFace downloader + text chunker
│ ├── evaluation.go # LLM judge for semantic answer matching
│ ├── logging.go # Per-query markdown + summary.md + results.json
│ └── dotenv.go # .env file loader
│
├── chromem_db/ # Persistent vector storage (auto-created)
└── logs/ # Timestamped run logs (auto-created)
Embedded vector DB — Uses chromem-go instead of ChromaDB. No external server process needed. Data persists as gob-encoded files on disk. For ~600 documents, exhaustive cosine search takes <1ms.
Date range post-filtering — chromem-go supports only exact-match metadata filters. RetrieveFiltered handles date ranges by fetching a larger result set (5× k) and filtering by published_ts in Go. Negligible cost at this corpus size.
Parallel benchmark — The --parallel N flag runs queries concurrently using a goroutine pool with a semaphore. Each worker gets its own LLM client and agent instance.
Code generation — go generate ./multihoprag/ scans //go:opensymbolicai directives and produces the Dispatch(), Primitives(), Decompositions(), and PrimitiveNames() methods. The generated code lives in multihoprag/*_opensymbolicai.go and should not be edited by hand.
Single binary — go build -o multihop-rag . produces one self-contained executable. No runtime dependencies, no node_modules, no Python environment.
Retry with backoff — Embedding API calls in AddDocuments retry up to 3 times with exponential backoff (2s, 4s, 8s) to handle transient Fireworks 500 errors during corpus ingestion.
Max-iterations fallback — When the GoalSeeking loop hits max iterations with no synthesized answer, the benchmark defaults to "Insufficient information." — correct for null queries and graceful degradation for API rate-limiting.
Phased decompositions — Decomposition examples are split into hop-1 (gather) and hop-2 (synthesize) patterns. This teaches the LLM planner to spread work across iterations instead of cramming retrieval + synthesis into a single plan — critical for temporal and comparison accuracy.
build: generate
go build -o multihop-rag .
generate:
go generate ./...
setup: build
./multihop-rag setup
run: build
./multihop-rag run
clean:
rm -rf multihop-rag chromem_db logs"FIREWORKS_API_KEY environment variable is required"
- Export it:
export FIREWORKS_API_KEY=your-key
Queries return "Insufficient information"
- If you used
--quick(50 articles), many queries won't have matching articles - Load the full corpus:
./multihop-rag setup --clear
Slow embedding during setup
- First-time setup embeds ~4,000 chunks via the Fireworks API. This takes a few minutes
- Subsequent runs load from disk instantly
"query failed" errors during benchmark
- Check that setup completed successfully:
./multihop-rag setupshould report document count - Verify API key is valid: the embedding API and LLM API both use
FIREWORKS_API_KEY
Many "(none)" answers in large benchmark runs
- This is API rate-limiting from the LLM provider during long runs
- The max-iterations fallback converts these to "Insufficient information." automatically
- Reduce parallelism (
--parallel 5) or use a provider with higher rate limits
Embedding API 500 errors during setup
- Transient Fireworks errors. The retry logic handles these automatically (3 retries with exponential backoff)
- If setup fails repeatedly, wait a few minutes and re-run without
--clear— existing chunks are persisted