MultiHop RAG Benchmark (Go)

A Go implementation of the MultiHop RAG benchmark using the GoalSeeking pattern from OpenSymbolicAI. The agent answers complex questions requiring evidence from multiple documents through iterative retrieval-augmented generation.

Built as a single static binary with no external server dependencies — the vector database (chromem-go) is embedded and persisted to disk.

Benchmark Results

81.6% accuracy on the full MultiHop-RAG dataset (2,556 queries) using gpt-oss-120b:

Query Type	Accuracy	Correct / Total
Inference	87.7%	716 / 816
Comparison	73.9%	633 / 856
Temporal	75.0%	437 / 583
Null	99.3%	299 / 301
Overall	81.6%	2,085 / 2,556

Cross-language comparison (same model, same dataset):

	Python	C#	Go
Overall	82.9%	83.8%	81.6%

See BENCHMARK_REPORT.md for full analysis, failure breakdown, and iteration history.

Prerequisites

Go 1.22+
A Fireworks AI API key (used for document embeddings and the default LLM)

Step-by-Step Setup

1. Build

go generate ./...
go build -o multihop-rag .

go generate scans //go:opensymbolicai directives in multihoprag/agent.go and produces the dispatch/metadata code. You only need to re-run it when you add or modify primitives or decompositions.

2. Configure API Keys

Create a .env file in the project root:

FIREWORKS_API_KEY=your-fireworks-api-key-here

# Optional — only needed if you use --provider openai/anthropic/groq
OPENAI_API_KEY=your-key
ANTHROPIC_API_KEY=your-key
GROQ_API_KEY=your-key

Get a Fireworks key at fireworks.ai. The .env file is loaded automatically — no need to export variables.

3. Ingest the Corpus

The benchmark uses the MultiHop-RAG dataset: 609 news articles across 6 categories (tech, sports, entertainment, business, science, health).

Articles are downloaded from HuggingFace, chunked into ~300-word segments respecting paragraph boundaries, embedded via Fireworks (nomic-embed-text-v1.5), and stored locally via chromem-go.

Load the full corpus (609 articles, ~4,000 chunks):

./multihop-rag setup

Quick setup for testing (50 articles):

./multihop-rag setup --quick

Clear existing data and reload:

./multihop-rag setup --clear

All setup flags:

Flag	Default	Description
`--quick`	—	Load only the first 50 articles
`--max-articles N`	all (609)	Limit to N articles
`--clear`	—	Wipe existing data before loading
`--chunk-size N`	300	Target words per chunk
`--max-chunks N`	20	Max chunks per article

Data is persisted in ./chromem_db/. On subsequent runs, existing data is reused automatically.

4. Verify Setup

Quick smoke test — run a single query:

./multihop-rag query "What company was reported to invest billions to maintain its default search engine status?"

You should see the agent iterate through retrieval hops and produce an answer.

Running the Benchmark

Interactive Mode

Chat-like REPL — type questions, get multi-hop answers:

./multihop-rag interactive

Single Query

./multihop-rag query "Who was found guilty in the crypto trial?"

Benchmark Run

Loads queries from the MultiHopRAG dataset, runs them through the agent, and evaluates answers against ground truth using an LLM judge.

# Run 2 queries per type (default)
./multihop-rag run

# Run 5 inference queries only
./multihop-rag run --type inference --num 5

# Run 10 queries of each type with 10 parallel workers
./multihop-rag run --type all --num 10 --parallel 10

# Full dataset (2,556 queries)
./multihop-rag run --type all --num 0 --parallel 10

# Use a different model
./multihop-rag run --provider openai --model gpt-4o

Results are logged to logs/<timestamp>/ with per-query markdown traces, a summary, and JSON results.

All queries and interactive sessions are logged — not just benchmark runs.

CLI Reference

multihop-rag setup       [--quick] [--max-articles N] [--clear] [--chunk-size N] [--max-chunks N]
multihop-rag run         [--model M] [--provider P] [--type T] [--num N] [--parallel N] [--max-iterations N]
multihop-rag query       [--model M] [--provider P] [--max-iterations N] "your question"
multihop-rag interactive [--model M] [--provider P] [--max-iterations N]

Flag	Default	Description
`--model`	`accounts/fireworks/models/gpt-oss-120b`	LLM model
`--provider`	`fireworks`	LLM provider: fireworks, openai, anthropic, groq, ollama
`--type`	`all`	Query type filter: inference, comparison, temporal, null, all
`--num`	`2`	Number of queries per type
`--parallel`	`1`	Concurrent workers (goroutine pool with semaphore)
`--max-iterations`	`5`	Maximum GoalSeeking iterations per query
`--quick`	—	Quick corpus setup (50 articles) if KB is empty
`--reinit`	—	Clear and reload the knowledge base before running

Dataset

The MultiHop-RAG dataset contains:

609 news articles across 6 categories: technology, sports, entertainment, business, science, health
2,556 benchmark queries with ground-truth answers
4 query types:
- Inference (32%) — connect facts across multiple articles
- Comparison (33%) — compare reporting between named sources
- Temporal (23%) — assess consistency/change across time periods
- Null (12%) — questions where the corpus lacks sufficient information

Each query requires evidence from 2–4 documents.

Architecture

The agent implements GoalSeekingAgent[*MultiHopContext] for iterative plan-execute-evaluate cycles:

User Query
    │
MultiHopGoalAgent via GoalSeeking.Seek(query)
    │── LOOP (max 5 iterations):
    │   │── 1. PLAN       ── LLM generates Go code (primitive calls)
    │   │── 2. EXECUTE    ── interpreter walks AST, dispatches primitives
    │   │── 3. INTROSPECT ── UpdateContext(): raw trace → structured context
    │   │── 4. EVALUATE   ── Evaluate(): sufficient + answer ready?
    │   └── 5. CONTINUE?  ── stop if achieved or max iterations
    │
    └── GoalSeekingResult { Status, Iterations, FinalAnswer }

The Introspection Boundary

UpdateContext() is the critical architectural feature. It transforms raw execution traces into structured context fields (evidence, entities, queries tried, sufficiency). The planner and evaluator never see raw traces — only these curated fields:

type MultiHopContext struct {
    Evidence          []EvidencePiece  // accumulated facts from each hop
    EntitiesFound     []string         // bridge entities linking documents
    CurrentAnswer     *string          // latest synthesized answer
    AnswerConfidence  float64          // computed from evidence + sufficiency
    QueriesTried      []string         // prevents duplicate queries
    Sufficient        bool             // evidence sufficiency flag
    InsufficientCount int              // tracks repeated insufficiency
}

Agent Primitives

Primitive	Purpose
`Retrieve(query, k)`	Semantic search across corpus
`RetrieveByCategory(query, category, k)`	Filter by news category
`RetrieveBySource(query, source, k)`	Filter by news outlet name
`RetrieveFiltered(query, source, category, dateFrom, dateTo, k)`	Combined metadata filters with date range post-filtering
`ExtractEvidence(context, question)`	LLM: extract relevant facts from documents
`IdentifyEntities(text)`	LLM: find named entities for bridge reasoning
`GenerateNextQuery(question, evidenceSoFar)`	LLM: plan next retrieval hop
`SynthesizeAnswer(question, evidence)`	LLM: synthesize final answer from evidence
`AssessSufficiency(question, evidence)`	LLM: check if evidence is sufficient
`CombineContexts(documents)`	Merge documents into labeled context string

Primitives are registered via //go:opensymbolicai primitive directives and dispatch code is auto-generated by go generate.

Phased Decomposition Examples (13 patterns)

The agent teaches the LLM planner via //go:opensymbolicai decomposition directives. Examples are split into hop-1 (retrieve + extract + assess) and hop-2 (synthesize from accumulated knowledge) to teach the planner to spread work across iterations:

Hop 1 — Gather evidence (no synthesis):

Inference hop — retrieve, extract, identify entities, assess sufficiency
Multi-source inference hop — RetrieveBySource from each named source, extract, assess
Inference corroboration hop — generate follow-up query, retrieve from different angle
Source comparison hop — RetrieveBySource for each source, extract claims, assess
Claim verification hop — verify each "does A while B" claim independently
Temporal hop — RetrieveFiltered with date ranges for each period, extract, assess
Chronological ordering hop — retrieve with dates, extract with publication dates
Temporal change hop — extract specific positions/assessments from each period
Cross-source entity hop — retrieve from two sources, extract, identify entities

Hop 2 — Synthesize from accumulated knowledge: 10. Inference synthesis — assess sufficiency + synthesize answer 11. Comparison synthesis — assess + synthesize comparison across sources 12. Temporal synthesis — assess + synthesize Yes/No consistency/change answer

Project Structure

├── go.mod                          # Module: core-go + chromem-go
├── main.go                         # CLI entry point: setup, run, query, interactive
│
├── multihoprag/                    # Core package
│   ├── models.go                   # Document, EvidencePiece, MultiHopContext, QueryItem
│   ├── retriever.go                # chromem-go wrapper + Fireworks embeddings
│   ├── agent.go                    # 10 primitives + 7 decompositions (//go:opensymbolicai)
│   ├── goal_agent.go               # GoalSeeking wrapper: UpdateContext, Evaluate, BuildGoalTask
│   ├── multihopragagent_opensymbolicai.go  # Auto-generated: Dispatch, Primitives, Decompositions
│   ├── dataset.go                  # HuggingFace downloader + text chunker
│   ├── evaluation.go               # LLM judge for semantic answer matching
│   ├── logging.go                  # Per-query markdown + summary.md + results.json
│   └── dotenv.go                   # .env file loader
│
├── chromem_db/                     # Persistent vector storage (auto-created)
└── logs/                           # Timestamped run logs (auto-created)

Go-Specific Design Decisions

Embedded vector DB — Uses chromem-go instead of ChromaDB. No external server process needed. Data persists as gob-encoded files on disk. For ~600 documents, exhaustive cosine search takes <1ms.

Date range post-filtering — chromem-go supports only exact-match metadata filters. RetrieveFiltered handles date ranges by fetching a larger result set (5× k) and filtering by published_ts in Go. Negligible cost at this corpus size.

Parallel benchmark — The --parallel N flag runs queries concurrently using a goroutine pool with a semaphore. Each worker gets its own LLM client and agent instance.

Code generation — go generate ./multihoprag/ scans //go:opensymbolicai directives and produces the Dispatch(), Primitives(), Decompositions(), and PrimitiveNames() methods. The generated code lives in multihoprag/*_opensymbolicai.go and should not be edited by hand.

Single binary — go build -o multihop-rag . produces one self-contained executable. No runtime dependencies, no node_modules, no Python environment.

Retry with backoff — Embedding API calls in AddDocuments retry up to 3 times with exponential backoff (2s, 4s, 8s) to handle transient Fireworks 500 errors during corpus ingestion.

Max-iterations fallback — When the GoalSeeking loop hits max iterations with no synthesized answer, the benchmark defaults to "Insufficient information." — correct for null queries and graceful degradation for API rate-limiting.

Phased decompositions — Decomposition examples are split into hop-1 (gather) and hop-2 (synthesize) patterns. This teaches the LLM planner to spread work across iterations instead of cramming retrieval + synthesis into a single plan — critical for temporal and comparison accuracy.

Makefile

build: generate
	go build -o multihop-rag .

generate:
	go generate ./...

setup: build
	./multihop-rag setup

run: build
	./multihop-rag run

clean:
	rm -rf multihop-rag chromem_db logs

Troubleshooting

"FIREWORKS_API_KEY environment variable is required"

Export it: export FIREWORKS_API_KEY=your-key

Queries return "Insufficient information"

If you used --quick (50 articles), many queries won't have matching articles
Load the full corpus: ./multihop-rag setup --clear

Slow embedding during setup

First-time setup embeds ~4,000 chunks via the Fireworks API. This takes a few minutes
Subsequent runs load from disk instantly

"query failed" errors during benchmark

Check that setup completed successfully: ./multihop-rag setup should report document count
Verify API key is valid: the embedding API and LLM API both use FIREWORKS_API_KEY

Many "(none)" answers in large benchmark runs

This is API rate-limiting from the LLM provider during long runs
The max-iterations fallback converts these to "Insufficient information." automatically
Reduce parallelism (--parallel 5) or use a provider with higher rate limits

Embedding API 500 errors during setup

Transient Fireworks errors. The retry logic handles these automatically (3 retries with exponential backoff)
If setup fails repeatedly, wait a few minutes and re-run without --clear — existing chunks are persisted

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
multihoprag		multihoprag
.env.example		.env.example
.gitignore		.gitignore
BENCHMARK_REPORT.md		BENCHMARK_REPORT.md
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiHop RAG Benchmark (Go)

Benchmark Results

Prerequisites

Step-by-Step Setup

1. Build

2. Configure API Keys

3. Ingest the Corpus

4. Verify Setup

Running the Benchmark

Interactive Mode

Single Query

Benchmark Run

CLI Reference

Dataset

Architecture

The Introspection Boundary

Agent Primitives

Phased Decomposition Examples (13 patterns)

Project Structure

Go-Specific Design Decisions

Makefile

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiHop RAG Benchmark (Go)

Benchmark Results

Prerequisites

Step-by-Step Setup

1. Build

2. Configure API Keys

3. Ingest the Corpus

4. Verify Setup

Running the Benchmark

Interactive Mode

Single Query

Benchmark Run

CLI Reference

Dataset

Architecture

The Introspection Boundary

Agent Primitives

Phased Decomposition Examples (13 patterns)

Project Structure

Go-Specific Design Decisions

Makefile

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages