MultiHop-RAG Benchmark (C#)

A multi-hop question-answering benchmark built with the GoalSeeking blueprint from OpenSymbolicAI. This is a C# port of the Python benchmark. It answers complex questions that require reasoning across multiple documents by iteratively retrieving evidence -- each retrieval "hop" is one iteration in the GoalSeeking loop.

What is MultiHopRAG?

MultiHopRAG is a benchmark for evaluating retrieval-augmented generation on questions that cannot be answered from a single document. It contains 609 news articles and 2,556 queries where the answer requires connecting facts across 2-4 articles.

Example problem:

Q: Who was the individual associated with cryptocurrency who was found guilty?

A single article mentions a "crypto executive on trial" but not the name. Another article names "Sam Bankman-Fried" in a fraud case but doesn't mention the verdict. A third article reports the guilty verdict. The system must retrieve all three, link the entities, and synthesize: Sam Bankman-Fried.

Query types span inference (connecting facts across articles), comparison (across news sources), temporal (across time periods), and null (insufficient information in the corpus).

Why GoalSeeking?

Traditional RAG retrieves documents once and generates an answer. This works for simple look-up questions but fails when the answer depends on evidence scattered across multiple articles -- the retriever doesn't know what to look for until it has seen the first result.

GoalSeeking makes retrieval iterative: each hop sees the accumulated evidence and decides what to search for next, like a researcher following leads. This structured approach lets a 120B-parameter open-source model (gpt-oss-120b, Fireworks AI) outperform GPT-4 on this benchmark:

Method	Overall
GoalSeeking + gpt-oss-120b (ours, C#)	83.8%
GoalSeeking + gpt-oss-120b (Python)	82.9%
IRCoT + RAG (Llama-8B)	75.0%
Multi-Meta-RAG + GPT-4	60.6%
GPT-4 RAG baseline	56.0%

The gain comes from the retrieval strategy, not model scale alone.

Prerequisites

.NET 10 SDK
API key for Fireworks AI (used for both embeddings and LLM inference)

Quick Start

# 1. Clone and enter the repo
cd benchmark-cs-MultiHopRAG

# 2. Copy .env.example and fill in your API key
cp .env.example .env
# Edit .env with your FIREWORKS_API_KEY

# 3. Restore dependencies
dotnet restore

# 4. Run with quick corpus setup (50 articles, ~1 min)
dotnet run --project src/MultiHopRAG -- --quick --query "Who was found guilty in the crypto trial?"

For the full 609-article corpus (better accuracy, takes longer to embed):

dotnet run --project src/MultiHopRAG -- --query "Who was found guilty in the crypto trial?"

How It Works

Iterative Retrieval vs. Single-Shot RAG

Traditional RAG pipelines generate the entire retrieval plan in one shot. This is brittle for multi-hop questions because the planner must anticipate all hops before seeing any evidence.

GoalSeeking makes each hop adaptive:

SeekAsync("Who is the individual linked to crypto that was found guilty?")

  Iteration 1:  Retrieve("crypto individual guilty") -> ExtractEvidence
                -> UpdateContext: found "Sam Bankman-Fried", gap: "verdict details"
                -> EvaluateAsync: only 1 source, low confidence -> CONTINUE

  Iteration 2:  Retrieve("Bankman-Fried trial verdict") -> ExtractEvidence
                -> UpdateContext: corroborating evidence from 2nd source
                -> EvaluateAsync: 2+ sources, sufficient -> ACHIEVED

  -> "Sam Bankman-Fried"

The planner sees accumulated knowledge (not raw results) and decides what to search next. The evaluator checks if enough cross-referenced evidence has been gathered.

Architecture

User Query
    |
    v
MultiHopRagAgent.SeekAsync(query)
    |
    +-- CreateContext() -> MultiHopContext(Evidence=[], EntitiesFound=[], ...)
    |
    +-- LOOP (max 5 iterations):
        |
        +-- 1. BuildGoalTask()    <- LLM sees accumulated evidence, plans next hop
        +-- 2. RunAsync()         <- executes C# plan via Roslyn (retrieve, extract, synthesize)
        +-- 3. UpdateContext()    <- INTROSPECTION BOUNDARY: raw -> structured insights
        +-- 4. EvaluateAsync()   <- [Evaluator] checks: Sufficient + CurrentAnswer ready?
        +-- 5. ShouldContinue()  <- stop if achieved or max iterations
            |
            v
        GoalSeekingResult<MultiHopContext>(FinalAnswer, Iterations, Status)

The Introspection Boundary

UpdateContext() is the key architectural feature. It converts raw ExecutionResult (via CapturedCalls) into structured fields on MultiHopContext:

Primitive Called	Context Updated
`Retrieve` / `RetrieveByCategory` / `RetrieveBySource`	`QueriesTried` -- tracks search angles used
`ExtractEvidence`	`Evidence` -- accumulates `EvidencePiece` records
`IdentifyEntities`	`EntitiesFound` -- bridge entities for cross-referencing
`AssessSufficiency`	`Sufficient` -- flag when evidence is enough
`SynthesizeAnswer`	`CurrentAnswer` + `AnswerConfidence`

The planner and evaluator only see these structured fields -- never the raw execution results.

Primitives

Primitive	Purpose
`Retrieve(query, k)`	Semantic search over the news corpus
`RetrieveByCategory(query, category, k)`	Filtered by news category (tech, sports, etc.)
`RetrieveBySource(query, source, k)`	Filtered by news outlet name
`RetrieveFiltered(query, source, category, dateFrom, dateTo, k)`	Combined metadata filters (pass `""` to skip a filter)
`ExtractEvidence(context, question)`	Pull relevant facts from retrieved text
`IdentifyEntities(text)`	Find named entities / bridge entities
`GenerateNextQuery(question, evidence)`	Plan the next retrieval hop
`SynthesizeAnswer(question, evidence)`	Combine multi-source evidence into answer
`AssessSufficiency(question, evidence)`	Check if evidence is enough to answer
`CombineContexts(documents)`	Merge documents into a context string

Decomposition Patterns (few-shot examples)

Seven patterns teach the LLM planner how to compose primitives:

Two-hop inference -- Retrieve -> ExtractEvidence -> GenerateNextQuery -> Retrieve -> SynthesizeAnswer
Source comparison -- RetrieveBySource(A) -> ExtractEvidence -> RetrieveBySource(B) -> ExtractEvidence -> SynthesizeAnswer
Single retrieval with sufficiency check -- Retrieve -> ExtractEvidence -> AssessSufficiency -> SynthesizeAnswer
Consistency comparison -- RetrieveBySource(A) -> ExtractEvidence -> RetrieveBySource(B) -> ExtractEvidence -> compare
Cross-source entity resolution -- RetrieveBySource(A) -> IdentifyEntities -> RetrieveBySource(B) -> IdentifyEntities -> SynthesizeAnswer
Temporal source comparison -- RetrieveFiltered(date_A) -> ExtractEvidence -> RetrieveFiltered(date_B) -> ExtractEvidence -> compare
Yes/No temporal consistency -- RetrieveFiltered(period_A) -> ExtractEvidence -> RetrieveFiltered(period_B) -> ExtractEvidence -> Yes/No

Differences from the Python Version

Aspect	Python	C#
Vector store	ChromaDB (external process)	LiteDB v6 (embedded, single-file)
Embedding cache	ChromaDB stores embeddings	LiteDB stores embeddings alongside content
Code execution	Custom AST sanitizer (`on_code_extracted`)	Roslyn scripting + `PlanValidator` (default-deny allowlist)
Metadata extraction	Runtime introspection	Source generator at compile time (`[Primitive]`, `[Decomposition]`)
Type safety	`GoalContext` with Pydantic	`GoalSeeking<MultiHopContext>` with generic constraint
Introspection	Inspects `ExecutionStep.args`	Uses `CapturedCalls` (typed argument capture)
Async	Synchronous primitives	`Task<T>` throughout

Project Structure

benchmark-cs-MultiHopRAG/
|-- src/MultiHopRAG/
|   |-- Program.cs                 # CLI entry point (interactive, --query, --demo)
|   |-- Models/
|   |   |-- Document.cs            # Retrieved document with metadata
|   |   |-- EvidencePiece.cs       # Structured evidence from documents
|   |   |-- MultiHopContext.cs     # GoalContext subclass (introspection boundary)
|   |   +-- QueryItem.cs           # Benchmark query from dataset
|   |-- Retriever/
|   |   |-- IRetriever.cs          # Retriever interface
|   |   |-- EmbeddingClient.cs     # IEmbeddingClient + FireworksEmbeddingClient
|   |   +-- LiteDbRetriever.cs     # LiteDB v6 vector store with cosine similarity
|   |-- Agent/
|   |   +-- MultiHopRagAgent.cs    # GoalSeeking<MultiHopContext> with 10 primitives, 7 decompositions
|   |-- Evaluation/
|   |   +-- LlmMatchEvaluator.cs   # LLM-based semantic answer matching
|   |-- Data/
|   |   |-- DatasetLoader.cs       # HuggingFace dataset download + corpus loading
|   |   +-- TextChunker.cs         # Paragraph-aware text splitting
|   |-- Logging/
|   |   +-- BenchmarkLogger.cs     # Markdown traces + JSON results
|   +-- RetryHandler.cs            # HTTP retry with exponential backoff (429, 5xx)
|-- tests/MultiHopRAG.Tests/
|   |-- ModelTests.cs              # Data model unit tests
|   |-- TextChunkerTests.cs        # Text chunking tests
|   +-- RetrieverTests.cs          # Retriever tests (mocked embeddings)
|-- MultiHopRAG.sln
|-- .env.example
+-- README.md

Dataset: MultiHop-RAG

Property	Value
Articles	609 news articles (tech, sports, entertainment, business, science, health)
Queries	2,556 with ground-truth answers
Query types	inference (32%), comparison (33%), temporal (23%), null (12%)
Evidence per query	2-4 documents
Source	HuggingFace: yixuantt/MultiHopRAG

Query Types

Inference queries -- require connecting facts across multiple articles to identify a person, event, or outcome
Comparison queries -- compare claims or reporting between two named news sources
Temporal queries -- assess consistency or change in reporting across different time periods
Null queries -- questions where the corpus does not contain sufficient information (expected answer: "Insufficient information")

Benchmark Results

Run on the full MultiHop-RAG dataset (2,556 queries across all four types) using GoalSeeking with iterative multi-hop retrieval. The backbone LLM is gpt-oss-120b (120B parameters, served by Fireworks AI).

Metric	C#	Python	Delta
Overall accuracy	83.8% (2,143 / 2,556)	82.9% (2,118 / 2,556)	+0.9pp
Goals achieved	100% (2,556 / 2,556)	99.6% (2,545 / 2,556)	+0.4pp
Avg iterations	1.4	1.9	-26%
LLM calls	3,619	4,733	-24%
Total tokens	16.2M (11.5M in + 4.7M out)	23.4M (18.4M in + 5.0M out)	-31%
Estimated cost	~$14.58	~$21.09	-$6.51

Per-Type Breakdown

Query Type	C#	Python	Delta
Inference	91.3% (745/816)	88.0% (718/816)	+3.3pp
Comparison	78.4% (671/856)	78.2% (669/856)	+0.2pp
Temporal	74.4% (434/583)	76.5% (446/583)	-2.1pp
Null	97.3% (293/301)	94.7% (285/301)	+2.6pp
Overall	83.8% (2,143/2,556)	82.9% (2,118/2,556)	+0.9pp

Comparison with Published Results

Caveat on cross-method comparisons: The results below come from different studies using different LLM backbones (GPT-4, Llama 3.1 8B/70B, gpt-oss-120b), different embedding models, different retrieval corpora or index configurations, and different evaluation splits. The original MultiHop-RAG paper excluded null queries from its accuracy calculation. No official leaderboard exists. These numbers provide directional context, not a controlled ablation.

Method	Backbone	Inference	Comparison	Temporal	Null	Overall
GoalSeeking (ours, C#)	gpt-oss-120b (120B)	91.3%	78.4%	74.4%	97.3%	83.8%
GoalSeeking (Python)	gpt-oss-120b (120B)	88.0%	78.2%	76.5%	94.7%	82.9%
IRCoT + RAG [1]	Llama 3.1 8B	96.2%	65.0%	57.6%	80.1%	75.0%
IRCoT + GraphRAG [1]	Llama 3.1 8B	95.0%	65.9%	60.4%	69.4%	74.3%
Community-GraphRAG Local [1]	Llama 3.1 70B	92.0%	60.2%	49.1%	88.7%	71.2%
HippoRAG2 [1]	Llama 3.1 8B	91.5%	58.4%	49.9%	85.7%	70.3%
SCMRAG (AAMAS 2025) [2]	--	--	~64%	~58%	--	~67.6%
Multi-Meta-RAG [3]	GPT-4	95.1%	38.2%	25.6%	98.7%	60.6%
GPT-4 RAG baseline [4]	GPT-4	--	--	--	excl.	56.0%
GPT-4 + ground-truth chunks [4]	GPT-4	--	--	--	excl.	89.0%

Key takeaways:

+8.8pp overall vs. the previous best non-GoalSeeking method (IRCoT + RAG at 75.0%)
+0.9pp over Python GoalSeeking with 26% fewer iterations (1.4 vs 1.9 avg)
100% goal achievement (all queries produce an answer, vs 99.6% in Python)
97.3% on null queries -- near-perfect at recognizing when the corpus lacks the answer
The only system with balanced performance across all four query types (>74% each)

References:

RAG vs. GraphRAG: A Systematic Evaluation (arXiv:2502.11371)
SCMRAG -- Self-Corrective Multihop RAG (AAMAS 2025)
Multi-Meta-RAG (arXiv:2406.13213)
MultiHop-RAG benchmark paper (arXiv:2401.15391)

CLI Reference

# Interactive mode (default)
dotnet run --project src/MultiHopRAG

# Single query
dotnet run --project src/MultiHopRAG -- --query "Who was found guilty in the crypto trial?"

# Quick corpus setup (50 articles) + interactive
dotnet run --project src/MultiHopRAG -- --quick

# Benchmark demo
dotnet run --project src/MultiHopRAG -- --demo --type inference --num 3

# Run 10 queries across all types, 3 in parallel
dotnet run --project src/MultiHopRAG -- --demo --num 10 --parallel 3

# Use a different provider/model
dotnet run --project src/MultiHopRAG -- --provider ollama --model llama3.2

Flag	Description	Default
`--model MODEL`	LLM model name	`accounts/fireworks/models/gpt-oss-120b`
`--provider`	`fireworks`, `ollama`, `openai`, `anthropic`, `groq`	`fireworks`
`--query / -q`	Single query mode (non-interactive)	--
`--demo`	Run benchmark queries from the dataset	off
`--type / -t`	Filter demo queries: `inference`, `comparison`, `temporal`, `null`, `all`	`all`
`--num / -n`	Number of queries per type in demo mode	2
`--parallel / -p`	Concurrent queries in demo mode	5
`--max-iterations`	GoalSeeking max iterations per query	5
`--quick`	Quick corpus setup (50 articles) if DB is empty	off
`--reinit`	Clear and reload the knowledge base	off

Logging

Every run creates a timestamped directory under logs/ with:

query_N.md -- Full trace for each query: plan code, execution steps, arguments, results, timing, and evaluation
summary.md -- Aggregate statistics (accuracy, iterations, per-type breakdown)
results.json -- Machine-readable results for programmatic analysis

Environment Variables

Variable	Required	Description
`FIREWORKS_API_KEY`	Yes	Fireworks AI API key (embeddings + default LLM)
`GROQ_API_KEY`	No	Groq API key (if using `--provider groq`)
`ANTHROPIC_API_KEY`	No	Anthropic API key (if using `--provider anthropic`)

Running Tests

dotnet test

The test suite (18 tests) covers:

Models -- Document, EvidencePiece, MultiHopContext, QueryItem creation and defaults
TextChunker -- Paragraph splitting, overlap, long paragraph fallback, small chunk filtering
Retriever -- Add, count, query, category/source filtering, clear, initialization markers (all with mocked embeddings via NSubstitute)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/MultiHopRAG		src/MultiHopRAG
tests/MultiHopRAG.Tests		tests/MultiHopRAG.Tests
.env.example		.env.example
.gitignore		.gitignore
MultiHopRAG.slnx		MultiHopRAG.slnx
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiHop-RAG Benchmark (C#)

What is MultiHopRAG?

Why GoalSeeking?

Prerequisites

Quick Start

How It Works

Iterative Retrieval vs. Single-Shot RAG

Architecture

The Introspection Boundary

Primitives

Decomposition Patterns (few-shot examples)

Differences from the Python Version

Project Structure

Dataset: MultiHop-RAG

Query Types

Benchmark Results

Per-Type Breakdown

Comparison with Published Results

CLI Reference

Logging

Environment Variables

Running Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MultiHop-RAG Benchmark (C#)

What is MultiHopRAG?

Why GoalSeeking?

Prerequisites

Quick Start

How It Works

Iterative Retrieval vs. Single-Shot RAG

Architecture

The Introspection Boundary

Primitives

Decomposition Patterns (few-shot examples)

Differences from the Python Version

Project Structure

Dataset: MultiHop-RAG

Query Types

Benchmark Results

Per-Type Breakdown

Comparison with Published Results

CLI Reference

Logging

Environment Variables

Running Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages