Skip to content

QA eval pipeline for retrieval#1754

Merged
jperez999 merged 19 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline
Apr 16, 2026
Merged

QA eval pipeline for retrieval#1754
jperez999 merged 19 commits intoNVIDIA:mainfrom
KyleZheng1284:feature/qa-harness-fullpage-pipeline

Conversation

@KyleZheng1284
Copy link
Copy Markdown
Member

@KyleZheng1284 KyleZheng1284 commented Mar 30, 2026

Description

  • Adds a pluggable QA evaluation harness for measuring Retrieval quality end-to-end using multi-tier scoring.

Capabilities:

  • Multi-tier scoring -- Tier 1 retrieval recall (answer-in-context), Tier 2 programmatic (exact match + token F1), and Tier 3 LLM-as-judge (1-5 rubric) run together in a single pass at zero extra retrieval cost.
  • Full-page markdown retrieval -- Reconstructs complete document pages from NeMo Retriever extraction records via to_markdown_by_page()
  • Pluggable retrieval -- Any retrieval system (vector search, agentic, hybrid, BM25) plugs in by producing a standard JSON (queries → chunks); no harness code changes required.
  • Pluggable datasets -- Any CSV with query/answer columns loads via csv:path/to/file.csv; default ground truth is data/bo767_annotations.csv (1007 Q&A pairs, all modalities).
  • Pluggable LLMs -- Generator and judge models swap via env var or YAML config using litellm prefix routing (nvidia_nim/, openai/, huggingface/).
  • Multi-model sweep -- Set GEN_MODELS to evaluate multiple generators in a single run with side-by-side score comparisons.
  • Failure classification -- Per-query categorization into correct, partial, retrieval_miss, generation_miss, no_context, thinking_truncated to pinpoint exactly where the pipeline fails.

Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730

)## Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

@KyleZheng1284 KyleZheng1284 requested review from a team as code owners March 30, 2026 21:26
@KyleZheng1284 KyleZheng1284 requested a review from nkmcalli March 30, 2026 21:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread tools/harness/ingest_bo767.py Outdated
Comment thread tools/harness/extract_bo767_parquet.py Outdated
Comment thread tools/harness/export_retrieval_nemo.py Outdated
print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found")


def main() -> int:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a tool we can call via import, instead of a main function.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core evaluation logic has been moved into nemo_retriever.evaluation (importable package, pip-installable via nemo_retriever[eval])

Comment thread tools/harness/export_retrieval_nemo.py Outdated
Comment thread tools/harness/build_page_markdown_index.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/evaluation/types.py
Comment thread tools/harness/src/nv_ingest_harness/utils/qa/orchestrator.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/evaluation/judges.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/generators.py
@KyleZheng1284 KyleZheng1284 force-pushed the feature/qa-harness-fullpage-pipeline branch from d7c48fa to 9262c63 Compare April 3, 2026 21:56
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR introduces a pluggable multi-tier QA evaluation harness (nemo_retriever/evaluation/) with Tier-1 retrieval recall, Tier-2 token F1, and Tier-3 LLM-as-judge scoring, plus supporting utilities for Parquet-to-page-index building and LanceDB export. All previously raised P1 findings have been resolved — the answer_in_context substring bug, thinking_truncated misclassification, IndexError on empty evaluations, bo767_infographic None-data_dir crash, and the retrieval_loader exception-dispatch issue.

  • P1 — Missing tests: 15+ new source modules are added with no corresponding test files; test-mirrors-source-structure and test-coverage-new-code require coverage for the pure-computation scoring functions at minimum.
  • P1 — Bare except Exception: pass in io/dataframe.py: Four silent fallback blocks in read_extraction_parquet discard failure details with no logging, making Parquet read failures undiagnosable.

Confidence Score: 4/5

Safe to merge once unit tests are added and the bare-except logging gaps in io/dataframe.py are addressed.

All previously raised P1 correctness bugs are fixed. Two new P1 findings remain: missing tests across 15+ new modules (rules violation), and silent exception swallowing in read_extraction_parquet that makes Parquet read failures undiagnosable. The core scoring and orchestration logic is well-structured.

nemo_retriever/src/nemo_retriever/io/dataframe.py (bare except), and any new test files that should mirror the evaluation/ package.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/io/dataframe.py read_extraction_parquet adds multi-strategy Parquet reading but four bare except Exception: pass blocks swallow fallback failures without any logging, violating no-bare-except rule.
nemo_retriever/src/nemo_retriever/evaluation/scoring.py Programmatic multi-tier scoring; prior substring-matching and thinking_truncated misclassification bugs are fixed; no unit tests.
nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py QAEvalPipeline wiring retrieval→generation→judging→scoring; architecture is solid, late-binding closure bug is guarded, no tests.
nemo_retriever/src/nemo_retriever/evaluation/generators.py LiteLLMClient wrapper; generation failures caught and returned as error results but logged only at DEBUG level, making LLM API errors invisible in production.
nemo_retriever/src/nemo_retriever/evaluation/config.py Config loading with env-var expansion and legacy/new schema normalization; previously flagged IndexError on empty evaluations is fixed.
nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py Dataset loaders for bo767_infographic, ViDoRe v3, and generic CSV; previously flagged bo767_infographic None data_dir crash is fixed.
nemo_retriever/src/nemo_retriever/io/markdown.py Adds build_page_index and _read_parquet_for_markdown with column-selection optimization to avoid loading large image/embedding columns.
nemo_retriever/src/nemo_retriever/export.py New module for querying LanceDB and exporting FileRetriever-compatible JSON; well-documented, no tests.

Sequence Diagram

sequenceDiagram
    participant CLI as retriever eval run
    participant Loader as RetrievalLoaderOperator
    participant FR as FileRetriever
    participant GT as GroundTruth CSV
    participant Gen as QAGenerationOperator(LiteLLMClient)
    participant Judge as JudgingOperator(LLMJudge)
    participant Score as ScoringOperator

    CLI->>Loader: process(None)
    Loader->>GT: load_generic_csv / get_qa_dataset_loader
    GT-->>Loader: qa_pairs
    Loader->>FR: retrieve(query, top_k) per pair
    FR-->>Loader: RetrievalResult(chunks, metadata)
    Loader-->>Gen: DataFrame(query, reference_answer, context)

    Gen->>Gen: ThreadPoolExecutor (max_workers)
    Gen->>LiteLLM: litellm.completion(messages)
    LiteLLM-->>Gen: raw_answer
    Gen->>Gen: strip_think_tags(raw_answer)
    Gen-->>Judge: DataFrame + answer columns

    Judge->>Judge: ThreadPoolExecutor (max_workers)
    Judge->>LiteLLM: litellm.completion(judge_prompt)
    LiteLLM-->>Judge: JSON score 1-5
    Judge-->>Score: DataFrame + judge columns

    Score->>Score: answer_in_context() Tier 1
    Score->>Score: token_f1() Tier 2
    Score->>Score: classify_failure() Tier 3
    Score-->>CLI: DataFrame with all metrics

    CLI->>CLI: write timestamped results JSON
    CLI->>CLI: print multi-tier summary
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/dataframe.py
Line: 35-67

Comment:
**Bare `except Exception: pass` clauses swallow errors without logging**

Four intermediate `except Exception: pass` blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final `pd.read_parquet(path)` call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose *why* a specific parquet cannot be read.

Per the project's `no-bare-except` rule, bare-except blocks at non-boundary sites must log the caught exception. Add `logger.debug` (or `logger.warning`) with `exc_info=True` at each fallback transition so failures are traceable:

```python
try:
    table = pq.ParquetFile(path).read()
    try:
        table = table.combine_chunks()
    except Exception:
        pass  # combine_chunks is best-effort; chunked array still usable
    try:
        return table.to_pandas(split_blocks=False)
    except Exception as e:
        logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
        return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
    logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
    return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
    logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/generators.py
Line: 533-540

Comment:
**LLM API failures logged only at DEBUG level — invisible in production**

The `except Exception` handler uses `logger.debug` without `exc_info=True`. In a default INFO-level deployment any API connectivity failure, authentication error, or rate-limit exception will be silently swallowed at the logging layer — the only record of the failure is the string stored in `GenerationResult.error`. Operators that call `generate()` in a thread pool (e.g. `QAEvalPipeline.process()`) will emit no visible log output when a model endpoint is unreachable, making large-scale eval runs very hard to diagnose.

Raise to `logger.warning` (or `logger.error`) with `exc_info=True`:

```suggestion
        except Exception as exc:
            logger.warning("Generation failed for model=%s: %s", self.model, exc, exc_info=True)
            return GenerationResult(
                answer="",
                latency_s=0.0,
                model=self.model,
                error=str(exc),
            )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/scoring.py
Line: 1-5

Comment:
**No unit tests for any new evaluation modules**

This PR adds 15+ new source modules (`scoring.py`, `generators.py`, `judges.py`, `orchestrator.py`, `config.py`, `ground_truth.py`, `retrieval_loader.py`, `runner.py`, `retrievers.py`, `export.py`, etc.) with zero corresponding test files. The PR checklist marks "New or existing tests cover these changes" as done, but the diff contains no test files at all.

Per the project's `test-mirrors-source-structure` and `test-coverage-new-code` rules, new modules must have test counterparts under `nemo_retriever/tests/`. The pure-computation functions (`token_f1`, `answer_in_context`, `classify_failure`, `_normalize`, `_parse_judge_response`, `_sanitize_prefix`) are especially easy to unit-test and would provide high confidence in the scoring logic correctness.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (13): Last reviewed commit: "add reference to harness readme + minor ..." | Re-trigger Greptile

Comment thread tools/harness/run_qa_eval.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/evaluation/scoring.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/scoring.py Outdated
Comment thread tools/harness/src/nv_ingest_harness/cases/qa_eval.py Outdated
Comment thread tools/harness/src/nv_ingest_harness/utils/qa/__init__.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py Outdated
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving in the right direction. Lets remove all the changes to the harness not in nemo_retriever. That will slim down the PR quite a bit. Also, unless you feel it is really helpful, lets remove all the extra tools you added and replace them with helper functions for those actions. We should refactor to make it possible to tack these operators on the graph in graph_pipeline.py or into the Retreiver object already in use. We should be trying to reuse as much of the objects that we have as much as possible. Keep in mind, everything here is a discussion, if you feel it is better the way you have it, please explain it to me.

Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/config.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py
Comment thread nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py
# ---------------------------------------------------------------------------


def run_agentic_retrieval(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is something that we need to do separate from graph_pipeline.py entry point? Cant we just add in the operators we want and use that same entrypoint. It would then allow us to make changes to the query file and datasets and should still get same behavior.

--output data/test_retrieval/bo767_retrieval_dense.json
"""

from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a whole new file to do what graph_pipeline already mostly does?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exists because retrieval-bench only works with HuggingFace datasets out of the box. We would need this file to load our extraction Parquets, expand chunk hits to full-page markdown, and output the FileRetriever JSON that our QA eval pipeline expects.

import json
import os

from nv_ingest_harness.cases.e2e import main as e2e_main
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again it seems like you are creating a whole new graph specifically for this. When what I think we want is to be able to tack on these operations to any graph.

Comment thread tools/harness/src/nv_ingest_harness/cli/run.py Outdated
from nemo_retriever.evaluation.types import RetrievalResult


class TopKRetriever:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding this in the harness. This should exist in nemo_retriever. All code changes in legacy nv-ingest can be removed unless necessary to make nemo_retriever work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moving it would pull harness dependencies into nemo_retriever right, which isn't what we want. It makes more sense in my mind if the harness consumes the nemo_retriever protocl instead of vice versa.

Comment thread nemo_retriever/src/nemo_retriever/evaluation/retrieval_loader.py
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are things we should polish in another round but this can get merged. we are going to want to move away from the eval CLI command completely. This needs to be incorporated to be usable, if activate in graph_pipeline.py

return records


def load_vidore_v3_qa(dataset_name: str, cache_dir: Optional[str] = None) -> list[dict]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have the vidore_v3 question and answer pairs why do I need to have the datasets library? If I have the CSV do I really still need to have datasets library?

from nemo_retriever.evaluation.retrievers import FileRetriever

source = self._ground_truth_csv
try:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better if you just required the question and answer pairs. Instead of trying to handle pulling the data out of a particular file?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is heavily coupled with the specifics of the datasets we support. I know you have the catch all after to read csv, but I wouldnt even do that. What happens if the user has this information in a different format. Now we require them to read the information in and change it to csv. If we made it to where we took in the question answer pairs and ground truth, then we put loading that information on the user.

Minimal required JSON format::

{
"queries": {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is missing a set of list brackets, queries should be a list of dict records right?

def score_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""Apply all programmatic scoring metrics to a DataFrame.

Input DataFrame must have ``reference_answer``, ``answer``, ``context``
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice, if we made it possible to calculate this for any dataframe. We could keep the column names as defaults but if we add a place where the user can set their own mappings, it removes an extra step requiring the column name normalization.

from nemo_retriever.evaluation.scoring import score_dataframe


class ScoringOperator(EvalOperator):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is only used in this module. If its only called in your scoring operator how does putting it nearer to where it is called couple it to non-LLM scoring logic? It seems the evaluation subfolder is all tied to llm and judge evaluation.



@runtime_checkable
class RetrieverStrategy(Protocol):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but what is the different between this and the Retriever



@runtime_checkable
class LLMClient(Protocol):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retrieval_results = [<results inside>]
analysis_results = []
for llm_ref in llms:
      res = analyze_results(retrieval_results, llm_ref)
      analysis_results.append(res)

Why wouldn't something like this work?



def build_page_index(
parquet_dir: str | Path | None = None,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would break if I did something like build_page_index(dataframe). It would need to be build_page_index(None, dataframe) to work correctly, right?

LANCEDB_TABLE = "nv-ingest"


def reload_parquet_to_lancedb(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is lancedb specific it should go in the vector_store subfolder.

@KyleZheng1284 KyleZheng1284 force-pushed the feature/qa-harness-fullpage-pipeline branch from f51a25a to 881eb64 Compare April 16, 2026 16:51
Comment on lines +35 to 67
import pyarrow.parquet as pq

try:
table = pq.ParquetFile(path).read()
try:
table = table.combine_chunks()
except Exception:
pass
try:
return table.to_pandas(split_blocks=False)
except Exception:
return _arrow_table_to_pandas_via_pylist(table)
except Exception:
pass
try:
return pd.read_parquet(path, engine="fastparquet")
except Exception:
pass
try:
table = pq.ParquetFile(path).read()
return _arrow_table_to_pandas_via_pylist(table)
except Exception:
pass
return pd.read_parquet(path)


def read_dataframe(path: Path) -> pd.DataFrame:
suffix = path.suffix.lower()
if suffix == ".parquet":
return pd.read_parquet(path)
return read_extraction_parquet(path)
if suffix in {".jsonl", ".json"}:
text = path.read_text(encoding="utf-8")
if suffix == ".jsonl":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Bare except Exception: pass clauses swallow errors without logging

Four intermediate except Exception: pass blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final pd.read_parquet(path) call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose why a specific parquet cannot be read.

Per the project's no-bare-except rule, bare-except blocks at non-boundary sites must log the caught exception. Add logger.debug (or logger.warning) with exc_info=True at each fallback transition so failures are traceable:

try:
    table = pq.ParquetFile(path).read()
    try:
        table = table.combine_chunks()
    except Exception:
        pass  # combine_chunks is best-effort; chunked array still usable
    try:
        return table.to_pandas(split_blocks=False)
    except Exception as e:
        logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
        return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
    logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
    return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
    logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...
Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/dataframe.py
Line: 35-67

Comment:
**Bare `except Exception: pass` clauses swallow errors without logging**

Four intermediate `except Exception: pass` blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final `pd.read_parquet(path)` call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose *why* a specific parquet cannot be read.

Per the project's `no-bare-except` rule, bare-except blocks at non-boundary sites must log the caught exception. Add `logger.debug` (or `logger.warning`) with `exc_info=True` at each fallback transition so failures are traceable:

```python
try:
    table = pq.ParquetFile(path).read()
    try:
        table = table.combine_chunks()
    except Exception:
        pass  # combine_chunks is best-effort; chunked array still usable
    try:
        return table.to_pandas(split_blocks=False)
    except Exception as e:
        logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
        return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
    logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
    return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
    logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...
```

How can I resolve this? If you propose a fix, please make it concise.

@jperez999 jperez999 merged commit 89d9965 into NVIDIA:main Apr 16, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants