QA eval pipeline for retrieval#1754
Conversation
| print(f" Page index key check: {matched}/{len(sampled)} sampled source_ids found") | ||
|
|
||
|
|
||
| def main() -> int: |
There was a problem hiding this comment.
Why not make this a tool we can call via import, instead of a main function.
There was a problem hiding this comment.
core evaluation logic has been moved into nemo_retriever.evaluation (importable package, pip-installable via nemo_retriever[eval])
d7c48fa to
9262c63
Compare
Greptile SummaryThis PR introduces a pluggable multi-tier QA evaluation harness (
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/io/dataframe.py | read_extraction_parquet adds multi-strategy Parquet reading but four bare except Exception: pass blocks swallow fallback failures without any logging, violating no-bare-except rule. |
| nemo_retriever/src/nemo_retriever/evaluation/scoring.py | Programmatic multi-tier scoring; prior substring-matching and thinking_truncated misclassification bugs are fixed; no unit tests. |
| nemo_retriever/src/nemo_retriever/evaluation/orchestrator.py | QAEvalPipeline wiring retrieval→generation→judging→scoring; architecture is solid, late-binding closure bug is guarded, no tests. |
| nemo_retriever/src/nemo_retriever/evaluation/generators.py | LiteLLMClient wrapper; generation failures caught and returned as error results but logged only at DEBUG level, making LLM API errors invisible in production. |
| nemo_retriever/src/nemo_retriever/evaluation/config.py | Config loading with env-var expansion and legacy/new schema normalization; previously flagged IndexError on empty evaluations is fixed. |
| nemo_retriever/src/nemo_retriever/evaluation/ground_truth.py | Dataset loaders for bo767_infographic, ViDoRe v3, and generic CSV; previously flagged bo767_infographic None data_dir crash is fixed. |
| nemo_retriever/src/nemo_retriever/io/markdown.py | Adds build_page_index and _read_parquet_for_markdown with column-selection optimization to avoid loading large image/embedding columns. |
| nemo_retriever/src/nemo_retriever/export.py | New module for querying LanceDB and exporting FileRetriever-compatible JSON; well-documented, no tests. |
Sequence Diagram
sequenceDiagram
participant CLI as retriever eval run
participant Loader as RetrievalLoaderOperator
participant FR as FileRetriever
participant GT as GroundTruth CSV
participant Gen as QAGenerationOperator(LiteLLMClient)
participant Judge as JudgingOperator(LLMJudge)
participant Score as ScoringOperator
CLI->>Loader: process(None)
Loader->>GT: load_generic_csv / get_qa_dataset_loader
GT-->>Loader: qa_pairs
Loader->>FR: retrieve(query, top_k) per pair
FR-->>Loader: RetrievalResult(chunks, metadata)
Loader-->>Gen: DataFrame(query, reference_answer, context)
Gen->>Gen: ThreadPoolExecutor (max_workers)
Gen->>LiteLLM: litellm.completion(messages)
LiteLLM-->>Gen: raw_answer
Gen->>Gen: strip_think_tags(raw_answer)
Gen-->>Judge: DataFrame + answer columns
Judge->>Judge: ThreadPoolExecutor (max_workers)
Judge->>LiteLLM: litellm.completion(judge_prompt)
LiteLLM-->>Judge: JSON score 1-5
Judge-->>Score: DataFrame + judge columns
Score->>Score: answer_in_context() Tier 1
Score->>Score: token_f1() Tier 2
Score->>Score: classify_failure() Tier 3
Score-->>CLI: DataFrame with all metrics
CLI->>CLI: write timestamped results JSON
CLI->>CLI: print multi-tier summary
Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/dataframe.py
Line: 35-67
Comment:
**Bare `except Exception: pass` clauses swallow errors without logging**
Four intermediate `except Exception: pass` blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final `pd.read_parquet(path)` call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose *why* a specific parquet cannot be read.
Per the project's `no-bare-except` rule, bare-except blocks at non-boundary sites must log the caught exception. Add `logger.debug` (or `logger.warning`) with `exc_info=True` at each fallback transition so failures are traceable:
```python
try:
table = pq.ParquetFile(path).read()
try:
table = table.combine_chunks()
except Exception:
pass # combine_chunks is best-effort; chunked array still usable
try:
return table.to_pandas(split_blocks=False)
except Exception as e:
logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/generators.py
Line: 533-540
Comment:
**LLM API failures logged only at DEBUG level — invisible in production**
The `except Exception` handler uses `logger.debug` without `exc_info=True`. In a default INFO-level deployment any API connectivity failure, authentication error, or rate-limit exception will be silently swallowed at the logging layer — the only record of the failure is the string stored in `GenerationResult.error`. Operators that call `generate()` in a thread pool (e.g. `QAEvalPipeline.process()`) will emit no visible log output when a model endpoint is unreachable, making large-scale eval runs very hard to diagnose.
Raise to `logger.warning` (or `logger.error`) with `exc_info=True`:
```suggestion
except Exception as exc:
logger.warning("Generation failed for model=%s: %s", self.model, exc, exc_info=True)
return GenerationResult(
answer="",
latency_s=0.0,
model=self.model,
error=str(exc),
)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/evaluation/scoring.py
Line: 1-5
Comment:
**No unit tests for any new evaluation modules**
This PR adds 15+ new source modules (`scoring.py`, `generators.py`, `judges.py`, `orchestrator.py`, `config.py`, `ground_truth.py`, `retrieval_loader.py`, `runner.py`, `retrievers.py`, `export.py`, etc.) with zero corresponding test files. The PR checklist marks "New or existing tests cover these changes" as done, but the diff contains no test files at all.
Per the project's `test-mirrors-source-structure` and `test-coverage-new-code` rules, new modules must have test counterparts under `nemo_retriever/tests/`. The pure-computation functions (`token_f1`, `answer_in_context`, `classify_failure`, `_normalize`, `_parse_judge_response`, `_sanitize_prefix`) are especially easy to unit-test and would provide high confidence in the scoring logic correctness.
How can I resolve this? If you propose a fix, please make it concise.Reviews (13): Last reviewed commit: "add reference to harness readme + minor ..." | Re-trigger Greptile
jperez999
left a comment
There was a problem hiding this comment.
Moving in the right direction. Lets remove all the changes to the harness not in nemo_retriever. That will slim down the PR quite a bit. Also, unless you feel it is really helpful, lets remove all the extra tools you added and replace them with helper functions for those actions. We should refactor to make it possible to tack these operators on the graph in graph_pipeline.py or into the Retreiver object already in use. We should be trying to reuse as much of the objects that we have as much as possible. Keep in mind, everything here is a discussion, if you feel it is better the way you have it, please explain it to me.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def run_agentic_retrieval( |
There was a problem hiding this comment.
So this is something that we need to do separate from graph_pipeline.py entry point? Cant we just add in the operators we want and use that same entrypoint. It would then allow us to make changes to the query file and datasets and should still get same behavior.
| --output data/test_retrieval/bo767_retrieval_dense.json | ||
| """ | ||
|
|
||
| from __future__ import annotations |
There was a problem hiding this comment.
Why create a whole new file to do what graph_pipeline already mostly does?
There was a problem hiding this comment.
This script exists because retrieval-bench only works with HuggingFace datasets out of the box. We would need this file to load our extraction Parquets, expand chunk hits to full-page markdown, and output the FileRetriever JSON that our QA eval pipeline expects.
| import json | ||
| import os | ||
|
|
||
| from nv_ingest_harness.cases.e2e import main as e2e_main |
There was a problem hiding this comment.
Again it seems like you are creating a whole new graph specifically for this. When what I think we want is to be able to tack on these operations to any graph.
| from nemo_retriever.evaluation.types import RetrievalResult | ||
|
|
||
|
|
||
| class TopKRetriever: |
There was a problem hiding this comment.
Why are you adding this in the harness. This should exist in nemo_retriever. All code changes in legacy nv-ingest can be removed unless necessary to make nemo_retriever work.
There was a problem hiding this comment.
moving it would pull harness dependencies into nemo_retriever right, which isn't what we want. It makes more sense in my mind if the harness consumes the nemo_retriever protocl instead of vice versa.
jperez999
left a comment
There was a problem hiding this comment.
There are things we should polish in another round but this can get merged. we are going to want to move away from the eval CLI command completely. This needs to be incorporated to be usable, if activate in graph_pipeline.py
| return records | ||
|
|
||
|
|
||
| def load_vidore_v3_qa(dataset_name: str, cache_dir: Optional[str] = None) -> list[dict]: |
There was a problem hiding this comment.
If I have the vidore_v3 question and answer pairs why do I need to have the datasets library? If I have the CSV do I really still need to have datasets library?
| from nemo_retriever.evaluation.retrievers import FileRetriever | ||
|
|
||
| source = self._ground_truth_csv | ||
| try: |
There was a problem hiding this comment.
wouldn't it be better if you just required the question and answer pairs. Instead of trying to handle pulling the data out of a particular file?
There was a problem hiding this comment.
This is heavily coupled with the specifics of the datasets we support. I know you have the catch all after to read csv, but I wouldnt even do that. What happens if the user has this information in a different format. Now we require them to read the information in and change it to csv. If we made it to where we took in the question answer pairs and ground truth, then we put loading that information on the user.
| Minimal required JSON format:: | ||
|
|
||
| { | ||
| "queries": { |
There was a problem hiding this comment.
I think this example is missing a set of list brackets, queries should be a list of dict records right?
| def score_dataframe(df: pd.DataFrame) -> pd.DataFrame: | ||
| """Apply all programmatic scoring metrics to a DataFrame. | ||
|
|
||
| Input DataFrame must have ``reference_answer``, ``answer``, ``context`` |
There was a problem hiding this comment.
would be nice, if we made it possible to calculate this for any dataframe. We could keep the column names as defaults but if we add a place where the user can set their own mappings, it removes an extra step requiring the column name normalization.
| from nemo_retriever.evaluation.scoring import score_dataframe | ||
|
|
||
|
|
||
| class ScoringOperator(EvalOperator): |
There was a problem hiding this comment.
But it is only used in this module. If its only called in your scoring operator how does putting it nearer to where it is called couple it to non-LLM scoring logic? It seems the evaluation subfolder is all tied to llm and judge evaluation.
|
|
||
|
|
||
| @runtime_checkable | ||
| class RetrieverStrategy(Protocol): |
There was a problem hiding this comment.
Ok but what is the different between this and the Retriever
|
|
||
|
|
||
| @runtime_checkable | ||
| class LLMClient(Protocol): |
There was a problem hiding this comment.
retrieval_results = [<results inside>]
analysis_results = []
for llm_ref in llms:
res = analyze_results(retrieval_results, llm_ref)
analysis_results.append(res)
Why wouldn't something like this work?
|
|
||
|
|
||
| def build_page_index( | ||
| parquet_dir: str | Path | None = None, |
There was a problem hiding this comment.
I think this would break if I did something like build_page_index(dataframe). It would need to be build_page_index(None, dataframe) to work correctly, right?
| LANCEDB_TABLE = "nv-ingest" | ||
|
|
||
|
|
||
| def reload_parquet_to_lancedb( |
There was a problem hiding this comment.
Since this is lancedb specific it should go in the vector_store subfolder.
…ith better scoring naming, more generic env naming for api keys for multi model support
f51a25a to
881eb64
Compare
| import pyarrow.parquet as pq | ||
|
|
||
| try: | ||
| table = pq.ParquetFile(path).read() | ||
| try: | ||
| table = table.combine_chunks() | ||
| except Exception: | ||
| pass | ||
| try: | ||
| return table.to_pandas(split_blocks=False) | ||
| except Exception: | ||
| return _arrow_table_to_pandas_via_pylist(table) | ||
| except Exception: | ||
| pass | ||
| try: | ||
| return pd.read_parquet(path, engine="fastparquet") | ||
| except Exception: | ||
| pass | ||
| try: | ||
| table = pq.ParquetFile(path).read() | ||
| return _arrow_table_to_pandas_via_pylist(table) | ||
| except Exception: | ||
| pass | ||
| return pd.read_parquet(path) | ||
|
|
||
|
|
||
| def read_dataframe(path: Path) -> pd.DataFrame: | ||
| suffix = path.suffix.lower() | ||
| if suffix == ".parquet": | ||
| return pd.read_parquet(path) | ||
| return read_extraction_parquet(path) | ||
| if suffix in {".jsonl", ".json"}: | ||
| text = path.read_text(encoding="utf-8") | ||
| if suffix == ".jsonl": |
There was a problem hiding this comment.
Bare
except Exception: pass clauses swallow errors without logging
Four intermediate except Exception: pass blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final pd.read_parquet(path) call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose why a specific parquet cannot be read.
Per the project's no-bare-except rule, bare-except blocks at non-boundary sites must log the caught exception. Add logger.debug (or logger.warning) with exc_info=True at each fallback transition so failures are traceable:
try:
table = pq.ParquetFile(path).read()
try:
table = table.combine_chunks()
except Exception:
pass # combine_chunks is best-effort; chunked array still usable
try:
return table.to_pandas(split_blocks=False)
except Exception as e:
logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...Prompt To Fix With AI
This is a comment left during a code review.
Path: nemo_retriever/src/nemo_retriever/io/dataframe.py
Line: 35-67
Comment:
**Bare `except Exception: pass` clauses swallow errors without logging**
Four intermediate `except Exception: pass` blocks silently discard the failure reason. When every fallback strategy fails, the only error visible to the caller is the one from the final `pd.read_parquet(path)` call at the bottom — not the original failure from the primary PyArrow path. This makes it impossible to diagnose *why* a specific parquet cannot be read.
Per the project's `no-bare-except` rule, bare-except blocks at non-boundary sites must log the caught exception. Add `logger.debug` (or `logger.warning`) with `exc_info=True` at each fallback transition so failures are traceable:
```python
try:
table = pq.ParquetFile(path).read()
try:
table = table.combine_chunks()
except Exception:
pass # combine_chunks is best-effort; chunked array still usable
try:
return table.to_pandas(split_blocks=False)
except Exception as e:
logger.debug("to_pandas(split_blocks=False) failed for %s: %s; using pylist fallback", path, e)
return _arrow_table_to_pandas_via_pylist(table)
except Exception as e:
logger.debug("Primary PyArrow read failed for %s: %s; trying fastparquet", path, e)
try:
return pd.read_parquet(path, engine="fastparquet")
except Exception as e:
logger.debug("fastparquet read failed for %s: %s; retrying pylist", path, e)
...
```
How can I resolve this? If you propose a fix, please make it concise.
Description
Capabilities:
Note - the csv containing the q-a pairs is a subset of the existing https://github.com/NVIDIA/NeMo-Retriever/blob/main/data/digital_corpora_10k_annotations.csv. Currently have an separate PR up with a subset annotations for only bo767 specific files here - #1730
)## Checklist