Keep original `NaN` or `null` scores from LLM judge in eval output by yczhang-nv · Pull Request #1058 · NVIDIA/NeMo-Agent-Toolkit

yczhang-nv · 2025-10-21T18:21:18Z

Description

When running nat eval, the NaN or null scores return by LLM judge are converted to 0.0 to calculate average score. Those scores are output as 0.0 in the eval output as well, which is misleading to users (since NaN or null indicate error occurs during eval, while 0.0 means the eval is successful but the quality of the answer is poor).

This PR keeps the original NaN or null scores from LLM judge in eval output, while still converting them to 0.0 when calculating the average score.

Closes #584
Closes nvbugs-5441058
Closes AIQ-1787

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

Bug Fixes
- Fixed evaluation output to preserve original metric scores, including missing or invalid values, rather than using normalized versions in results.

review-notebook-app · 2025-10-21T18:21:23Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coderabbitai · 2025-10-21T18:21:29Z

Warning

Rate limit exceeded

@yczhang-nv has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 659c504 and 7758b20.

📒 Files selected for processing (1)

tests/nat/eval/rag_evaluator/test_rag_evaluate.py (2 hunks)

Walkthrough

The change modifies ragas_to_eval_output to preserve original per-metric scores (including NaN/None) separately from normalized scores used for averaging. When constructing output items, the function uses preserved original values instead of zero-normalized values, preventing NaN from overwriting valid scores.

Changes

Cohort / File(s)	Change Summary
Score preservation logic `src/nat/eval/rag_evaluator/evaluate.py`	Modified `ragas_to_eval_output` to store original per-metric scores in `original_scores_dict` while maintaining normalized `scores_dict` for averaging. Updated EvalOutputItem construction to use preserved original scores for the first metric instead of zero-coerced values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues Check	❓ Inconclusive	The PR addresses key coding-related requirements from issue #584 by fixing the metrics/aggregation code (ragas_to_eval_output) to preserve original NaN/None scores in output while still using normalized 0.0 values for average computation, and by treating NaN as a distinct state rather than numeric zero. However, the changeset is missing unit and integration tests that validate the NaN handling behavior as specified in the linked issue. While the PR description indicates tests will be added before final review, the current implementation does not include these verification mechanisms, leaving the compliance assessment incomplete.	The PR should include unit and integration tests that verify NaN scores are properly preserved in output and not converted to 0.0, similar to the minimal repro test mentioned in issue #584. Additionally, consider adding tests that simulate judge failures to ensure the error handling paths produce NaN/skipped entries correctly. Review the test requirements stated in the linked issue and ensure they are addressed before merging.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "Keep original `NaN` or `null` scores from LLM judge in eval output" is 67 characters (within the 72-character limit), uses clear imperative mood with the verb "Keep," and is concise and descriptive. The title directly corresponds to the main change in the changeset, which modifies the ragas_to_eval_output function to preserve original NaN/None scores in output rather than converting them to 0.0. The title accurately summarizes the primary objective of the pull request.
Out of Scope Changes Check	✅ Passed	The changes are narrowly focused on the ragas_to_eval_output function in src/nat/eval/rag_evaluator/evaluate.py, specifically modifying how NaN/None scores are handled during output construction. No alterations to exported or public entities are present, and all modifications are directly related to fixing the bug described in issue #584 where NaN scores were incorrectly converted to 0.0 in evaluation output.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Further improvements to flow and documentation of getting started notebooks 1, 2, 3, and 4 per VDR feedback. What was originally notebook 3 has been split into what is now notebooks 3 and 4. The biggest changes include: addressing specific VDR feedback around clarity of each notebook, adding a table of contents, making cells default collapsed in google colab. Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Expanded and reorganized getting started notebook with improved navigation and structured sections * Added four new comprehensive tutorials covering agent integration, tool development, multi-agent orchestration, and observability * Updated README files with clearer notebook descriptions and enhanced learning pathways Authors: - Bryan Bednarski (https://github.com/bbednarski9) Approvers: - Will Killian (https://github.com/willkill07) URL: NVIDIA#1039 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

Address feedback for improving the top-level function groups documentation. Closes AIQ-{2136,2137,2138,12139,2140} ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Major rewrite and expansion of Function Groups docs: new overview, rationale, key concepts (namespacing, access patterns), end-to-end workflows, programmatic usage, advanced filtering, resource-sharing patterns, best practices, testing, and troubleshooting; branding updated to "NVIDIA NeMo Agent toolkit". * **Examples** * Sample function-group config field renamed from "expose" to "include" (no behavioral change). * **Chores** * CI vocabulary regex broadened and path-check allowlist extended to accept additional terms/tokens. Authors: - Will Killian (https://github.com/willkill07) Approvers: - https://github.com/lvojtku - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1025 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Chores** * Enhanced CI/CD validation by expanding pre-commit checks to run on all files in the codebase, strengthening overall code quality standards. Authors: - Will Killian (https://github.com/willkill07) Approvers: - Zhongxuan (Daniel) Wang (https://github.com/zhongxuanwang-nv) URL: NVIDIA#1048 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

* Add Langfuse and services required by Langfuse (clickhouse & postgres) to E2E services. * Fix scoping of `mysql_server` and `minio_server` fixtures * Improve `minio_server` fixture to run `client.list_buckets()` rather than catching exceptions regarding a missing bucket. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **New Features** * Added Langfuse observability integration support alongside ClickHouse and PostgreSQL services. * **Tests** * Added an integration test validating the Langfuse end-to-end workflow. * Extended test infrastructure with new service fixtures, health checks, and longer-lived test services. * **Chores** * Updated CI configuration to include new services and environment variables. * Improved test-running docs with automated password generation and expanded environment setup. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Will Killian (https://github.com/willkill07) URL: NVIDIA#1047 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

Closes AIQ-2148 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Added an "Open in Colab" badge to the README top badge area for quick Colab access. * Added Colab badge links in the examples docs near the Notebooks and Building an Agentic System sections to simplify notebook launching. * Documentation-only update; no functional or behavioral changes. Authors: - Will Killian (https://github.com/willkill07) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1051 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

Improve optimizer feature and example documentation. Minor quality of life update to how parameters are named in output files. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Major expansion of the Parameter Optimization guide: concepts, configuration sections, GA terminology, workflows, step‑by‑step examples, best practices, multi‑objective strategies, outputs interpretation, CLI usage, and visualization guidance. * **Improvements** * Visualizations handle both legacy and metric-named result formats. * Exported optimization CSVs use metric-aware column names while preserving post‑export normalization and reporting artifacts. * **Examples** * Example configs updated: prompt‑optimization enabled with init/recombination hooks, evaluation metrics standardized, numeric trial and GA defaults tuned, and README run/visualization guidance expanded. Authors: - Dhruv Nandakumar (https://github.com/dnandakumar-nv) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Will Killian (https://github.com/willkill07) - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1026 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

…an-score Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/nat/eval/rag_evaluator/evaluate.py (1)
124-124: Consider iterator-based access for minor efficiency gain.

The current slice creates an intermediate list. Using next(iter(scores_dict.keys())) is slightly more efficient and idiomatic.

Apply this diff:
-        first_metric_name = list(scores_dict.keys())[0] if scores_dict else None
+        first_metric_name = next(iter(scores_dict.keys())) if scores_dict else None

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5173df6 and 659c504.

📒 Files selected for processing (1)

src/nat/eval/rag_evaluator/evaluate.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

src/nat/eval/rag_evaluator/evaluate.py

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

src/nat/eval/rag_evaluator/evaluate.py

src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

src/nat/eval/rag_evaluator/evaluate.py

src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

src/nat/eval/rag_evaluator/evaluate.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/eval/rag_evaluator/evaluate.py

{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

src/nat/eval/rag_evaluator/evaluate.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

src/nat/eval/rag_evaluator/evaluate.py

🧬 Code graph analysis (1)

src/nat/eval/rag_evaluator/evaluate.py (1)

src/nat/eval/evaluator/evaluator_model.py (1)

EvalOutputItem (50-53)

🪛 Ruff (0.14.1)

src/nat/eval/rag_evaluator/evaluate.py

124-124: Prefer next(iter(scores_dict.keys())) over single element slice

Replace with next(iter(scores_dict.keys()))

(RUF015)

🔇 Additional comments (2)

src/nat/eval/rag_evaluator/evaluate.py (2)

119-126: Excellent separation of concerns for NaN preservation.

The dual-dictionary approach correctly preserves original NaN/None values in original_scores_dict for output while maintaining normalized 0.0 values in scores_dict for safe average calculation. This aligns perfectly with the PR objective to reflect judge failures accurately in eval output without corrupting aggregated metrics.

143-153: Core fix correctly uses preserved original scores.

Line 147 now references original_scores_dict instead of the normalized scores_dict, ensuring that NaN/None values from the LLM judge are written to the output as-is rather than being converted to 0.0. This directly addresses the misleading zero-scoring issue described in the PR objectives.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

yczhang-nv · 2025-10-21T19:28:23Z

/merge

yczhang-nv self-assigned this Oct 21, 2025

yczhang-nv requested a review from a team as a code owner October 21, 2025 18:21

yczhang-nv added bug Something isn't working non-breaking Non-breaking change labels Oct 21, 2025

yczhang-nv changed the base branch from develop to release/1.3 October 21, 2025 18:21

bbednarski9 and others added 7 commits October 21, 2025 11:22

keep NaN score in eval result

f4c3389

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

willkill07 approved these changes Oct 21, 2025

View reviewed changes

yczhang-nv force-pushed the yuchen-fix-nan-score branch from 9d7471e to f4c3389 Compare October 21, 2025 18:24

Merge remote-tracking branch 'upstream/release/1.3' into yuchen-fix-n…

659c504

…an-score Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

coderabbitai bot reviewed Oct 21, 2025

View reviewed changes

update unit test

7758b20

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>

rapids-bot bot merged commit 5778703 into NVIDIA:release/1.3 Oct 21, 2025
17 checks passed

yczhang-nv deleted the yuchen-fix-nan-score branch October 21, 2025 19:28

willkill07 mentioned this pull request Oct 22, 2025

[BUG]: 0 score for accuracy when LLM judge returns NAN #584

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep original `NaN` or `null` scores from LLM judge in eval output #1058

Keep original `NaN` or `null` scores from LLM judge in eval output #1058
rapids-bot[bot] merged 9 commits intoNVIDIA:release/1.3from
yczhang-nv:yuchen-fix-nan-score

yczhang-nv commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

review-notebook-app bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

yczhang-nv commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

yczhang-nv commented Oct 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

review-notebook-app bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

yczhang-nv commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yczhang-nv commented Oct 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading