Skip to content

Keep original NaN or null scores from LLM judge in eval output #1058

Merged
rapids-bot[bot] merged 9 commits intoNVIDIA:release/1.3from
yczhang-nv:yuchen-fix-nan-score
Oct 21, 2025
Merged

Keep original NaN or null scores from LLM judge in eval output #1058
rapids-bot[bot] merged 9 commits intoNVIDIA:release/1.3from
yczhang-nv:yuchen-fix-nan-score

Conversation

@yczhang-nv
Copy link
Contributor

@yczhang-nv yczhang-nv commented Oct 21, 2025

Description

When running nat eval, the NaN or null scores return by LLM judge are converted to 0.0 to calculate average score. Those scores are output as 0.0 in the eval output as well, which is misleading to users (since NaN or null indicate error occurs during eval, while 0.0 means the eval is successful but the quality of the answer is poor).

This PR keeps the original NaN or null scores from LLM judge in eval output, while still converting them to 0.0 when calculating the average score.

Closes #584
Closes nvbugs-5441058
Closes AIQ-1787

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • Bug Fixes
    • Fixed evaluation output to preserve original metric scores, including missing or invalid values, rather than using normalized versions in results.

@yczhang-nv yczhang-nv self-assigned this Oct 21, 2025
@yczhang-nv yczhang-nv requested a review from a team as a code owner October 21, 2025 18:21
@yczhang-nv yczhang-nv added bug Something isn't working non-breaking Non-breaking change labels Oct 21, 2025
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@yczhang-nv yczhang-nv changed the base branch from develop to release/1.3 October 21, 2025 18:21
@coderabbitai
Copy link

coderabbitai bot commented Oct 21, 2025

Warning

Rate limit exceeded

@yczhang-nv has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 659c504 and 7758b20.

📒 Files selected for processing (1)
  • tests/nat/eval/rag_evaluator/test_rag_evaluate.py (2 hunks)

Walkthrough

The change modifies ragas_to_eval_output to preserve original per-metric scores (including NaN/None) separately from normalized scores used for averaging. When constructing output items, the function uses preserved original values instead of zero-normalized values, preventing NaN from overwriting valid scores.

Changes

Cohort / File(s) Change Summary
Score preservation logic
src/nat/eval/rag_evaluator/evaluate.py
Modified ragas_to_eval_output to store original per-metric scores in original_scores_dict while maintaining normalized scores_dict for averaging. Updated EvalOutputItem construction to use preserved original scores for the first metric instead of zero-coerced values.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Linked Issues Check ❓ Inconclusive The PR addresses key coding-related requirements from issue #584 by fixing the metrics/aggregation code (ragas_to_eval_output) to preserve original NaN/None scores in output while still using normalized 0.0 values for average computation, and by treating NaN as a distinct state rather than numeric zero. However, the changeset is missing unit and integration tests that validate the NaN handling behavior as specified in the linked issue. While the PR description indicates tests will be added before final review, the current implementation does not include these verification mechanisms, leaving the compliance assessment incomplete. The PR should include unit and integration tests that verify NaN scores are properly preserved in output and not converted to 0.0, similar to the minimal repro test mentioned in issue #584. Additionally, consider adding tests that simulate judge failures to ensure the error handling paths produce NaN/skipped entries correctly. Review the test requirements stated in the linked issue and ensure they are addressed before merging.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "Keep original NaN or null scores from LLM judge in eval output" is 67 characters (within the 72-character limit), uses clear imperative mood with the verb "Keep," and is concise and descriptive. The title directly corresponds to the main change in the changeset, which modifies the ragas_to_eval_output function to preserve original NaN/None scores in output rather than converting them to 0.0. The title accurately summarizes the primary objective of the pull request.
Out of Scope Changes Check ✅ Passed The changes are narrowly focused on the ragas_to_eval_output function in src/nat/eval/rag_evaluator/evaluate.py, specifically modifying how NaN/None scores are handled during output construction. No alterations to exported or public entities are present, and all modifications are directly related to fixing the bug described in issue #584 where NaN scores were incorrectly converted to 0.0 in evaluation output.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

bbednarski9 and others added 7 commits October 21, 2025 11:22
Further improvements to flow and documentation of getting started notebooks 1, 2, 3, and 4 per VDR feedback. What was originally notebook 3 has been split into what is now notebooks 3 and 4. The biggest changes include: addressing specific VDR feedback around clarity of each notebook, adding a table of contents, making cells default collapsed in google colab.
Closes

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **Documentation**
  * Expanded and reorganized getting started notebook with improved navigation and structured sections
  * Added four new comprehensive tutorials covering agent integration, tool development, multi-agent orchestration, and observability
  * Updated README files with clearer notebook descriptions and enhanced learning pathways

Authors:
  - Bryan Bednarski (https://github.com/bbednarski9)

Approvers:
  - Will Killian (https://github.com/willkill07)

URL: NVIDIA#1039
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Address feedback for improving the top-level function groups documentation.

Closes AIQ-{2136,2137,2138,12139,2140}

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **Documentation**
  * Major rewrite and expansion of Function Groups docs: new overview, rationale, key concepts (namespacing, access patterns), end-to-end workflows, programmatic usage, advanced filtering, resource-sharing patterns, best practices, testing, and troubleshooting; branding updated to "NVIDIA NeMo Agent toolkit".
* **Examples**
  * Sample function-group config field renamed from "expose" to "include" (no behavioral change).
* **Chores**
  * CI vocabulary regex broadened and path-check allowlist extended to accept additional terms/tokens.

Authors:
  - Will Killian (https://github.com/willkill07)

Approvers:
  - https://github.com/lvojtku
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: NVIDIA#1025
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Closes

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **Chores**
  * Enhanced CI/CD validation by expanding pre-commit checks to run on all files in the codebase, strengthening overall code quality standards.

Authors:
  - Will Killian (https://github.com/willkill07)

Approvers:
  - Zhongxuan (Daniel) Wang (https://github.com/zhongxuanwang-nv)

URL: NVIDIA#1048
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
* Add Langfuse and services required by Langfuse (clickhouse & postgres) to E2E services.
* Fix scoping of `mysql_server` and `minio_server` fixtures
* Improve `minio_server` fixture to run `client.list_buckets()` rather than catching exceptions regarding a missing bucket.

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **New Features**
  * Added Langfuse observability integration support alongside ClickHouse and PostgreSQL services.

* **Tests**
  * Added an integration test validating the Langfuse end-to-end workflow.
  * Extended test infrastructure with new service fixtures, health checks, and longer-lived test services.

* **Chores**
  * Updated CI configuration to include new services and environment variables.
  * Improved test-running docs with automated password generation and expanded environment setup.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Will Killian (https://github.com/willkill07)

URL: NVIDIA#1047
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Closes AIQ-2148

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **Documentation**
  * Added an "Open in Colab" badge to the README top badge area for quick Colab access.
  * Added Colab badge links in the examples docs near the Notebooks and Building an Agentic System sections to simplify notebook launching.
  * Documentation-only update; no functional or behavioral changes.

Authors:
  - Will Killian (https://github.com/willkill07)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: NVIDIA#1051
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Improve optimizer feature and example documentation. Minor quality of life update to how parameters are named in output files.

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **Documentation**
  * Major expansion of the Parameter Optimization guide: concepts, configuration sections, GA terminology, workflows, step‑by‑step examples, best practices, multi‑objective strategies, outputs interpretation, CLI usage, and visualization guidance.

* **Improvements**
  * Visualizations handle both legacy and metric-named result formats.
  * Exported optimization CSVs use metric-aware column names while preserving post‑export normalization and reporting artifacts.

* **Examples**
  * Example configs updated: prompt‑optimization enabled with init/recombination hooks, evaluation metrics standardized, numeric trial and GA defaults tuned, and README run/visualization guidance expanded.

Authors:
  - Dhruv Nandakumar (https://github.com/dnandakumar-nv)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Will Killian (https://github.com/willkill07)
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: NVIDIA#1026
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv yczhang-nv force-pushed the yuchen-fix-nan-score branch from 9d7471e to f4c3389 Compare October 21, 2025 18:24
…an-score

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/nat/eval/rag_evaluator/evaluate.py (1)

124-124: Consider iterator-based access for minor efficiency gain.

The current slice creates an intermediate list. Using next(iter(scores_dict.keys())) is slightly more efficient and idiomatic.

Apply this diff:

-        first_metric_name = list(scores_dict.keys())[0] if scores_dict else None
+        first_metric_name = next(iter(scores_dict.keys())) if scores_dict else None
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5173df6 and 659c504.

📒 Files selected for processing (1)
  • src/nat/eval/rag_evaluator/evaluate.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)

**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).

**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
src/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All importable Python code must live under src/ (or packages//src/)

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Changes in src/nat should prioritize backward compatibility

Files:

  • src/nat/eval/rag_evaluator/evaluate.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/rag_evaluator/evaluate.py
🧬 Code graph analysis (1)
src/nat/eval/rag_evaluator/evaluate.py (1)
src/nat/eval/evaluator/evaluator_model.py (1)
  • EvalOutputItem (50-53)
🪛 Ruff (0.14.1)
src/nat/eval/rag_evaluator/evaluate.py

124-124: Prefer next(iter(scores_dict.keys())) over single element slice

Replace with next(iter(scores_dict.keys()))

(RUF015)

🔇 Additional comments (2)
src/nat/eval/rag_evaluator/evaluate.py (2)

119-126: Excellent separation of concerns for NaN preservation.

The dual-dictionary approach correctly preserves original NaN/None values in original_scores_dict for output while maintaining normalized 0.0 values in scores_dict for safe average calculation. This aligns perfectly with the PR objective to reflect judge failures accurately in eval output without corrupting aggregated metrics.


143-153: Core fix correctly uses preserved original scores.

Line 147 now references original_scores_dict instead of the normalized scores_dict, ensuring that NaN/None values from the LLM judge are written to the output as-is rather than being converted to 0.0. This directly addresses the misleading zero-scoring issue described in the PR objectives.

Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
@yczhang-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 5778703 into NVIDIA:release/1.3 Oct 21, 2025
17 checks passed
@yczhang-nv yczhang-nv deleted the yuchen-fix-nan-score branch October 21, 2025 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: 0 score for accuracy when LLM judge returns NAN

5 participants