Keep original NaN or null scores from LLM judge in eval output #1058
Keep original NaN or null scores from LLM judge in eval output #1058rapids-bot[bot] merged 9 commits intoNVIDIA:release/1.3from
NaN or null scores from LLM judge in eval output #1058Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Warning Rate limit exceeded@yczhang-nv has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 4 minutes and 2 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe change modifies Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Further improvements to flow and documentation of getting started notebooks 1, 2, 3, and 4 per VDR feedback. What was originally notebook 3 has been split into what is now notebooks 3 and 4. The biggest changes include: addressing specific VDR feedback around clarity of each notebook, adding a table of contents, making cells default collapsed in google colab. Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Expanded and reorganized getting started notebook with improved navigation and structured sections * Added four new comprehensive tutorials covering agent integration, tool development, multi-agent orchestration, and observability * Updated README files with clearer notebook descriptions and enhanced learning pathways Authors: - Bryan Bednarski (https://github.com/bbednarski9) Approvers: - Will Killian (https://github.com/willkill07) URL: NVIDIA#1039 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Address feedback for improving the top-level function groups documentation.
Closes AIQ-{2136,2137,2138,12139,2140}
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
## Summary by CodeRabbit
* **Documentation**
* Major rewrite and expansion of Function Groups docs: new overview, rationale, key concepts (namespacing, access patterns), end-to-end workflows, programmatic usage, advanced filtering, resource-sharing patterns, best practices, testing, and troubleshooting; branding updated to "NVIDIA NeMo Agent toolkit".
* **Examples**
* Sample function-group config field renamed from "expose" to "include" (no behavioral change).
* **Chores**
* CI vocabulary regex broadened and path-check allowlist extended to accept additional terms/tokens.
Authors:
- Will Killian (https://github.com/willkill07)
Approvers:
- https://github.com/lvojtku
- Michael Demoret (https://github.com/mdemoret-nv)
URL: NVIDIA#1025
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Chores** * Enhanced CI/CD validation by expanding pre-commit checks to run on all files in the codebase, strengthening overall code quality standards. Authors: - Will Killian (https://github.com/willkill07) Approvers: - Zhongxuan (Daniel) Wang (https://github.com/zhongxuanwang-nv) URL: NVIDIA#1048 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
* Add Langfuse and services required by Langfuse (clickhouse & postgres) to E2E services. * Fix scoping of `mysql_server` and `minio_server` fixtures * Improve `minio_server` fixture to run `client.list_buckets()` rather than catching exceptions regarding a missing bucket. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **New Features** * Added Langfuse observability integration support alongside ClickHouse and PostgreSQL services. * **Tests** * Added an integration test validating the Langfuse end-to-end workflow. * Extended test infrastructure with new service fixtures, health checks, and longer-lived test services. * **Chores** * Updated CI configuration to include new services and environment variables. * Improved test-running docs with automated password generation and expanded environment setup. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Will Killian (https://github.com/willkill07) URL: NVIDIA#1047 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Closes AIQ-2148 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Added an "Open in Colab" badge to the README top badge area for quick Colab access. * Added Colab badge links in the examples docs near the Notebooks and Building an Agentic System sections to simplify notebook launching. * Documentation-only update; no functional or behavioral changes. Authors: - Will Killian (https://github.com/willkill07) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1051 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Improve optimizer feature and example documentation. Minor quality of life update to how parameters are named in output files. ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **Documentation** * Major expansion of the Parameter Optimization guide: concepts, configuration sections, GA terminology, workflows, step‑by‑step examples, best practices, multi‑objective strategies, outputs interpretation, CLI usage, and visualization guidance. * **Improvements** * Visualizations handle both legacy and metric-named result formats. * Exported optimization CSVs use metric-aware column names while preserving post‑export normalization and reporting artifacts. * **Examples** * Example configs updated: prompt‑optimization enabled with init/recombination hooks, evaluation metrics standardized, numeric trial and GA defaults tuned, and README run/visualization guidance expanded. Authors: - Dhruv Nandakumar (https://github.com/dnandakumar-nv) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Will Killian (https://github.com/willkill07) - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1026 Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
9d7471e to
f4c3389
Compare
…an-score Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/nat/eval/rag_evaluator/evaluate.py (1)
124-124: Consider iterator-based access for minor efficiency gain.The current slice creates an intermediate list. Using
next(iter(scores_dict.keys()))is slightly more efficient and idiomatic.Apply this diff:
- first_metric_name = list(scores_dict.keys())[0] if scores_dict else None + first_metric_name = next(iter(scores_dict.keys())) if scores_dict else None
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/nat/eval/rag_evaluator/evaluate.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{py,yaml,yml}
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.{py,yaml,yml}: Configure response_seq as a list of strings; values cycle per call, and [] yields an empty string.
Configure delay_ms to inject per-call artificial latency in milliseconds for nat_test_llm.
Files:
src/nat/eval/rag_evaluator/evaluate.py
**/*.py
📄 CodeRabbit inference engine (.cursor/rules/nat-test-llm.mdc)
**/*.py: Programmatic use: create TestLLMConfig(response_seq=[...], delay_ms=...), add with builder.add_llm("", cfg).
When retrieving the test LLM wrapper, use builder.get_llm(name, wrapper_type=LLMFrameworkEnum.) and call the framework’s method (e.g., ainvoke, achat, call).
**/*.py: In code comments/identifiers use NAT abbreviations as specified: nat for API namespace/CLI, nvidia-nat for package name, NAT for env var prefixes; do not use these abbreviations in documentation
Follow PEP 20 and PEP 8; run yapf with column_limit=120; use 4-space indentation; end files with a single trailing newline
Run ruff check --fix as linter (not formatter) using pyproject.toml config; fix warnings unless explicitly ignored
Respect naming: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
Treat pyright warnings as errors during development
Exception handling: use bare raise to re-raise; log with logger.error() when re-raising to avoid duplicate stack traces; use logger.exception() when catching without re-raising
Provide Google-style docstrings for every public module, class, function, and CLI command; first line concise and ending with a period; surround code entities with backticks
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work; profile CPU-heavy paths with cProfile or mprof before optimizing; cache expensive computations with functools.lru_cache or external cache; leverage NumPy vectorized operations when beneficial
Files:
src/nat/eval/rag_evaluator/evaluate.py
src/**/*.py
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All importable Python code must live under src/ (or packages//src/)
Files:
src/nat/eval/rag_evaluator/evaluate.py
src/nat/**/*
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
Changes in src/nat should prioritize backward compatibility
Files:
src/nat/eval/rag_evaluator/evaluate.py
⚙️ CodeRabbit configuration file
This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.
Files:
src/nat/eval/rag_evaluator/evaluate.py
{src/**/*.py,packages/*/src/**/*.py}
📄 CodeRabbit inference engine (.cursor/rules/general.mdc)
All public APIs must have Python 3.11+ type hints on parameters and return values; prefer typing/collections.abc abstractions; use typing.Annotated when useful
Files:
src/nat/eval/rag_evaluator/evaluate.py
**/*
⚙️ CodeRabbit configuration file
**/*: # Code Review Instructions
- Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.- Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:def my_function(param1: int, param2: str) -> bool: pass- For Python exception handling, ensure proper stack trace preservation:
- When re-raising exceptions: use bare
raisestatements to maintain the original stack trace,
and uselogger.error()(notlogger.exception()) to avoid duplicate stack trace output.- When catching and logging exceptions without re-raising: always use
logger.exception()
to capture the full stack trace information.Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any
words listed in the
ci/vale/styles/config/vocabularies/nat/reject.txtfile, words that might appear to be
spelling mistakes but are listed in theci/vale/styles/config/vocabularies/nat/accept.txtfile are OK.Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,
and should contain an Apache License 2.0 header comment at the top of each file.
- Confirm that copyright years are up-to date whenever a file is changed.
Files:
src/nat/eval/rag_evaluator/evaluate.py
🧬 Code graph analysis (1)
src/nat/eval/rag_evaluator/evaluate.py (1)
src/nat/eval/evaluator/evaluator_model.py (1)
EvalOutputItem(50-53)
🪛 Ruff (0.14.1)
src/nat/eval/rag_evaluator/evaluate.py
124-124: Prefer next(iter(scores_dict.keys())) over single element slice
Replace with next(iter(scores_dict.keys()))
(RUF015)
🔇 Additional comments (2)
src/nat/eval/rag_evaluator/evaluate.py (2)
119-126: Excellent separation of concerns for NaN preservation.The dual-dictionary approach correctly preserves original NaN/None values in
original_scores_dictfor output while maintaining normalized 0.0 values inscores_dictfor safe average calculation. This aligns perfectly with the PR objective to reflect judge failures accurately in eval output without corrupting aggregated metrics.
143-153: Core fix correctly uses preserved original scores.Line 147 now references
original_scores_dictinstead of the normalizedscores_dict, ensuring that NaN/None values from the LLM judge are written to the output as-is rather than being converted to 0.0. This directly addresses the misleading zero-scoring issue described in the PR objectives.
Signed-off-by: Yuchen Zhang <yuchenz@nvidia.com>
|
/merge |
Description
When running
nat eval, theNaNornullscores return by LLM judge are converted to0.0to calculate average score. Those scores are output as0.0in the eval output as well, which is misleading to users (sinceNaNornullindicate error occurs during eval, while0.0means the eval is successful but the quality of the answer is poor).This PR keeps the original
NaNornullscores from LLM judge in eval output, while still converting them to0.0when calculating the average score.Closes #584
Closes nvbugs-5441058
Closes AIQ-1787
By Submitting this PR I confirm:
Summary by CodeRabbit