Add evaluator reasoning to Weave score logs by thepatrickchin · Pull Request #1141 · NVIDIA/NeMo-Agent-Toolkit

thepatrickchin · 2025-11-03T09:53:59Z

Description

This PR includes evaluator reasoning in Weave logs to supplement item score.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Score logging now includes optional reasoning information alongside numeric scores, enriching evaluation data with additional context.

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>

coderabbitai · 2025-11-03T09:54:11Z

Walkthrough

Modified alog_score function to construct a score_value dictionary containing the numeric score and optional reasoning, then pass this dictionary to the underlying pred_logger.alog_score call instead of passing a raw scalar value. Coroutine gathering logic remains unchanged.

Changes

Cohort / File(s)	Change Summary
Payload Structure Enhancement `src/nat/eval/utils/weave_eval.py`	Modified `alog_score` to wrap numeric score and optional reasoning into a dictionary structure before passing to `pred_logger.alog_score`. Augments observable data while maintaining existing control flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify the dictionary structure matches expected schema in pred_logger.alog_score
Confirm optional reasoning field handling when not present
Test payload serialization with and without reasoning data

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title 'Add evaluator reasoning to Weave score logs' is concise (43 characters), descriptive, uses imperative mood with the verb 'Add', and clearly summarizes the main change. The title accurately reflects the core objective of the PR, which is to augment Weave score logs with evaluator reasoning information. The title is directly related to the changeset that modifies alog_score to construct a score_value dictionary containing reasoning data.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abf527b and 57361bc.

📒 Files selected for processing (1)

src/nat/eval/utils/weave_eval.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

src/nat/eval/utils/weave_eval.py

{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

src/nat/eval/utils/weave_eval.py

src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

src/nat/eval/utils/weave_eval.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/eval/utils/weave_eval.py

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

src/nat/eval/utils/weave_eval.py

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

src/nat/eval/utils/weave_eval.py

🔇 Additional comments (1)

src/nat/eval/utils/weave_eval.py (1)

139-150: I need to gather more information about the Weave library to verify API compatibility. Let me search for the Weave dependency and its documentation.

No further action required; Weave API supports dict payloads.

The Weave API explicitly supports passing dicts into the log_score (and corresponding alog_score) methods. The implementation correctly constructs a dictionary payload with "score" and optional "reasoning" keys, which aligns with Weave's design. The code change is compatible with the Weave API and will render correctly in the UI.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

willkill07 · 2025-11-03T17:46:39Z

/merge

Add evaluator reasoning to Weave score logs

57361bc

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>

thepatrickchin requested a review from a team as a code owner November 3, 2025 09:54

willkill07 self-assigned this Nov 3, 2025

willkill07 added improvement Improvement to existing functionality non-breaking Non-breaking change labels Nov 3, 2025

willkill07 approved these changes Nov 3, 2025

View reviewed changes

rapids-bot bot merged commit 1904bb4 into NVIDIA:develop Nov 3, 2025
16 of 17 checks passed

thepatrickchin deleted the weave-reasoning-output branch November 4, 2025 04:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evaluator reasoning to Weave score logs#1141

Add evaluator reasoning to Weave score logs#1141
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
thepatrickchin:weave-reasoning-output

thepatrickchin commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 3, 2025 •

edited

Loading

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

willkill07 commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thepatrickchin commented Nov 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

willkill07 commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thepatrickchin commented Nov 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 3, 2025 •

edited

Loading