Skip to content

Add evaluator reasoning to Weave score logs#1141

Merged
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
thepatrickchin:weave-reasoning-output
Nov 3, 2025
Merged

Add evaluator reasoning to Weave score logs#1141
rapids-bot[bot] merged 1 commit intoNVIDIA:developfrom
thepatrickchin:weave-reasoning-output

Conversation

@thepatrickchin
Copy link
Member

@thepatrickchin thepatrickchin commented Nov 3, 2025

Description

This PR includes evaluator reasoning in Weave logs to supplement item score.

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features
    • Score logging now includes optional reasoning information alongside numeric scores, enriching evaluation data with additional context.

Signed-off-by: Patrick Chin <8509935+thepatrickchin@users.noreply.github.com>
@thepatrickchin thepatrickchin requested a review from a team as a code owner November 3, 2025 09:54
@coderabbitai
Copy link

coderabbitai bot commented Nov 3, 2025

Walkthrough

Modified alog_score function to construct a score_value dictionary containing the numeric score and optional reasoning, then pass this dictionary to the underlying pred_logger.alog_score call instead of passing a raw scalar value. Coroutine gathering logic remains unchanged.

Changes

Cohort / File(s) Change Summary
Payload Structure Enhancement
src/nat/eval/utils/weave_eval.py
Modified alog_score to wrap numeric score and optional reasoning into a dictionary structure before passing to pred_logger.alog_score. Augments observable data while maintaining existing control flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify the dictionary structure matches expected schema in pred_logger.alog_score
  • Confirm optional reasoning field handling when not present
  • Test payload serialization with and without reasoning data

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'Add evaluator reasoning to Weave score logs' is concise (43 characters), descriptive, uses imperative mood with the verb 'Add', and clearly summarizes the main change. The title accurately reflects the core objective of the PR, which is to augment Weave score logs with evaluator reasoning information. The title is directly related to the changeset that modifies alog_score to construct a score_value dictionary containing reasoning data.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between abf527b and 57361bc.

📒 Files selected for processing (1)
  • src/nat/eval/utils/weave_eval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: In code comments use the abbreviations: nat (API namespace/CLI), nvidia-nat (package), NAT (env var prefixes); never use these abbreviations in documentation
Follow PEP 20 and PEP 8 for Python style
Run yapf with column_limit=120; yapf is used for formatting (run second)
Indent with 4 spaces (no tabs) and end each file with a single trailing newline
Use ruff (ruff check --fix) as a linter (not formatter) per pyproject.toml; fix warnings unless explicitly ignored
Respect Python naming schemes: snake_case for functions/variables, PascalCase for classes, UPPER_CASE for constants
When re-raising exceptions, use bare raise to preserve stack trace; log with logger.error(), not logger.exception()
When catching and logging without re-raising, use logger.exception() to capture full stack trace
Provide Google-style docstrings for every public module, class, function, and CLI command
Docstring first line must be a concise description ending with a period
Validate and sanitize all user input, especially in web or CLI interfaces
Prefer httpx with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use async/await for I/O-bound work (HTTP, DB, file I/O)
Cache expensive computations with functools.lru_cache or an external cache when appropriate
Leverage NumPy vectorized operations when beneficial and feasible

Files:

  • src/nat/eval/utils/weave_eval.py
{src/**/*.py,packages/*/src/**/*.py}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{src/**/*.py,packages/*/src/**/*.py}: All importable Python code must live under src/ or packages//src/
All public APIs must have Python 3.11+ type hints on parameters and return values
Prefer typing/collections.abc abstractions (e.g., Sequence over list)
Use typing.Annotated for units or metadata when useful
Treat pyright warnings as errors during development

Files:

  • src/nat/eval/utils/weave_eval.py
src/nat/**/*

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

src/nat/**/* contains core functionality; changes should prioritize backward compatibility

Files:

  • src/nat/eval/utils/weave_eval.py

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/eval/utils/weave_eval.py
{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

{**/*.py,**/*.sh,**/*.md,**/*.toml,**/*.y?(a)ml,**/*.json,**/*.txt,**/*.ini,**/*.cfg,**/*.ipynb}: Every file must start with the standard SPDX Apache-2.0 header
Confirm copyright years are up to date when a file is changed
All source files must include the SPDX Apache-2.0 header template (copy from an existing file)

Files:

  • src/nat/eval/utils/weave_eval.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/eval/utils/weave_eval.py
🔇 Additional comments (1)
src/nat/eval/utils/weave_eval.py (1)

139-150: I need to gather more information about the Weave library to verify API compatibility. Let me search for the Weave dependency and its documentation.

No further action required; Weave API supports dict payloads.

The Weave API explicitly supports passing dicts into the log_score (and corresponding alog_score) methods. The implementation correctly constructs a dictionary payload with "score" and optional "reasoning" keys, which aligns with Weave's design. The code change is compatible with the Weave API and will render correctly in the UI.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@willkill07 willkill07 self-assigned this Nov 3, 2025
@willkill07 willkill07 added improvement Improvement to existing functionality non-breaking Non-breaking change labels Nov 3, 2025
@willkill07
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 1904bb4 into NVIDIA:develop Nov 3, 2025
16 of 17 checks passed
@thepatrickchin thepatrickchin deleted the weave-reasoning-output branch November 4, 2025 04:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants