Skip to content

Add support for LangSmith evaluators#1592

Merged
rapids-bot[bot] merged 20 commits intoNVIDIA:developfrom
mpenn:mpenn_lc-evaluators
Feb 19, 2026
Merged

Add support for LangSmith evaluators#1592
rapids-bot[bot] merged 20 commits intoNVIDIA:developfrom
mpenn:mpenn_lc-evaluators

Conversation

@mpenn
Copy link
Contributor

@mpenn mpenn commented Feb 11, 2026

Description

Closes #1585

Adds two new evaluator plugins to the LangChain integration package that allow users to leverage existing LangSmith SDK and openevals evaluators within NAT's evaluation harness:

  • langsmith -- Import any LangSmith-compatible evaluator by Python dotted path. Auto-detects the calling convention (RunEvaluator, (run, example), or (inputs, outputs, reference_outputs)) and adapts it to NAT's BaseEvaluator interface.
  • langsmith_judge -- LLM-as-judge evaluator powered by openevals create_llm_as_judge. Supports prebuilt prompts (e.g., correctness, hallucination), custom prompt templates, continuous/discrete/boolean scoring, few-shot examples, custom output schemas, and retry configuration.
  • Both plugins support an extra_fields mapping to forward additional dataset fields to evaluators, and sync evaluators are offloaded to an executor to avoid blocking the event loop.

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • LangSmith evaluator integration: dynamic evaluator loading, multi-convention adapter, and optional extra-field mappings.
    • LangSmith judge evaluator: LLM-as-judge with configurable prompts, scoring modes, retries, and output-schema support.
    • Custom evaluator support: import-by-path evaluators with automatic convention detection.
    • Utilities to translate evaluation inputs/results between systems.
  • Tests

    • Extensive unit tests covering registration, adapters, input mappings, result conversion, judge behavior, and extra-fields.
  • Chores

    • Added openevals dependency and registration/packaging housekeeping.

This includes langsmith and openevals, allowing developers to use
out of the box langsmith/openevals evaluators and/or evaluators they
have already written in these frameworks.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
This reduces the mental acrobatics developers need to do with mutually
exclusive fields

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…a fields and improved error handling

- Updated `LangSmithEvaluatorAdapter` to pass additional context fields to evaluators.
- Enhanced `LangSmithJudgeConfig` with new fields for system messages and few-shot examples.
- Introduced validation for overlapping keys in `judge_kwargs` and typed fields.
- Added utility functions for importing attributes and extracting nested fields from results.
- Updated tests to cover new configurations and ensure proper handling of extra fields.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds LangSmith/openevals evaluator support: new langsmith, langsmith_judge, and langsmith_custom evaluator registrations, adapter, utilities, pyproject dependency pin, and extensive unit tests for importing, convention detection, data mapping, and result conversion.

Changes

Cohort / File(s) Summary
Example config
examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml
Trivial single-line replacement; no functional change.
Packaging
packages/nvidia_nat_langchain/pyproject.toml
Add dynamic dependency constraint: openevals>=0.1.3,<1.0.0.
Top-level plugin registration
packages/nvidia_nat_langchain/src/nat/plugins/langchain/register.py, packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/register.py
Import eval registration module to ensure evaluator registrations run on import.
Eval package init
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/__init__.py
New package initializer containing license header only (no runtime logic).
LangSmith evaluator core
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py
Register langsmith evaluators, registry of openevals evaluators, LangSmithEvaluatorConfig, resolver and async register function yielding EvaluatorInfo.
Custom evaluator registration
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py
Dotted-path importer, evaluator instantiation, calling-convention detection, LangSmithCustomEvaluatorConfig, and register_langsmith_custom_evaluator.
LangSmith judge
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py
LangSmithJudgeConfig and register function: prompt resolution (builtin/custom), typed fields, judge_kwargs validation, optional auto-retry, and adapter wrapping.
Adapter implementation
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator_adapter.py
LangSmithEvaluatorAdapter supporting three calling conventions, async/sync invocation adapter preserving contextvars, tracing disabled during calls, and result normalization to NAT EvalOutputItem.
Utilities
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py
Dotted-path importer, mapping EvalInputItem→openevals kwargs, Run/Example constructors, dot-path field extraction, and comprehensive result→EvalOutputItem conversion with robust error handling.
Tests: infra & fixtures
packages/nvidia_nat_langchain/tests/eval/__init__.py, packages/nvidia_nat_langchain/tests/eval/conftest.py
Add test package init and fixtures/helpers (mock builder, register helper, sample EvalInput fixtures including context cases).
Tests: langsmith evaluator
packages/nvidia_nat_langchain/tests/eval/test_langsmith_evaluator.py
Extensive tests for config validation, registry registration, adapter behavior across conventions, extra_fields, error wrapping, multi-item/empty inputs, and metadata.
Tests: langsmith judge
packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py
Extensive tests for judge config validation, prompt resolution, typed-field propagation, judge_kwargs validation, retry behavior, output_schema handling, and evaluation flow with mocked openevals.
Tests: utils
packages/nvidia_nat_langchain/tests/eval/test_utils.py
Comprehensive tests for data mapping, dot-path extraction, Run/Example construction, result conversion (lists, dicts, custom schema), and score_field propagation.
Custom evaluator tests
packages/nvidia_nat_langchain/tests/eval/test_langsmith_custom_evaluator.py
Tests for dotted-path import, instantiation, convention detection, extra_fields behavior, and error cases during registration/evaluation.

Sequence Diagram(s)

sequenceDiagram
    participant Config as Configuration
    participant Reg as Registration
    participant Import as DynamicImporter
    participant Detect as ConventionDetector
    participant Adapter as LangSmithAdapter
    participant Eval as Evaluator

    Config->>Reg: load langsmith config (evaluator dotted path)
    Reg->>Import: import evaluator by dotted path / resolve registered name
    Import-->>Reg: return object or callable
    Reg->>Detect: detect calling convention
    Detect-->>Reg: return convention
    Reg->>Adapter: instantiate adapter (evaluator, convention, extra_fields)
    Reg-->>Config: yield EvaluatorInfo

    rect rgba(100,150,200,0.5)
    Note over Adapter,Eval: Evaluation time
    Adapter->>Eval: route call based on convention
    alt run_evaluator class
        Adapter->>Eval: evaluator.aevaluate_run(run, example)
    else run/example function
        Adapter->>Eval: evaluator(run, example)
    else openevals function
        Adapter->>Eval: evaluator(**openevals_kwargs)
    end
    Eval-->>Adapter: return result
    Adapter-->>Config: convert to EvalOutputItem
    end
Loading
sequenceDiagram
    participant Config as Configuration
    participant JudgeReg as JudgeRegistration
    participant Prompt as PromptResolver
    participant Builder as EvalBuilder
    participant OpenEvals as OpenEvals
    participant Adapter as LangSmithAdapter

    Config->>JudgeReg: load langsmith_judge config (prompt, llm_name)
    JudgeReg->>Prompt: resolve prompt (builtin or custom)
    Prompt-->>JudgeReg: prompt template
    JudgeReg->>Builder: get llm (llm_name)
    Builder-->>JudgeReg: llm instance
    JudgeReg->>OpenEvals: create_async_llm_as_judge(llm, prompt, options)
    OpenEvals-->>JudgeReg: judge evaluator
    JudgeReg->>Adapter: wrap evaluator with LangSmithEvaluatorAdapter
    JudgeReg-->>Config: yield EvaluatorInfo (judge)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.12% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add support for LangSmith evaluators' is concise, descriptive, uses imperative mood, and directly summarizes the main change in the changeset.
Linked Issues check ✅ Passed The PR implements all primary objectives from issue #1585: langsmith plugin for custom evaluators [1585], langsmith_judge for LLM-as-judge [1585], extra_fields support [1585], and sync executor offloading [1585].
Out of Scope Changes check ✅ Passed All changes are scoped to the LangSmith evaluator plugins and their supporting infrastructure. A minor config file change and dependency addition in pyproject.toml are necessary supporting changes, not out-of-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mpenn mpenn marked this pull request as ready for review February 11, 2026 16:14
@mpenn mpenn requested review from a team as code owners February 11, 2026 16:14
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml`:
- Line 43: Remove the trailing whitespace on the blank line in
config-custom-dataset-format.yml (the blank line around the middle of the file);
open the file, delete any spaces or tabs at the end of that empty line so it is
truly empty, then save and run a quick whitespace/lint check to ensure no other
lines end with trailing whitespace.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py`:
- Around line 176-183: create_kwargs currently always sets "choices":
config.choices even when it's None; adjust the construction of create_kwargs
(the dict built before calling create_llm_as_judge) to only include the
"choices" key when config.choices is not None, mirroring how "system" and
"few_shot_examples" are conditionally added; this ensures create_llm_as_judge
receives the key only when explicitly provided rather than an explicit None
value.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 60-63: The current getattr call uses None as the default which
conflates “attribute missing” with an attribute explicitly set to None; change
the logic in the block around obj = getattr(module, attr_name, None) to use a
unique sentinel (e.g., create a local sentinel object) and check identity
against that sentinel (if obj is sentinel) before raising the AttributeError;
keep the same error text referring to module_path and attr_name and the existing
attribute listing logic (dir(module) filtered for non-underscored names) so
attributes that are present but None are not treated as missing.

In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py`:
- Around line 293-309: The test function
test_system_passed_to_create_llm_as_judge declares an unused fixture parameter
eval_input_matching; remove eval_input_matching from the function signature so
the test reads async def test_system_passed_to_create_llm_as_judge(self): to
satisfy Ruff ARG002 and avoid confusion when running tests, leaving the body and
assertions (including the patch of openevals.llm.create_llm_as_judge) unchanged.
🧹 Nitpick comments (6)
examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml (1)

67-94: Large commented-out block in example config.

The commented-out tunable_rag_evaluator block is quite lengthy. Consider either removing it entirely or moving it to a separate example config file (e.g., config-tunable-rag.yml) to keep this example focused on the new langsmith_judge evaluator.

packages/nvidia_nat_langchain/tests/eval/test_utils.py (1)

31-41: Fixture naming convention: add name argument and use fixture_ prefix.

Per coding guidelines, pytest fixtures should define the name argument and the decorated function should use the fixture_ prefix.

♻️ Proposed fix
-@pytest.fixture
-def sample_item():
+@pytest.fixture(name="sample_item")
+def fixture_sample_item():
     return EvalInputItem(

As per coding guidelines, "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the fixture_ prefix, using snake_case."

packages/nvidia_nat_langchain/tests/eval/conftest.py (1)

57-153: All fixtures missing name argument and fixture_ prefix.

All five fixtures (eval_input_matching, eval_input_non_matching, eval_input_multi_item, item_with_context, eval_input_with_context) should follow the project convention.

♻️ Proposed fix (showing pattern for each)
-@pytest.fixture
-def eval_input_matching():
+@pytest.fixture(name="eval_input_matching")
+def fixture_eval_input_matching():
-@pytest.fixture
-def eval_input_non_matching():
+@pytest.fixture(name="eval_input_non_matching")
+def fixture_eval_input_non_matching():
-@pytest.fixture
-def eval_input_multi_item():
+@pytest.fixture(name="eval_input_multi_item")
+def fixture_eval_input_multi_item():
-@pytest.fixture
-def item_with_context():
+@pytest.fixture(name="item_with_context")
+def fixture_item_with_context():
-@pytest.fixture
-def eval_input_with_context(item_with_context):
+@pytest.fixture(name="eval_input_with_context")
+def fixture_eval_input_with_context(item_with_context):

As per coding guidelines, "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the fixture_ prefix, using snake_case."

packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py (1)

86-118: Convention detection uses intersection — any single matching parameter triggers classification.

A function with only an inputs parameter (but no outputs or reference_outputs) will be classified as openevals_function, and one with only a run parameter as run_example_function. This is a loose heuristic. If precision matters, consider requiring at least two matching params (e.g., len(openevals_params.intersection(param_names)) >= 2). Likely fine in practice since real evaluators will have the full signature.

packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (1)

264-326: EvaluationResults batch handling only uses the first result.

Lines 299-302: When a dict has a "results" key, only results_list[0] is used and the rest are silently dropped. If batched results are expected, this loses data. Consider using _handle_list_result for the full list, or at minimum logging a warning when len(results_list) > 1.

Suggested improvement
     if isinstance(result, dict) and "results" in result:
         results_list = result["results"]
-        if results_list:
+        if isinstance(results_list, list) and len(results_list) > 1:
+            return _handle_list_result(item_id, results_list)
+        elif results_list:
             result = results_list[0]
         else:
             return EvalOutputItem(
packages/nvidia_nat_langchain/tests/eval/test_langsmith_evaluator.py (1)

408-425: Consider using _register helper for consistency and add a clarifying comment on the evaluator path.

This test uses async with register_langsmith_evaluator(...) directly instead of the _register helper used elsewhere. Also, the evaluator path _detect_convention is arbitrary (since _import_evaluator is mocked) but reads confusingly. A brief comment would help.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
@willkill07 willkill07 added feature request New feature or request non-breaking Non-breaking change labels Feb 11, 2026
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…metnation

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Copy link

@Salonijain27 Salonijain27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved from a dependency point of view

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (2)

117-151: Synthetic Run and Example receive random UUIDs that don't correlate with the item.

item.id is available but not used. The run.id, example.id, and trace_id are all random uuid.uuid4() values. If any downstream evaluator or logging needs to correlate a Run/Example back to the NAT item, this link is lost. Consider deriving a deterministic UUID from item.id (e.g., uuid.uuid5(uuid.NAMESPACE_OID, str(item.id))) or storing it in Run metadata/extras, so traceability is preserved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around
lines 117 - 151, The synthetic Run/Example created in
eval_input_item_to_run_and_example uses random UUIDs which breaks traceability
to the original item; change the code to derive deterministic IDs from item.id
(for example using uuid.uuid5 with a stable namespace) and use that derived UUID
for run.id, example.id and trace_id or alternatively include item.id in
Run/Example metadata/extras (e.g., a "nat_item_id" field) so downstream
evaluators can correlate back to the NAT item; update references to Run,
Example, and trace_id in the function accordingly.

301-311: EvaluationResults batch handling silently discards all but the first result.

When a dict with a "results" key is returned, only results_list[0] is processed. Multi-result batches will have results silently dropped. A log warning would help operators notice this, or consider aggregating all results similar to _handle_list_result.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around
lines 301 - 311, When unwrapping a dict with "results" you currently drop all
but the first item; change the block that sets result = results_list[0] to
detect multi-result batches (len(results_list) > 1), emit a warning including
the item_id and number of results, and delegate handling of the full list to the
existing _handle_list_result path (or otherwise aggregate the results similarly)
instead of silently discarding them; keep the single-result behavior (set result
= results_list[0]) for len == 1 and return the EvalOutputItem with the
empty-results error for an empty list as before.
packages/nvidia_nat_langchain/tests/eval/conftest.py (1)

38-54: register_evaluator_ctx — consider adding a return type hint.

This is a public helper used across multiple test files. Adding a return type hint would improve discoverability. The coding guidelines require type hints on public APIs.

♻️ Suggested fix
-async def register_evaluator_ctx(register_fn, config, builder=None):
+async def register_evaluator_ctx(register_fn, config, builder=None) -> EvaluatorInfo:

(with the appropriate import of EvaluatorInfo at the top)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/tests/eval/conftest.py` around lines 38 - 54,
The helper function register_evaluator_ctx lacks a return type hint; add one (->
EvaluatorInfo) to its signature and import EvaluatorInfo at the top of the file
so the public API is properly typed; keep existing behavior (defaulting builder
via make_mock_builder) unchanged.
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py (1)

100-108: Convention detection uses intersection (any overlap) rather than issubset (all required params).

A function with only inputs (but not outputs or reference_outputs) will be classified as openevals_function, and a function with only run (but not example) will be classified as run_example_function. This is permissive by design, but could misclassify helper functions that coincidentally use one of these parameter names.

This is likely acceptable for practical use, but worth documenting explicitly in the docstring.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py`
around lines 100 - 108, The convention-detection currently uses
openevals_params.intersection(param_names) and
langsmith_params.intersection(param_names) which matches on any overlapping name
and can misclassify; change the checks to require the full parameter set by
using openevals_params.issubset(param_names) to return
_EvaluatorConvention.OPENEVALS_FUNCTION and
langsmith_params.issubset(param_names) to return
_EvaluatorConvention.RUN_EXAMPLE_FUNCTION (update any related docstring/tests to
reflect the stricter behavior if needed), referencing the variables
openevals_params, langsmith_params, param_names, and _EvaluatorConvention in
your change.
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py (1)

36-53: Registry is recreated on every call; consider caching.

_get_registry() is called in both the model validator (Line 98) and _resolve_evaluator (Line 69), each time importing openevals and constructing a new dict. While the lazy-import pattern is good, the overhead compounds when multiple configs are validated.

Consider using functools.lru_cache to memoize the result:

♻️ Suggested refactor
+import functools
+
+@functools.lru_cache(maxsize=1)
 def _get_registry() -> dict[str, Callable[..., Any]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py`
around lines 36 - 53, The registry is recreated on each call; memoize
_get_registry to avoid repeated openevals imports and dict construction by
decorating _get_registry with functools.lru_cache(maxsize=1) (add the import for
lru_cache/functools), leaving the function body intact so callers like
_resolve_evaluator and the model validator keep the same behavior but reuse the
cached registry.
packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py (1)

106-107: Extract the repeated fake_judge stub into a pytest fixture.

The same fake_judge async function (with identical or near-identical bodies) is defined ~14 times across the file. Per coding guidelines, frequently repeated code should be extracted into fixtures. A factory fixture can handle varying return values:

♻️ Suggested fixture
`@pytest.fixture`(name="fake_judge")
def fixture_fake_judge():
    """Factory that returns a fake async judge with a configurable result."""

    def _make(result=None):
        if result is None:
            result = {"key": "score", "score": 1.0}

        async def _judge(*, inputs=None, outputs=None, reference_outputs=None, **kwargs):
            return result

        return _judge

    return _make

Then each test can do:

async def test_builtin_prompt(self, eval_input_matching, fake_judge):
    judge = fake_judge({"key": "score", "score": 0.9, "comment": "Looks good"})
    ...
    with patch("openevals.llm.create_async_llm_as_judge", return_value=judge) as mock_create:
        ...

For the test_extra_fields_forwarded_through_adapter test that captures kwargs, you can keep the inline definition since it has a unique side-effect.

As per coding guidelines, "Any frequently repeated code should be extracted into pytest fixtures."

Also applies to: 200-201, 297-298, 370-371, 487-488, 559-561

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py` around
lines 106 - 107, Extract the repeated async fake_judge stub into a pytest
factory fixture named fake_judge that returns an async function given a
configurable result dict (defaulting to {"key":"score","score":1.0}), then
update tests that currently define identical/near-identical async fake_judge
functions (e.g., test_builtin_prompt, others around the noted ranges) to call
the fixture to create the judge instance and patch
openevals.llm.create_async_llm_as_judge to return that instance; keep the inline
fake_judge in test_extra_fields_forwarded_through_adapter because it inspects
kwargs and has unique side-effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 293-295: The heuristic that treats a dict as custom schema when
"key" is absent is fragile; change the check in the branch that calls
_handle_custom_schema_result so it distinguishes standard openevals dicts by
looking for the standard pair of keys (e.g., both "key" and "score") instead of
only testing for absence of "key". Concretely, in the conditional that currently
reads if score_field is not None and isinstance(result, dict) and "key" not in
result, update it to treat any dict that contains both "key" and "score" as a
standard openevals result (so it falls through to _handle_dict_result) and only
call _handle_custom_schema_result when score_field is set and the dict does not
have that standard ("key" and "score") shape; keep references to score_field,
result, _handle_custom_schema_result, and _handle_dict_result to locate and
modify the logic.

In `@packages/nvidia_nat_langchain/tests/eval/conftest.py`:
- Around line 57-74: Rename each fixture function to use the fixture_ prefix and
add the name argument to the decorator so external tests keep the same fixture
name; e.g., change the function eval_input_matching to
fixture_eval_input_matching and update its decorator to
`@pytest.fixture`(name="eval_input_matching"), and apply the same pattern to
eval_input_non_matching, eval_input_multi_item, item_with_context, and
eval_input_with_context so the runtime fixture names remain unchanged while
functions follow the fixture_ naming convention.

---

Duplicate comments:
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py`:
- Around line 179-186: The create_kwargs dict unconditionally includes
"choices": config.choices which can be None, unlike "system" and
"few_shot_examples" that are only added when not None; update the construction
so "choices" is only added to create_kwargs when config.choices is not None
(same pattern used for system and few_shot_examples) so
create_async_llm_as_judge receives an absent key instead of an explicit None.

---

Nitpick comments:
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py`:
- Around line 100-108: The convention-detection currently uses
openevals_params.intersection(param_names) and
langsmith_params.intersection(param_names) which matches on any overlapping name
and can misclassify; change the checks to require the full parameter set by
using openevals_params.issubset(param_names) to return
_EvaluatorConvention.OPENEVALS_FUNCTION and
langsmith_params.issubset(param_names) to return
_EvaluatorConvention.RUN_EXAMPLE_FUNCTION (update any related docstring/tests to
reflect the stricter behavior if needed), referencing the variables
openevals_params, langsmith_params, param_names, and _EvaluatorConvention in
your change.

In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py`:
- Around line 36-53: The registry is recreated on each call; memoize
_get_registry to avoid repeated openevals imports and dict construction by
decorating _get_registry with functools.lru_cache(maxsize=1) (add the import for
lru_cache/functools), leaving the function body intact so callers like
_resolve_evaluator and the model validator keep the same behavior but reuse the
cached registry.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 117-151: The synthetic Run/Example created in
eval_input_item_to_run_and_example uses random UUIDs which breaks traceability
to the original item; change the code to derive deterministic IDs from item.id
(for example using uuid.uuid5 with a stable namespace) and use that derived UUID
for run.id, example.id and trace_id or alternatively include item.id in
Run/Example metadata/extras (e.g., a "nat_item_id" field) so downstream
evaluators can correlate back to the NAT item; update references to Run,
Example, and trace_id in the function accordingly.
- Around line 301-311: When unwrapping a dict with "results" you currently drop
all but the first item; change the block that sets result = results_list[0] to
detect multi-result batches (len(results_list) > 1), emit a warning including
the item_id and number of results, and delegate handling of the full list to the
existing _handle_list_result path (or otherwise aggregate the results similarly)
instead of silently discarding them; keep the single-result behavior (set result
= results_list[0]) for len == 1 and return the EvalOutputItem with the
empty-results error for an empty list as before.

In `@packages/nvidia_nat_langchain/tests/eval/conftest.py`:
- Around line 38-54: The helper function register_evaluator_ctx lacks a return
type hint; add one (-> EvaluatorInfo) to its signature and import EvaluatorInfo
at the top of the file so the public API is properly typed; keep existing
behavior (defaulting builder via make_mock_builder) unchanged.

In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py`:
- Around line 106-107: Extract the repeated async fake_judge stub into a pytest
factory fixture named fake_judge that returns an async function given a
configurable result dict (defaulting to {"key":"score","score":1.0}), then
update tests that currently define identical/near-identical async fake_judge
functions (e.g., test_builtin_prompt, others around the noted ranges) to call
the fixture to create the judge instance and patch
openevals.llm.create_async_llm_as_judge to return that instance; keep the inline
fake_judge in test_extra_fields_forwarded_through_adapter because it inspects
kwargs and has unique side-effects.

- Simplified the condition in `langsmith_result_to_eval_output_item` to remove unnecessary check for "key" in result.
- Renamed test fixture functions for clarity, adding explicit names to `eval_input_matching`, `eval_input_non_matching`, `eval_input_multi_item`, `item_with_context`, and `eval_input_with_context`.
- Updated corresponding test cases to reflect the new fixture names, ensuring consistency and readability.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (1)

301-311: EvaluationResults batch silently drops all but the first result.

LangSmith's documented multi-score format is { "results": [ {"key": "precision", "score": 0.8}, {"key": "recall", "score": 0.9}, ... ] } — a common pattern when a single evaluator wants to report multiple metrics. The current code uses only results_list[0] and discards the rest, returning a single misleading score instead of aggregating.

This is asymmetric with _handle_list_result, which correctly averages a bare list. Consider delegating to _handle_list_result for consistency:

♻️ Proposed fix
     # EvaluationResults batch -- unwrap then fall through
     if isinstance(result, dict) and "results" in result:
         results_list = result["results"]
-        if results_list:
-            result = results_list[0]
-        else:
-            return EvalOutputItem(
-                id=item_id,
-                score=0.0,
-                reasoning={"error": "Empty EvaluationResults returned"},
-            )
+        return _handle_list_result(item_id, results_list)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around
lines 301 - 311, The code that unwraps an EvaluationResults dict currently takes
only results_list[0], dropping other metrics; change it to delegate the
non-empty results_list to the existing _handle_list_result logic instead of
selecting the first element so multiple metric entries are aggregated
consistently; keep the empty-list branch that returns EvalOutputItem(id=item_id,
score=0.0, reasoning={"error":"Empty EvaluationResults returned"}) and ensure
you call _handle_list_result with the same parameters/context used elsewhere so
keys like "score" in each dict are handled the same way.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 180-204: The error strings currently embedded only in the
reasoning dict must be moved into the EvalOutputItem.error field so downstream
consumers can detect failures; update _handle_custom_schema_result,
_handle_list_result, the empty-batch branch, and the fallback branch to set
EvalOutputItem.error to the error message (while still keeping rich context in
reasoning e.g., raw or parsed output), and adjust any test assertions in
test_utils.py that expect reasoning["error"] to instead check output.error (you
can keep reasoning["error"] for backwards info but ensure error is populated on
all failure paths).
- Around line 274-281: Update the docstring for the dispatcher in utils.py to
remove the stale heuristic text about "(no \"key\" field)" and instead state
that a dict is treated as a custom output_schema whenever score_field is not
None; keep the rest of the bullet list (bare list, EvaluationResults batch,
EvaluationResult object, plain dict, fallback) but change the first bullet to
reflect the current behavior: "Custom output_schema dict (treated as custom when
score_field is not None) with score_field". Reference the dispatcher docstring
and the score_field symbol when making this edit.
- Around line 71-95: The docstring for eval_input_item_to_openevals_kwargs is
missing documentation of the ValueError that can be raised when an extra_fields
key conflicts with a standard parameter; update the Raises section to list both
KeyError (for missing dataset fields) and ValueError (for conflicts when an
extra_fields key equals one of the reserved keys like 'inputs', 'outputs', or
'reference_outputs') so callers know both failure modes.

In `@packages/nvidia_nat_langchain/tests/eval/test_utils.py`:
- Around line 31-41: Rename the fixture function to follow the convention and
keep the external name the same: change the decorated function from sample_item
to fixture_sample_item and update the decorator to
`@pytest.fixture`(name="sample_item"); ensure this is applied to the fixture that
constructs the EvalInputItem (the function currently returning EvalInputItem
with id "test_1", input_obj "What is AI?", etc.) so tests can continue to
reference the fixture as sample_item while the function name uses the required
fixture_ prefix.
- Around line 319-344: The inner test function custom_schema_evaluator declares
unused parameters (inputs, outputs, reference_outputs, **kwargs) causing Ruff
ARG001 warnings; update the function signature in test_adapter_uses_score_field
to append a noqa suppression (add "# noqa: ARG001" on the def line for
custom_schema_evaluator) so the unused-arg linter warning is silenced while
keeping the explicit interface.

---

Nitpick comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 301-311: The code that unwraps an EvaluationResults dict currently
takes only results_list[0], dropping other metrics; change it to delegate the
non-empty results_list to the existing _handle_list_result logic instead of
selecting the first element so multiple metric entries are aggregated
consistently; keep the empty-list branch that returns EvalOutputItem(id=item_id,
score=0.0, reasoning={"error":"Empty EvaluationResults returned"}) and ensure
you call _handle_list_result with the same parameters/context used elsewhere so
keys like "score" in each dict are handled the same way.

- Introduced the `openevals` package as a dependency in both `uv.lock` files for `nvidia_nat_langchain` and `nvidia_nat_vanna`, specifying version constraints.
- Updated evaluation utility functions to improve error handling by separating error messages from reasoning, ensuring clearer output in case of failures.
- Modified tests to reflect changes in error handling, ensuring consistency in how errors are reported.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…y to use typing_extensions for compatibility.

Signed-off-by: Matthew Penn <mpenn@nvidia.com>
@dnandakumar-nv
Copy link
Contributor

Code looks good to me!

@dnandakumar-nv
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit b7fc478 into NVIDIA:develop Feb 19, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for LangSmith evaluators

5 participants