Add support for LangSmith evaluators#1592
Conversation
This includes langsmith and openevals, allowing developers to use out of the box langsmith/openevals evaluators and/or evaluators they have already written in these frameworks. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
This reduces the mental acrobatics developers need to do with mutually exclusive fields Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…a fields and improved error handling - Updated `LangSmithEvaluatorAdapter` to pass additional context fields to evaluators. - Enhanced `LangSmithJudgeConfig` with new fields for system messages and few-shot examples. - Introduced validation for overlapping keys in `judge_kwargs` and typed fields. - Added utility functions for importing attributes and extracting nested fields from results. - Updated tests to cover new configurations and ensure proper handling of extra fields. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds LangSmith/openevals evaluator support: new langsmith, langsmith_judge, and langsmith_custom evaluator registrations, adapter, utilities, pyproject dependency pin, and extensive unit tests for importing, convention detection, data mapping, and result conversion. Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Configuration
participant Reg as Registration
participant Import as DynamicImporter
participant Detect as ConventionDetector
participant Adapter as LangSmithAdapter
participant Eval as Evaluator
Config->>Reg: load langsmith config (evaluator dotted path)
Reg->>Import: import evaluator by dotted path / resolve registered name
Import-->>Reg: return object or callable
Reg->>Detect: detect calling convention
Detect-->>Reg: return convention
Reg->>Adapter: instantiate adapter (evaluator, convention, extra_fields)
Reg-->>Config: yield EvaluatorInfo
rect rgba(100,150,200,0.5)
Note over Adapter,Eval: Evaluation time
Adapter->>Eval: route call based on convention
alt run_evaluator class
Adapter->>Eval: evaluator.aevaluate_run(run, example)
else run/example function
Adapter->>Eval: evaluator(run, example)
else openevals function
Adapter->>Eval: evaluator(**openevals_kwargs)
end
Eval-->>Adapter: return result
Adapter-->>Config: convert to EvalOutputItem
end
sequenceDiagram
participant Config as Configuration
participant JudgeReg as JudgeRegistration
participant Prompt as PromptResolver
participant Builder as EvalBuilder
participant OpenEvals as OpenEvals
participant Adapter as LangSmithAdapter
Config->>JudgeReg: load langsmith_judge config (prompt, llm_name)
JudgeReg->>Prompt: resolve prompt (builtin or custom)
Prompt-->>JudgeReg: prompt template
JudgeReg->>Builder: get llm (llm_name)
Builder-->>JudgeReg: llm instance
JudgeReg->>OpenEvals: create_async_llm_as_judge(llm, prompt, options)
OpenEvals-->>JudgeReg: judge evaluator
JudgeReg->>Adapter: wrap evaluator with LangSmithEvaluatorAdapter
JudgeReg-->>Config: yield EvaluatorInfo (judge)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In
`@examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml`:
- Line 43: Remove the trailing whitespace on the blank line in
config-custom-dataset-format.yml (the blank line around the middle of the file);
open the file, delete any spaces or tabs at the end of that empty line so it is
truly empty, then save and run a quick whitespace/lint check to ensure no other
lines end with trailing whitespace.
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py`:
- Around line 176-183: create_kwargs currently always sets "choices":
config.choices even when it's None; adjust the construction of create_kwargs
(the dict built before calling create_llm_as_judge) to only include the
"choices" key when config.choices is not None, mirroring how "system" and
"few_shot_examples" are conditionally added; this ensures create_llm_as_judge
receives the key only when explicitly provided rather than an explicit None
value.
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 60-63: The current getattr call uses None as the default which
conflates “attribute missing” with an attribute explicitly set to None; change
the logic in the block around obj = getattr(module, attr_name, None) to use a
unique sentinel (e.g., create a local sentinel object) and check identity
against that sentinel (if obj is sentinel) before raising the AttributeError;
keep the same error text referring to module_path and attr_name and the existing
attribute listing logic (dir(module) filtered for non-underscored names) so
attributes that are present but None are not treated as missing.
In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py`:
- Around line 293-309: The test function
test_system_passed_to_create_llm_as_judge declares an unused fixture parameter
eval_input_matching; remove eval_input_matching from the function signature so
the test reads async def test_system_passed_to_create_llm_as_judge(self): to
satisfy Ruff ARG002 and avoid confusion when running tests, leaving the body and
assertions (including the patch of openevals.llm.create_llm_as_judge) unchanged.
🧹 Nitpick comments (6)
examples/evaluation_and_profiling/simple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml (1)
67-94: Large commented-out block in example config.The commented-out
tunable_rag_evaluatorblock is quite lengthy. Consider either removing it entirely or moving it to a separate example config file (e.g.,config-tunable-rag.yml) to keep this example focused on the newlangsmith_judgeevaluator.packages/nvidia_nat_langchain/tests/eval/test_utils.py (1)
31-41: Fixture naming convention: addnameargument and usefixture_prefix.Per coding guidelines, pytest fixtures should define the
nameargument and the decorated function should use thefixture_prefix.♻️ Proposed fix
-@pytest.fixture -def sample_item(): +@pytest.fixture(name="sample_item") +def fixture_sample_item(): return EvalInputItem(As per coding guidelines, "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the
fixture_prefix, using snake_case."packages/nvidia_nat_langchain/tests/eval/conftest.py (1)
57-153: All fixtures missingnameargument andfixture_prefix.All five fixtures (
eval_input_matching,eval_input_non_matching,eval_input_multi_item,item_with_context,eval_input_with_context) should follow the project convention.♻️ Proposed fix (showing pattern for each)
-@pytest.fixture -def eval_input_matching(): +@pytest.fixture(name="eval_input_matching") +def fixture_eval_input_matching():-@pytest.fixture -def eval_input_non_matching(): +@pytest.fixture(name="eval_input_non_matching") +def fixture_eval_input_non_matching():-@pytest.fixture -def eval_input_multi_item(): +@pytest.fixture(name="eval_input_multi_item") +def fixture_eval_input_multi_item():-@pytest.fixture -def item_with_context(): +@pytest.fixture(name="item_with_context") +def fixture_item_with_context():-@pytest.fixture -def eval_input_with_context(item_with_context): +@pytest.fixture(name="eval_input_with_context") +def fixture_eval_input_with_context(item_with_context):As per coding guidelines, "Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture function being decorated should be named using the
fixture_prefix, using snake_case."packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py (1)
86-118: Convention detection usesintersection— any single matching parameter triggers classification.A function with only an
inputsparameter (but nooutputsorreference_outputs) will be classified asopenevals_function, and one with only arunparameter asrun_example_function. This is a loose heuristic. If precision matters, consider requiring at least two matching params (e.g.,len(openevals_params.intersection(param_names)) >= 2). Likely fine in practice since real evaluators will have the full signature.packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (1)
264-326:EvaluationResultsbatch handling only uses the first result.Lines 299-302: When a dict has a
"results"key, onlyresults_list[0]is used and the rest are silently dropped. If batched results are expected, this loses data. Consider using_handle_list_resultfor the full list, or at minimum logging a warning whenlen(results_list) > 1.Suggested improvement
if isinstance(result, dict) and "results" in result: results_list = result["results"] - if results_list: + if isinstance(results_list, list) and len(results_list) > 1: + return _handle_list_result(item_id, results_list) + elif results_list: result = results_list[0] else: return EvalOutputItem(packages/nvidia_nat_langchain/tests/eval/test_langsmith_evaluator.py (1)
408-425: Consider using_registerhelper for consistency and add a clarifying comment on the evaluator path.This test uses
async with register_langsmith_evaluator(...)directly instead of the_registerhelper used elsewhere. Also, the evaluator path_detect_conventionis arbitrary (since_import_evaluatoris mocked) but reads confusingly. A brief comment would help.
...mple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py
Show resolved
Hide resolved
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
...mple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml
Outdated
Show resolved
Hide resolved
...mple_calculator_eval/src/nat_simple_calculator_eval/configs/config-custom-dataset-format.yml
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py
Outdated
Show resolved
Hide resolved
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…metnation Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Salonijain27
left a comment
There was a problem hiding this comment.
Approved from a dependency point of view
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (6)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (2)
117-151: SyntheticRunandExamplereceive random UUIDs that don't correlate with the item.
item.idis available but not used. Therun.id,example.id, andtrace_idare all randomuuid.uuid4()values. If any downstream evaluator or logging needs to correlate a Run/Example back to the NAT item, this link is lost. Consider deriving a deterministic UUID fromitem.id(e.g.,uuid.uuid5(uuid.NAMESPACE_OID, str(item.id))) or storing it in Run metadata/extras, so traceability is preserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around lines 117 - 151, The synthetic Run/Example created in eval_input_item_to_run_and_example uses random UUIDs which breaks traceability to the original item; change the code to derive deterministic IDs from item.id (for example using uuid.uuid5 with a stable namespace) and use that derived UUID for run.id, example.id and trace_id or alternatively include item.id in Run/Example metadata/extras (e.g., a "nat_item_id" field) so downstream evaluators can correlate back to the NAT item; update references to Run, Example, and trace_id in the function accordingly.
301-311:EvaluationResultsbatch handling silently discards all but the first result.When a dict with a
"results"key is returned, onlyresults_list[0]is processed. Multi-result batches will have results silently dropped. A log warning would help operators notice this, or consider aggregating all results similar to_handle_list_result.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around lines 301 - 311, When unwrapping a dict with "results" you currently drop all but the first item; change the block that sets result = results_list[0] to detect multi-result batches (len(results_list) > 1), emit a warning including the item_id and number of results, and delegate handling of the full list to the existing _handle_list_result path (or otherwise aggregate the results similarly) instead of silently discarding them; keep the single-result behavior (set result = results_list[0]) for len == 1 and return the EvalOutputItem with the empty-results error for an empty list as before.packages/nvidia_nat_langchain/tests/eval/conftest.py (1)
38-54:register_evaluator_ctx— consider adding a return type hint.This is a public helper used across multiple test files. Adding a return type hint would improve discoverability. The coding guidelines require type hints on public APIs.
♻️ Suggested fix
-async def register_evaluator_ctx(register_fn, config, builder=None): +async def register_evaluator_ctx(register_fn, config, builder=None) -> EvaluatorInfo:(with the appropriate import of
EvaluatorInfoat the top)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/tests/eval/conftest.py` around lines 38 - 54, The helper function register_evaluator_ctx lacks a return type hint; add one (-> EvaluatorInfo) to its signature and import EvaluatorInfo at the top of the file so the public API is properly typed; keep existing behavior (defaulting builder via make_mock_builder) unchanged.packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py (1)
100-108: Convention detection usesintersection(any overlap) rather thanissubset(all required params).A function with only
inputs(but notoutputsorreference_outputs) will be classified asopenevals_function, and a function with onlyrun(but notexample) will be classified asrun_example_function. This is permissive by design, but could misclassify helper functions that coincidentally use one of these parameter names.This is likely acceptable for practical use, but worth documenting explicitly in the docstring.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py` around lines 100 - 108, The convention-detection currently uses openevals_params.intersection(param_names) and langsmith_params.intersection(param_names) which matches on any overlapping name and can misclassify; change the checks to require the full parameter set by using openevals_params.issubset(param_names) to return _EvaluatorConvention.OPENEVALS_FUNCTION and langsmith_params.issubset(param_names) to return _EvaluatorConvention.RUN_EXAMPLE_FUNCTION (update any related docstring/tests to reflect the stricter behavior if needed), referencing the variables openevals_params, langsmith_params, param_names, and _EvaluatorConvention in your change.packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py (1)
36-53: Registry is recreated on every call; consider caching.
_get_registry()is called in both the model validator (Line 98) and_resolve_evaluator(Line 69), each time importing openevals and constructing a new dict. While the lazy-import pattern is good, the overhead compounds when multiple configs are validated.Consider using
functools.lru_cacheto memoize the result:♻️ Suggested refactor
+import functools + +@functools.lru_cache(maxsize=1) def _get_registry() -> dict[str, Callable[..., Any]]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py` around lines 36 - 53, The registry is recreated on each call; memoize _get_registry to avoid repeated openevals imports and dict construction by decorating _get_registry with functools.lru_cache(maxsize=1) (add the import for lru_cache/functools), leaving the function body intact so callers like _resolve_evaluator and the model validator keep the same behavior but reuse the cached registry.packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py (1)
106-107: Extract the repeatedfake_judgestub into a pytest fixture.The same
fake_judgeasync function (with identical or near-identical bodies) is defined ~14 times across the file. Per coding guidelines, frequently repeated code should be extracted into fixtures. A factory fixture can handle varying return values:♻️ Suggested fixture
`@pytest.fixture`(name="fake_judge") def fixture_fake_judge(): """Factory that returns a fake async judge with a configurable result.""" def _make(result=None): if result is None: result = {"key": "score", "score": 1.0} async def _judge(*, inputs=None, outputs=None, reference_outputs=None, **kwargs): return result return _judge return _makeThen each test can do:
async def test_builtin_prompt(self, eval_input_matching, fake_judge): judge = fake_judge({"key": "score", "score": 0.9, "comment": "Looks good"}) ... with patch("openevals.llm.create_async_llm_as_judge", return_value=judge) as mock_create: ...For the
test_extra_fields_forwarded_through_adaptertest that captureskwargs, you can keep the inline definition since it has a unique side-effect.As per coding guidelines, "Any frequently repeated code should be extracted into pytest fixtures."
Also applies to: 200-201, 297-298, 370-371, 487-488, 559-561
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py` around lines 106 - 107, Extract the repeated async fake_judge stub into a pytest factory fixture named fake_judge that returns an async function given a configurable result dict (defaulting to {"key":"score","score":1.0}), then update tests that currently define identical/near-identical async fake_judge functions (e.g., test_builtin_prompt, others around the noted ranges) to call the fixture to create the judge instance and patch openevals.llm.create_async_llm_as_judge to return that instance; keep the inline fake_judge in test_extra_fields_forwarded_through_adapter because it inspects kwargs and has unique side-effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 293-295: The heuristic that treats a dict as custom schema when
"key" is absent is fragile; change the check in the branch that calls
_handle_custom_schema_result so it distinguishes standard openevals dicts by
looking for the standard pair of keys (e.g., both "key" and "score") instead of
only testing for absence of "key". Concretely, in the conditional that currently
reads if score_field is not None and isinstance(result, dict) and "key" not in
result, update it to treat any dict that contains both "key" and "score" as a
standard openevals result (so it falls through to _handle_dict_result) and only
call _handle_custom_schema_result when score_field is set and the dict does not
have that standard ("key" and "score") shape; keep references to score_field,
result, _handle_custom_schema_result, and _handle_dict_result to locate and
modify the logic.
In `@packages/nvidia_nat_langchain/tests/eval/conftest.py`:
- Around line 57-74: Rename each fixture function to use the fixture_ prefix and
add the name argument to the decorator so external tests keep the same fixture
name; e.g., change the function eval_input_matching to
fixture_eval_input_matching and update its decorator to
`@pytest.fixture`(name="eval_input_matching"), and apply the same pattern to
eval_input_non_matching, eval_input_multi_item, item_with_context, and
eval_input_with_context so the runtime fixture names remain unchanged while
functions follow the fixture_ naming convention.
---
Duplicate comments:
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_judge.py`:
- Around line 179-186: The create_kwargs dict unconditionally includes
"choices": config.choices which can be None, unlike "system" and
"few_shot_examples" that are only added when not None; update the construction
so "choices" is only added to create_kwargs when config.choices is not None
(same pattern used for system and few_shot_examples) so
create_async_llm_as_judge receives an absent key instead of an explicit None.
---
Nitpick comments:
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_custom_evaluator.py`:
- Around line 100-108: The convention-detection currently uses
openevals_params.intersection(param_names) and
langsmith_params.intersection(param_names) which matches on any overlapping name
and can misclassify; change the checks to require the full parameter set by
using openevals_params.issubset(param_names) to return
_EvaluatorConvention.OPENEVALS_FUNCTION and
langsmith_params.issubset(param_names) to return
_EvaluatorConvention.RUN_EXAMPLE_FUNCTION (update any related docstring/tests to
reflect the stricter behavior if needed), referencing the variables
openevals_params, langsmith_params, param_names, and _EvaluatorConvention in
your change.
In
`@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/langsmith_evaluator.py`:
- Around line 36-53: The registry is recreated on each call; memoize
_get_registry to avoid repeated openevals imports and dict construction by
decorating _get_registry with functools.lru_cache(maxsize=1) (add the import for
lru_cache/functools), leaving the function body intact so callers like
_resolve_evaluator and the model validator keep the same behavior but reuse the
cached registry.
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 117-151: The synthetic Run/Example created in
eval_input_item_to_run_and_example uses random UUIDs which breaks traceability
to the original item; change the code to derive deterministic IDs from item.id
(for example using uuid.uuid5 with a stable namespace) and use that derived UUID
for run.id, example.id and trace_id or alternatively include item.id in
Run/Example metadata/extras (e.g., a "nat_item_id" field) so downstream
evaluators can correlate back to the NAT item; update references to Run,
Example, and trace_id in the function accordingly.
- Around line 301-311: When unwrapping a dict with "results" you currently drop
all but the first item; change the block that sets result = results_list[0] to
detect multi-result batches (len(results_list) > 1), emit a warning including
the item_id and number of results, and delegate handling of the full list to the
existing _handle_list_result path (or otherwise aggregate the results similarly)
instead of silently discarding them; keep the single-result behavior (set result
= results_list[0]) for len == 1 and return the EvalOutputItem with the
empty-results error for an empty list as before.
In `@packages/nvidia_nat_langchain/tests/eval/conftest.py`:
- Around line 38-54: The helper function register_evaluator_ctx lacks a return
type hint; add one (-> EvaluatorInfo) to its signature and import EvaluatorInfo
at the top of the file so the public API is properly typed; keep existing
behavior (defaulting builder via make_mock_builder) unchanged.
In `@packages/nvidia_nat_langchain/tests/eval/test_langsmith_judge.py`:
- Around line 106-107: Extract the repeated async fake_judge stub into a pytest
factory fixture named fake_judge that returns an async function given a
configurable result dict (defaulting to {"key":"score","score":1.0}), then
update tests that currently define identical/near-identical async fake_judge
functions (e.g., test_builtin_prompt, others around the noted ranges) to call
the fixture to create the judge instance and patch
openevals.llm.create_async_llm_as_judge to return that instance; keep the inline
fake_judge in test_extra_fields_forwarded_through_adapter because it inspects
kwargs and has unique side-effects.
- Simplified the condition in `langsmith_result_to_eval_output_item` to remove unnecessary check for "key" in result. - Renamed test fixture functions for clarity, adding explicit names to `eval_input_matching`, `eval_input_non_matching`, `eval_input_multi_item`, `item_with_context`, and `eval_input_with_context`. - Updated corresponding test cases to reflect the new fixture names, ensuring consistency and readability. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py (1)
301-311:EvaluationResultsbatch silently drops all but the first result.LangSmith's documented multi-score format is
{ "results": [ {"key": "precision", "score": 0.8}, {"key": "recall", "score": 0.9}, ... ] }— a common pattern when a single evaluator wants to report multiple metrics. The current code uses onlyresults_list[0]and discards the rest, returning a single misleading score instead of aggregating.This is asymmetric with
_handle_list_result, which correctly averages a bare list. Consider delegating to_handle_list_resultfor consistency:♻️ Proposed fix
# EvaluationResults batch -- unwrap then fall through if isinstance(result, dict) and "results" in result: results_list = result["results"] - if results_list: - result = results_list[0] - else: - return EvalOutputItem( - id=item_id, - score=0.0, - reasoning={"error": "Empty EvaluationResults returned"}, - ) + return _handle_list_result(item_id, results_list)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py` around lines 301 - 311, The code that unwraps an EvaluationResults dict currently takes only results_list[0], dropping other metrics; change it to delegate the non-empty results_list to the existing _handle_list_result logic instead of selecting the first element so multiple metric entries are aggregated consistently; keep the empty-list branch that returns EvalOutputItem(id=item_id, score=0.0, reasoning={"error":"Empty EvaluationResults returned"}) and ensure you call _handle_list_result with the same parameters/context used elsewhere so keys like "score" in each dict are handled the same way.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 180-204: The error strings currently embedded only in the
reasoning dict must be moved into the EvalOutputItem.error field so downstream
consumers can detect failures; update _handle_custom_schema_result,
_handle_list_result, the empty-batch branch, and the fallback branch to set
EvalOutputItem.error to the error message (while still keeping rich context in
reasoning e.g., raw or parsed output), and adjust any test assertions in
test_utils.py that expect reasoning["error"] to instead check output.error (you
can keep reasoning["error"] for backwards info but ensure error is populated on
all failure paths).
- Around line 274-281: Update the docstring for the dispatcher in utils.py to
remove the stale heuristic text about "(no \"key\" field)" and instead state
that a dict is treated as a custom output_schema whenever score_field is not
None; keep the rest of the bullet list (bare list, EvaluationResults batch,
EvaluationResult object, plain dict, fallback) but change the first bullet to
reflect the current behavior: "Custom output_schema dict (treated as custom when
score_field is not None) with score_field". Reference the dispatcher docstring
and the score_field symbol when making this edit.
- Around line 71-95: The docstring for eval_input_item_to_openevals_kwargs is
missing documentation of the ValueError that can be raised when an extra_fields
key conflicts with a standard parameter; update the Raises section to list both
KeyError (for missing dataset fields) and ValueError (for conflicts when an
extra_fields key equals one of the reserved keys like 'inputs', 'outputs', or
'reference_outputs') so callers know both failure modes.
In `@packages/nvidia_nat_langchain/tests/eval/test_utils.py`:
- Around line 31-41: Rename the fixture function to follow the convention and
keep the external name the same: change the decorated function from sample_item
to fixture_sample_item and update the decorator to
`@pytest.fixture`(name="sample_item"); ensure this is applied to the fixture that
constructs the EvalInputItem (the function currently returning EvalInputItem
with id "test_1", input_obj "What is AI?", etc.) so tests can continue to
reference the fixture as sample_item while the function name uses the required
fixture_ prefix.
- Around line 319-344: The inner test function custom_schema_evaluator declares
unused parameters (inputs, outputs, reference_outputs, **kwargs) causing Ruff
ARG001 warnings; update the function signature in test_adapter_uses_score_field
to append a noqa suppression (add "# noqa: ARG001" on the def line for
custom_schema_evaluator) so the unused-arg linter warning is silenced while
keeping the explicit interface.
---
Nitpick comments:
In `@packages/nvidia_nat_langchain/src/nat/plugins/langchain/eval/utils.py`:
- Around line 301-311: The code that unwraps an EvaluationResults dict currently
takes only results_list[0], dropping other metrics; change it to delegate the
non-empty results_list to the existing _handle_list_result logic instead of
selecting the first element so multiple metric entries are aggregated
consistently; keep the empty-list branch that returns EvalOutputItem(id=item_id,
score=0.0, reasoning={"error":"Empty EvaluationResults returned"}) and ensure
you call _handle_list_result with the same parameters/context used elsewhere so
keys like "score" in each dict are handled the same way.
- Introduced the `openevals` package as a dependency in both `uv.lock` files for `nvidia_nat_langchain` and `nvidia_nat_vanna`, specifying version constraints. - Updated evaluation utility functions to improve error handling by separating error messages from reasoning, ensuring clearer output in case of failures. - Modified tests to reflect changes in error handling, ensuring consistency in how errors are reported. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
…y to use typing_extensions for compatibility. Signed-off-by: Matthew Penn <mpenn@nvidia.com>
Signed-off-by: Matthew Penn <mpenn@nvidia.com>
|
Code looks good to me! |
|
/merge |
Description
Closes #1585
Adds two new evaluator plugins to the LangChain integration package that allow users to leverage existing LangSmith SDK and openevals evaluators within NAT's evaluation harness:
By Submitting this PR I confirm:
Summary by CodeRabbit
New Features
Tests
Chores