Add a simple evaluate_item endpoint by AnuradhaKaruppiah · Pull Request #1138 · NVIDIA/NeMo-Agent-Toolkit

AnuradhaKaruppiah · 2025-10-31T23:14:22Z

Description

This PR adds a new /evaluate/item endpoint for synchronous single-item evaluation, enabling quick testing and debugging of evaluators without running full dataset evaluations.

Changes

API Endpoint (/evaluate/item)

Method: POST
Purpose: Evaluate a single item with a specified evaluator (synchronous response)
Use cases: Interactive testing, debugging, real-time evaluation
No Dask required: Works without async job infrastructure

Route Structure (Nested)

POST /evaluate/item → Single item evaluation (sync, immediate response)

Implementation

Added add_evaluate_item_route() in fastapi_front_end_plugin_worker.py
Evaluator initialization via WorkflowEvalBuilder on server startup
Request/response models: EvaluateItemRequest, EvaluateItemResponse

Example Scripts

evaluate_single_item.py - Full version with trajectory processing
evaluate_single_item_simple.py - Simplified version without trajectory

Tests (test_evaluate_endpoints.py)
Docs deferred (evaluate_api.md will be going through more changes before next rel)

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

New Features
- Added a single-item evaluation endpoint (/evaluate/item) to evaluate an individual item with a named evaluator.
- Added two example scripts showcasing end-to-end and minimal single-item evaluation workflows against the server, with streaming handling and evaluation reporting.
Tests
- Added tests covering success, evaluator-not-found (404), evaluator runtime errors, and invalid-payload validation for the single-item evaluation endpoint.

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

coderabbitai · 2025-10-31T23:14:33Z

Walkthrough

Adds single-item evaluation: two example scripts (full and simple), new FastAPI request/response models and a POST /evaluate/item endpoint, evaluator initialization/cleanup and route wiring in the plugin worker, and tests covering success, not-found, error, and validation cases.

Changes

Cohort / File(s)	Summary
Example scripts `examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py`, `examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py`	New scripts demonstrating single-item evaluation workflows. Full script streams generation, extracts final output and intermediate trajectory steps, validates into Pydantic models, aggregates a trajectory, then posts to `/evaluate/item`. Simple script streams final output only and posts an empty trajectory for evaluation.
FastAPI config / models `src/nat/front_ends/fastapi/fastapi_front_end_config.py`	Adds `EvaluateItemRequest`, `EvaluateItemResponse` and associated `EvalInputItem`/`EvalOutputItem` usage; declares new `evaluate_item` endpoint (`POST /evaluate/item`) in front-end config.
FastAPI plugin worker `src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py`	Adds evaluator builder and storage (`_eval_builder`, `_evaluators`), async `initialize_evaluators` and `cleanup_evaluators`, wires evaluator initialization into `configure`, registers shutdown cleanup, and adds `add_evaluate_item_route` handler to evaluate a single item (including error and missing-evaluator handling).
Tests `tests/nat/front_ends/fastapi/test_evaluate_endpoints.py`	New tests and fixtures for `/evaluate/item`: success, evaluator-not-found (404), evaluator-raises-error, and invalid payload (422) scenarios; mocks evaluator behavior and session manager.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FastAPI
    participant PluginWorker
    participant Evaluator

    Client->>FastAPI: POST /evaluate/item (EvaluateItemRequest)
    FastAPI->>PluginWorker: dispatch to evaluate_item handler
    PluginWorker->>PluginWorker: lookup evaluator in _evaluators

    alt evaluator found
        PluginWorker->>Evaluator: await evaluate_fn(item)
        Evaluator-->>PluginWorker: evaluation result (EvalOutputItem)
        PluginWorker-->>FastAPI: EvaluateItemResponse(success=true, result)
        FastAPI-->>Client: 200 OK
    else evaluator not found
        PluginWorker-->>FastAPI: 404 Not Found (error)
        FastAPI-->>Client: 404
    else evaluator raised
        PluginWorker-->>FastAPI: EvaluateItemResponse(success=false, error=...)
        FastAPI-->>Client: 200 OK (error payload)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Review initializer/cleanup logic in src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py for lifecycle correctness and resource leaks.
Validate EvaluateItemRequest / EvaluateItemResponse shapes against evaluator interfaces and serialization expectations.
Inspect error paths: missing evaluator (404) vs. evaluation failure (success=false) and HTTP status choices.
Check example scripts' streaming parsing and Pydantic model validation of intermediate steps.
Review tests for realistic mocks and sufficient coverage of edge cases.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add a simple evaluate_item endpoint' is concise (35 chars), uses imperative mood, and directly describes the main change: adding a new POST /evaluate/item endpoint for single-item evaluation.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…point

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

…point

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)
121-127: Simplify logger.exception calls and drop redundant exception arguments

In the JSON parsing and aiohttp.ClientError handlers you’re passing the exception object into logger.exception while also getting the stack trace from exc_info, which Ruff flags (TRY401) and is unnecessary.

For example:
except json.JSONDecodeError as e:
    logger.exception("Failed to parse response: %s", e)
and similar patterns at Lines 129–130 and 199–200.

You can simplify to:
-        except json.JSONDecodeError as e:
-            logger.exception("Failed to parse response: %s", e)
+        except json.JSONDecodeError:
+            logger.exception("Failed to parse response")

-    except aiohttp.ClientError as e:
-        logger.exception("Request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Request failed")

-    except aiohttp.ClientError as e:
-        logger.exception("Evaluation request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Evaluation request failed")
This matches the exception-handling guideline and removes redundant arguments.

Also applies to: 129-135, 199-202
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)
160-165: Tighten logger.exception usage to avoid redundant exception arguments

Similar to the simple script, several exception handlers pass the exception object into logger.exception, which already logs the stack trace and doesn’t need the extra argument. Ruff flags these as TRY401.

Suggested edits:
-        except json.JSONDecodeError as e:
-            logger.exception("Failed to parse generate response chunk: %s", e)
+        except json.JSONDecodeError:
+            logger.exception("Failed to parse generate response chunk")

-        except (json.JSONDecodeError, ValidationError) as e:
-            logger.exception("Failed to parse intermediate step: %s", e)
+        except (json.JSONDecodeError, ValidationError):
+            logger.exception("Failed to parse intermediate step")

-    except aiohttp.ClientError as e:
-        logger.exception("Request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Request failed")

-    except aiohttp.ClientError as e:
-        logger.exception("Evaluation request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Evaluation request failed")
This keeps the full stack trace while simplifying the logging calls.

Also applies to: 181-183, 185-191, 261-263
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)
239-270: Use logger.exception in eval-init/cleanup error paths and add return type hints

In initialize_evaluators and cleanup_evaluators, exceptions are caught and logged with logger.error, and the methods are unannotated:
async def initialize_evaluators(self, config: Config):
    ...
    except Exception as e:
        logger.error(f"Failed to initialize evaluators: {e}")
        self._evaluators = {}

async def cleanup_evaluators(self):
    ...
    except Exception as e:
        logger.error(f"Error cleaning up evaluator builder: {e}")
Given these blocks swallow the exception and don’t re-raise, the logging guideline suggests logger.exception to capture the stack trace. Also, adding explicit return types would align with the project’s typing guidance.

Suggested changes:
-    async def initialize_evaluators(self, config: Config):
+    async def initialize_evaluators(self, config: Config) -> None:
@@
-        except Exception as e:
-            logger.error(f"Failed to initialize evaluators: {e}")
+        except Exception:
+            logger.exception("Failed to initialize evaluators")
             # Don't fail startup, just log the error
             self._evaluators = {}
@@
-    async def cleanup_evaluators(self):
+    async def cleanup_evaluators(self) -> None:
@@
-            except Exception as e:
-                logger.error(f"Error cleaning up evaluator builder: {e}")
+            except Exception:
+                logger.exception("Error cleaning up evaluator builder")
You may also optionally add -> None to the new configure and add_evaluate_item_route methods for consistency.

Also applies to: 271-282

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb53737 and de9e75d.

📒 Files selected for processing (5)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1 hunks)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1 hunks)
src/nat/front_ends/fastapi/fastapi_front_end_config.py (3 hunks)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5 hunks)
tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (4)

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

tests/nat/front_ends/fastapi/test_evaluate_endpoints.py
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
src/nat/front_ends/fastapi/fastapi_front_end_config.py

tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

tests/nat/front_ends/fastapi/test_evaluate_endpoints.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
src/nat/front_ends/fastapi/fastapi_front_end_config.py

🧬 Code graph analysis (5)

tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (3)

src/nat/eval/evaluator/evaluator_model.py (3)

EvalInput (46-47)

EvalOutput (56-58)

EvalOutputItem (50-53)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (2)

config (127-128)

add_evaluate_item_route (500-561)

src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)

EndpointBase (156-179)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)

src/nat/data_models/api_server.py (1)

ResponseIntermediateStep (482-494)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (4)

src/nat/builder/eval_builder.py (3)

WorkflowEvalBuilder (43-166)

populate_builder (135-158)

get_evaluator (69-74)

src/nat/eval/evaluator/evaluator_model.py (1)

EvalInput (46-47)

src/nat/front_ends/fastapi/fastapi_front_end_config.py (2)

EvaluateItemRequest (138-141)

EvaluateItemResponse (144-148)

src/nat/runtime/session.py (3)

config (88-89)

SessionManager (47-226)

session (100-135)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)

main (267-294)

src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)

src/nat/eval/evaluator/evaluator_model.py (2)

EvalInputItem (23-43)

EvalOutputItem (50-53)

🪛 Ruff (0.14.4)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

1-1: The file is executable but no shebang is present

(EXE002)

164-164: Redundant exception object included in logging.exception call

(TRY401)

182-182: Redundant exception object included in logging.exception call

(TRY401)

186-186: Redundant exception object included in logging.exception call

(TRY401)

187-187: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

188-188: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

189-191: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

261-261: Redundant exception object included in logging.exception call

(TRY401)

262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

266-266: Do not catch blind exception: Exception

(BLE001)

267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

277-277: Do not catch blind exception: Exception

(BLE001)

278-278: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

530-530: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

126-126: Redundant exception object included in logging.exception call

(TRY401)

130-130: Redundant exception object included in logging.exception call

(TRY401)

131-131: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

133-135: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

200-200: Redundant exception object included in logging.exception call

(TRY401)

201-201: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: CI Pipeline / Check

🔇 Additional comments (3)

tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (1)

266-404: Evaluate-item test coverage looks solid and aligned with the API contract

The new fixtures and tests comprehensively exercise /evaluate/item:

Success path with score and reasoning.

404 for unknown evaluator.

Evaluator exception mapped to success == False with an error message.

422 for invalid payload.

Using a dedicated evaluate_item_client and evaluate_item_client_with_error keeps concerns nicely isolated, and asserting on both HTTP status codes and body fields matches the intended endpoint semantics. No issues from a correctness or style standpoint.

src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)

30-31: Evaluate-item models and endpoint wiring are consistent and type-safe

EvaluateItemRequest/EvaluateItemResponse correctly reuse EvalInputItem and EvalOutputItem, and the evaluate_item endpoint definition matches the route path/method used by the worker and tests. The field descriptions are clear and will generate sensible OpenAPI docs. No changes needed here.

Also applies to: 138-149, 257-262
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)
300-303: Raise HTTPException(500) on evaluator errors; fix lazy logging formatting

The route declares 500 in its OpenAPI responses but never raises it. On exception, change lines 529–530 to:
-                except Exception as e:
-                    logger.exception(f"Error evaluating item with {request.evaluator_name}")
-                    return EvaluateItemResponse(success=False, result=None, error=f"Evaluation failed: {str(e)}")
+                except Exception as e:
+                    logger.exception("Error evaluating item with %s", request.evaluator_name)
+                    raise HTTPException(status_code=500, detail=f"Evaluation failed: {str(e)}") from e
This aligns with FastAPI best practices and the documented response schema. The 200+success=False pattern is an anti-pattern for single-item endpoints; use proper HTTP semantics (5xx for server errors) so clients, intermediaries, and observability tools respond correctly.

Likely an incorrect or invalid review comment.

...g/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

...rofiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

...rofiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

...g/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)
235-281: Tighten evaluator lifecycle logging and type hints

The evaluator lifecycle wiring is sound, but there are a couple of small cleanups worth doing:

In initialize_evaluators and cleanup_evaluators you catch Exception and don’t re-raise; per the project’s exception-handling guidelines and Ruff hints, use logger.exception(...) instead of logger.error(...) so the stack trace is preserved.

Consider adding explicit return type hints (-> None) to initialize_evaluators and cleanup_evaluators to match the “all methods typed” guideline.

Example diff:
-    async def initialize_evaluators(self, config: Config):
+    async def initialize_evaluators(self, config: Config) -> None:
@@
-        except Exception as e:
-            logger.error(f"Failed to initialize evaluators: {e}")
+        except Exception:
+            logger.exception("Failed to initialize evaluators")
             # Don't fail startup, just log the error
             self._evaluators = {}
@@
-    async def cleanup_evaluators(self):
+    async def cleanup_evaluators(self) -> None:
@@
-            except Exception as e:
-                logger.error(f"Error cleaning up evaluator builder: {e}")
+            except Exception:
+                logger.exception("Error cleaning up evaluator builder")
If you intend to keep the broad except Exception here to avoid failing startup/shutdown, a brief comment or # noqa: BLE001 with rationale would also silence Ruff without changing behavior.

🧹 Nitpick comments (1)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

305-305: Minor note: extra builder.build() for the new route

Adding await self.add_evaluate_item_route(app, SessionManager(await builder.build())) follows the existing pattern used for the other routes; it does mean one more builder.build() call at startup. If workflow construction becomes expensive, consider sharing a single SessionManager instance across related routes in a future cleanup, but this is fine for now.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de9e75d and dedcbf7.

📒 Files selected for processing (1)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

🧬 Code graph analysis (1)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5)

src/nat/builder/eval_builder.py (3)

WorkflowEvalBuilder (43-166)

populate_builder (135-158)

get_evaluator (69-74)

src/nat/eval/evaluator/evaluator_model.py (1)

EvalInput (46-47)

src/nat/front_ends/fastapi/fastapi_front_end_config.py (2)

EvaluateItemRequest (138-141)

EvaluateItemResponse (144-148)

src/nat/runtime/session.py (3)

config (88-89)

SessionManager (47-226)

session (100-135)

src/nat/eval/rag_evaluator/register.py (1)

evaluate_fn (110-117)

🪛 Ruff (0.14.4)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

266-266: Do not catch blind exception: Exception

(BLE001)

267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

277-277: Do not catch blind exception: Exception

(BLE001)

278-278: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

533-533: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: CI Pipeline / Check

🔇 Additional comments (2)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (2)

42-43: New eval-related imports look correct and consistent

The added imports (WorkflowEvalBuilder, EvaluatorInfo, EvalInput, EvaluateItemRequest/Response) are all used in the new evaluator lifecycle and /evaluate/item endpoint and fit the existing import style. No changes needed here.

Also applies to: 56-56, 62-63

292-297: Evaluator init + shutdown handler wiring looks good

Initializing evaluators in configure and registering cleanup_evaluators on FastAPI "shutdown" closes the loop on the async context for WorkflowEvalBuilder and avoids long-lived resource leaks. This also addresses the previous review’s concern about missing cleanup of the eval builder.

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)
129-129: Remove redundant exception objects from logger.exception() calls.

The logger.exception() method automatically includes the full exception traceback in the log output, so passing the exception object as a format argument is redundant and can result in the exception appearing twice in the logs.

Apply this diff:
-                            logger.exception("Failed to parse response: %s", e)
+                            logger.exception("Failed to parse response")
-            logger.exception("Request failed: %s", e)
+            logger.exception("Request failed")
-            logger.exception("Evaluation request failed: %s", e)
+            logger.exception("Evaluation request failed")
Also applies to: 133-133, 203-203
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)
164-164: Remove redundant exception objects from logger.exception() calls.

The logger.exception() method automatically includes the full exception traceback, so passing the exception object as a format argument is redundant and can result in duplicate exception information in the logs.

Apply this diff:
-                            logger.exception("Failed to parse generate response chunk: %s", e)
+                            logger.exception("Failed to parse generate response chunk")
-                            logger.exception("Failed to parse intermediate step: %s", e)
+                            logger.exception("Failed to parse intermediate step")
-            logger.exception("Request failed: %s", e)
+            logger.exception("Request failed")
-            logger.exception("Evaluation request failed: %s", e)
+            logger.exception("Evaluation request failed")
Also applies to: 182-182, 186-186, 261-261

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dedcbf7 and d73e81f.

📒 Files selected for processing (2)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1 hunks)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions
Ensure the code follows best practices and coding standards. - For Python code, follow
PEP 20 and
PEP 8 for style guidelines.
Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
Example:
def my_function(param1: int, param2: str) -> bool:
    pass
For Python exception handling, ensure proper stack trace preservation:

When re-raising exceptions: use bare raise statements to maintain the original stack trace,
and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.

When catching and logging exceptions without re-raising: always use logger.exception()
to capture the full stack trace information.
Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

Confirm that copyright years are up-to date whenever a file is changed.

Files:

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

If an example contains Python code, it should be placed in a subdirectory named src/ and should
contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.

If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
be checked into git-lfs.

Files:

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

🧬 Code graph analysis (2)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (3)

src/nat/data_models/api_server.py (1)

ResponseIntermediateStep (482-494)

src/nat/data_models/intermediate_step.py (3)

IntermediateStep (235-310)

IntermediateStepPayload (130-232)

UUID (301-302)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)

main (208-233)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)

main (267-294)

🪛 Ruff (0.14.4)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

1-1: The file is executable but no shebang is present

(EXE002)

164-164: Redundant exception object included in logging.exception call

(TRY401)

182-182: Redundant exception object included in logging.exception call

(TRY401)

186-186: Redundant exception object included in logging.exception call

(TRY401)

187-187: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

188-188: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

189-191: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

261-261: Redundant exception object included in logging.exception call

(TRY401)

262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

129-129: Redundant exception object included in logging.exception call

(TRY401)

133-133: Redundant exception object included in logging.exception call

(TRY401)

134-134: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

135-135: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

136-138: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

203-203: Redundant exception object included in logging.exception call

(TRY401)

204-204: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: CI Pipeline / Check

AnuradhaKaruppiah · 2025-11-15T03:25:10Z

/merge

This PR adds a new /evaluate/item endpoint for synchronous single-item evaluation, enabling quick testing and debugging of evaluators without running full dataset evaluations. **Changes** 1. API Endpoint (/evaluate/item) - Method: POST - Purpose: Evaluate a single item with a specified evaluator (synchronous response) - Use cases: Interactive testing, debugging, real-time evaluation - No Dask required: Works without async job infrastructure 2. Route Structure (Nested) - POST /evaluate/item → Single item evaluation (sync, immediate response) 3. Implementation - Added add_evaluate_item_route() in fastapi_front_end_plugin_worker.py - Evaluator initialization via WorkflowEvalBuilder on server startup - Request/response models: EvaluateItemRequest, EvaluateItemResponse 4. Example Scripts - evaluate_single_item.py - Full version with trajectory processing - evaluate_single_item_simple.py - Simplified version without trajectory 5. Tests (test_evaluate_endpoints.py) 6. Docs deferred (evaluate_api.md will be going through more changes before next rel) ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. ## Summary by CodeRabbit * **New Features** * Added a single-item evaluation endpoint (/evaluate/item) to evaluate an individual item with a named evaluator. * Added two example scripts showcasing end-to-end and minimal single-item evaluation workflows against the server, with streaming handling and evaluation reporting. * **Tests** * Added tests covering success, evaluator-not-found (404), evaluator runtime errors, and invalid-payload validation for the single-item evaluation endpoint. Authors: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - David Gardner (https://github.com/dagardner-nv) - Michael Demoret (https://github.com/mdemoret-nv) URL: NVIDIA#1138 Signed-off-by: Sangharsh Aglave <aglave@synopsys.com>

Add simple evaluate_item endpoint

5603d99

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

AnuradhaKaruppiah self-assigned this Oct 31, 2025

AnuradhaKaruppiah added improvement Improvement to existing functionality non-breaking Non-breaking change DO NOT MERGE PR should not be merged; see PR for details labels Oct 31, 2025

AnuradhaKaruppiah added 8 commits November 4, 2025 17:38

Merge remote-tracking branch 'upstream/develop' into ak-eval-item-end…

13a2774

…point

Reference scripts that demonstrate how to use the evaluate_item endpoint

91380b1

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

Misc fixes

c3fd6fb

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

Add some todo notes for later

f000a5b

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

Merge remote-tracking branch 'upstream/develop' into ak-eval-item-end…

2eaa92d

…point

Merge remote-tracking branch 'upstream/develop' into ak-eval-item-end…

eabc69e

…point

Change evaluate_item to evaluate/item

3f57014

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

Add unit tests

de9e75d

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

AnuradhaKaruppiah removed the DO NOT MERGE PR should not be merged; see PR for details label Nov 15, 2025

AnuradhaKaruppiah marked this pull request as ready for review November 15, 2025 00:40

AnuradhaKaruppiah requested a review from a team as a code owner November 15, 2025 00:40

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

dagardner-nv approved these changes Nov 15, 2025

View reviewed changes

...rofiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py Outdated Show resolved Hide resolved

...g/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py Outdated Show resolved Hide resolved

mdemoret-nv approved these changes Nov 15, 2025

View reviewed changes

AnuradhaKaruppiah added 2 commits November 14, 2025 18:40

Ensure evaluators are cleaned up

dedcbf7

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

Address review comments

d73e81f

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

rapids-bot bot merged commit 939fce2 into NVIDIA:develop Nov 15, 2025
17 checks passed

AnuradhaKaruppiah deleted the ak-eval-item-endpoint branch November 17, 2025 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a simple evaluate_item endpoint#1138

Add a simple evaluate_item endpoint#1138
rapids-bot[bot] merged 11 commits intoNVIDIA:developfrom
AnuradhaKaruppiah:ak-eval-item-endpoint

AnuradhaKaruppiah commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

coderabbitai bot left a comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

AnuradhaKaruppiah commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AnuradhaKaruppiah commented Oct 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

By Submitting this PR I confirm:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

Uh oh!

AnuradhaKaruppiah commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AnuradhaKaruppiah commented Oct 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2025 •

edited

Loading