Skip to content

Add a simple evaluate_item endpoint#1138

Merged
rapids-bot[bot] merged 11 commits intoNVIDIA:developfrom
AnuradhaKaruppiah:ak-eval-item-endpoint
Nov 15, 2025
Merged

Add a simple evaluate_item endpoint#1138
rapids-bot[bot] merged 11 commits intoNVIDIA:developfrom
AnuradhaKaruppiah:ak-eval-item-endpoint

Conversation

@AnuradhaKaruppiah
Copy link
Contributor

@AnuradhaKaruppiah AnuradhaKaruppiah commented Oct 31, 2025

Description

This PR adds a new /evaluate/item endpoint for synchronous single-item evaluation, enabling quick testing and debugging of evaluators without running full dataset evaluations.

Changes

  1. API Endpoint (/evaluate/item)
  • Method: POST
  • Purpose: Evaluate a single item with a specified evaluator (synchronous response)
  • Use cases: Interactive testing, debugging, real-time evaluation
  • No Dask required: Works without async job infrastructure
  1. Route Structure (Nested)
  • POST /evaluate/item → Single item evaluation (sync, immediate response)
  1. Implementation
  • Added add_evaluate_item_route() in fastapi_front_end_plugin_worker.py
  • Evaluator initialization via WorkflowEvalBuilder on server startup
  • Request/response models: EvaluateItemRequest, EvaluateItemResponse
  1. Example Scripts
  • evaluate_single_item.py - Full version with trajectory processing
  • evaluate_single_item_simple.py - Simplified version without trajectory
  1. Tests (test_evaluate_endpoints.py)
  2. Docs deferred (evaluate_api.md will be going through more changes before next rel)

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added a single-item evaluation endpoint (/evaluate/item) to evaluate an individual item with a named evaluator.
    • Added two example scripts showcasing end-to-end and minimal single-item evaluation workflows against the server, with streaming handling and evaluation reporting.
  • Tests

    • Added tests covering success, evaluator-not-found (404), evaluator runtime errors, and invalid-payload validation for the single-item evaluation endpoint.

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
@AnuradhaKaruppiah AnuradhaKaruppiah self-assigned this Oct 31, 2025
@AnuradhaKaruppiah AnuradhaKaruppiah added improvement Improvement to existing functionality non-breaking Non-breaking change DO NOT MERGE PR should not be merged; see PR for details labels Oct 31, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 31, 2025

Walkthrough

Adds single-item evaluation: two example scripts (full and simple), new FastAPI request/response models and a POST /evaluate/item endpoint, evaluator initialization/cleanup and route wiring in the plugin worker, and tests covering success, not-found, error, and validation cases.

Changes

Cohort / File(s) Summary
Example scripts
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py, examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
New scripts demonstrating single-item evaluation workflows. Full script streams generation, extracts final output and intermediate trajectory steps, validates into Pydantic models, aggregates a trajectory, then posts to /evaluate/item. Simple script streams final output only and posts an empty trajectory for evaluation.
FastAPI config / models
src/nat/front_ends/fastapi/fastapi_front_end_config.py
Adds EvaluateItemRequest, EvaluateItemResponse and associated EvalInputItem/EvalOutputItem usage; declares new evaluate_item endpoint (POST /evaluate/item) in front-end config.
FastAPI plugin worker
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
Adds evaluator builder and storage (_eval_builder, _evaluators), async initialize_evaluators and cleanup_evaluators, wires evaluator initialization into configure, registers shutdown cleanup, and adds add_evaluate_item_route handler to evaluate a single item (including error and missing-evaluator handling).
Tests
tests/nat/front_ends/fastapi/test_evaluate_endpoints.py
New tests and fixtures for /evaluate/item: success, evaluator-not-found (404), evaluator-raises-error, and invalid payload (422) scenarios; mocks evaluator behavior and session manager.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FastAPI
    participant PluginWorker
    participant Evaluator

    Client->>FastAPI: POST /evaluate/item (EvaluateItemRequest)
    FastAPI->>PluginWorker: dispatch to evaluate_item handler
    PluginWorker->>PluginWorker: lookup evaluator in _evaluators

    alt evaluator found
        PluginWorker->>Evaluator: await evaluate_fn(item)
        Evaluator-->>PluginWorker: evaluation result (EvalOutputItem)
        PluginWorker-->>FastAPI: EvaluateItemResponse(success=true, result)
        FastAPI-->>Client: 200 OK
    else evaluator not found
        PluginWorker-->>FastAPI: 404 Not Found (error)
        FastAPI-->>Client: 404
    else evaluator raised
        PluginWorker-->>FastAPI: EvaluateItemResponse(success=false, error=...)
        FastAPI-->>Client: 200 OK (error payload)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review initializer/cleanup logic in src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py for lifecycle correctness and resource leaks.
  • Validate EvaluateItemRequest / EvaluateItemResponse shapes against evaluator interfaces and serialization expectations.
  • Inspect error paths: missing evaluator (404) vs. evaluation failure (success=false) and HTTP status choices.
  • Check example scripts' streaming parsing and Pydantic model validation of intermediate steps.
  • Review tests for realistic mocks and sufficient coverage of edge cases.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add a simple evaluate_item endpoint' is concise (35 chars), uses imperative mood, and directly describes the main change: adding a new POST /evaluate/item endpoint for single-item evaluation.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
@AnuradhaKaruppiah AnuradhaKaruppiah removed the DO NOT MERGE PR should not be merged; see PR for details label Nov 15, 2025
@AnuradhaKaruppiah AnuradhaKaruppiah marked this pull request as ready for review November 15, 2025 00:40
@AnuradhaKaruppiah AnuradhaKaruppiah requested a review from a team as a code owner November 15, 2025 00:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (3)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)

121-127: Simplify logger.exception calls and drop redundant exception arguments

In the JSON parsing and aiohttp.ClientError handlers you’re passing the exception object into logger.exception while also getting the stack trace from exc_info, which Ruff flags (TRY401) and is unnecessary.

For example:

except json.JSONDecodeError as e:
    logger.exception("Failed to parse response: %s", e)

and similar patterns at Lines 129–130 and 199–200.

You can simplify to:

-        except json.JSONDecodeError as e:
-            logger.exception("Failed to parse response: %s", e)
+        except json.JSONDecodeError:
+            logger.exception("Failed to parse response")

-    except aiohttp.ClientError as e:
-        logger.exception("Request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Request failed")

-    except aiohttp.ClientError as e:
-        logger.exception("Evaluation request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Evaluation request failed")

This matches the exception-handling guideline and removes redundant arguments.

Also applies to: 129-135, 199-202

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)

160-165: Tighten logger.exception usage to avoid redundant exception arguments

Similar to the simple script, several exception handlers pass the exception object into logger.exception, which already logs the stack trace and doesn’t need the extra argument. Ruff flags these as TRY401.

Suggested edits:

-        except json.JSONDecodeError as e:
-            logger.exception("Failed to parse generate response chunk: %s", e)
+        except json.JSONDecodeError:
+            logger.exception("Failed to parse generate response chunk")

-        except (json.JSONDecodeError, ValidationError) as e:
-            logger.exception("Failed to parse intermediate step: %s", e)
+        except (json.JSONDecodeError, ValidationError):
+            logger.exception("Failed to parse intermediate step")

-    except aiohttp.ClientError as e:
-        logger.exception("Request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Request failed")

-    except aiohttp.ClientError as e:
-        logger.exception("Evaluation request failed: %s", e)
+    except aiohttp.ClientError:
+        logger.exception("Evaluation request failed")

This keeps the full stack trace while simplifying the logging calls.

Also applies to: 181-183, 185-191, 261-263

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

239-270: Use logger.exception in eval-init/cleanup error paths and add return type hints

In initialize_evaluators and cleanup_evaluators, exceptions are caught and logged with logger.error, and the methods are unannotated:

async def initialize_evaluators(self, config: Config):
    ...
    except Exception as e:
        logger.error(f"Failed to initialize evaluators: {e}")
        self._evaluators = {}

async def cleanup_evaluators(self):
    ...
    except Exception as e:
        logger.error(f"Error cleaning up evaluator builder: {e}")

Given these blocks swallow the exception and don’t re-raise, the logging guideline suggests logger.exception to capture the stack trace. Also, adding explicit return types would align with the project’s typing guidance.

Suggested changes:

-    async def initialize_evaluators(self, config: Config):
+    async def initialize_evaluators(self, config: Config) -> None:
@@
-        except Exception as e:
-            logger.error(f"Failed to initialize evaluators: {e}")
+        except Exception:
+            logger.exception("Failed to initialize evaluators")
             # Don't fail startup, just log the error
             self._evaluators = {}
@@
-    async def cleanup_evaluators(self):
+    async def cleanup_evaluators(self) -> None:
@@
-            except Exception as e:
-                logger.error(f"Error cleaning up evaluator builder: {e}")
+            except Exception:
+                logger.exception("Error cleaning up evaluator builder")

You may also optionally add -> None to the new configure and add_evaluate_item_route methods for consistency.

Also applies to: 271-282

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb53737 and de9e75d.

📒 Files selected for processing (5)
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1 hunks)
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1 hunks)
  • src/nat/front_ends/fastapi/fastapi_front_end_config.py (3 hunks)
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5 hunks)
  • tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • tests/nat/front_ends/fastapi/test_evaluate_endpoints.py
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
  • src/nat/front_ends/fastapi/fastapi_front_end_config.py
tests/**/*.py

⚙️ CodeRabbit configuration file

tests/**/*.py: - Ensure that tests are comprehensive, cover edge cases, and validate the functionality of the code. - Test functions should be named using the test_ prefix, using snake_case. - Any frequently repeated code should be extracted into pytest fixtures. - Pytest fixtures should define the name argument when applying the pytest.fixture decorator. The fixture
function being decorated should be named using the fixture_ prefix, using snake_case. Example:
@pytest.fixture(name="my_fixture")
def fixture_my_fixture():
pass

Files:

  • tests/nat/front_ends/fastapi/test_evaluate_endpoints.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/front_ends/fastapi/fastapi_front_end_config.py
🧬 Code graph analysis (5)
tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (3)
src/nat/eval/evaluator/evaluator_model.py (3)
  • EvalInput (46-47)
  • EvalOutput (56-58)
  • EvalOutputItem (50-53)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (2)
  • config (127-128)
  • add_evaluate_item_route (500-561)
src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)
  • EndpointBase (156-179)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)
src/nat/data_models/api_server.py (1)
  • ResponseIntermediateStep (482-494)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (4)
src/nat/builder/eval_builder.py (3)
  • WorkflowEvalBuilder (43-166)
  • populate_builder (135-158)
  • get_evaluator (69-74)
src/nat/eval/evaluator/evaluator_model.py (1)
  • EvalInput (46-47)
src/nat/front_ends/fastapi/fastapi_front_end_config.py (2)
  • EvaluateItemRequest (138-141)
  • EvaluateItemResponse (144-148)
src/nat/runtime/session.py (3)
  • config (88-89)
  • SessionManager (47-226)
  • session (100-135)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)
  • main (267-294)
src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)
src/nat/eval/evaluator/evaluator_model.py (2)
  • EvalInputItem (23-43)
  • EvalOutputItem (50-53)
🪛 Ruff (0.14.4)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

1-1: The file is executable but no shebang is present

(EXE002)


164-164: Redundant exception object included in logging.exception call

(TRY401)


182-182: Redundant exception object included in logging.exception call

(TRY401)


186-186: Redundant exception object included in logging.exception call

(TRY401)


187-187: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


188-188: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


189-191: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


261-261: Redundant exception object included in logging.exception call

(TRY401)


262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

266-266: Do not catch blind exception: Exception

(BLE001)


267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


277-277: Do not catch blind exception: Exception

(BLE001)


278-278: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


530-530: Use explicit conversion flag

Replace with conversion flag

(RUF010)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

126-126: Redundant exception object included in logging.exception call

(TRY401)


130-130: Redundant exception object included in logging.exception call

(TRY401)


131-131: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


132-132: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


133-135: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


200-200: Redundant exception object included in logging.exception call

(TRY401)


201-201: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (3)
tests/nat/front_ends/fastapi/test_evaluate_endpoints.py (1)

266-404: Evaluate-item test coverage looks solid and aligned with the API contract

The new fixtures and tests comprehensively exercise /evaluate/item:

  • Success path with score and reasoning.
  • 404 for unknown evaluator.
  • Evaluator exception mapped to success == False with an error message.
  • 422 for invalid payload.

Using a dedicated evaluate_item_client and evaluate_item_client_with_error keeps concerns nicely isolated, and asserting on both HTTP status codes and body fields matches the intended endpoint semantics. No issues from a correctness or style standpoint.

src/nat/front_ends/fastapi/fastapi_front_end_config.py (1)

30-31: Evaluate-item models and endpoint wiring are consistent and type-safe

EvaluateItemRequest/EvaluateItemResponse correctly reuse EvalInputItem and EvalOutputItem, and the evaluate_item endpoint definition matches the route path/method used by the worker and tests. The field descriptions are clear and will generate sensible OpenAPI docs. No changes needed here.

Also applies to: 138-149, 257-262

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

300-303: Raise HTTPException(500) on evaluator errors; fix lazy logging formatting

The route declares 500 in its OpenAPI responses but never raises it. On exception, change lines 529–530 to:

-                except Exception as e:
-                    logger.exception(f"Error evaluating item with {request.evaluator_name}")
-                    return EvaluateItemResponse(success=False, result=None, error=f"Evaluation failed: {str(e)}")
+                except Exception as e:
+                    logger.exception("Error evaluating item with %s", request.evaluator_name)
+                    raise HTTPException(status_code=500, detail=f"Evaluation failed: {str(e)}") from e

This aligns with FastAPI best practices and the documented response schema. The 200+success=False pattern is an anti-pattern for single-item endpoints; use proper HTTP semantics (5xx for server errors) so clients, intermediaries, and observability tools respond correctly.

Likely an incorrect or invalid review comment.

Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Signed-off-by: Anuradha Karuppiah <anuradhak@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

235-281: Tighten evaluator lifecycle logging and type hints

The evaluator lifecycle wiring is sound, but there are a couple of small cleanups worth doing:

  • In initialize_evaluators and cleanup_evaluators you catch Exception and don’t re-raise; per the project’s exception-handling guidelines and Ruff hints, use logger.exception(...) instead of logger.error(...) so the stack trace is preserved.
  • Consider adding explicit return type hints (-> None) to initialize_evaluators and cleanup_evaluators to match the “all methods typed” guideline.

Example diff:

-    async def initialize_evaluators(self, config: Config):
+    async def initialize_evaluators(self, config: Config) -> None:
@@
-        except Exception as e:
-            logger.error(f"Failed to initialize evaluators: {e}")
+        except Exception:
+            logger.exception("Failed to initialize evaluators")
             # Don't fail startup, just log the error
             self._evaluators = {}
@@
-    async def cleanup_evaluators(self):
+    async def cleanup_evaluators(self) -> None:
@@
-            except Exception as e:
-                logger.error(f"Error cleaning up evaluator builder: {e}")
+            except Exception:
+                logger.exception("Error cleaning up evaluator builder")

If you intend to keep the broad except Exception here to avoid failing startup/shutdown, a brief comment or # noqa: BLE001 with rationale would also silence Ruff without changing behavior.

🧹 Nitpick comments (1)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

305-305: Minor note: extra builder.build() for the new route

Adding await self.add_evaluate_item_route(app, SessionManager(await builder.build())) follows the existing pattern used for the other routes; it does mean one more builder.build() call at startup. If workflow construction becomes expensive, consider sharing a single SessionManager instance across related routes in a future cleanup, but this is fine for now.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de9e75d and dedcbf7.

📒 Files selected for processing (1)
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
🧬 Code graph analysis (1)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (5)
src/nat/builder/eval_builder.py (3)
  • WorkflowEvalBuilder (43-166)
  • populate_builder (135-158)
  • get_evaluator (69-74)
src/nat/eval/evaluator/evaluator_model.py (1)
  • EvalInput (46-47)
src/nat/front_ends/fastapi/fastapi_front_end_config.py (2)
  • EvaluateItemRequest (138-141)
  • EvaluateItemResponse (144-148)
src/nat/runtime/session.py (3)
  • config (88-89)
  • SessionManager (47-226)
  • session (100-135)
src/nat/eval/rag_evaluator/register.py (1)
  • evaluate_fn (110-117)
🪛 Ruff (0.14.4)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py

266-266: Do not catch blind exception: Exception

(BLE001)


267-267: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


277-277: Do not catch blind exception: Exception

(BLE001)


278-278: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


533-533: Use explicit conversion flag

Replace with conversion flag

(RUF010)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (2)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (2)

42-43: New eval-related imports look correct and consistent

The added imports (WorkflowEvalBuilder, EvaluatorInfo, EvalInput, EvaluateItemRequest/Response) are all used in the new evaluator lifecycle and /evaluate/item endpoint and fit the existing import style. No changes needed here.

Also applies to: 56-56, 62-63


292-297: Evaluator init + shutdown handler wiring looks good

Initializing evaluators in configure and registering cleanup_evaluators on FastAPI "shutdown" closes the loop on the async context for WorkflowEvalBuilder and avoids long-lived resource leaks. This also addresses the previous review’s concern about missing cleanup of the eval builder.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)

129-129: Remove redundant exception objects from logger.exception() calls.

The logger.exception() method automatically includes the full exception traceback in the log output, so passing the exception object as a format argument is redundant and can result in the exception appearing twice in the logs.

Apply this diff:

-                            logger.exception("Failed to parse response: %s", e)
+                            logger.exception("Failed to parse response")
-            logger.exception("Request failed: %s", e)
+            logger.exception("Request failed")
-            logger.exception("Evaluation request failed: %s", e)
+            logger.exception("Evaluation request failed")

Also applies to: 133-133, 203-203

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)

164-164: Remove redundant exception objects from logger.exception() calls.

The logger.exception() method automatically includes the full exception traceback, so passing the exception object as a format argument is redundant and can result in duplicate exception information in the logs.

Apply this diff:

-                            logger.exception("Failed to parse generate response chunk: %s", e)
+                            logger.exception("Failed to parse generate response chunk")
-                            logger.exception("Failed to parse intermediate step: %s", e)
+                            logger.exception("Failed to parse intermediate step")
-            logger.exception("Request failed: %s", e)
+            logger.exception("Request failed")
-            logger.exception("Evaluation request failed: %s", e)
+            logger.exception("Evaluation request failed")

Also applies to: 182-182, 186-186, 261-261

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dedcbf7 and d73e81f.

📒 Files selected for processing (2)
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1 hunks)
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values.
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

Misc. - All code (except .mdc files that contain Cursor rules) should be licensed under the Apache License 2.0,

and should contain an Apache License 2.0 header comment at the top of each file.

  • Confirm that copyright years are up-to date whenever a file is changed.

Files:

  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
examples/**/*

⚙️ CodeRabbit configuration file

examples/**/*: - This directory contains example code and usage scenarios for the toolkit, at a minimum an example should
contain a README.md or file README.ipynb.

  • If an example contains Python code, it should be placed in a subdirectory named src/ and should
    contain a pyproject.toml file. Optionally, it might also contain scripts in a scripts/ directory.
  • If an example contains YAML files, they should be placed in a subdirectory named configs/. - If an example contains sample data files, they should be placed in a subdirectory named data/, and should
    be checked into git-lfs.

Files:

  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py
  • examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py
🧬 Code graph analysis (2)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (3)
src/nat/data_models/api_server.py (1)
  • ResponseIntermediateStep (482-494)
src/nat/data_models/intermediate_step.py (3)
  • IntermediateStep (235-310)
  • IntermediateStepPayload (130-232)
  • UUID (301-302)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)
  • main (208-233)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py (1)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py (1)
  • main (267-294)
🪛 Ruff (0.14.4)
examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item.py

1-1: The file is executable but no shebang is present

(EXE002)


164-164: Redundant exception object included in logging.exception call

(TRY401)


182-182: Redundant exception object included in logging.exception call

(TRY401)


186-186: Redundant exception object included in logging.exception call

(TRY401)


187-187: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


188-188: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


189-191: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


261-261: Redundant exception object included in logging.exception call

(TRY401)


262-262: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


263-263: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

examples/evaluation_and_profiling/simple_web_query_eval/src/nat_simple_web_query_eval/scripts/evaluate_single_item_simple.py

129-129: Redundant exception object included in logging.exception call

(TRY401)


133-133: Redundant exception object included in logging.exception call

(TRY401)


134-134: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


135-135: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


136-138: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


203-203: Redundant exception object included in logging.exception call

(TRY401)


204-204: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check

@AnuradhaKaruppiah
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 939fce2 into NVIDIA:develop Nov 15, 2025
17 checks passed
@AnuradhaKaruppiah AnuradhaKaruppiah deleted the ak-eval-item-endpoint branch November 17, 2025 22:09
saglave pushed a commit to snps-scm13/SNPS-NeMo-Agent-Toolkit that referenced this pull request Dec 11, 2025
This PR adds a new /evaluate/item endpoint for synchronous single-item evaluation, enabling quick testing and debugging of evaluators without running full dataset evaluations.

**Changes**
1. API Endpoint (/evaluate/item)
- Method: POST
- Purpose: Evaluate a single item with a specified evaluator (synchronous response)
- Use cases: Interactive testing, debugging, real-time evaluation
- No Dask required: Works without async job infrastructure

2. Route Structure (Nested)
- POST   /evaluate/item         → Single item evaluation (sync, immediate response)

3. Implementation
- Added add_evaluate_item_route() in fastapi_front_end_plugin_worker.py
- Evaluator initialization via WorkflowEvalBuilder on server startup
- Request/response models: EvaluateItemRequest, EvaluateItemResponse

4. Example Scripts
- evaluate_single_item.py - Full version with trajectory processing
- evaluate_single_item_simple.py - Simplified version without trajectory

5. Tests (test_evaluate_endpoints.py)
6. Docs deferred (evaluate_api.md will be going through more changes before next rel)

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

## Summary by CodeRabbit

* **New Features**
  * Added a single-item evaluation endpoint (/evaluate/item) to evaluate an individual item with a named evaluator.
  * Added two example scripts showcasing end-to-end and minimal single-item evaluation workflows against the server, with streaming handling and evaluation reporting.

* **Tests**
  * Added tests covering success, evaluator-not-found (404), evaluator runtime errors, and invalid-payload validation for the single-item evaluation endpoint.

Authors:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

Approvers:
  - David Gardner (https://github.com/dagardner-nv)
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: NVIDIA#1138
Signed-off-by: Sangharsh Aglave <aglave@synopsys.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement to existing functionality non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants