Skip to content

Async endpoint improvements#1374

Merged
rapids-bot[bot] merged 45 commits intoNVIDIA:release/1.4from
dagardner-nv:david-rel-14-cve-mem-leak
Jan 12, 2026
Merged

Async endpoint improvements#1374
rapids-bot[bot] merged 45 commits intoNVIDIA:release/1.4from
dagardner-nv:david-rel-14-cve-mem-leak

Conversation

@dagardner-nv
Copy link
Contributor

@dagardner-nv dagardner-nv commented Jan 9, 2026

Description

  • Move Dask task functions to their own module and update them to be self-contained.
  • Be a bit aggressive about deleting variables in the run_generation method.
  • Ensure that SQLAlchemy's Session.remove() is called in the event of an error
  • Move nat.cli.entrypoint.setup_logging to nat.utils.log_utils to be used outside of the CLI.
  • Configure logging in Dask tasks when using subprocesses
  • Configure uvicorn logging to match NAT
  • Log Dask details a bit more verbosely

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

Summary by CodeRabbit

  • New Features

    • Added asynchronous job generation and a periodic cleanup task runnable via the distributed execution layer.
  • Improvements

    • Centralized logging setup with consistent timestamp format and propagated log level to server and workers.
    • Exposed configurable worker threading mode (thread vs process) via runtime environment.
    • Server startup now applies effective log settings and enhanced log formatting.
  • Bug Fixes

    • More robust job lifecycle cleanup with guaranteed resource handling and explicit reporting of expired-job counts.

✏️ Tip: You can customize this high-level summary in your review settings.

…ry point

Signed-off-by: David Gardner <dagardner@nvidia.com>
…ssion.remove() is always called

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
This reverts commit 6a1031e.

Signed-off-by: David Gardner <dagardner@nvidia.com>
…sion_manager

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…to david-rel-14-cve-mem-leak

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
This reverts commit 01a94fb.

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
@dagardner-nv dagardner-nv self-assigned this Jan 9, 2026
@dagardner-nv dagardner-nv added bug Something isn't working non-breaking Non-breaking change labels Jan 9, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

Walkthrough

Centralizes logging into a utils helper, adds async Dask task helpers for job execution and periodic cleanup, integrates those helpers into the FastAPI front-end and worker (with env-driven thread/log flags), and hardens JobStore cleanup to count and safely remove expired jobs.

Changes

Cohort / File(s) Summary
Logging utils & entrypoint
src/nat/utils/log_utils.py, src/nat/cli/entrypoint.py
Add LOG_FORMAT, LOG_DATE_FORMAT and setup_logging(log_level); entrypoint now delegates logging setup to the new helper instead of configuring logging inline.
Async Dask job helpers
src/nat/front_ends/fastapi/async_job.py
New module exporting async run_generation(...) and periodic_cleanup(...) that optionally configure logging, run generation tasks, update JobStore statuses, and perform periodic cleanup loops.
FastAPI front-end plugin
src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
Add _use_dask_threads flag, submit periodic_cleanup to Dask (passing logging control based on threading), expose NAT_USE_DASK_THREADS and NAT_FASTAPI_LOG_LEVEL to workers, build custom uvicorn log_config using LOG_DATE_FORMAT, and pass log_level/log_config to uvicorn.run.
Worker integration
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
Import run_generation and setup_logging; add _use_dask_threads and _log_level, call setup_logging in init, and submit run_generation with configure_logging and log_level arguments before scheduler/db/config/job/payload.
Job store cleanup
src/nat/front_ends/fastapi/job_store.py
cleanup_expired_jobs() now returns int (number expired), logs start, counts expired IDs, uses a per-job Dask Variable to attempt cancellation and guarantees variable deletion in finally blocks; session registry removal is guarded to ensure cleanup.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FastAPI_Plugin
    participant Dask_Client
    participant Dask_Worker
    participant async_job
    participant JobStore
    participant Database

    Client->>FastAPI_Plugin: POST /enqueue (payload)
    FastAPI_Plugin->>Dask_Client: submit run_generation(configure_logging, log_level, scheduler_addr, db_url, config_file, job_id, payload)
    Dask_Client->>Dask_Worker: schedule task
    Dask_Worker->>async_job: invoke run_generation()
    async_job->>async_job: optionally call setup_logging(log_level)
    async_job->>async_job: load workflow & compute result
    async_job->>JobStore: update job status (SUCCESS/FAILURE) with result/error
    JobStore->>Database: persist status/result
    async_job->>Dask_Worker: cleanup resources and finish
Loading
sequenceDiagram
    participant FastAPI_Plugin
    participant Dask_Client
    participant Dask_Worker
    participant async_job
    participant JobStore
    participant Database

    loop every sleep_time_sec
        FastAPI_Plugin->>Dask_Client: submit periodic_cleanup(scheduler_addr, db_url, sleep_time_sec, configure_logging, log_level)
        Dask_Client->>Dask_Worker: schedule periodic task
        Dask_Worker->>async_job: invoke periodic_cleanup()
        async_job->>async_job: optionally call setup_logging(log_level)
        async_job->>async_job: sleep interval
        async_job->>JobStore: cleanup_expired_jobs() -> returns num_expired
        JobStore->>Database: query & remove expired records, attempt cancellation via Dask Variable
        async_job->>FastAPI_Plugin: log cleanup result (num_expired)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 59.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Async endpoint improvements' is vague and generic, lacking specific detail about the substantial changes made (refactoring Dask tasks, memory management, logging improvements). Use a more specific title reflecting the main changes, such as 'Refactor Dask tasks for async endpoint and improve logging' or 'Move Dask tasks to dedicated module and fix memory leak'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…to david-rel-14-cve-mem-leak

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…ging in sub-processes

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
This reverts commit 06bd4b2.

Signed-off-by: David Gardner <dagardner@nvidia.com>
This reverts commit 71ab041.

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…to david-rel-14-cve-mem-leak

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
@dagardner-nv dagardner-nv marked this pull request as ready for review January 9, 2026 18:31
@dagardner-nv dagardner-nv requested a review from a team as a code owner January 9, 2026 18:31
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/nat/cli/entrypoint.py (1)

16-24: Remove duplicate and conflicting license header.

The file has two license headers: Apache-2.0 (lines 1-14) and a proprietary NVIDIA license (lines 16-24). Per the coding guidelines, all code should be licensed under Apache License 2.0 only. The second header block should be removed.

🔧 Proposed fix
 # limitations under the License.

-# SPDX-FileCopyrightText: Copyright (c) 2024-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: LicenseRef-NvidiaProprietary
-#
-# NVIDIA CORPORATION, its affiliates and licensors retain all intellectual
-# property and proprietary rights in and to this material, related
-# documentation and any modifications thereto. Any use, reproduction,
-# disclosure or distribution of this material and related documentation
-# without an express license agreement from NVIDIA CORPORATION or
-# its affiliates is strictly prohibited.
-
 import logging
🤖 Fix all issues with AI agents
In @src/nat/front_ends/fastapi/async_job.py:
- Line 1: Update the file header copyright year range from "2026" to "2025-2026"
so it matches other files in the PR; locate the SPDX/ copyright comment at the
top of src/nat/front_ends/fastapi/async_job.py and change the year token in that
comment to "2025-2026".
🧹 Nitpick comments (1)
src/nat/utils/log_utils.py (1)

43-49: Consider adding force=True for repeated basicConfig calls.

logging.basicConfig() is a no-op if the root logger already has handlers configured. When Dask workers call this function in subprocess mode, it should work fine. However, in thread mode, subsequent calls may not apply the configuration if handlers were already set.

Consider using force=True (Python 3.8+) to ensure configuration is always applied:

♻️ Suggested improvement
 def setup_logging(log_level: int):
     """Configure logging with the specified level"""
     logging.basicConfig(
         level=log_level,
         format=LOG_FORMAT,
         datefmt=LOG_DATE_FORMAT,
+        force=True,
     )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f41f0a0 and d682975.

📒 Files selected for processing (6)
  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/utils/log_utils.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/cli/entrypoint.py
  • src/nat/front_ends/fastapi/async_job.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
  • src/nat/utils/log_utils.py
  • src/nat/front_ends/fastapi/job_store.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
🧬 Code graph analysis (5)
src/nat/cli/entrypoint.py (1)
src/nat/utils/log_utils.py (1)
  • setup_logging (43-49)
src/nat/front_ends/fastapi/async_job.py (6)
src/nat/utils/log_utils.py (1)
  • setup_logging (43-49)
src/nat/cli/entrypoint.py (1)
  • setup_logging (46-50)
src/nat/front_ends/fastapi/job_store.py (1)
  • JobStatus (58-82)
src/nat/front_ends/fastapi/response_helpers.py (1)
  • generate_single_response (108-118)
src/nat/runtime/loader.py (1)
  • load_workflow (97-124)
src/nat/builder/context.py (1)
  • output (60-61)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (3)
src/nat/front_ends/fastapi/async_job.py (1)
  • run_generation (32-82)
src/nat/utils/log_utils.py (1)
  • setup_logging (43-49)
src/nat/cli/entrypoint.py (1)
  • setup_logging (46-50)
src/nat/utils/log_utils.py (1)
src/nat/cli/entrypoint.py (1)
  • setup_logging (46-50)
src/nat/front_ends/fastapi/fastapi_front_end_plugin.py (2)
src/nat/front_ends/fastapi/async_job.py (1)
  • periodic_cleanup (85-122)
src/nat/front_ends/fastapi/dask_client_mixin.py (2)
  • DaskClientMixin (27-65)
  • client (30-46)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (19)
src/nat/utils/log_utils.py (1)

18-19: LGTM!

The logging format constants are well-defined and follow Python conventions for module-level constants (UPPER_CASE). The format string includes useful debugging information (timestamp, level, module, line number, message).

src/nat/cli/entrypoint.py (1)

35-35: LGTM!

The refactoring to delegate logging setup to the centralized log_utils.setup_logging is clean. The function maintains its existing signature (accepting a string log level) and correctly converts it to a numeric level before calling the utility function.

Also applies to: 46-50

src/nat/front_ends/fastapi/job_store.py (4)

211-218: LGTM - Good fix for session memory leak.

Wrapping the session context in try/finally ensures _session.remove() is always called, preventing the async_scoped_session from accumulating stale task keys. This directly addresses the memory leak concern mentioned in the PR objectives.


506-513: LGTM!

Good improvements: the return type annotation -> int documents the new behavior, and the startup log provides visibility into cleanup operations.


545-566: LGTM - Robust variable cleanup with proper error handling.

The finally block ensures var.delete() is called regardless of whether the future cancellation succeeds. The exception handling correctly uses logger.exception() for logging without re-raising, which captures the full stack trace as per coding guidelines. The explicit del var after deletion is a reasonable defensive measure for memory cleanup.


571-571: LGTM!

Returning num_expired enables callers to log/monitor cleanup activity, which aligns with the PR goal of increased verbosity for Dask-related logs.

src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (3)

58-58: LGTM!

Good refactoring - importing run_generation from the dedicated async_job module improves code organization by keeping Dask task functions self-contained, as mentioned in the PR objectives.

Also applies to: 79-79


111-113: LGTM!

Environment-driven configuration for Dask threading mode and log level is well-implemented. Calling setup_logging during initialization ensures consistent logging configuration in the worker process.


929-937: LGTM - Correct configure_logging logic.

Passing not self._use_dask_threads as the configure_logging parameter is correct: when using subprocess workers, logging needs to be configured in each subprocess; when using threads, the parent process's logging configuration is shared, so reconfiguration is unnecessary.

src/nat/front_ends/fastapi/async_job.py (4)

15-17: LGTM!

Clear module docstring explaining the design rationale for self-contained Dask task functions.


24-29: LGTM!

Good design - the deferred import of setup_logging inside the function ensures proper isolation when this module is serialized and sent to Dask workers.


32-82: LGTM - Well-structured Dask task with proper cleanup.

The function is self-contained with deferred imports, includes comprehensive docstrings, and implements proper resource cleanup:

  • Explicit del statements for session objects to reduce memory footprint
  • Uses logger.exception() correctly when catching without re-raising
  • Updates job status appropriately on success/failure

85-122: LGTM - Appropriate bare except for long-running cleanup task.

The bare except clause (with # noqa: E722) is acceptable here because:

  1. This is a long-running Dask task that should never terminate due to transient errors
  2. All exceptions are properly logged with logger.exception() for debugging
  3. The loop continues to provide cleanup service

The function is well-documented with a complete docstring.

src/nat/front_ends/fastapi/fastapi_front_end_plugin.py (6)

16-16: LGTM!

Good imports: copy for safe uvicorn config modification, periodic_cleanup from the new dedicated module, and LOG_DATE_FORMAT for consistent timestamp formatting.

Also applies to: 24-24, 32-32


49-49: LGTM!

Initializing _use_dask_threads to False is a safe default that will be updated based on configuration when the cluster is set up.


66-74: LGTM!

Good changes:

  • Using logger.info instead of logger.debug for cleanup task submission increases visibility
  • Delegating to imported periodic_cleanup improves modularity
  • configure_logging=not self._use_dask_threads correctly configures logging only in subprocess mode

108-121: LGTM!

Correct implementation:

  • processes=not self._use_dask_threads creates thread workers when dask_workers == 'threads'
  • Worker setup (_setup_worker) is correctly skipped for thread mode and Windows, as os.setsid() only applies to Unix subprocess workers

148-153: LGTM!

Setting environment variables NAT_USE_DASK_THREADS and NAT_LOG_LEVEL ensures worker subprocesses can read the configuration. Using str(int(...)) for the boolean conversion ensures consistent "0"/"1" string representation.


184-199: LGTM - Safe uvicorn logging customization.

Using copy.deepcopy(uvicorn.config.LOGGING_CONFIG) before modifying ensures the global uvicorn configuration remains untouched. The timestamp prefix and date format customization aligns logging output with the rest of the application.

Copy link
Member

@willkill07 willkill07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for the exception-flow driven cleanup of job_store

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
…into david-rel-14-cve-mem-leak

Signed-off-by: David Gardner <dagardner@nvidia.com>
Signed-off-by: David Gardner <dagardner@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In @src/nat/front_ends/fastapi/async_job.py:
- Around line 82-119: The loop in periodic_cleanup currently sleeps before
running cleanup and uses a bare except that swallows asyncio.CancelledError;
change the loop to run cleanup_expired_jobs() first and then await
asyncio.sleep(sleep_time_sec), and replace the bare except with two handlers:
except asyncio.CancelledError: raise to allow graceful cancellation, and except
Exception as e: logger.exception("Error during job cleanup: %s", e) to log other
errors; keep references to periodic_cleanup and JobStore.cleanup_expired_jobs
when making the edits.
🧹 Nitpick comments (1)
src/nat/front_ends/fastapi/async_job.py (1)

24-30: Consider making logging setup explicitly idempotent (or intentionally “force”).

logging.basicConfig(...) is a no-op if handlers already exist; if the intent is “configure logging in subprocess tasks”, this is probably fine, but please confirm you don’t need force=True (Python 3.8+) or a “no handlers present” guard to avoid surprises when a worker already configured logging.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d682975 and 0f66b70.

📒 Files selected for processing (1)
  • src/nat/front_ends/fastapi/async_job.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • src/nat/front_ends/fastapi/async_job.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • src/nat/front_ends/fastapi/async_job.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • src/nat/front_ends/fastapi/async_job.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • src/nat/front_ends/fastapi/async_job.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • src/nat/front_ends/fastapi/async_job.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/front_ends/fastapi/async_job.py
🧠 Learnings (1)
📚 Learning: 2026-01-05T15:46:49.677Z
Learnt from: CR
Repo: NVIDIA/NeMo-Agent-Toolkit PR: 0
File: .cursor/rules/general.mdc:0-0
Timestamp: 2026-01-05T15:46:49.677Z
Learning: Applies to **/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst} : Confirm that copyright years are up-to-date whenever a file is changed

Applied to files:

  • src/nat/front_ends/fastapi/async_job.py
🧬 Code graph analysis (1)
src/nat/front_ends/fastapi/async_job.py (4)
src/nat/utils/log_utils.py (1)
  • setup_logging (43-49)
src/nat/front_ends/fastapi/job_store.py (4)
  • JobStatus (58-82)
  • JobStore (138-571)
  • update_status (328-378)
  • cleanup_expired_jobs (506-571)
src/nat/front_ends/fastapi/response_helpers.py (1)
  • generate_single_response (108-118)
src/nat/runtime/loader.py (1)
  • load_workflow (97-124)
🔇 Additional comments (1)
src/nat/front_ends/fastapi/async_job.py (1)

1-17: SPDX header + 2026 copyright year look correct for a new file.

Based on learnings, thanks for keeping the copyright year current.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/nat/front_ends/fastapi/job_store.py (1)

506-512: Docstring missing Returns section.

The method signature now returns int, but the docstring doesn't document what the return value represents. Per the coding guidelines requiring Google-style docstrings, please add a Returns section.

📝 Suggested docstring update
     async def cleanup_expired_jobs(self) -> int:
         """
         Cleanup expired jobs, keeping the most recent one.

         Updated_at is used instead of created_at to determine the most recent job. This is because jobs may not be
         processed in the order they are created.
+
+        Returns
+        -------
+        int
+            The number of expired jobs found during this cleanup cycle.
         """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f66b70 and d3aaf5b.

📒 Files selected for processing (1)
  • src/nat/front_ends/fastapi/job_store.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • src/nat/front_ends/fastapi/job_store.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • src/nat/front_ends/fastapi/job_store.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • src/nat/front_ends/fastapi/job_store.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • src/nat/front_ends/fastapi/job_store.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • src/nat/front_ends/fastapi/job_store.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/front_ends/fastapi/job_store.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (3)
src/nat/front_ends/fastapi/job_store.py (3)

211-218: LGTM! Proper session cleanup pattern.

The try/finally ensures self._session.remove() is called even when exceptions occur during session usage, preventing memory leaks in the scoped session registry.


545-545: Good defensive cleanup pattern.

Initializing var = None before the try block and checking it in the finally block ensures the Dask Variable is deleted even when cancellation fails. Using logger.exception() for caught exceptions is correct per the coding guidelines.

Also applies to: 560-566


541-542: Verify return value semantics: num_expired vs successfully_expired count.

The method returns num_expired (total expired jobs found), but only jobs in successfully_expired are actually marked as expired in the database. If some jobs fail to cancel (caught at line 557-558), the return value will be higher than the number of jobs actually cleaned up.

Consider whether the caller expects the count of jobs found expired or jobs successfully cleaned up. If the latter, return len(successfully_expired) instead.

Also applies to: 570-570

Signed-off-by: David Gardner <dagardner@nvidia.com>
…ame name

Signed-off-by: David Gardner <dagardner@nvidia.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py:
- Around line 111-113: Validate and handle parsing errors when reading
NAT_FASTAPI_LOG_LEVEL: wrap the int conversion for self._log_level in a
try/except (catch ValueError/TypeError), fall back to logging.INFO (or a
configured default) on failure, and emit a warning via the logger (or print)
that includes the invalid env value and the exception; update the code around
self._log_level and setup_logging(self._log_level) to use the safe fallback so a
bad env value cannot crash the process.
🧹 Nitpick comments (1)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

929-937: Consider adding a clarifying comment for the inverted logic.

Line 930 passes not self._use_dask_threads as the configure_logging parameter. This inverted logic is correct (threads inherit logging from the parent process, while subprocesses need their own configuration), but it may not be immediately obvious to future maintainers.

📝 Suggested comment
                    (_, job) = await self._job_store.submit_job(
                        job_id=job_id,
                        expiry_seconds=request.expiry_seconds,
                        job_fn=run_generation,
                        sync_timeout=request.sync_timeout,
                        job_args=[
+                            # Configure logging only for subprocess workers (not threads)
                            not self._use_dask_threads,
                            self._log_level,
                            self._scheduler_address,
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37b2cec and 7cabf1d.

📒 Files selected for processing (2)
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.py: Follow PEP 20 and PEP 8 for Python style guidelines
Run yapf with PEP 8 base and 'column_limit = 120' for code formatting
Use 'ruff check --fix' for linting with configuration from 'pyproject.toml', fix warnings unless explicitly ignored
Use snake_case for functions and variables, PascalCase for classes, UPPER_CASE for constants
All public APIs require Python 3.11+ type hints on parameters and return values
Prefer 'collections.abc' / 'typing' abstractions (e.g., 'Sequence' over 'list') for type hints
Use 'typing.Annotated' for units or extra metadata when useful
Treat 'pyright' warnings (configured in 'pyproject.toml') as errors during development
Preserve stack traces and prevent duplicate logging when handling exceptions; use bare 'raise' statements when re-raising, and use 'logger.error()' for logging (not 'logger.exception()') to avoid duplicate stack trace output
When catching and logging exceptions without re-raising, always use 'logger.exception()' (equivalent to 'logger.error(exc_info=True)') to capture full stack trace information
Pydantic models using 'SecretStr', 'SerializableSecretStr', or 'OptionalSecretStr' should use 'default=None' for optional fields and 'default_factory=lambda: SerializableSecretStr("")' for non-optional fields to avoid initialization bugs
Provide Google-style docstrings for every public module, class, function and CLI command
The first line of docstrings must be a concise description ending with a period
Surround code entities in docstrings with backticks to avoid Vale false-positives
Validate and sanitise all user input, especially in web or CLI interfaces
Prefer 'httpx' with SSL verification enabled by default and follow OWASP Top-10 recommendations
Use 'async'/'await' for I/O-bound work (HTTP, DB, file reads)
Cache expensive computations with 'functools.lru_cache' or an external cache when appropriate
Leverage NumPy vectorised operations whenever beneficial and feasible

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
**/*.{py,yaml,yml,json,toml}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Indent with 4 spaces (never tabs) and ensure every file ends with a single newline

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

**/*.{py,js,ts,tsx,jsx,sh,yaml,yml,json,toml,md,mdx,rst}: Every file must start with the standard SPDX Apache-2.0 header
Confirm that copyright years are up-to-date whenever a file is changed
All source files must include the SPDX Apache-2.0 header template

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
**/*.{py,md,mdx,rst}

📄 CodeRabbit inference engine (.cursor/rules/general.mdc)

Version numbers are derived automatically by 'setuptools-scm'; never hard-code them in code or docs

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
**/*

⚙️ CodeRabbit configuration file

**/*: # Code Review Instructions

  • Ensure the code follows best practices and coding standards. - For Python code, follow
    PEP 20 and
    PEP 8 for style guidelines.
  • Check for security vulnerabilities and potential issues. - Python methods should use type hints for all parameters and return values (except for return values of None,
    in that situation no return type hint is needed).
    Example:
    def my_function(param1: int, param2: str) -> bool:
        pass
  • For Python exception handling, ensure proper stack trace preservation:
    • When re-raising exceptions: use bare raise statements to maintain the original stack trace,
      and use logger.error() (not logger.exception()) to avoid duplicate stack trace output.
    • When catching and logging exceptions without re-raising: always use logger.exception()
      to capture the full stack trace information.

Documentation Review Instructions - Verify that documentation and comments are clear and comprehensive. - Verify that the documentation doesn't contain any TODOs, FIXMEs or placeholder text like "lorem ipsum". - Verify that the documentation doesn't contain any offensive or outdated terms. - Verify that documentation and comments are free of spelling mistakes, ensure the documentation doesn't contain any

words listed in the ci/vale/styles/config/vocabularies/nat/reject.txt file, words that might appear to be
spelling mistakes but are listed in the ci/vale/styles/config/vocabularies/nat/accept.txt file are OK.

  • Documentation in Markdown files should not contain usage of a possessive 's with inanimate objects
    (ex: "the system's performance" should be "the performance of the system").
  • Documentation in Markdown files should not use NAT as an acronym, always spell out NeMo Agent Toolkit.
    The exception to this rule is when referring to package names or code identifiers that contain "nat", th...

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
src/nat/**/*

⚙️ CodeRabbit configuration file

This directory contains the core functionality of the toolkit. Changes should prioritize backward compatibility.

Files:

  • src/nat/front_ends/fastapi/fastapi_front_end_plugin.py
  • src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py
🧬 Code graph analysis (1)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (3)
src/nat/front_ends/fastapi/async_job.py (1)
  • run_generation (32-84)
src/nat/utils/log_utils.py (1)
  • setup_logging (43-49)
src/nat/cli/entrypoint.py (1)
  • setup_logging (46-50)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: CI Pipeline / Check
🔇 Additional comments (6)
src/nat/front_ends/fastapi/fastapi_front_end_plugin_worker.py (1)

58-58: LGTM! Imports align with the refactoring objectives.

The imports correctly reference the newly centralized logging utilities and async job helpers.

Also applies to: 79-79

src/nat/front_ends/fastapi/fastapi_front_end_plugin.py (5)

16-16: LGTM! Imports are appropriate.

The new imports support the logging configuration enhancements and async cleanup functionality.

Also applies to: 24-24, 32-32


66-74: LGTM! Cleanup task properly delegates to the async job helper.

The increased log verbosity (info vs debug) and the use of the centralized periodic_cleanup function align with the PR objectives.


86-86: LGTM! Log level capture and environment variable setup are correct.

Capturing the effective log level early and propagating it via environment variables ensures consistent logging across main and worker processes.

Also applies to: 151-152


184-199: LGTM! Uvicorn logging configuration properly aligns with NeMo Agent Toolkit standards.

The deep copy approach safely customizes the uvicorn logging format to include timestamps and match the centralized log format, ensuring consistent log output across all components.


49-49: LGTM! Dask threading control logic is correctly implemented.

The _use_dask_threads flag properly controls whether Dask uses threads or processes, with appropriate conditional logic for process-specific setup (like setsid()).

Also applies to: 108-108, 111-111, 118-118

@dagardner-nv
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit b1b96da into NVIDIA:release/1.4 Jan 12, 2026
17 checks passed
Jerryguan777 pushed a commit to Jerryguan777/NeMo-Agent-Toolkit that referenced this pull request Jan 28, 2026
* Move Dask task functions to their own module and update them to be self-contained.
* Be a bit aggressive about deleting variables in the `run_generation` method.
* Ensure that SQLAlchemy's `Session.remove()` is called in the event of an error
* Move `nat.cli.entrypoint.setup_logging` to `nat.utils.log_utils` to be used outside of the CLI.
* Configure logging in Dask tasks when using subprocesses
* Configure uvicorn logging to match NAT
* Log Dask details a bit more verbosely


## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.



## Summary by CodeRabbit

* **New Features**
  * Added asynchronous job generation and a periodic cleanup task runnable via the distributed execution layer.

* **Improvements**
  * Centralized logging setup with consistent timestamp format and propagated log level to server and workers.
  * Exposed configurable worker threading mode (thread vs process) via runtime environment.
  * Server startup now applies effective log settings and enhanced log formatting.

* **Bug Fixes**
  * More robust job lifecycle cleanup with guaranteed resource handling and explicit reporting of expired-job counts.

<sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Will Killian (https://github.com/willkill07)

URL: NVIDIA#1374
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants