[CLI] Support agents with custom training loops in handle_dse_job by rutayan-nv · Pull Request #893 · NVIDIA/cloudai

rutayan-nv · 2026-05-15T20:57:21Z

Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop.
New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths.
CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.

coderabbitai · 2026-05-15T20:57:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 976c1ce5-2b4e-44b8-97d0-fad196e68c79

📥 Commits

Reviewing files that changed from the base of the PR and between 3ffe893 and 9552e5a.

📒 Files selected for processing (2)

src/cloudai/cli/handlers.py
tests/test_handlers.py

📝 Walkthrough

Walkthrough

This PR adds a runtime-checkable CustomTrainingLoopAgent protocol and helpers to run an agent's self-contained train() (with optional shutdown() and process-style exit codes). The DSE handler dispatches to this path when detected and tests cover helper behavior and end-to-end dispatch.

Changes

Custom Training Loop Support

Layer / File(s)	Summary
Custom training loop protocol and execution helpers `src/cloudai/cli/handlers.py`	Typing imports expanded to include `Protocol` and `runtime_checkable`. New `CustomTrainingLoopAgent` protocol with `HAS_CUSTOM_TRAINING_LOOP` flag and `train()` method. Helper functions detect the protocol and execute the training loop, handling exceptions, calling optional `shutdown()`, and returning exit codes.
DSE job handler integration `src/cloudai/cli/handlers.py`	In `handle_dse_job`, added conditional detection and execution of custom-training-loop agents via the helper, ORing the exit code into the error accumulator and skipping the default step loop.
Test agent stub and fixture `tests/test_handlers.py`	Test imports expanded to include `_run_custom_training_loop`. New `CustomLoopStubAgent` and config opt into the custom loop, track `train`/`shutdown` call counts, optionally raise during `train`. Fixture registers the stub in the shared registry and manages counters.
Helper and integration tests `tests/test_handlers.py`	Tests verify `_run_custom_training_loop` calls `train()` and optional `shutdown()` on success, returns nonzero and still calls `shutdown()` on exception with logging, tolerates missing `shutdown()`, and integration tests assert `handle_dse_job` correctly dispatches to and propagates failures from the custom loop.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop in code where loops run free,

I call my train then kindly call shutdown,
I catch the bumps and log them on the way,
A tidy exit, zero or one,
✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding support for agents with custom training loops in handle_dse_job, which is the core objective of this pull request.
Description check	✅ Passed	The description is directly related to the changeset, explaining the three key features: the HAS_CUSTOM_TRAINING_LOOP opt-in mechanism, the _run_custom_training_loop helper with exception handling and teardown semantics, and the CustomTrainingLoopAgent Protocol.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440

📥 Commits

Reviewing files that changed from the base of the PR and between 4bdc465 and 3b62ddc.

📒 Files selected for processing (2)

src/cloudai/cli/handlers.py
tests/test_handlers.py

- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.

Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.

If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.

podkidyshev · 2026-05-19T14:44:00Z

    return installables, installer


+@runtime_checkable


let's move this code into base_agent.py. handlers.py is already too long

as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py

(not related to tests against handle_dse_job)

podkidyshev · 2026-05-19T14:49:57Z

        agent = agent_class(env, agent_config)

+        if _has_custom_training_loop(agent):
+            err |= _run_custom_training_loop(agent, agent_type)


shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D

rutayan-nv requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners May 15, 2026 20:57

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/cloudai/cli/handlers.py Outdated

This was referenced May 15, 2026

Gym enhancements #863

Draft

More Gym enhancements #884

Draft

rutayan-nv changed the title ~~feat(cli): support agents with custom training loops in handle_dse_job~~ [CLI] Support agents with custom training loops in handle_dse_job May 15, 2026

rutayan-nv added 3 commits May 18, 2026 12:33

rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 3ffe893 to 9552e5a Compare May 18, 2026 16:33

podkidyshev requested changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CLI] Support agents with custom training loops in handle_dse_job#893

[CLI] Support agents with custom training loops in handle_dse_job#893
rutayan-nv wants to merge 3 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/custom-training-loop-dispatch

rutayan-nv commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

podkidyshev May 19, 2026

Uh oh!

podkidyshev May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rutayan-nv commented May 15, 2026

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

podkidyshev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

podkidyshev May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 15, 2026 •

edited

Loading