Skip to content

[CLI] Support agents with custom training loops in handle_dse_job#893

Open
rutayan-nv wants to merge 3 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/custom-training-loop-dispatch
Open

[CLI] Support agents with custom training loops in handle_dse_job#893
rutayan-nv wants to merge 3 commits into
NVIDIA:mainfrom
rutayan-nv:rpatro/custom-training-loop-dispatch

Conversation

@rutayan-nv
Copy link
Copy Markdown
Contributor

  • Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop.
  • New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths.
  • CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 976c1ce5-2b4e-44b8-97d0-fad196e68c79

📥 Commits

Reviewing files that changed from the base of the PR and between 3ffe893 and 9552e5a.

📒 Files selected for processing (2)
  • src/cloudai/cli/handlers.py
  • tests/test_handlers.py

📝 Walkthrough

Walkthrough

This PR adds a runtime-checkable CustomTrainingLoopAgent protocol and helpers to run an agent's self-contained train() (with optional shutdown() and process-style exit codes). The DSE handler dispatches to this path when detected and tests cover helper behavior and end-to-end dispatch.

Changes

Custom Training Loop Support

Layer / File(s) Summary
Custom training loop protocol and execution helpers
src/cloudai/cli/handlers.py
Typing imports expanded to include Protocol and runtime_checkable. New CustomTrainingLoopAgent protocol with HAS_CUSTOM_TRAINING_LOOP flag and train() method. Helper functions detect the protocol and execute the training loop, handling exceptions, calling optional shutdown(), and returning exit codes.
DSE job handler integration
src/cloudai/cli/handlers.py
In handle_dse_job, added conditional detection and execution of custom-training-loop agents via the helper, ORing the exit code into the error accumulator and skipping the default step loop.
Test agent stub and fixture
tests/test_handlers.py
Test imports expanded to include _run_custom_training_loop. New CustomLoopStubAgent and config opt into the custom loop, track train/shutdown call counts, optionally raise during train. Fixture registers the stub in the shared registry and manages counters.
Helper and integration tests
tests/test_handlers.py
Tests verify _run_custom_training_loop calls train() and optional shutdown() on success, returns nonzero and still calls shutdown() on exception with logging, tolerates missing shutdown(), and integration tests assert handle_dse_job correctly dispatches to and propagates failures from the custom loop.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hop in code where loops run free,

I call my train then kindly call shutdown,
I catch the bumps and log them on the way,
A tidy exit, zero or one,

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding support for agents with custom training loops in handle_dse_job, which is the core objective of this pull request.
Description check ✅ Passed The description is directly related to the changeset, explaining the three key features: the HAS_CUSTOM_TRAINING_LOOP opt-in mechanism, the _run_custom_training_loop helper with exception handling and teardown semantics, and the CustomTrainingLoopAgent Protocol.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440

📥 Commits

Reviewing files that changed from the base of the PR and between 4bdc465 and 3b62ddc.

📒 Files selected for processing (2)
  • src/cloudai/cli/handlers.py
  • tests/test_handlers.py

Comment thread src/cloudai/cli/handlers.py Outdated
This was referenced May 15, 2026
@rutayan-nv rutayan-nv changed the title feat(cli): support agents with custom training loops in handle_dse_job [CLI] Support agents with custom training loops in handle_dse_job May 15, 2026
- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop;
  handle_dse_job calls agent.train() and skips the per-step env.step loop.
- New _run_custom_training_loop helper logs exceptions, returns a process-style
  exit code, and always invokes agent.shutdown() (when defined) in a finally
  block so resources are released on both success and failure paths.
- CustomTrainingLoopAgent Protocol documents the opt-in contract for type
  checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the
plain bool predicate did not narrow agent's static type from BaseAgent to
CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from
_has_custom_training_loop so the truthy branch in handle_dse_job sees the
opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the
earlier return 0/1 from agent.train() and propagated the exception, breaking
the outer test-run loop in handle_dse_job (skipped remaining scenarios,
failed to accumulate err |= rc). Wrap shutdown() in its own try/except,
log via logging.exception, set rc = 1, and return rc after finally so the
helper always honours the (int) -> int contract.

Adds tests for shutdown-only failure and combined train+shutdown failure.
@rutayan-nv rutayan-nv force-pushed the rpatro/custom-training-loop-dispatch branch from 3ffe893 to 9552e5a Compare May 18, 2026 16:33
return installables, installer


@runtime_checkable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's move this code into base_agent.py. handlers.py is already too long

as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py

(not related to tests against handle_dse_job)

agent = agent_class(env, agent_config)

if _has_custom_training_loop(agent):
err |= _run_custom_training_loop(agent, agent_type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants