[CLI] Support agents with custom training loops in handle_dse_job#893
[CLI] Support agents with custom training loops in handle_dse_job#893rutayan-nv wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds a runtime-checkable CustomTrainingLoopAgent protocol and helpers to run an agent's self-contained train() (with optional shutdown() and process-style exit codes). The DSE handler dispatches to this path when detected and tests cover helper behavior and end-to-end dispatch. ChangesCustom Training Loop Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/cli/handlers.py`:
- Around line 140-152: The finally block in _run_custom_training_loop currently
calls shutdown() directly which can raise and override the earlier return value;
wrap the shutdown invocation (getattr(agent, "shutdown", None) and the callable
check) in its own try/except Exception handler so any exceptions from shutdown
are caught and logged via logging.exception (include agent_type) and not
re-raised, ensuring the original return 0/1 from agent.train() is preserved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 21ef8346-6b78-45d3-8a1c-29a3c393e440
📒 Files selected for processing (2)
src/cloudai/cli/handlers.pytests/test_handlers.py
- Agents that set HAS_CUSTOM_TRAINING_LOOP = True drive their own training loop; handle_dse_job calls agent.train() and skips the per-step env.step loop. - New _run_custom_training_loop helper logs exceptions, returns a process-style exit code, and always invokes agent.shutdown() (when defined) in a finally block so resources are released on both success and failure paths. - CustomTrainingLoopAgent Protocol documents the opt-in contract for type checkers and IDEs.
Pyright rejected calling _run_custom_training_loop(agent, ...) because the plain bool predicate did not narrow agent's static type from BaseAgent to CustomTrainingLoopAgent. Return TypeGuard[CustomTrainingLoopAgent] from _has_custom_training_loop so the truthy branch in handle_dse_job sees the opted-in shape and the helper can call agent.train() directly.
If agent.shutdown() raised from the finally block, Python suppressed the earlier return 0/1 from agent.train() and propagated the exception, breaking the outer test-run loop in handle_dse_job (skipped remaining scenarios, failed to accumulate err |= rc). Wrap shutdown() in its own try/except, log via logging.exception, set rc = 1, and return rc after finally so the helper always honours the (int) -> int contract. Adds tests for shutdown-only failure and combined train+shutdown failure.
3ffe893 to
9552e5a
Compare
| return installables, installer | ||
|
|
||
|
|
||
| @runtime_checkable |
There was a problem hiding this comment.
let's move this code into base_agent.py. handlers.py is already too long
as for the tests against _run_custom_training_loop: I'm starting to make the tests folder structure replicate the main code structure. so in this case, I'd place all the relevant tests you added into tests/configurator/test_base_agent.py
(not related to tests against handle_dse_job)
| agent = agent_class(env, agent_config) | ||
|
|
||
| if _has_custom_training_loop(agent): | ||
| err |= _run_custom_training_loop(agent, agent_type) |
There was a problem hiding this comment.
shouldn't we exit (immediate return err) if err is greater than zero? The existing code above doesn't really treat the err well but maybe it's the time to start doing so :D
HAS_CUSTOM_TRAINING_LOOP = Truedrive their own training loop;handle_dse_jobcallsagent.train()and skips the per-stepenv.steploop._run_custom_training_loophelper logs exceptions, returns a process-style exit code, and always invokesagent.shutdown()(when defined) in afinallyblock so resources are released on both success and failure paths.CustomTrainingLoopAgentProtocoldocuments the opt-in contract for type checkers and IDEs.