Retry database failures in the JobRunner#259
Merged
daniel-thom merged 2 commits intomainfrom Apr 8, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR makes the JobRunner more resilient to transient backend/database failures by expanding retry behavior and avoiding hard panics, and updates various docs/tests to use the unified top-level CLI syntax. It also introduces multi-provider AI Chat support (OpenAI/Ollama/GitHub Models) in torc-dash.
Changes:
- Add broader transient error detection + retry loop improvements for client API calls, and propagate/handle errors in JobRunner instead of panicking.
- Update docs/tests/examples to use unified CLI commands (e.g.,
torc create,torc run,torc submit) and align “slurm generate” terminology. - Expand
torc-dashAI Chat to support multiple LLM providers (Anthropic/OpenAI/Ollama/GitHub Models) with UI + backend wiring.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| torc-dash/static/js/app-chat.js | Updates chat setup UI behavior and request payload to support new providers/fields. |
| torc-dash/static/index.html | Adds provider/model/base-url fields for additional AI chat providers. |
| tests/workflows/timeout_auto_recovery_test/workflow.yaml | Updates test procedure to new CLI syntax. |
| tests/workflows/scale_test/README.md | Updates workflow creation command to torc create. |
| tests/workflows/oom_auto_recovery_test/workflow.yaml | Updates test procedure to new CLI syntax. |
| tests/workflows/database_contention_test/workflow.yaml | Updates documented commands to new CLI syntax. |
| tests/workflows/database_contention_test/README.md | Updates workflow creation command to torc create. |
| tests/workflows/README.md | Updates scale test instructions to use torc create. |
| src/mcp_server/tools.rs | Switches Slurm workflow creation to slurm generate + create two-step flow. |
| src/mcp_server/server.rs | Updates “exact CLI commands” guidance to new syntax. |
| src/client/utils.rs | Adds retryable error classifier + retries on more transient failures (HTTP 5xx / DB contention). |
| src/client/job_runner.rs | Propagates API errors, adds local kill path, replaces panics with logging + state rollback. |
| src/client/commands/slurm.rs | Aligns comment wording with slurm generate. |
| src/bin/torc-dash.rs | Adds multi-provider LLM configuration, tool filtering/token estimation, updates command wiring. |
| julia_client/Torc/test/test_workflow.jl | Updates Julia tests to call new CLI syntax. |
| examples/yaml/resource_monitoring_demo.yaml | Updates example command to torc create. |
| examples/README.md | Updates example workflow creation commands to torc create. |
| docs/src/specialized/design/recovery.md | Updates references from create-slurm to slurm generate. |
| docs/src/core/reference/cli.md | Removes create-slurm docs and clarifies lifecycle commands are top-level. |
| docs/src/core/monitoring/dashboard.md | Documents new AI Chat providers + CLI args/env vars. |
| README.md | Updates quickstart workflow creation command to torc create. |
| CLAUDE.md | Updates documented CLI usage to new syntax and removes submit-slurm. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This fixes issues seen by a user where an overloaded torc-server (running on a login node experiencing Lustre filesystem delays) caused a torc job_runner to exit. This changes the runner to retry those failures for up to 20 minutes.
f9fa37c to
0ec39e4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This fixes issues seen by a user where an overloaded torc-server (running on a login node experiencing Lustre filesystem delays) caused a torc job_runner to exit. This changes the runner to retry those failures for up to 20 minutes.
It also fixes cases where we were still using old CLI syntax, mostly in the docs.