Add compute node resource monitoring#270
Conversation
There was a problem hiding this comment.
Pull request overview
Adds end-to-end compute-node (system-wide) resource monitoring, persists summary metrics on compute node records, and extends the dashboard/TUI/CLI + plot tooling to display and filter these metrics alongside updated workflow-spec configuration semantics.
Changes:
- Add compute-node resource summary fields to the DB/API model and populate them from the resource monitor at runner shutdown.
- Introduce scoped
resource_monitor.jobsandresource_monitor.compute_nodeconfig blocks (while keeping legacy compatibility) and update docs/examples/tests accordingly. - Extend UI surfaces (dashboard tables, resource plots tab, TUI, CLI) and plot-resources tooling to include system timelines/summaries and workflow-based DB filtering.
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| torc-server/migrations/20260319000000_add_compute_node_resource_summary.up.sql | Adds compute-node summary columns. |
| torc-server/migrations/20260319000000_add_compute_node_resource_summary.down.sql | Drops the added compute-node summary columns. |
| torc-dash/static/js/app-workflows.js | Adds resource-plots workflow selector + robust id comparison. |
| torc-dash/static/js/app-tables.js | Shows compute-node peak/avg CPU + memory columns in tables. |
| torc-dash/static/js/app-resources.js | Filters resource DB list by workflow id; resets plot state on workflow change. |
| torc-dash/static/js/app-details.js | Adds compute-node peak/avg columns to details table body rendering. |
| torc-dash/static/js/app-core.js | Syncs resource-plots workflow selector on tab switch. |
| torc-dash/static/index.html | Adds workflow selector UI to Resource Plots tab. |
| torc-dash/static/css/style.css | Styles workflow labels in resource DB list. |
| tests/workflows/multi_node_parallel_jobs_test/workflow.yaml | Updates workflow spec to new resource_monitor.jobs shape. |
| tests/test_resource_requirements.rs | Improves panic message with underlying error. |
| tests/test_hpc.rs | Adds additional ISO8601 duration parsing test case. |
| tests/test_compute_nodes.rs | Adds integration test for compute-node summary field round-trip. |
| src/tui/ui.rs | Adds Compute Nodes detail tab + table rendering with summary columns. |
| src/tui/app.rs | Adds ComputeNodes view state, loading, and filtering. |
| src/tui/api.rs | Adds client method to list compute nodes for TUI. |
| src/server/api/compute_nodes.rs | Extends compute-node CRUD/list to include new summary columns. |
| src/plot_resources_cmd.rs | Adds system sample/summary loading + system plots; improves bar dashboard axes. |
| src/models.rs | Extends ComputeNodeModel with summary fields + initializes defaults/tests. |
| src/client/workflow_spec.rs | Parses/prints nested resource-monitor scopes; adds legacy-compat tests. |
| src/client/resource_monitor.rs | Implements scoped job vs compute-node monitoring + persists system samples/summary. |
| src/client/job_runner.rs | Captures monitor shutdown summary and writes to compute-node record. |
| src/client/commands/compute_nodes.rs | Displays compute-node system peak/avg metrics in CLI list/get + adds tests. |
| src/client/async_cli_command.rs | Starts/stops per-job monitoring only when jobs scope is enabled. |
| src/bin/torc-dash.rs | Adds workflow_id parsing for resource DB filenames and returns it to UI. |
| python_client/src/torc/openapi_client/models/compute_node_model.py | Adds new compute-node summary fields to Python client model. |
| julia_client/julia_client/docs/ComputeNodeModel.md | Documents new compute-node summary fields for Julia client. |
| julia_client/Torc/src/api/models/model_ComputeNodeModel.jl | Adds new compute-node summary fields to Julia client model. |
| examples/yaml/slurm_staged_pipeline.yaml | Updates example to new resource_monitor.jobs config shape. |
| examples/yaml/resource_monitoring_demo.yaml | Updates example to include compute-node monitoring scope. |
| examples/yaml/multi_node_slurm.yaml | Updates example to new resource_monitor.jobs config shape. |
| examples/kdl/slurm_staged_pipeline.kdl | Updates example to nested jobs config block. |
| examples/kdl/resource_monitoring_demo.kdl | Updates example to include nested jobs + compute_node blocks. |
| examples/json/slurm_staged_pipeline.json5 | Updates example to nested jobs config block. |
| examples/json/resource_monitoring_demo.json5 | Updates example to include nested jobs + compute_node blocks. |
| docs/src/core/reference/workflow-spec.md | Documents new scoped resource-monitor config and legacy behavior. |
| docs/src/core/reference/resource-monitoring.md | Updates resource-monitoring reference with system tables/plots. |
| docs/src/core/how-to/view-resource-plots.md | Updates how-to to use resource_monitor.jobs for time-series. |
| api/openapi.yaml | Adds new compute-node summary fields to OpenAPI schema. |
| api/openapi.codegen.yaml | Mirrors OpenAPI schema changes for codegen. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ae857b1 to
58c075e
Compare
There was a problem hiding this comment.
Pull request overview
Adds compute-node (system-wide) resource monitoring and surfaces the resulting peak/avg CPU+memory summaries across the API, CLI/TUI, dashboard, and plot generation tooling.
Changes:
- Extend
compute_nodewith persisted resource summary fields (sample count, peak/avg CPU%, peak/avg memory). - Add compute-node/system sampling + storage to the resource monitor and generate system timeline/summary plots in
plot-resources. - Update dashboard/TUI/CLI plus workflow-spec parsing/docs/examples to support scoped
resource_monitor.jobsandresource_monitor.compute_nodeconfiguration.
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| torc-server/migrations/20260319000000_add_compute_node_resource_summary.up.sql | Adds new compute-node summary columns to the DB schema. |
| torc-server/migrations/20260319000000_add_compute_node_resource_summary.down.sql | Removes the new compute-node summary columns on downgrade. |
| torc-dash/static/js/app-workflows.js | Adds a workflow selector for the resource plots tab + robust ID comparison. |
| torc-dash/static/js/app-tables.js | Shows CPU/mem peak/avg columns in compute nodes table. |
| torc-dash/static/js/app-resources.js | Adds workflow filtering + state reset for resource DB selection/plot generation. |
| torc-dash/static/js/app-details.js | Shows CPU/mem peak/avg in compute node details table. |
| torc-dash/static/js/app-core.js | Syncs selected workflow into the resource-plots workflow selector. |
| torc-dash/static/index.html | Adds workflow selector UI to Resource Plots tab. |
| torc-dash/static/css/style.css | Styles workflow label in resource DB list items. |
| tests/workflows/multi_node_parallel_jobs_test/workflow.yaml | Updates resource_monitor config to the new nested jobs structure. |
| tests/test_resource_requirements.rs | Improves panic message with underlying error. |
| tests/test_hpc.rs | Adds additional ISO8601 duration parsing test coverage. |
| tests/test_compute_nodes.rs | Adds API round-trip test for new compute-node summary fields. |
| src/tui/ui.rs | Adds a Compute Nodes detail view/table rendering in the TUI. |
| src/tui/app.rs | Adds compute-nodes state, filtering, and load behavior to the TUI app model. |
| src/tui/api.rs | Adds a TUI client call to list compute nodes. |
| src/server/api/compute_nodes.rs | Extends compute-node CRUD/list queries to include the new summary fields. |
| src/plot_resources_cmd.rs | Loads/merges system samples/summary and generates system timeline/summary plots + tests. |
| src/models.rs | Extends ComputeNodeModel with the new optional summary fields. |
| src/client/workflow_spec.rs | Adds scoped resource_monitor parsing/serialization + legacy compatibility tests. |
| src/client/resource_monitor.rs | Introduces scoped monitoring config and compute-node (system) sampling + DB storage + tests. |
| src/client/job_runner.rs | Persists compute-node system summary to the compute_node record on shutdown. |
| src/client/commands/compute_nodes.rs | Displays compute-node peak/avg CPU+mem in CLI list/get output + tests. |
| src/client/async_cli_command.rs | Gates per-job monitoring by jobs_enabled() for new scoped config. |
| src/bin/torc-dash.rs | Adds workflow-id parsing for resource DB filenames + exposes it via API + tests. |
| python_client/src/torc/openapi_client/models/compute_node_model.py | Regenerates Python client model to include new fields. |
| julia_client/julia_client/docs/ComputeNodeModel.md | Updates Julia client docs for new compute-node fields. |
| julia_client/Torc/src/api/models/model_ComputeNodeModel.jl | Updates Julia client model for new compute-node fields. |
| examples/yaml/slurm_staged_pipeline.yaml | Updates example to nested resource_monitor.jobs. |
| examples/yaml/resource_monitoring_demo.yaml | Updates example to nested jobs + new compute_node block. |
| examples/yaml/multi_node_slurm.yaml | Updates example to nested resource_monitor.jobs. |
| examples/kdl/slurm_staged_pipeline.kdl | Updates example to nested jobs block. |
| examples/kdl/resource_monitoring_demo.kdl | Updates example to nested jobs + compute_node blocks. |
| examples/json/slurm_staged_pipeline.json5 | Updates example to nested resource_monitor.jobs. |
| examples/json/resource_monitoring_demo.json5 | Updates example to nested jobs + compute_node blocks. |
| docs/src/core/reference/workflow-spec.md | Documents new scoped resource_monitor configuration and legacy behavior. |
| docs/src/core/reference/resource-monitoring.md | Documents job vs compute-node monitoring scopes and new DB tables/outputs. |
| docs/src/core/how-to/view-resource-plots.md | Updates how-to to use nested resource_monitor.jobs. |
| api/openapi.yaml | Extends ComputeNodeModel schema with new summary fields. |
| api/openapi.codegen.yaml | Extends codegen schema with new summary fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
No description provided.