Buffer resource monitor samples and flush in batches#281
Merged
daniel-thom merged 3 commits intomainfrom Apr 25, 2026
Merged
Conversation
Buffer time-series samples in memory and flush all of them — both job and system rows — in a single transaction every flush_interval_seconds (default 300). At default settings this drops the commit rate from 360/hr to 12/hr per monitor and gives SQLite enough rows per flush that the DB file extends in larger chunks. Aggregated peak/avg metrics live outside the buffer and are unaffected by buffer loss; only the time-series tail is exposed to crash loss (bounded by flush_interval_seconds). Shutdown reuses the same flush path so a controlled exit loses nothing — it appends the final system sample to the buffer and then issues one transaction covering the buffered tail plus the system summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Buffers resource-monitor time-series samples in memory and flushes them to the SQLite time-series DB in batched transactions on a configurable flush interval, reducing commit frequency and filesystem churn.
Changes:
- Added
flush_interval_secondstoResourceMonitorConfig(default 300) with validation. - Batched job/system time-series inserts and flush-on-shutdown via a shared
flush_pending_samplestransaction path. - Updated workflow/spec constructors to use
..ResourceMonitorConfig::default()to populate the new field.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/exec_cmd.rs |
Populates the new ResourceMonitorConfig field via struct update defaulting. |
src/client/workflow_spec.rs |
Ensures default values are applied when enabling resource monitoring in generated specs. |
src/client/resource_monitor.rs |
Implements buffering + periodic flushing, adds DB pragmas, and adds/updates unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Previously used journal_mode=MEMORY + synchronous=OFF, which can corrupt the entire DB on a process kill (e.g. Slurm SIGKILL on OOM) — exactly the scenario the resource monitor is meant to diagnose. With the 5-minute default flush interval, per-commit fsync cost is negligible, so the durability tradeoff is no longer justified. Also harden the periodic_flush test against transient SQLite busy errors by setting busy_timeout and retrying briefly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Buffer time-series samples in memory and flush all of them — both job and system rows — in a single transaction every
flush_interval_seconds(default 300). At default settings this sets the commit rate at 12/hr per monitor and gives SQLite enough rows per flush that the DB file extends in larger chunks. Aggregated peak/avg metrics live outside the buffer and are unaffected by buffer loss; only the time-series tail is exposed to crash loss (bounded byflush_interval_seconds).Shutdown reuses the same flush path so a controlled exit loses nothing — it appends the final system sample to the buffer and then issues one transaction covering the buffered tail plus the system summary.
Durability
The metrics DB uses
journal_mode=WAL+synchronous=NORMAL+temp_store=MEMORY. WAL keeps the DB consistent across process kills (e.g. SlurmSIGKILLon OOM — exactly the scenario this monitor is meant to diagnose) and OS crashes;synchronous=NORMALfsyncs at each commit and at WAL checkpoints. With a 5-minute default flush interval the per-commit fsync cost is negligible — the perf win comes from batching, not from disabling durability. Worst-case data loss is the unflushed tail (≤flush_interval_secondsof samples), with the DB itself remaining readable.