Skip to content

Buffer resource monitor samples and flush in batches#281

Merged
daniel-thom merged 3 commits intomainfrom
perf/time-series-monitoring
Apr 25, 2026
Merged

Buffer resource monitor samples and flush in batches#281
daniel-thom merged 3 commits intomainfrom
perf/time-series-monitoring

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

@daniel-thom daniel-thom commented Apr 25, 2026

Buffer time-series samples in memory and flush all of them — both job and system rows — in a single transaction every flush_interval_seconds (default 300). At default settings this sets the commit rate at 12/hr per monitor and gives SQLite enough rows per flush that the DB file extends in larger chunks. Aggregated peak/avg metrics live outside the buffer and are unaffected by buffer loss; only the time-series tail is exposed to crash loss (bounded by flush_interval_seconds).

Shutdown reuses the same flush path so a controlled exit loses nothing — it appends the final system sample to the buffer and then issues one transaction covering the buffered tail plus the system summary.

Durability

The metrics DB uses journal_mode=WAL + synchronous=NORMAL + temp_store=MEMORY. WAL keeps the DB consistent across process kills (e.g. Slurm SIGKILL on OOM — exactly the scenario this monitor is meant to diagnose) and OS crashes; synchronous=NORMAL fsyncs at each commit and at WAL checkpoints. With a 5-minute default flush interval the per-commit fsync cost is negligible — the perf win comes from batching, not from disabling durability. Worst-case data loss is the unflushed tail (≤ flush_interval_seconds of samples), with the DB itself remaining readable.

Buffer time-series samples in memory and flush all of them — both job and
system rows — in a single transaction every flush_interval_seconds (default
300). At default settings this drops the commit rate from 360/hr to 12/hr per
monitor and gives SQLite enough rows per flush that the DB file extends in
larger chunks. Aggregated peak/avg metrics live outside the buffer and are
unaffected by buffer loss; only the time-series tail is exposed to crash loss
(bounded by flush_interval_seconds).

Shutdown reuses the same flush path so a controlled exit loses nothing — it
appends the final system sample to the buffer and then issues one transaction
covering the buffered tail plus the system summary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Buffers resource-monitor time-series samples in memory and flushes them to the SQLite time-series DB in batched transactions on a configurable flush interval, reducing commit frequency and filesystem churn.

Changes:

  • Added flush_interval_seconds to ResourceMonitorConfig (default 300) with validation.
  • Batched job/system time-series inserts and flush-on-shutdown via a shared flush_pending_samples transaction path.
  • Updated workflow/spec constructors to use ..ResourceMonitorConfig::default() to populate the new field.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/exec_cmd.rs Populates the new ResourceMonitorConfig field via struct update defaulting.
src/client/workflow_spec.rs Ensures default values are applied when enabling resource monitoring in generated specs.
src/client/resource_monitor.rs Implements buffering + periodic flushing, adds DB pragmas, and adds/updates unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/resource_monitor.rs
Comment thread src/client/resource_monitor.rs Outdated
Comment thread src/client/resource_monitor.rs Outdated
daniel-thom and others added 2 commits April 25, 2026 09:54
Previously used journal_mode=MEMORY + synchronous=OFF, which can corrupt
the entire DB on a process kill (e.g. Slurm SIGKILL on OOM) — exactly
the scenario the resource monitor is meant to diagnose. With the
5-minute default flush interval, per-commit fsync cost is negligible,
so the durability tradeoff is no longer justified.

Also harden the periodic_flush test against transient SQLite busy
errors by setting busy_timeout and retrying briefly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@daniel-thom daniel-thom merged commit f582249 into main Apr 25, 2026
13 checks passed
@daniel-thom daniel-thom deleted the perf/time-series-monitoring branch April 25, 2026 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants