Buffer resource monitor samples and flush in batches by daniel-thom · Pull Request #281 · NatLabRockies/torc

daniel-thom · 2026-04-25T15:46:36Z

Buffer time-series samples in memory and flush all of them — both job and system rows — in a single transaction every flush_interval_seconds (default 300). At default settings this sets the commit rate at 12/hr per monitor and gives SQLite enough rows per flush that the DB file extends in larger chunks. Aggregated peak/avg metrics live outside the buffer and are unaffected by buffer loss; only the time-series tail is exposed to crash loss (bounded by flush_interval_seconds).

Shutdown reuses the same flush path so a controlled exit loses nothing — it appends the final system sample to the buffer and then issues one transaction covering the buffered tail plus the system summary.

Durability

The metrics DB uses journal_mode=WAL + synchronous=NORMAL + temp_store=MEMORY. WAL keeps the DB consistent across process kills (e.g. Slurm SIGKILL on OOM — exactly the scenario this monitor is meant to diagnose) and OS crashes; synchronous=NORMAL fsyncs at each commit and at WAL checkpoints. With a 5-minute default flush interval the per-commit fsync cost is negligible — the perf win comes from batching, not from disabling durability. Worst-case data loss is the unflushed tail (≤ flush_interval_seconds of samples), with the DB itself remaining readable.

Buffer time-series samples in memory and flush all of them — both job and system rows — in a single transaction every flush_interval_seconds (default 300). At default settings this drops the commit rate from 360/hr to 12/hr per monitor and gives SQLite enough rows per flush that the DB file extends in larger chunks. Aggregated peak/avg metrics live outside the buffer and are unaffected by buffer loss; only the time-series tail is exposed to crash loss (bounded by flush_interval_seconds). Shutdown reuses the same flush path so a controlled exit loses nothing — it appends the final system sample to the buffer and then issues one transaction covering the buffered tail plus the system summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Buffers resource-monitor time-series samples in memory and flushes them to the SQLite time-series DB in batched transactions on a configurable flush interval, reducing commit frequency and filesystem churn.

Changes:

Added flush_interval_seconds to ResourceMonitorConfig (default 300) with validation.
Batched job/system time-series inserts and flush-on-shutdown via a shared flush_pending_samples transaction path.
Updated workflow/spec constructors to use ..ResourceMonitorConfig::default() to populate the new field.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/exec_cmd.rs`	Populates the new `ResourceMonitorConfig` field via struct update defaulting.
`src/client/workflow_spec.rs`	Ensures default values are applied when enabling resource monitoring in generated specs.
`src/client/resource_monitor.rs`	Implements buffering + periodic flushing, adds DB pragmas, and adds/updates unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Previously used journal_mode=MEMORY + synchronous=OFF, which can corrupt the entire DB on a process kill (e.g. Slurm SIGKILL on OOM) — exactly the scenario the resource monitor is meant to diagnose. With the 5-minute default flush interval, per-commit fsync cost is negligible, so the durability tradeoff is no longer justified. Also harden the periodic_flush test against transient SQLite busy errors by setting busy_timeout and retrying briefly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

daniel-thom requested a review from Copilot April 25, 2026 15:46

Copilot started reviewing on behalf of daniel-thom April 25, 2026 15:47 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/client/resource_monitor.rs

Comment thread src/client/resource_monitor.rs Outdated

Comment thread src/client/resource_monitor.rs Outdated

daniel-thom and others added 2 commits April 25, 2026 09:54

Update KDL parser

9b585b3

daniel-thom requested a review from Copilot April 25, 2026 16:14

Copilot started reviewing on behalf of daniel-thom April 25, 2026 16:14 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

daniel-thom merged commit f582249 into main Apr 25, 2026
13 checks passed

daniel-thom deleted the perf/time-series-monitoring branch April 25, 2026 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer resource monitor samples and flush in batches#281

Buffer resource monitor samples and flush in batches#281
daniel-thom merged 3 commits intomainfrom
perf/time-series-monitoring

daniel-thom commented Apr 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

daniel-thom commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Durability

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daniel-thom commented Apr 25, 2026 •

edited

Loading