feat: prevent disk-full cascade failures with ZFS reservations and ea…#96
Merged
hsinatfootprintai merged 1 commit intomainfrom Apr 23, 2026
Merged
Conversation
…rlier alerts Addresses the incident where a full ZFS pool caused PostgreSQL to crash (couldn't write its PID file), which cascaded into Caddy OOM and a full web UI outage. ZFS reservations for core services (idempotent, applied on each EnsurePostgres/Caddy/VictoriaMetrics/Security call): - postgres: 5GB reserved - caddy: 2GB reserved - security: 2GB reserved - victoria: 2GB reserved Total 11GB guaranteed for core services even if user containers fill the pool. ZFS set is silently skipped on non-ZFS pools. Alert rule changes: - New DiskUsageWarning at 70% for 10m (early heads-up to plan expansion) - Lower DiskAlmostFull from 95% to 90% for 2m (more reaction time) - HighDiskUsage description now warns that core services may fail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // Idempotent — safe to call repeatedly. Silently skips on non-ZFS pools. | ||
| func (cs *CoreServices) ensureCoreReservation(containerName, size string) { | ||
| dataset := fmt.Sprintf("incus-pool/containers/containers/%s", containerName) | ||
| cmd := exec.Command("zfs", "set", "reservation="+size, dataset) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…rlier alerts
Addresses the incident where a full ZFS pool caused PostgreSQL to crash (couldn't write its PID file), which cascaded into Caddy OOM and a full web UI outage.
ZFS reservations for core services (idempotent, applied on each EnsurePostgres/Caddy/VictoriaMetrics/Security call):
Total 11GB guaranteed for core services even if user containers fill the pool. ZFS set is silently skipped on non-ZFS pools.
Alert rule changes: