Skip to content

feat: prevent disk-full cascade failures with ZFS reservations and ea…#96

Merged
hsinatfootprintai merged 1 commit intomainfrom
feat---resources-hard-core-system
Apr 23, 2026
Merged

feat: prevent disk-full cascade failures with ZFS reservations and ea…#96
hsinatfootprintai merged 1 commit intomainfrom
feat---resources-hard-core-system

Conversation

@hsinatfootprintai
Copy link
Copy Markdown
Contributor

…rlier alerts

Addresses the incident where a full ZFS pool caused PostgreSQL to crash (couldn't write its PID file), which cascaded into Caddy OOM and a full web UI outage.

ZFS reservations for core services (idempotent, applied on each EnsurePostgres/Caddy/VictoriaMetrics/Security call):

  • postgres: 5GB reserved
  • caddy: 2GB reserved
  • security: 2GB reserved
  • victoria: 2GB reserved

Total 11GB guaranteed for core services even if user containers fill the pool. ZFS set is silently skipped on non-ZFS pools.

Alert rule changes:

  • New DiskUsageWarning at 70% for 10m (early heads-up to plan expansion)
  • Lower DiskAlmostFull from 95% to 90% for 2m (more reaction time)
  • HighDiskUsage description now warns that core services may fail

…rlier alerts

Addresses the incident where a full ZFS pool caused PostgreSQL to
crash (couldn't write its PID file), which cascaded into Caddy OOM
and a full web UI outage.

ZFS reservations for core services (idempotent, applied on each
EnsurePostgres/Caddy/VictoriaMetrics/Security call):
- postgres: 5GB reserved
- caddy:    2GB reserved
- security: 2GB reserved
- victoria: 2GB reserved

Total 11GB guaranteed for core services even if user containers fill
the pool. ZFS set is silently skipped on non-ZFS pools.

Alert rule changes:
- New DiskUsageWarning at 70% for 10m (early heads-up to plan expansion)
- Lower DiskAlmostFull from 95% to 90% for 2m (more reaction time)
- HighDiskUsage description now warns that core services may fail

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hsinatfootprintai hsinatfootprintai merged commit 91123d9 into main Apr 23, 2026
4 checks passed
@hsinatfootprintai hsinatfootprintai deleted the feat---resources-hard-core-system branch April 23, 2026 12:50
// Idempotent — safe to call repeatedly. Silently skips on non-ZFS pools.
func (cs *CoreServices) ensureCoreReservation(containerName, size string) {
dataset := fmt.Sprintf("incus-pool/containers/containers/%s", containerName)
cmd := exec.Command("zfs", "set", "reservation="+size, dataset)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants