Skip to content

feat(labctl): harden Talos image build reliability#65

Merged
jmgilman merged 1 commit intomasterfrom
session-057/talos-image-reliability
May 3, 2026
Merged

feat(labctl): harden Talos image build reliability#65
jmgilman merged 1 commit intomasterfrom
session-057/talos-image-reliability

Conversation

@jmgilman
Copy link
Copy Markdown
Contributor

@jmgilman jmgilman commented May 3, 2026

Summary

Makes labctl bootstrap talos image build resilient enough for durable clusters, not just disposable bootstrap. Closes the asymmetry between the Talos and IncusOS paths: the IncusOS sibling already verifies SHA256 and bounds xz decompression; Talos didn't.

This is PR 1 of 5 from the session 057 reliability+flexibility plan. Code-only; no schema impact.

What changed

  • Atomic downloads + SHA256 verification. Stream the archive into a sibling .tmp-* file via os.CreateTemp, hash on stream against the Image Factory <asset>.sha256 sidecar, os.Rename only when the digest matches. Corrupt cached archives are detected on every run and redownloaded. Same atomic pattern now applies to the decompressed boot image.
  • HTTP client timeouts. New httpupstream.NewHTTPClient sets ResponseHeaderTimeout=30s and IdleConnTimeout=90s. No top-level Timeout — long downloads rely on context cancellation. Both Talos and IncusOS paths benefit from the explicit client.
  • Retry with backoff. 5 attempts, exponential 500ms..16s with ±20% jitter, retried on connection errors, 5xx, and 429. 4xx and ctx errors are permanent. Uses cenkalti/backoff/v4 (already an indirect dep, now promoted to direct).
  • xz decompression bounded to MaxBootImageBytes = 4 GiB via io.LimitReader, mirroring the existing IncusOS pattern at incusosimage/service.go:210.
  • Result carries artifact digests. BootArtifactSHA256 and ConfigArtifactSHA256 are populated from the streamed hashes and exposed in --json output so operators can verify before flashing.

Why now

Session 056 shipped the Talos build slice (PR #64) yesterday. A code review (session 057) found that the cache silently served corrupt archives, the HTTP client had no timeouts, and there was no upper bound on xz decompression — none of which were caught because the IncusOS sibling already does them right and operators never compared. Fixing them before durable-cluster work begins is cheaper than fixing them after.

Out of scope

  • Cidata image file mode (PR 2 — carries Talos PKI, needs 0600).
  • Dynamic cidata size (PR 2).
  • Schema defaults / output.dir tightening (PR 3).
  • Progress reporting (PR 4 — port wiring touches IncusOS too).
  • Declarative extensions via POST /schematics (PR 5).
  • Talos machine-config generation from patches + secrets bundle (deferred to its own session).

Test plan

  • go test ./... from tools/labctl/ — all pass.
  • moon run labctl:check --summary minimal — format, lint, test pass.
  • moon ci --summary minimal from platform/ — all actions pass including labctl:image-smoke.
  • New unit tests:
    • service_test.go — atomic commit, corrupt-cache redownload, sidecar-mismatch rejection (no leaked .tmp-*), sidecar fetch error propagation.
    • client_test.go (new) — sha256 parsing edge cases, retry on 5xx, no-retry on 4xx, ctx-cancel exit.
  • Updated txtar integration test asserts new digest fields appear in --json output.
  • Manual verification against real Image Factory deferred until merge — current local tests use a fake upstream.

🤖 Generated with Claude Code

Make `labctl bootstrap talos image build` resilient enough for durable
clusters, not just disposable bootstrap. Mirrors the IncusOS path's
SHA256 verification while adding the protections it lacks.

- Atomic downloads: stream archive into a sibling .tmp-* file via
  os.CreateTemp, hash-on-stream against the published Image Factory
  sidecar (`<asset>.sha256`), and os.Rename only when the digest
  matches. A corrupt cached archive is detected on every run and
  redownloaded. Same atomic pattern is now used for the decompressed
  boot image.
- Bound xz decompression to MaxBootImageBytes (4 GiB) using the same
  io.LimitReader pattern incusosimage already uses.
- HTTP client gets explicit ResponseHeaderTimeout (30s) and
  IdleConnTimeout (90s) via httpupstream.NewHTTPClient instead of
  http.DefaultClient. No top-level Timeout — long downloads use
  context cancellation. Both Talos and IncusOS paths benefit.
- Retry with exponential backoff (5 attempts, 500ms..16s, ±20%
  jitter) on connection errors, 5xx, and 429. 4xx and ctx errors are
  permanent.
- Result now carries BootArtifactSHA256 and ConfigArtifactSHA256;
  --json output exposes them so operators can verify before flashing.

Tests cover atomic-write commit, redownload-on-corrupt-cache,
sidecar-mismatch rejection (with no leaked .tmp- file), retry on 5xx,
no-retry on 4xx, and SHA256 sidecar parsing edge cases. testscript
fixture now serves the matching .sha256 sidecar and asserts the new
digest fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jmgilman jmgilman merged commit 8b983ce into master May 3, 2026
6 checks passed
@jmgilman jmgilman deleted the session-057/talos-image-reliability branch May 3, 2026 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant