Skip to content

fix(box): boxes.json cross-process lock, atomic image index, split signing.rs#8

Merged
ZhiXiao-Lin merged 4 commits into
release/v2.0.4from
fix/box-state-and-split
May 31, 2026
Merged

fix(box): boxes.json cross-process lock, atomic image index, split signing.rs#8
ZhiXiao-Lin merged 4 commits into
release/v2.0.4from
fix/box-state-and-split

Conversation

@ZhiXiao-Lin
Copy link
Copy Markdown
Contributor

Summary

Two remaining hardening items, on top of #7 (stacked on fix/box-net-hardening).
All verified on a real Linux x86_64 + KVM host; macOS arm64 + clippy
--all-targets -D warnings clean.

Changes

  1. fix(box): keep VM alive after container exits for vsock services #4 — serialize boxes.json writes (cross-process lock). boxes.json had
    no inter-process lock; save() rewrites the whole record vector, so the
    monitor daemon, compose, per-box health checkers, and concurrent CLI
    commands clobbered each other. Added a flock(LOCK_EX) StateLock, a
    transactional StateFile::modify() (load→mutate→save under the lock, never
    .await inside), and atomic add_record/remove_record. Migrated the
    long-window async daemon writers (monitor poll_once + run_due_health_checks,
    per-box health loop) to reload-before-save modify(), and the common
    add/remove paths (create, rm) to the atomic helpers; save() now also
    locks. Verified: 12 parallel creates now persist all 12 records (was
    lossy).
  2. Atomic image-store index.json. save_index_inner used a non-atomic
    write, so a concurrent reader saw a truncated file ("Failed to parse image store index: EOF"). Now tmp + rename. Verified: the parallel-create EOF
    race is gone.
  3. fix(box): make CRI reachable over UDS — patch h2 to accept grpc-go authority #10 — split oci/signing.rs (1458 lines) into signing/{mod,crypto,sign}.rs
    per the >1000-line rule. Behavior-preserving: crypto/Fulcio primitives →
    crypto.rs, signing → sign.rs, orchestration + tests stay in mod.rs with
    glob re-imports so all call sites and the public API are unchanged.
    Verified: 34 signing unit tests pass.
  4. chore: fix a pre-existing clippy::unnecessary_mut_passed in the
    core_smoke pty test so --all-targets -D warnings (CI gate) stays green.

Verification (real KVM host)

  • core_smoke regression: 14/14
  • Unit tests: state 71, signing 34, passt 6
  • 12 parallel creates → 12 records (concurrency)
  • cargo clippy --all-targets -D warnings clean; macOS arm64 cargo check clean

Scope note

The remaining synchronous CLI mutators (pause/unpause/rename/snapshot/network/
container-update/start/restart/stop/compose) still use the now-locked save()
(writes serialized); migrating each to modify()/add_record/remove_record
for full per-command atomicity is a mechanical follow-up. The daemon (the
long-await primary offender) and the common create/rm paths are done.

Roy Lin added 4 commits May 31, 2026 10:27
Per the CLAUDE.md >1000-line split rule. Behavior-preserving:
- signing/crypto.rs: PEM/SPKI/ECDSA/base64 primitives + Fulcio X.509 helpers (pub(super))
- signing/sign.rs: SignResult + sign_image + private key parsing
- signing/mod.rs: policy/result types, cosign payloads, verify_image_signature
  orchestration, and the test module (kept here). Re-imports the submodules
  (use crypto::*; pub use sign::{sign_image, SignResult};) so all call sites and
  the public API (oci::signing::{sign_image,SignResult,SignaturePolicy,VerifyResult})
  are unchanged.

Verified: cargo clippy -D warnings clean; 34 signing unit tests pass.
boxes.json had no inter-process lock: every writer did load -> mutate ->
save() and save() rewrites the whole record vector, so the monitor daemon,
compose, per-box health checkers, and concurrent CLI commands clobbered each
other (lost-update / resurrected / dropped records).

Add a flock(LOCK_EX)-based StateLock (state/lock.rs) and a transactional
StateFile::modify() (load->mutate->save under the lock; never .await inside),
plus atomic add_record/remove_record helpers. save() now takes the lock too.
Migrate the long-window async daemon writers (monitor poll_once +
run_due_health_checks, per-box health loop) to reload-before-save modify(), and
the common add/remove paths (create, rm) to the atomic helpers.

Verified on Linux: 12 parallel `create`s now persist all 12 records (was lossy);
state/network unit tests and core_smoke 14/14 still pass.
save_index_inner used a non-atomic tokio::fs::write, so a concurrent reader
(another process running create/run) could observe a truncated/empty file,
surfacing as "Failed to parse image store index: EOF". Write to a temp file and
rename into place so readers always see the old or new index, never a partial
one. Verified: 12 parallel creates no longer hit the EOF race.
libc::openpty takes a const winsize; pass &winsize (not &mut) so
`cargo clippy --all-targets -- -D warnings` (the CI gate) stays green.
@ZhiXiao-Lin ZhiXiao-Lin changed the base branch from fix/box-net-hardening to release/v2.0.4 May 31, 2026 03:10
@ZhiXiao-Lin ZhiXiao-Lin merged commit 08a54dc into release/v2.0.4 May 31, 2026
@ZhiXiao-Lin ZhiXiao-Lin deleted the fix/box-state-and-split branch May 31, 2026 03:12
ZhiXiao-Lin pushed a commit that referenced this pull request May 31, 2026
The #8 state refactor switched rm_one to the atomic, lock-safe static
StateFile::remove_record (disk), but left test_rm_force_removes_paused_stale_record
asserting the in-memory handle, which rm_one no longer mutated — a latent failure
not caught because release-branch PRs don't run ci.yml (main-only). Add a no-save
StateFile::forget(id) and call it after remove_record so the in-memory handle stays
consistent without a second clobbering save. Full workspace lib tests green (cli 535/0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant