Skip to content

feat(template): optimize disk usage peak during image-to-ext4 build#472

Merged
fslongjin merged 2 commits into
TencentCloud:masterfrom
fslongjin:fix-cubemaster-tpl-create-disk-usage
Jun 5, 2026
Merged

feat(template): optimize disk usage peak during image-to-ext4 build#472
fslongjin merged 2 commits into
TencentCloud:masterfrom
fslongjin:fix-cubemaster-tpl-create-disk-usage

Conversation

@fslongjin
Copy link
Copy Markdown
Member

Summary

Reduce peak disk usage from ~4.2X image size down to ~1.2X (72% reduction) during template image-to-ext4 builds.

Changes

Phase 1 (default on)

  1. Pipe-streamed export — eliminate intermediate rootfs.tar file by connecting docker export | tar -xf - via Go pipe
  2. Early workDir cleanup — remove work directory before ext4 creation instead of defer-to-end
  3. Precise ext4 sizing — replace power-of-2 alignment with 256 MiB boundary, adding triple-overhead model (fixed + percentage + per-file)
  4. Direct-to-storeDir export — skip relocate step on local filesystems, detect NFS/CIFS via statfs magic for fallback
  5. Disk-space pre-check — fail-fast with configurable safety margin

Phase 2 (opt-in, default off)

  1. Loop-mount streaming buildCUBEMASTER_LOOP_MOUNT_EXT4_ENABLED=true, auto-detects CAP_SYS_ADMIN and /dev/loop-control

Security hardening

  • Reject - prefixed image references
  • Add -- separator to all docker/mount/umount/losetup invocations
  • Mount with nosuid,noexec,nodev,noatime for loop devices
  • sync.Once for idempotent umount and losetup detach

Other improvements

  • SHA256 computation uses 4 MiB buffer
  • Cleanup failures logged at WARN level

Verification

Metric Old (power-of-2) New Improvement
Peak disk (5 GiB image) ~20.8 GiB (4.2X) ~3.9 GiB (~1.2X) ↓ 72%
ext4 size (rust:1.77) ~4 GiB 2.0 GiB ↓ 50%
Actual disk usage ~3.3 GiB 1.4 GiB ↓ 58%

Tested with alpine (~8 MiB), mysql:8.0 (~786 MiB), golang:1.22 (~754 MiB), rust:1.77 (~1.32 GiB) on x86_64.

All existing unit tests pass.

🤖 Generated with Claude Code

Reduce peak disk usage from ~4.2X image size down to ~1.2X (72% reduction)
via five complementary optimisations:

1. Pipe-streamed export (eliminate rootfs.tar intermediate file):
   - Connect docker export stdout directly to tar -xf stdin
   - Proper subprocess lifecycle: kill+wait tar on export failure
   - 1 MiB pipe buffer via F_SETPIPE_SZ for throughput
   - Context.Background() for cleanup to survive request cancellation

2. Early workDir cleanup:
   - Remove workDir immediately after rootfs reaches storeDir,
     before ext4 creation, instead of defer-to-end

3. Precise ext4 sizing with file-count overhead:
   - Replace power-of-2 alignment with 256 MiB boundary
   - Triple overhead model: fixed (256 MiB default) + percentage
     (10% default) + per-file (1 KiB/file)
   - Single-pass directorySizeAndFileCount avoids extra I/O

4. Direct-to-storeDir export (local fast filesystems):
   - Detect NFS/CIFS via statfs magic and fall back to
     workDir+relocate only for network filesystems
   - Eliminates the relocate step and potential cross-device copy

5. Disk-space pre-check (fail-fast):
   - Statfs on storeDir parent before build starts
   - Configurable safety margin (CUBEMASTER_DISK_SPACE_SAFETY_MARGIN,
     default 1.5x)
   - Uses docker image inspect --format {{.Size}} for estimation

Phase 2 (loop-mount streaming ext4 build) is also implemented,
gated behind CUBEMASTER_LOOP_MOUNT_EXT4_ENABLED (default false)
with automatic capability detection (CAP_SYS_ADMIN, /dev/loop-control).

Security hardening:
- Reject image references starting with "-" in normalisation
- Add "--" separator to all docker/mount/umount/losetup invocations
- Mount with nosuid,noexec,nodev,noatime for loop devices
- Use sync.Once for idempotent umount and losetup detach

Additional improvements:
- SHA256 computation uses 4 MiB buffer to reduce read syscalls
- Cleanup failures are logged at WARN level instead of silent discard
- Configurable ext4 overhead via CUBEMASTER_EXT4_OVERHEAD_PERCENT
  and CUBEMASTER_EXT4_FIXED_OVERHEAD_MIB

All existing tests pass. Verified on x86_64 host with alpine (~8 MiB),
mysql:8.0 (~786 MiB), golang:1.22 (~754 MiB) and rust:1.77 (~1.32 GiB)
images. Peak disk increment measured at 3.92 GiB for a 1.32 GiB
compressed image (rust:1.77-bullseye) including docker pull, compared
to an estimated ~13.9 GiB under the previous power-of-2 alignment.

Signed-off-by: jinlong <jinlong@tencent.com>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 5, 2026

Claude finished @fslongjin's task in 1m 59s —— View job


Code Review: feat(template): optimize disk usage peak during image-to-ext4 build

Analyzing PR #472...

Tasks:

  • Read changed files and understand the changes
  • Review Phase 1 optimizations (pipe-export, early cleanup, ext4 sizing)
  • Review Phase 2 loop-mount streaming build
  • Review security hardening
  • Review new helper functions (checkDiskSpace, isLocalFastFS, canUseLoopMount, etc.)
  • Check for potential issues, dead code, or bugs
  • Post comprehensive review feedback

Overview

This PR significantly reduces peak disk usage during template image-to-ext4 builds from ~4.2× image size down to ~1.2× (72% reduction) through a combination of pipe-streamed exports, early cleanup, precise ext4 sizing, direct-to-store exports, a pre-flight disk space check, and an optional loop-mount streaming path. It also adds security hardening across docker/mount invocations.

The code is generally well-structured with clear comments, good error messages, and appropriate use of Go idioms (deferred cleanup, sync.Once, context propagation). Below are my findings organized by category.


Security: ✅ Overall solid improvements

  • Argument injection prevention: Adding -- separators to all docker, mount, umount, and losetup invocations is the correct defense-in-depth approach.
  • Image ref validation: Rejecting - prefixed source image refs in both exportImageRootfs (line 1446-1448) and normalizeTemplateImageRequest (line 1782-1784) covers both the early validation and the execution path.
  • Secure mount options: nosuid,noexec,nodev,noatime for loop device mounts (line 2418) is appropriate.
  • context.Background() for cleanup: Lines 1455-1456 use a background context for container cleanup so that cleanup runs even after the request context is cancelled. This is a robust pattern.

Issues Found

1. Dead code: cleanupIntermediateArtifacts (line 2710)

The function cleanupIntermediateArtifacts is no longer called — the old deferred call in buildRootfsArtifact was replaced with inlined cleanup logic (lines 1283-1302). The function definition at line 2710-2723 is now dead code and should be removed to avoid confusion.

2. Dead struct fields: RootFS and dockerRootFS (lines 106, 109-111)

dockerRootFS type is defined and RootFS field is added to dockerInspectImage, but neither is ever read in the new code. Consider removing these if they aren't needed for future use.

3. Duplicated comment on createExt4ImageStreaming (lines 2366-2371)

The Go comment block before createExt4ImageStreaming contains the same opening sentence twice:

// createExt4ImageStreaming uses loop-mount to stream docker export directly into
// an ext4 image, avoiding any intermediate rootfs directory on disk (Phase 2).
// Falls back to Phase 1 when prerequisites are not met.
// createExt4ImageStreaming uses loop-mount to stream docker export directly into  <-- duplicate
// an ext4 image, avoiding any intermediate rootfs directory on disk (Phase 2).
// estimatedSizeBytes should be obtained from estimateImageSizeFromInspect.

Line 2369 should be removed.

4. isLocalFastFS is overly permissive — consider renaming (line 2263-2279)

The function returns true for any filesystem that is not NFS or CIFS, including FUSE filesystems (s3fs, sshfs), overlay filesystems, tmpfs, etc. The name and comment say "local high-performance filesystem" but the check is really "not a well-known network filesystem." If an unusual network filesystem is used (e.g. GlusterFS, FUSE-based) the function will return true, and the code will skip the relocateRootfsToArtifactStore fallback path, potentially causing issues.

Suggested fix: Either rename to isNotNetworkFS to match what it actually does, or expand the magic number blocklist to include FUSE (0x65735546), or both.

5. estimateImageSizeFromInspect comment describes unimplemented fallback (lines 2341-2346)

The doc comment says:

If that is unavailable, it falls back to summing the compressed layer sizes from the manifest.

But the code (lines 2347-2364) only tries docker inspect --format '{{.Size}}' and errors out on failure. The fallback described in the comment was not implemented. Either implement it or update the comment to match reality.

6. Duplicated pipe-logic between exportImageRootfs and pipeExportToDir

Both exportImageRootfs (lines 1470-1509) and pipeExportToDir (lines 2458-2495) contain essentially identical pipe setup, start, and wait logic. Consider extracting a shared helper, e.g.:

func pipeExportToTar(ctx context.Context, containerID, destDir string) error

and have both callers use it. exportImageRootfs is now a thin wrapper that calls pipeExportToDir plus the ref validation and cleanup.

7. loopMountExt4Enabled could accept case-insensitive values (line 2113-2118)

The function only checks for "true" or "1". Environment variable values like "TRUE", "True", or "yes" are silently treated as disabled. Consider using strings.EqualFold for the "true" comparison for a better user experience.


Design Observations

Phase 1 optimizations are well-designed

  • Pipe-streamed export (line 1470): Replacing the intermediate rootfs.tar file with a Go pipe between docker export and tar -xf - eliminates one full write/read cycle of the rootfs. The best-effort 1 MiB pipe buffer increase (lines 1480-1482) is a thoughtful touch to reduce context switches.
  • Early workdir cleanup (lines 1325-1329): Removing the work directory before mkfs.ext4 runs reduces peak disk usage by eliminating the overlap period.
  • Precise ext4 sizing model (lines 1521-1539): The triple-overhead model (fixed + percentage + per-file) is much more accurate than the old power-of-2 rounding. The configurable env vars (CUBEMASTER_EXT4_FIXED_OVERHEAD_MIB, CUBEMASTER_EXT4_OVERHEAD_PERCENT) provide operational flexibility.
  • Direct-to-store export (lines 1306-1321): Using isLocalFastFS to skip the relocate step on local filesystems saves an unnecessary copy. The os.Rename fast path in relocateRootfsToArtifactStore (line 2732) with cp -a fallback for cross-device moves is correct.

Phase 2 loop-mount streaming is clean and well-gated

  • Auto-detection via canUseLoopMount() (lines 2284-2300) checks both CAP_SYS_ADMIN and /dev/loop-control, which is the correct approach for containerized environments.
  • The graceful fallback to Phase 1 (line 1254) with partial cleanup ensures robustness.
  • sync.Once for idempotent umount/detach (lines 2393-2416) is a good pattern for deferred resource cleanup.

Disk space pre-check is useful

The checkDiskSpace function (lines 2230-2257) with statfs fallback to parent directory when the target doesn't exist yet (line 2234-2236) is well-thought-out. The configurable safety margin (default 1.5×) via CUBEMASTER_DISK_SPACE_SAFETY_MARGIN gives operators control.


Minor Nits

  • Sparse allocation: For Phase 1 truncate (line 1541), using truncate -s creates a sparse file, which is more efficient than dd with seek. However, mkfs.ext4 will materialize the blocks. For Phase 2 this is also fine since the image starts sparse, gets materialized by mkfs.ext4, and then gets shrunk by resize2fs -M — the final truncate to the actual block size (line 2448) is a nice optimization.

  • SHA256 buffer: The 4 MiB buffer for io.CopyBuffer (line 2200) is a good choice — large enough to reduce syscall overhead for multi-GiB ext4 files without wasting memory.


Summary

Metric Old (power-of-2) New Improvement
Peak disk (5 GiB image) ~20.8 GiB (4.2X) ~3.9 GiB (~1.2X) ↓ 72%

This is a high-quality PR with measurable improvements and good attention to security. The main actionable items are removing the dead code and fixing the duplicated comment. The rest are design observations that the maintainers can weigh.

Tasks:

  • Read changed files and understand the changes
  • Review Phase 1 optimizations (pipe-export, early cleanup, ext4 sizing)
  • Review Phase 2 loop-mount streaming build
  • Review security hardening
  • Review new helper functions (checkDiskSpace, isLocalFastFS, canUseLoopMount, etc.)
  • Check for potential issues, dead code, or bugs
  • Post comprehensive review feedback

| Branch: fix-cubemaster-tpl-create-disk-usage

…ation

- Remove unused dockerRootFS struct and RootFS field (estimateImageSizeFromInspect
  now uses docker image inspect --format {{.Size}} directly)
- Deduplicate pipe logic: exportImageRootfs now calls pipeExportToDir instead of
  duplicating the same docker export | tar -xf - pipeline
- Fix isLocalFastFS to fall back to parent directory on ENOENT (consistent with
  checkDiskSpace), and add FUSE_SUPER_MAGIC (0x65735546) detection
- Use strconv.ParseBool in loopMountExt4Enabled for broader boolean support
  (true/TRUE/1/false/0)
- Delete dead code: cleanupIntermediateArtifacts (logic is now inline in
  buildRootfsArtifact defer closure)
- Replace TestCleanupIntermediateArtifacts* tests with:
  TestIsLocalFastFSFallsBackToParentForMissingArtifactDir
  TestLoopMountExt4EnabledParsesBoolValues
- Fix alignment of JobPhase constant declarations
- Fix error message in pipeExportToDir to include destDir parameter

Signed-off-by: jinlong <jinlong@tencent.com>
@cubesandboxbot
Copy link
Copy Markdown

cubesandboxbot Bot commented Jun 5, 2026

Claude finished @fslongjin's task in 2m 49s —— View job


Re-review Results: All Previous Issues Resolved ✅

Thanks for the updates! I've reviewed the commit c7646fc ("address PR review feedback") and all 7 issues from the previous review have been addressed:

✅ Resolved Issues

1. Dead code removedcleanupIntermediateArtifacts function was deleted (was at line 2710). ✅

2. Dead struct fields removedRootFS field from dockerInspectImage and the dockerRootFS type were removed. ✅

3. Duplicated comment fixed — The duplicate opening sentence in createExt4ImageStreaming's doc comment was removed. ✅

4. isLocalFastFS hardened — Added FUSE magic number (0x65735546) to the blocklist, added parent-directory fallback when the target path doesn't exist yet, and updated the comment to accurately describe the function as "conservative for known network or FUSE filesystems." ✅

5. estimateImageSizeFromInspect comment corrected — Removed the misleading reference to an unimplemented manifest-layer fallback. The comment now accurately describes that it only uses the {{.Size}} field. ✅

6. Pipe-logic deduplicatedexportImageRootfs now delegates to pipeExportToDir instead of inlining the same pipe setup/wait logic. ✅

7. loopMountExt4Enabled uses strconv.ParseBool — Now correctly handles "true", "TRUE", "1", "false", "0", etc. via the standard library, instead of the manual v == "true" || v == "1" check. ✅

✅ Updated Tests

The test changes match the production code changes:

Old Test New Test
TestCleanupIntermediateArtifactsRemovesIntermediateFiles Removed (function deleted)
TestCleanupIntermediateArtifactsKeepsStoreDirOnSuccess Removed (function deleted)
TestIsLocalFastFSFallsBackToParentForMissingArtifactDir
TestLoopMountExt4EnabledParsesBoolValues

The new TestLoopMountExt4EnabledParsesBoolValues covers all the key strconv.ParseBool cases: "true", "TRUE", "1", "false", empty/unset, and invalid values like "yes".

Code Quality

The indentation fix in prepareSourceImage (line 1432) was also cleaned up.

Summary

All previously-identified issues have been addressed. The PR is in good shape — clean, well-structured, and well-tested. No further issues found in this re-review.
— Branch: fix-cubemaster-tpl-create-disk-usage

@fslongjin fslongjin merged commit 6623a6e into TencentCloud:master Jun 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant