feat(cluster): resumable peer-replication transfers (#398) by xe-nvdk · Pull Request #404 · Basekick-Labs/arc

xe-nvdk · 2026-04-14T23:10:12Z

Summary

Failed peer-replication file transfers now resume from the last committed byte instead of restarting from zero — especially valuable for large compacted Parquet outputs on slow or flaky links
New ByteOffset field on FetchFileRequest/FetchFileAckHeader (omitempty); HMAC does not bind the offset (path is already bound)
AppendingBackend optional interface replaces a mandatory AppendReader method that S3/Azure would always reject — puller type-asserts and falls back gracefully
bad_offset counter split into bad_offset_server (AckCodeBadOffset) and bad_offset_backend (ErrResumeNotSupported) for cleaner observability
Post-review refactors: tryResumeFromPartial helper, writeFileTail extracted, pipe goroutine uses sync.WaitGroup, coordinator uses ReadToAt unconditionally

Test plan

go test ./internal/cluster/filereplication/... — all existing + new tests pass (resume happy path, AckCodeBadOffset, offset echo mismatch, partial file retry, bad-offset deletes partial)
go test ./internal/storage/... — ReadToAt, StatFile, AppendReader on LocalBackend
go test ./internal/cluster/... — integration test: full fetch round-trip still works
go build ./... — clean build
Manual: interrupt a large file transfer mid-way, confirm second attempt resumes from offset (not from zero)
Manual: S3/Azure backend confirms full re-fetch fallback (no partial file left behind)

Failed file pulls now resume from the last committed byte instead of restarting from zero. Large compacted Parquet outputs on slow or flaky links no longer re-transfer already-received data. Protocol: - FetchFileRequest.ByteOffset: byte position to resume from (omitempty) - FetchFileAckHeader.ByteOffset: server echo for confirmation - FetchFileAckHeader.SizeBytes now carries tail bytes, not total size - AckCodeBadOffset: server rejects invalid/stale offsets Storage interface: - ReadToAt(ctx, path, writer, offset): range-read; used server-side - StatFile(ctx, path): size or -1 if not found; drives pre-check and resume-offset detection in the puller - AppendingBackend: new optional interface for backends that support append writes. LocalBackend implements it (O_WRONLY|O_APPEND). S3 and Azure do not; the puller falls back to full re-fetch via type-assertion rather than a mandatory method that always errors. Puller behavior on retry (attempt > 1): 1. tryResumeFromPartial: StatFile → ReadTo into sha256.Hash (prefix) 2. Send ByteOffset to peer; peer sends only the tail via ReadToAt 3. writeFileTail: AppendingBackend.AppendReader or WriteReader 4. ErrBadOffset (server) → totalBadOffsetServer++, delete partial 5. ErrResumeNotSupported (backend) → totalBadOffsetBackend++, delete Other improvements from post-implementation review: - Coordinator uses ReadToAt for both full and partial fetches (no branch) - pullOnce goroutine uses sync.WaitGroup; write logic in writeFileTail - Resume detection extracted to tryResumeFromPartial helper - bad_offset counter split into bad_offset_server / bad_offset_backend Full SHA-256 verification preserved for all code paths.

gemini-code-assist

Code Review

This pull request implements resumable file transfers for cluster replication by introducing ReadToAt, StatFile, and AppendReader methods to the storage backends. The Puller is updated to detect partial files, verify their integrity via prefix hashing, and request only the remaining data from peers. Review feedback indicates a significant issue where the LocalBackend's atomic write strategy prevents partial files from being preserved, effectively disabling the resume feature for local storage. Suggestions were also made to optimize the resume check by verifying AppendingBackend support before hashing and to refine the Azure storage implementation to avoid unnecessary Range headers for full-file fetches.

gemini-code-assist · 2026-04-14T23:15:11Z

+	var byteOffset int64
+	var prefixHasher hash.Hash
+	if attempt > 1 {
+		byteOffset, prefixHasher = p.tryResumeFromPartial(log, entry)


The resume feature appears to be non-functional for the LocalBackend because WriteReader (called for the initial transfer at offset 0) is atomic. It writes to a temporary file and deletes it on failure (see internal/storage/local.go:181). Consequently, no partial file is left at entry.Path for tryResumeFromPartial to discover on subsequent attempts. To support resumption, the initial write strategy needs to preserve partial data (e.g., by using a persistent .part file that is not cleaned up on transport errors).

gemini-code-assist · 2026-04-14T23:15:11Z

+		return 0, nil
+	}
+
+	h := sha256.New()


tryResumeFromPartial should check if the backend implements storage.AppendingBackend before proceeding to hash the prefix. For backends like S3 or Azure that do not support appending, hashing the local partial file (which might involve a full download if the backend is remote) is a wasted operation, as writeFileTail will eventually reject the resume attempt with ErrResumeNotSupported.

if _, ok := p.cfg.Backend.(storage.AppendingBackend); !ok { return 0, nil } h := sha256.New()

gemini-code-assist · 2026-04-14T23:15:11Z

+	resp, err := blobClient.DownloadStream(ctx, &blob.DownloadStreamOptions{
+		Range: blob.HTTPRange{Offset: offset},
+	})


For consistency with the S3 implementation (line 299 in internal/storage/s3.go), it is recommended to avoid sending the Range header when offset is 0. This ensures the standard full-blob download path is used.

Suggested change

resp, err := blobClient.DownloadStream(ctx, &blob.DownloadStreamOptions{

Range: blob.HTTPRange{Offset: offset},

})

var downloadOpts *blob.DownloadStreamOptions

if offset > 0 {

downloadOpts = &blob.DownloadStreamOptions{

Range: blob.HTTPRange{Offset: offset},

}

}

resp, err := blobClient.DownloadStream(ctx, downloadOpts)

gemini-code-assist · 2026-04-14T23:15:11Z

+	// Read whatever the reader provides. If the reader closes with an error
+	// (e.g. a broken pipe from a cancelled fetch), store the partial bytes and
+	// return the error — matching real backend behaviour where partial bytes
+	// are flushed to disk before the write fails.


The fakeBackend implementation of WriteReader in tests is non-atomic and stores partial bytes on failure. This is inconsistent with the real LocalBackend implementation, which is atomic and cleans up partial writes. This discrepancy masks the issue where the resume feature cannot actually function with the current LocalBackend.

…der, and non-appending backend fallback - TestLocalBackend_StatFile: not-found (-1), existing file size, correct - TestLocalBackend_ReadToAt: full fetch, partial read from offset, negative offset error, missing file - TestLocalBackend_AppendReader: verifies LocalBackend implements AppendingBackend; prefix+tail concatenation - TestPuller_NonAppendingBackendFallback: backend without AppendingBackend triggers bad_offset_backend counter, deletes partial, retries full fetch from zero Also fixes a subtle bug in pullOnce: when the write goroutine returns ErrResumeNotSupported (closing the pipe), the resulting broken-pipe fetchErr was masking the write-side error. ErrResumeNotSupported is now checked before fetchErr so the counter increments and the partial is deleted correctly.

…ansfers - fix(storage/local): WriteReader now uses a deterministic `.part` staging file instead of a random `.tmp`. On transport error the staging file is left in place so the puller can resume; on success it is atomically renamed to the final path. This was the root cause of non-functional resume on local backends (HIGH severity finding). - fix(storage/azure): ReadToAt no longer sends a Range header when offset=0, avoiding a spurious 400 from Azure Blob Storage (MEDIUM). - fix(filereplication/puller): removed premature AppendingBackend early-return from tryResumeFromPartial so non-appendable backends correctly increment bad_offset_backend and delete the partial file before retrying from zero. - fix(filereplication/puller_test): merged duplicate fakeBackend.Delete methods; fakeBackend now mirrors LocalBackend .part staging semantics so resume tests catch the atomicity bug at the unit level. - fix(storage/backend_test): TestLocalBackend_AppendReader now exercises the real contract — failed WriteReader leaves .part, AppendReader appends tail and promotes to final.

xe-nvdk · 2026-04-14T23:50:45Z

@gemini-code-assist All findings from your previous review have been addressed in the latest push. Here's a summary of the fixes:

HIGH — LocalBackend.WriteReader non-resumable (staging file deleted on error)
WriteReader now writes to a deterministic <path>.part staging file (replacing the random .tmp). On transport error the staging file is left in place so the puller can discover it via StatFile/ReadToAt and resume on the next attempt. On success it is atomically renamed to the final path. Unit test updated to exercise the full failed-write → append-tail → promote flow.

MEDIUM — Azure ReadToAt sends Range header at offset=0
ReadToAt now only sets DownloadStreamOptions{Range: ...} when offset > 0. At offset=0 it passes nil (no Range header), matching the S3 behaviour and avoiding a spurious 400 from Azure Blob Storage.

MEDIUM — tryResumeFromPartial skips resume for non-AppendingBackend backends
Removed the premature AppendingBackend type-assertion early-return from tryResumeFromPartial. Non-appendable backends (S3, Azure) now correctly: detect a partial file → return a non-zero offset → hit ErrResumeNotSupported in writeFileTail → increment bad_offset_backend → delete the partial → retry from zero. New TestPuller_NonAppendingBackendFallback test verifies the full path.

fakeBackend staging semantics
fakeBackend in puller_test.go now mirrors LocalBackend's .part staging key semantics (WriteReader stores partial bytes under path+".part", AppendReader appends and promotes on full receipt, Delete cleans up both keys). Duplicate Delete method merged.

Please re-review when you get a chance. Thanks!

gemini-code-assist

Code Review

This pull request introduces resumable file transfers to the replication system, enabling interrupted downloads to continue from the last committed byte. It extends the storage Backend interface with ReadToAt and StatFile methods and adds an AppendingBackend interface specifically for local-SSD storage. The Puller has been updated to identify partial files, verify their existing content via prefix hashing, and request only the missing data from peers. Review feedback highlights a logic error in prefix hashing for local storage, suggests scaling the hashing timeout for large files, and identifies a race condition in the file-opening sequence of the local backend.

gemini-code-assist · 2026-04-14T23:53:33Z

+	}
+
+	h := sha256.New()
+	hashCtx, hashCancel := context.WithTimeout(p.ctx, 30*time.Second)


A fixed 30-second timeout for hashing the partial file prefix may be insufficient for very large files (e.g., several gigabytes), especially on systems with slow or heavily contended disk I/O. Since this operation is local and essential for resuming large transfers, consider using a more generous timeout or one scaled by the expected file size.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

xe-nvdk added 2 commits April 14, 2026 17:38

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

xe-nvdk and others added 2 commits April 14, 2026 17:54

Update internal/cluster/filereplication/puller.go

3f0cc16

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update internal/storage/local.go

f8e659d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

xe-nvdk merged commit f82ada0 into main Apr 14, 2026
6 checks passed

xe-nvdk mentioned this pull request Apr 15, 2026

Resumable transfers via HTTP Range for peer replication #398

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): resumable peer-replication transfers (#398)#404

feat(cluster): resumable peer-replication transfers (#398)#404
xe-nvdk merged 5 commits intomainfrom
feat/resumable-peer-replication

xe-nvdk commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

xe-nvdk commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-	resp, err := blobClient.DownloadStream(ctx, &blob.DownloadStreamOptions{
-		Range: blob.HTTPRange{Offset: offset},
-	})
+var downloadOpts *blob.DownloadStreamOptions
+	if offset > 0 {
+		downloadOpts = &blob.DownloadStreamOptions{
+			Range: blob.HTTPRange{Offset: offset},
+		}
+	}
+	resp, err := blobClient.DownloadStream(ctx, downloadOpts)

Conversation

xe-nvdk commented Apr 14, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

xe-nvdk commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant