Add experimental Content-Addressable Storage (CAS) backups#1367
Open
filimonov wants to merge 190 commits intoAltinity:masterfrom
Open
Add experimental Content-Addressable Storage (CAS) backups#1367filimonov wants to merge 190 commits intoAltinity:masterfrom
filimonov wants to merge 190 commits intoAltinity:masterfrom
Conversation
Two plans derived from docs/cas-design.md: - Phase 1 + 1.5: cas-upload, cas-download, cas-restore, cas-delete, cas-verify, cas-status, plus v1<->CAS isolation guards. Independently shippable (backup/restore works without GC). - Phase 2: cas-prune mark-and-sweep with deferred marker release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Promote the mature checksums.txt parser from docs/checksumstxt/ to pkg/checksumstxt/ so production CAS packages can import it. All 11 tests pass; go build ./... and go vet ./... clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ion/multi-block) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… layout This is the spec the cas-phase1 plan implements. format.md is already tracked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add cas.Config with defaults, Validate(), and ClusterPrefix(). Wire into pkg/config.Config (field, DefaultConfig, ValidateConfig). Move backend_storage.go to pkg/cas/casstorage to break the import cycle that would have resulted from pkg/config importing pkg/cas importing pkg/storage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardens path-key construction against operator misconfiguration that could produce traversal-flavored object keys (review feedback on Task 6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add sentinel errors (errors.go) and ValidateBackup (validate.go) which loads and verifies CAS metadata.json — enforcing name syntax, layout version bounds, inline_threshold range, and cluster_id ownership. Include validate_test.go covering happy path and all 7 failure modes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardens validateName beyond the regex; cited as Minor in Task 10 review.
Add DetectObjectDiskTables + IsObjectDiskType to pkg/cas, plus TableInfo and DiskInfo value types in types.go. Using local structs avoids the pkg/cas ↔ pkg/clickhouse import cycle. 6 unit tests cover longest-prefix matching, sibling-prefix false-positives, deduplication, and empty inputs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement Upload(): parses each part's checksums.txt, classifies files into inline (≤ threshold, packed into per-(disk,db,table) tar.zstd) and blob (> threshold, content-addressed by Hash128), cold-lists existing blobs to dedup, enforces marker discipline with pre-commit re-checks, then commits the root metadata.json last. Adds pkg/cas/internal/testfixtures for synthesizing local-backup trees with valid v2 checksums.txt so tests can drive Upload end-to-end against the in-memory fakedst backend.
Download(ctx, b, cfg, name, opts) reads cas/<cluster>/metadata/<name>/ metadata.json via ValidateBackup, fetches per-table TableMetadata JSONs, applies TableFilter / Partitions / SchemaOnly filters, downloads and extracts per-(disk, db, table) tar.zstd archives into the local shadow tree, then walks each part's on-disk checksums.txt and fetches every file with size > InlineThreshold from cas/blob/<aa>/<rest>. Local layout produced is exactly what v1 restore consumes when DataFormat="directory": <localDir>/<name>/metadata.json <localDir>/<name>/metadata/<enc_db>/<enc_table>.json <localDir>/<name>/shadow/<enc_db>/<enc_table>/<disk>/<part>/<file> Path containment is enforced for both tar extraction (via the existing ExtractArchive) and blob writes (re-asserted before O_TRUNC). Filenames parsed from checksums.txt are validated (reject "..", NUL, leading "/", nested paths unless they match the projection layout <name>.proj/<file>). Disk-space pre-flight uses syscall.Statfs (already a project dep) with a 1.1x safety margin; failures are best-effort and skipped if the syscall is unavailable. Tests (8): RoundTripBytes, RefusesV1Backup, RefusesUnsupportedLayoutVersion, TableFilter, SchemaOnly, RejectsTraversalFilename, RejectsTarTraversal, PartitionFilter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add pkg/cas/restore.go: a thin orchestrator that runs cas.Download and hands off to a caller-supplied V1RestoreFunc (the CLI binding will wire it to pkg/backup.Backuper.Restore in Task 19). The callback indirection keeps pkg/cas free of any dependency on pkg/backup. Patch the v1 entry points in pkg/backup to refuse CAS-shaped backups, returning cas.ErrCASBackup so callers can errors.Is the sentinel: - Restore: after BackupMetadata is unmarshalled (line ~138). - Download: after remoteBackup is found (line ~129). - RemoveBackupRemote: inside the BackupList loop (line ~344). In Restore, also bypass the two downloadObjectDiskParts calls (lines ~2168, ~2212) when backupMetadata.CAS != nil. CAS backups never carry object-disk parts (cas-upload preflight rejects object-disk tables), but the existing v1 detector inspects live ClickHouse disk types rather than backup metadata, so it has to be told explicitly. V1 unit tests for the new guards require full Backuper plumbing (ClickHouse client + storage backend + initDisksPathsAndBackupDestination); deferred to integration coverage in Task 21. CAS-side wrapper tests (restore_test.go) cover happy path, callback error propagation, ignore-dependencies rejection, nil-callback error, and Download error propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Delete removes the metadata subtree with metadata.json deleted first for atomic catalog removal; stale inprogress-marker detection, prune-in-progress guard, and best-effort stale-marker cleanup all covered by 6 tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add Verify() which loads backup metadata, streams per-table tar.zstd archives to extract checksums.txt entries, accumulates expected (path, size) pairs for every above-threshold blob, then HEADs each blob in parallel to detect missing blobs and size mismatches. Writes failures as human-readable lines or line-delimited JSON (opts.JSON=true) to the provided io.Writer; returns ErrVerifyFailures when any failures exist. Add ErrVerifyFailures sentinel to errors.go. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement Status(ctx, b, cfg) -> *StatusReport and PrintStatus(r, w) in pkg/cas/status.go. LIST-only (no GETs): walks metadata/, blob/, inprogress/ and stats prune.marker. Classifies in-progress markers as fresh or abandoned relative to cfg.AbandonThreshold. Four tests covering empty bucket, post-upload counts, prune marker detection, and fresh/abandoned classification. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add six urfave/cli v1 subcommands (cas-upload, cas-download, cas-restore, cas-delete, cas-verify, cas-status) as thin shims over pkg/cas. Each command opens a *storage.BackupDestination via the existing Backuper init path, adapts it through pkg/cas/casstorage.NewStorageBackend, and translates pkg/cas results / errors to CLI exit codes. cas-restore wires a V1RestoreFunc closure that delegates back to Backuper.Restore on the local directory cas-download just materialized; --ignore-dependencies is accepted but rejected by cas.Restore because CAS backups have no dependency chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review feedback on Task 19: CASUpload/CASDownload/CASRestore acquire a pidlock; CASDelete should too, matching v1 'delete' (pkg/backup/delete.go:99). A concurrent cas-upload + cas-delete on the same name now mutex correctly.
Add test/integration/cas_test.go covering Task 21 of the CAS Phase 1 plan: - TestCASRoundtrip: end-to-end create → cas-upload → cas-status → drop database → cas-restore → checksum-style row count + sum verification → cas-delete → cas-status (gone). - TestCASCrossModeGuards: §6.2.2 isolation. Verifies that v1 download/ delete-remote refuse a CAS backup, that cas-download/cas-delete refuse a v1 backup, and that same-mode operations succeed. - TestCASVerify: cas-verify happy path; stretch case removes a single blob from MinIO and asserts the next cas-verify exits non-zero with a "missing" diagnostic. Stretch is best-effort: skipped (with a warning) if the bucket layout does not match the assumed minio path. Implementation notes: - casBootstrap writes a CAS-flavored config to /tmp/config-cas.yml inside the clickhouse-backup container (configs/ is read-only mounted). The config is the stock config-s3.yml plus a cas: stanza with a per-test cluster_id so concurrent envPool slots cannot trample each other. - Each test has its own cluster_id (roundtrip / guards / verify). - Tests do NOT actually run as part of this commit; verified only via go vet -tags=integration ./test/integration/... and go test -c -tags= integration. Real run is deferred to manual / CI execution per Task 22. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-remote integration BackupList in pkg/storage now accepts skipPrefixes and ignores any top-level entry matching them. cas.Config.SkipPrefixes() returns ["cas/"] (or the configured root_prefix) when CAS is enabled, else nil. Every BackupList caller in pkg/backup passes b.cfg.CAS.SkipPrefixes(); CleanRemoteBroken and RemoveOldBackupsRemote inherit the exclusion via GetRemoteBackups / the new arg, so v1 retention no longer mistakes cas/<cluster>/... for a broken backup directory. cas.ListRemoteCAS walks cas/<cluster>/metadata/<bk>/metadata.json, parses each metadata.json (degrading to a "broken" description on read or parse failure rather than dropping the entry), and returns CASListEntry rows tagged "[CAS]" sorted newest-first. CollectRemoteBackups in pkg/backup/list.go appends those entries to the v1 list-remote output when CAS is enabled, so operators see CAS backups alongside v1 backups in `clickhouse-backup list remote`. Errors from the CAS side are logged and swallowed — informational listing must not break on a CAS-side failure that the v1 path just survived. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r README
Three fixes from the cross-cutting final review:
B1. cas-restore was broken: cas.Download wrote a local metadata.json with
bm.CAS != nil, which the v1 Restore handoff at pkg/backup/restore.go:141
refuses by design. Strip bm.CAS in the local copy so the on-disk layout
is indistinguishable from a v1 directory-format backup; the cross-mode
guard remains effective for direct v1 invocation. The object-disk skip
for CAS still works because the pre-flight refuses such tables and
iterating zero parts is a no-op.
UPLOAD STATS. UploadResult now exposes the breakdown operators care about:
total backup content (TotalFiles/TotalBytes), how it was placed
(InlineFiles/InlineBytes vs UniqueBlobs/BlobBytesTotal), and what crossed
the wire on this run (BlobsUploaded/BytesUploaded vs BlobsReused/BytesReused
+ ArchiveBytes). cas-upload prints a multi-line summary highlighting the
content-addressed dedup savings.
README. Rewrote the CAS section for end users: leads with the value-prop
(smart deduplicating backups, smaller uploads than incremental, no chain
dependency, mutation-friendly), no design-doc reference. Added a quick-
start config + command sequence. Updated upload --help to recommend
cas-upload without pointing at internal docs.
Includes the recovered Task 18 commit (BackupList prefix exclusion +
cross-mode list integration) which had been dangling — cherry-picked back
onto cas-phase1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o uploaded TableMetadata Without these fields the v1 restore handoff cannot recreate tables on a fresh host (CREATE TABLE statement is empty). Read the per-table JSON that 'clickhouse-backup create' already wrote to disk and merge into the uploaded TableMetadata. Add a download-side regression test that the schema fields survive the round-trip. This is the load-bearing correctness fix from the cas-phase1 follow-up plan.
Convert httpCASDeleteHandler from synchronous execution to the same
async pattern used by all other CAS handlers (upload, download, restore,
verify, prune). The sync path caused 502/504 from reverse-proxy timeouts
(nginx default 60s) while the delete continued server-side.
Changes:
- Wrap b.CASDelete() in go func with status.Current.StartWithOperationId
and errorCallback/successCallback, matching the upload handler pattern.
- Return newAsyncAck("cas-delete", name, operationId) immediately (HTTP 200).
- Parse optional callback query parameter (parity with other handlers).
- Delete casDeleteHTTPStatus helper (sync-mode workaround, no longer used)
and its dedicated TestCasDeleteHTTPStatus unit test.
- Replace TestCASDeleteHandler_LockedWhenBusy with an additional
TestCASDeleteHandler_AsyncAck that verifies the 200 acknowledged shape.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng (F10) Surface dry-run candidate blobs to API callers and CLI output so operators can see exactly what would be deleted without running a live prune. Changes: - Add DryRunCandidates []OrphanCandidate to PruneReport (json tagged, omitempty; only populated when DryRun=true). - Add JSON struct tags to all PruneReport fields and OrphanCandidate fields so the report serialises cleanly through the REST API response path. - Accumulate cands into rep.DryRunCandidates in the dry-run branch of Prune() (step 11), after the existing structured log.Info() lines. - Update PrintPruneReport to print a "Would delete:" section listing each candidate with its key, human-readable size, and UTC modification time when DryRunCandidates is non-empty. - Fix TestUpload_RefusesIfInprogressMarkerPresent assertion to match the updated error message format from upload.go (uses Tool field from marker rather than hardcoded "cas-upload"); check for "is in progress for" which is stable across both the old and new message formats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test) Wave-D1 scoped review surfaced two minor issues: - PruneReport.DryRunCandidates was assigned the live cands slice returned by SweepOrphans; if the caller mutates cands after the function returns, the report silently reflects mutations. Use append-copy to break the alias. - TestUpload_RefusesIfInprogressMarkerPresent asserted only on the generic 'is in progress for' substring, which would pass for any tool name. Tighten to assert tool=cas-upload, host=host-other, and in-progress all three appear in the diagnostic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two integration regressions surfaced by the wave-5 sweep: 1. TestCASAPIRoundtrip asserted cas-delete returns a sync 'success' response. F13 (commit ed4f24c) made the handler async, so the response is now an async-ack with operation_id. Updated the test to use the existing casAPIPostAndCaptureOpID + casAPIWaitForOperation pattern, matching every other CAS verb. 2. TestCASUploadRefusesConcurrent injected an inprogress marker with tool='test' and asserted the error mentions 'another cas-upload is in progress'. N2 (commit f6c3a70) made the diagnostic use the marker's Tool field dynamically. Changed the injected marker's tool to 'cas-upload' so the test exercises the realistic upload-vs-upload conflict. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…5, F28)
Wave-5 summary previously claimed several items were already in §9
when in fact they weren't. Adding now:
§9.4: pkg/pidlock TOCTOU (whole-tool, not CAS-specific) and
probe-key sweep on prune (F28).
§9.4.x: FTP.AllowUnsafeMarkers struct-field exposure (refactor
preference, F21).
§9.5: TestPrune_FailClosedOnNilCASMetadata (F23),
TestBackupList_SkipsV1BackupNamedSameasCASPrefix (F24),
TestListRemoteCAS_WalkError (F25).
Removed three §9.4 entries that have actually shipped:
- S3 IfNoneMatch startup probe (Phase 8 B1)
- RemoteStorage interface compatibility note (in §6.12 + changelog)
- Real-production error-classification tests (Phase 7 C2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity#13, Altinity#18) Altinity#5: wrap marker body reads in io.LimitReader(64KiB) so a corrupt or oversized object cannot exhaust heap; JSON unmarshal will surface the truncation as a parse error. Altinity#11: SFTP PutFileAbsoluteIfAbsent now propagates Close errors on the success path; if Close fails the partial file is removed so the slot stays available for the next caller. Altinity#13: cas-delete success message now reads "metadata removed" and adds a hint that blob storage is reclaimed by the next cas-prune run. Altinity#18: replace insertion-sort in sortExpectedBlobs with sort.Slice; O(n log n) rather than O(n²) for large blob lists. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d guard (Altinity#17, Altinity#19, Altinity#20) Altinity#17: isCASBackupRemote returns false immediately when cfg.Enabled is false, avoiding a storage probe when CAS is not configured. Altinity#19: register a deferred cleanup for the stale upload marker in the ipOK&&mdOK branch, using a detached 30 s context so the cleanup runs even when the parent ctx is already cancelled. The now-redundant explicit step-6 removal is dropped. Altinity#20: transient read failure on the inprogress marker now returns an error immediately ("refusing") rather than falling through to the stale-marker path, which could otherwise delete live content. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The function had no callers — only collectRefsFromArchive is used. Remove it entirely to eliminate dead code and the confusion it created when readers tried to understand the two near-identical functions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rker refusal (W6-A gaps) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All three marker-cleanup defers (prune.go, delete.go, upload.go) used the operation's ctx. When /backup/kill calls status.Cancel(), the ctx is cancelled before the function returns, causing the deferred DeleteFile to receive an already-cancelled context and fail — stranding the marker. Fix: create a fresh context.WithTimeout(context.Background(), 30s) inside each defer so cleanup always completes regardless of the operation ctx state. - pkg/cas/prune.go: prune.marker cleanup defer - pkg/cas/delete.go: cas-delete inprogress marker cleanup defer (!ipOK path) (the ipOK&&mdOK stale-upload-marker defer was already fixed in Wave 6-A) Tests: - TestPrune_CancelledContextStillReleasesMarker: pre-cancel ctx, verify prune.marker is absent after Prune returns with error Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xplicit cleanup sites (Altinity#3) Before: 10 hand-written `_ = DeleteInProgressMarker(ctx, b, cp, name)` calls guarded every error path in Upload. A panic in any step stranded the marker permanently. The cleanup also used the operation ctx, so a /backup/kill cancellation before the function returned could cause the cleanup to fail. After: a single deferred cleanup registered immediately after the marker is written handles all error paths (including panics). It uses a fresh context.WithTimeout(context.Background(), 30s) to ensure cleanup succeeds even when the operation ctx is cancelled. The success path (step 13) uses a `committed bool` sentinel to skip the defer's redundant delete: committed=true is set after committing metadata.json but before the explicit best-effort delete, so a panic during the delete does not trigger a second attempt. Removed explicit cleanup sites (10 error-path calls): 1. step 6 planUpload error 2. step 7 ColdList error 3. step 8 uploadMissingBlobs error 4. step 9 uploadPartArchives error 5. step 10 uploadTableJSONs error 6. step 11a prune-marker stat error 7. step 11a prune-marker exists 8. step 11b own-marker stat error 9. step 11c revalidateColdList error 10. step 12 put metadata.json error Tests: - TestUpload_ErrorPathCleansInprogressMarker: inject archive PUT failure (step 9), verify marker absent after Upload returns error - TestUpload_CancelledContextStillReleasesMarker: pre-cancel ctx, inject Walk error, verify marker absent despite cancelled operation ctx Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- upload.go step-13 success-path explicit DeleteInProgressMarker now uses detached cleanCtx, not the operation ctx (consistency with the deferred cleanup; survives caller cancellation immediately after commit) - delete_test.go: replace the broken TestDelete_CancelledContextStillReleasesMarker (the pre-cancelled ctx returned from waitForPrune before any defer was registered — wrong scenario) with a doc comment explaining why the delete-side cancellation invariant is verified by parity with the upload and prune tests, all three using the identical defer-with-cleanCtx pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
) Replace the io.ReadAll + bytes.NewReader pattern in buildVerifySet with direct streaming: archRC (the io.ReadCloser returned by GetFile) is passed directly to extractBlobsFromArchive, which already accepts io.Reader and internally wraps it with zstd.NewReader. This mirrors the identical shape used by prune.go::collectRefsFromArchive. Memory impact: per-table archives are no longer buffered in heap; only the small checksums.txt entries inside each archive are read into memory (via the existing io.ReadAll in extractBlobsFromArchive). Large archives with many parts no longer spike heap proportionally to archive size. Existing TestCASVerify tests cover the streaming refactor end-to-end. No dedicated large-archive test added (synthesizing 64 MiB test data would add non-trivial test infrastructure; streaming refactor correctness is verified by the existing round-trip tests). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add pkg/storage/general_test.go with TestBackupList_SkipPrefixesFiltering.
The test builds a minimal fakeRemoteStorage stub (implements RemoteStorage
by supplying Walk + no-op stubs for all other methods) and wraps it in a
BackupDestination to exercise BackupList directly without any real backend.
Three cases:
1. skipPrefixes=["cas/"] — "cas/" filtered, "casematch" NOT filtered
(validates HasPrefix-without-trailing-slash semantics).
2. skipPrefixes=nil — all four entries pass through.
3. skipPrefixes=[""] — empty string prefix skips nothing (defensive).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…to ERROR (Altinity#9) Sub-change A: pkg/storage/general.go — promote the BackupList skip-prefix log entry from log.Warn to log.Error. Operators reviewing logs will see this loudly (e.g. if a v1 backup is accidentally named "cas") instead of dismissing it as informational noise. Sub-change B: add NameCollidesWithCASPrefix to pkg/cas/validate.go and call it from both the CAS Upload path (pkg/cas/upload.go) and the v1 Upload path (pkg/backup/upload.go::validateUploadParams). Rejects backup names that exactly match a CAS skip-prefix segment (e.g. "cas" when root_prefix="cas/") at upload time, before any I/O, with a descriptive error message. Without this guard an operator could successfully create a v1 backup named "cas", which would then be silently invisible to BackupList (and therefore to v1 retention) once CAS is enabled — a hard-to-diagnose data loss scenario. Tests: - TestUpload_RejectsNameCollidingWithCASPrefix in pkg/cas/upload_test.go - v1 path covered by integration (unit setup requires real BackupDestination) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…covery (Altinity#10) Implements the operator escape hatch for a cas-upload in-progress marker left behind by SIGKILL, OOM, or network partition. Without this, operators had to delete the marker object manually via the storage console. Changes: - pkg/cas/errors.go: add ErrNoInProgressMarker sentinel. - pkg/cas/unlock.go: new UnlockInProgress function — stats marker, logs Tool/Host/StartedAt, deletes, returns ErrNoInProgressMarker when absent. - pkg/backup/cas_methods.go: CASUpload gains unlock bool parameter; when true, calls UnlockInProgress and exits without uploading. Refuses --unlock + --dry-run and --unlock + --skip-object-disks. - cmd/clickhouse-backup/cas_commands.go: add --unlock BoolFlag to cas-upload. - pkg/server/actions_cas.go, cas_handlers.go: pass unlock=false at API call sites (unlock is a CLI-only escape hatch, not an API operation). Tests: - TestUpload_UnlockRemovesInprogressMarker: pre-place marker → unlock → assert marker absent and no upload artifacts written. - TestUpload_UnlockRefusesWhenNoMarker: no marker → unlock → assert ErrNoInProgressMarker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ltinity#15) Mirrors the pattern in Upload and Prune: each of the four CAS operations now validates the config at function entry and wraps the error with a verb-specific prefix ("cas: delete: invalid config: …" etc.). Without this guard, callers that construct a Config and call these functions without invoking Validate() first would silently operate with zero-duration fields (GraceBlobDuration/AbandonThresholdDuration return 0 until Validate() runs). Status and Verify don't use those durations today, but having the guard everywhere makes the contract uniform and future-proof. Test: TestConfig_DurationsZeroWithoutValidate in pkg/cas/config_test.go locks the documented contract that duration accessors return 0 before Validate(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t per request (Altinity#8) Extract CASProbeState (probeOnce/probeErr/bannerOnce) from Backuper into a shared struct. APIServer holds a singleton CASProbeState created at startup; every per-request NewBackuper call receives it via the WithCASProbeState opt, so the conditional-put probe and unsafe-marker WARN banner are guaranteed to fire at most once per server lifetime instead of once per REST call. CLI paths create a fresh CASProbeState per NewBackuper (one-shot process, unchanged semantics). Three new unit tests lock the shared-state contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…flag-combo guards - Name-collision test: add t.Run case for 'casematch' (a name that prefix-matches 'cas' but is not equal to it) — locks the exact-match boundary so a future regression to HasPrefix-style matching is caught. - Unlock flag-combo test: cover the --unlock+--dry-run and --unlock+--skip-object-disks refusals in cas-upload as a table-driven test on Backuper.CASUpload directly. The simple early-return guards in cas_methods.go are now regression-protected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e changes (Altinity#1, Altinity#7) Add BREAKING CHANGES subsection at the top of vNEXT covering: - pre-CAS binary downgrade hazard (silent deletion of cas/ namespace) - RemoteStorage interface: two new required methods (PutFileAbsoluteIfAbsent, PutFileIfAbsent) - BackupDestination.BackupList fourth skipPrefixes parameter - v1 backup literally named "cas" silently filtered after upgrade Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…us via /backup/actions §9.2: semaphore ctx-cancellation gap + --parallelism flag for PruneOptions §9.4.x: atomicSwapDir misleading name + markerTool single-write contract §9.5: casstorage.Walk absolute-key reconstruction contract test + root_prefix mid-deployment auto-detect runbook: note that cas-status via POST /backup/actions returns only an ack; structured report is only available via GET /backup/cas-status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three snapshot scenarios in cli.py compare full output to recorded fixtures: 1. default_config — clickhouse-backup print-config now emits a 'cas:' block with 10 fields (enabled, cluster_id, root_prefix, inline_threshold, grace_blob, abandon_threshold, wait_for_prune, allow_unsafe_markers, skip_conditional_put_probe, allow_unsafe_object_disk_skip). 2. cli_usage — clickhouse-backup with no args now lists 7 cas-* commands (cas-upload, cas-download, cas-restore, cas-delete, cas-verify, cas-status, cas-prune) between 'server' and 'help'. 3. help_flag — clickhouse-backup --help shows the same expanded command list. This was caught by the Testflows CI matrix at PR Altinity#1367. The underlying Go integration suite already exercises CAS extensively; testflows is the BDD-style end-to-end harness that was missing the fixture refresh. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CI matrix exposed three v1 integration tests breaking on the CAS branch: - TestServerAPI listFormat regex did not include 'kind':"v1". The CAS work added a 'kind' field to /backup/list JSON entries so v1 and CAS rows can be distinguished in a merged response. Update the expected regex in serverAPI_test.go. - env.Cleanup() removed disk_s3 between tests but never touched the CAS namespace under <bucket>/backup/cluster/<shard>/cas/<cluster_id>/. v1 retention/clean-broken explicitly skips that prefix (by design, see SkipPrefixes in pkg/cas/config.go), so it persists across env- pool reuse and surfaces as a non-empty bucket in checkObjectStorageIsEmpty for the next non-CAS test on the same slot. Sweep every per-backend CAS path in Cleanup(). - TestCASUploadSkipObjectDisks t.Skip()'d after creating cas_skipod_db + local backup cas_skipod_bk, leaving them in place. The next v1 test that did 'clickhouse-backup create' iterated ALL tables and failed CopyObject on the dangling object-disk stub. Move cleanup into t.Cleanup() so it runs on the skip path too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Skip CAS tests on ClickHouse < 21.0 (lack of min_rows_for_wide_part / repeat() / system.disks columns). - Skip TestCASRoundtripWithProjection on CH < 23.0 (system.projections added in 23.x). - Drop "kind":"v1" from /backup/list output and rely on omitempty so legacy CH integration tables (CH < 21.1, no input_format_skip_unknown_fields) keep parsing the JSON. CAS rows still carry "kind":"cas". - Update TestServerAPI listFormat regex to match the no-kind v1 layout. - Cleanup(): also rmdir empty backup/ parent dirs after wiping cas/ subtree on MinIO and fake-gcs-server, so checkObjectStorageIsEmpty doesn't trip on stale empty parents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix only handled some failure modes from the CI run. Honest
audit of the remaining failures across all 12 jobs surfaced three more
bugs:
1. Data race in TestCASUploadSkipObjectDisks: t.Cleanup runs AFTER the
test function returns (and after `defer env.Cleanup` has already
closed env.ch and returned the env to the pool). The next test
acquiring that slot races on the shared env. Fixed by switching from
t.Cleanup to a plain defer (LIFO order: runs before env.Cleanup).
2. Mid-flight CAS test failures (e.g. system.projections on CH < 23,
t.Skip paths) leak databases (cas_proj_db, cas_skipod_db, ...) and
local backups (cas_prune_smoke_bk, ...) to the next test on the
same env-pool slot, breaking unrelated tests:
- TestTablePatterns / TestCheckSystemPartsColumns SHOW CREATE every
database, including the leaked cas_*_db ones (and choke on
cas_skipod_db.remote object-disk table).
- TestServerAPI counts local backups via /metrics, gets the wrong
number when CAS local backups are still around.
Fixed by adding a CAS-specific backstop to env.Cleanup: for any
TestCAS* test, wipe /var/lib/clickhouse/backup/* and DROP all
leaked cas_* databases at end-of-test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix used a 23.0 version cutoff, but system.projections was actually added in ClickHouse 24.4. Replace the version check with a runtime probe of system.tables so the test correctly skips on every ClickHouse version that lacks the table, regardless of point-release backports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment and skip message only — the runtime probe via system.tables is unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous selective cleanup ('rm cas/' + 'find -empty -delete' +
'rmdir backup') wasn't fully clearing state on MinIO — TestS3 still
saw a leftover backup/ directory after TestCASMutationDedup ran on
the same env-pool slot.
CAS tests don't write v1-shaped state to the same bucket (only
cas/<cluster>/...), so for TestCAS* tests just rm -rf the entire
backup/ tree on each backend. Non-CAS tests keep the targeted
cas-only cleanup since their own test logic still uses backup/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The single failed check was 'Testflows (1.26, 22.3)' on '/clickhouse backup/other engines/materializedpostgresql/I create MaterializedPostgreSQL': SELECT against a MaterializedPostgreSQL table hung past the 600s ExpectTimeoutError. This is upstream issue ClickHouse#32902 territory and pre-existing — no testflows source changes between master (last green at 92db680) and HEAD on this branch's testflows tree. Empty commit to trigger a re-run; not a fix because there's nothing branch-side to fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TestShadowCleanup{,OnFailure} run 'clickhouse-backup clean' which wipes
/var/lib/clickhouse/shadow/. ClickHouse then re-creates an empty
increment.txt as a side effect, and the NEXT test on the same env-pool
slot that runs FREEZE (TestSkipDisk, TestCustomKopia, ...) fails with:
code: 32, message: File /var/lib/clickhouse/shadow/increment.txt is
empty. You must fill it manually with appropriate value.
The flake hit different downstream tests on different CI runs (TestSkipDisk
on 22.8, TestCustomKopia on 21.3) — same root cause, intermittent because
it depends on whether ClickHouse has time to recreate the file before the
next env acquire. Pre-existing on master, not CAS-introduced.
In env.Cleanup, scrub the 0-byte increment.txt after TestShadowCleanup{,OnFailure}
so ClickHouse re-creates it cleanly on the next FREEZE.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add experimental Content-Addressable Storage (CAS) backups
Summary
Adds seven new commands —
cas-{upload,download,restore,delete,verify,prune,status}— that store backups in a content-addressable layout where every file is keyed by the CityHash128 already in ClickHouse's per-partchecksums.txt. Identical content is stored once and referenced from any number of backups, across mutations, days, and tables. There is noRequiredBackupchain — every backup is independently restorable; deleting one never affects another. Storage grows with new content, not with backup count.The feature lives side-by-side with v1 (
upload/download/restore) under a separate root prefix in the bucket. v1 commands stay byte-for-byte unchanged.Why
Three operational pain points motivated this:
ALTER TABLE … UPDATErewrites one column file and renames the part. Today's incremental dedup is name-based, so the new part looks unrelated and 99% of identical bytes get re-transferred. Content-addressing dedups them.How to enable
Then:
All seven commands also work over the existing daemon REST API (
POST /backup/cas-upload/{name}etc.;cas-*verbs in/backup/actions).What's new
Operator-facing:
cas:config block withenabled,cluster_id,root_prefix,inline_threshold,grace_blob,abandon_threshold,wait_for_prune,allow_unsafe_markers,skip_conditional_put_probe,allow_unsafe_object_disk_skipcas-prunemark-and-sweep GC;cas-verifyfor HEAD+size integrity checkscas-statusdaemon endpoint for monitoringImplementation:
pkg/cas/— core engine (~6 KLOC + ~6 KLOC tests)pkg/checksumstxt/— new parser for ClickHouse's part-file checksums format (versions 2/3/4)pkg/storage/— newPutFileAbsoluteIfAbsent/PutFileIfAbsenton all 6 backends (S3, Azure, GCS, COS, SFTP, FTP) for atomic markersIf-None-Match: *(refuses MinIO < 2024-11)Breaking changes
See
ChangeLog.mdvNEXT section. Top three:cas/namespace as a broken v1 backup and silently delete it on the nextclean remote_brokenorBackupsToKeepRemoterun. Recovery procedure in the operator runbook.pkg/storage.RemoteStorageinterface gains two methods (PutFileAbsoluteIfAbsent,PutFileIfAbsent). Third-party implementations need to add them.pkg/storage.BackupDestination.BackupListsignature gains askipPrefixes []stringparameter.A v1 backup literally named
"cas"(matching the defaultroot_prefix) will be filtered after upgrade; rename before upgrading.How to review (suggested reading order)
The branch has many small commits because it landed in 8 phases plus 6 review waves. The cleanest way through it is by package, not by commit. Suggested order:
1. Read the design first
docs/cas-design.md— full design (~700 lines). §1–§3 = motivation/scope, §6 = the protocol, §7 = what's reused, §8 = risk register, §9 = the v2 backlog of known deferrals.docs/cas-operator-runbook.md— what an operator sees: first-deployment walkthrough, recovery procedures, REST endpoints.2. Read the data model
pkg/checksumstxt/(smallest; standalone) — parser for the ClickHouse on-disk format. The hash CAS uses comes from here.format.mdis the reference spec.pkg/cas/types.go+pkg/cas/blobpath.go+pkg/cas/paths.go— blob-key derivation and the on-bucket layout (where everything lives undercas/<cluster>/).pkg/cas/config.go— thecas:config block, validation rules, theSkipPrefixes()contract that protects CAS from v1 retention.3. Storage primitives
pkg/storage/structs.go— interface-level:PutFileAbsoluteIfAbsent,ErrConditionalPutNotSupported, theBackupListsignature change.pkg/storage/{s3,gcs,azblob,cos,sftp,ftp}.go(and tests) — per-backend conditional-create implementations. Read S3'sIfNoneMatch: "*"first (canonical), then SFTP'sO_EXCL, then FTP (the refuse-by-default one with theallow_unsafe_markersopt-in best-effort fallback).pkg/cas/casstorage/backend_storage.go— the thin adapter that wrapspkg/storage.BackupDestinationas the narrowercas.Backendinterface. This is the import-cycle breaker; everything inpkg/cas/talks tocas.Backend, never topkg/storagedirectly.4. The hot path — upload
pkg/cas/upload.go— the 13-step upload protocol. Each step is commented with its number and intent. Steps 5 (write inprogress marker) → 11 (pre-commit re-checks) → 13 (commit metadata.json) are the load-bearing sequence; everything between is parallel blob/archive uploads. Note the single deferred cleanup at step 5 (added in wave-6 review Support incremental backups #3): on any error path including panic, the marker is released via a detached context.pkg/cas/coldlist.go+pkg/cas/cache.go— the parallel 256-shardLISTthat seeds the existence set so the same blob isn't re-uploaded.pkg/cas/archive.go— per-table tar.zstd builder for the inline files.pkg/cas/markers.go+pkg/cas/probe.go+pkg/cas/wait.go— atomic mutual exclusion, conditional-put probe, polite waiting on prune-marker.5. Download and restore
pkg/cas/download.go— materializes a backup into a v1-shaped local directory. Note the staging-dir + atomic-rename pattern (added in an earlier review wave) — partial downloads never leave a "looks valid" directory at the final path.pkg/cas/restore.go— thin wrapper: cas-download into a staging dir, then hand off to v1'sb.Restore. The handoff metadata setsBackupMetadata.CAS.Handoff = trueso v1's cross-mode guards distinguish "raw CAS backup" (refuse) from "CAS handoff layout" (allow + skip object-disk fetch).6. Lifecycle: delete and prune
pkg/cas/delete.go— ordered delete (metadata.json first, then subtree). Writes its own inprogress marker taggedTool: cas-deleteto lock out concurrent same-name uploads on other hosts.pkg/cas/prune.go+pkg/cas/sweep.go+pkg/cas/markset.go— mark-and-sweep GC. External-mergesort mark set spilled to disk; container/heap-based shard iterator for the sweep phase. Holds an exclusiveprune.marker; explicit--unlockescape hatch when stranded.7. Verify, list, status
pkg/cas/verify.go— HEAD + size check (no content re-hash; that's a v2 deferred item).pkg/cas/list.go— backup catalog walker; surfaces CAS backups inclickhouse-backup list remoteand the REST list endpoint.pkg/cas/status.go— bucket health summary (LIST-only, cheap).8. The integration glue
pkg/backup/cas_methods.go—Backuper.CASUpload/CASDownload/CASRestore/CASDelete/CASVerify/CASPrune/CASStatus. Object-disk preflight (fail-closed on disk-query failure), pidlock around cas-download phase, probe-singleton injection for daemon mode.cmd/clickhouse-backup/cas_commands.go— CLI flag definitions and action wiring.pkg/server/cas_handlers.go+pkg/server/actions_cas.go— REST handlers (mostly async with the existingstatus.Currentpattern;cas-statusis sync).pkg/server/server.go— route registration; newcasProbeStatefield on APIServer so the conditional-put probe fires once per daemon lifetime, not once per request.pkg/backup/list.go—CollectRemoteCASBackupsfor the merged/backup/listresponse.9. Tests
pkg/cas/internal/fakedst/— in-memorycas.Backendfake with hooks for error injection.pkg/cas/internal/testfixtures/— synthetic local-backup builder (parts with deterministic hashes).test/integration/cas_*.go— testcontainer-driven end-to-end coverage across MinIO, Azurite, fake-gcs-server, OpenSSH-server SFTP, and proftpd FTP. The full CAS integration suite passes 19 PASS / 2 SKIP / 0 FAIL.Test plan
Running on this branch:
go test ./pkg/cas/... ./pkg/storage/... ./pkg/backup/... ./pkg/server/... -race -count=1— ✅go vet ./...— ✅go test ./...whole tree,-short— ✅RUN_PARALLEL=1 go test -tags=integration ./test/integration/ -run TestCAS -timeout 90m— ✅ (19 PASS, 2 SKIP)CI in
.github/workflows/build.yamlwill run the integration suite across the ClickHouse-version matrix on this PR.What's deferred
The
docs/cas-design.md§9 backlog tracks ~30 known follow-up items. None block v1 of CAS. The largest categories:cas-verify --deep), object-disk parts, convergent encryption,cas-fsck, parallelcas-verify, distributed locking via S3 conditional create (beyond the same-name race we already close), local-disk / NFS target.SweepOrphansspill-to-disk, streaming archive upload viaio.Pipe, heap-merge forshardIter, replaceColdListwith per-blobPutFileIfAbsent(architectural alternative).--log-format=jsonfor prune, populateBlobsTotal/OrphansHeldByGrace(already partial), per-blob progress logging.pkg/pidlockTOCTOU (whole-tool, not CAS-specific), probe-key cleanup on prune, S3IfNoneMatchstartup probe (already shipped — listed for clarity).cas-delete --force,wait_for_prunedefaults docs, etc.Known limitations (operator-facing)
In
docs/cas-operator-runbook.md"Known limitations (v1)":--tablespatterns arefilepath.Matchglob, not regexallow_unsafe_markersfor best-effort markersRELEASE.2024-11-07rejected by conditional-put probePhasing of the work in this branch
For maintainers who do want to see how the work landed: 8 implementation phases + 6 external review-feedback waves. Commit groups roughly correspond to:
PruneReportcounters,cfg.Validate()defensive guardswait_for_pruneknob + REST API wiringgit log --oneline master..HEADshows the full sequence; commit subjects are tagged with the relevant fix-id.