Checkpoint Store: S3 Backend Design #6554

rohitkulshreshtha · 2026-03-31T18:18:11Z

rohitkulshreshtha
Mar 31, 2026
Maintainer

⚠️ Draft — This is an early strawman design, not a spec. Everything here is up for discussion. Please poke holes.

Interface

Store Path

User provides: s3://checkpoints/my-job/

How a store maps to queries, tasks, and pipeline runs is still an open question. For now, the user provides this path and it identifies the store — but the relationship between store identity and execution concepts (e.g. does each run get its own path? do retries share a path?) needs more design work.

Key Layout

s3://checkpoints/my-job/
├── {checkpoint_id}/
│   ├── keys/
│   │   ├── 0000.ipc          # Arrow IPC — one per stage_keys() call
│   │   ├── 0001.ipc
│   │   └── ...
│   ├── files/
│   │   ├── 0000.bin          # Opaque blob — one per stage_files() call
│   │   └── ...
│   └── manifest.json          # Written by checkpoint() — presence = sealed

State Model

State	How to detect
Staged	Directory exists (keys/files written), no `manifest.json`
Checkpointed	`manifest.json` exists with `"status": "checkpointed"`
Committed	`manifest.json` exists with `"status": "committed"`

Operations → S3 Calls

Operation	S3 Call	Atomic?
`stage_keys(id, series)`	`PutObject` to `{id}/keys/NNNN.ipc`	Yes (single object)
`stage_files(id, files)`	`PutObject` to `{id}/files/NNNN.bin`	Yes (single object)
`checkpoint(id)`	`PutObject` to `{id}/manifest.json`	Yes (single object)
`mark_committed(ids)`	`PutObject` to `{id}/manifest.json` (overwrite)	Yes per object, not across IDs
`get_checkpointed_keys()`	`ListObjectsV2` for `/manifest.json`, then `GetObject` for `/keys/*.ipc`	Multiple calls
`get_checkpointed_files()`	Same, but filter by `status: checkpointed` only	Multiple calls
`get_checkpoint(id)`	`GetObject` for `{id}/manifest.json`	Single call
`list_checkpoints()`	`ListObjectsV2` for `*/manifest.json`	Single call

Why Manifest Instead of Directory Moves

S3 doesn't have atomic directory moves. A "move" is copy + delete per object — not atomic, race-prone.

Instead, the presence of manifest.json is the atomicity boundary. checkpoint() is a single PutObject — atomic. Readers ignore directories without a manifest (orphaned staged entries).

manifest.json

{
    "checkpoint_id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "checkpointed",
    "created_at": "2026-03-27T14:00:00Z",
    "sealed_at": "2026-03-27T14:00:05Z",
    "committed_at": null,
    "num_key_files": 3,
    "num_file_files": 1
}

Serialization

Data	Format	Why
Keys (Series)	Arrow IPC	Native Daft format. `RecordBatch::to_ipc_stream()` already exists. Language-interoperable.
FileMetadata	Raw bytes (`.bin`)	Already an opaque `Vec<u8>` blob — just write it directly.
Manifest	JSON	Human-readable, debuggable, small.

Read Path (get_checkpointed_keys)

ListObjectsV2 with prefix s3://checkpoints/my-job/ and delimiter / to find checkpoint IDs
For each, GetObject on manifest.json — check status is checkpointed or committed
For matching checkpoints, ListObjectsV2 for {id}/keys/*.ipc
Stream GetObject for each IPC file → deserialize to Series

Optimization: Could cache manifest statuses to avoid re-reading on every call.

Idempotency

checkpoint() — overwriting manifest.json with same content is a no-op. Idempotent.
mark_committed() — overwriting manifest.json with updated status. Idempotent.
stage_keys() / stage_files() — appending new files with incrementing names. Need to check manifest doesn't exist first (return AlreadySealed).

Open Questions

Cleanup — how to delete orphaned staged entries? TTL via S3 lifecycle rules? Explicit cleanup_older_than() method?
Delta Lake as backend — A Delta table could BE the checkpoint store. _delta_log handles atomicity, versioning, and listing. txn action gives idempotency. No raw S3 key layout needed. Lower priority now that S3 is confirmed first, but worth revisiting.
Listing cost — ListObjectsV2 returns 1000 objects per call. Number of checkpoints = number of tasks in the pipeline. We don't know the scale for our customers (could be hundreds or millions of input files). Options if listing becomes expensive: (a) root-level index file (one GetObject instead of listing, but sync issues), (b) prefix-based partitioning (checkpointed/, committed/), (c) just paginate. Defer until we know the scale.

rohitkulshreshtha · 2026-03-31T18:18:29Z

rohitkulshreshtha
Mar 31, 2026
Maintainer Author

@chenghuichen @everySympathy

0 replies

chenghuichen · 2026-04-01T17:19:46Z

chenghuichen
Apr 1, 2026

The current design for mark_committed issues one PutObject per checkpoint ID, introducing partial failure risk. As noted in #6446, the window between a successful catalog commit and mark_committed() is already the hardest failure scenario to reason about — partial success here makes it significantly worse: some IDs appear committed, others don't, with no clear recovery path.

A simpler approach: instead of updating each {id}/manifest.json, write a single new file per mark_committed() call:

committed_log/{uuid}.json → { "committed_checkpoint_ids": ["<checkpoint-uuid-1>", "<checkpoint-uuid-2>", ...] }

This turns mark_committed(ids) into a single atomic PutObject on a new object — S3 guarantees atomicity for new-object writes. Crash before the write: no IDs marked. Crash after: all IDs marked. No partial state. Reading committed status means scanning the committed_log/ prefix and taking the union, which only happens at re-run time, not on the hot path.

This is essentially the same append-only log pattern that Delta Lake's transaction log uses, and for the same reasons.

1 reply

rohitkulshreshtha Apr 8, 2026
Maintainer Author

I like the thinking here, but I want to push back on the premise.

No amount of wrapping a non-idempotent catalog commit with checkpoint store logic is going to make it idempotent. If Iceberg/Delta commits produce duplicate rows on retry, tracking whether we called commit more carefully doesn't help — the damage happens at the catalog level. That has to be fixed at the catalog level.

Both formats have native mechanisms for this:

Iceberg: the txnId snapshot summary property. If a commit carries the same txn ID as an existing snapshot, the catalog rejects it as a duplicate — no new snapshot, no duplicate rows.
Delta Lake: the txn action (SetTransaction) with an app ID and version. Delta recognizes duplicate commits with the same (appId, version) pair and no-ops them.

If we commit to using these (which we should — it's the right fix regardless of checkpointing), then mark_committed partial failure becomes harmless. You just retry the ones that didn't update. No inconsistency, no data loss.

Given that, I think the simpler design from the original doc works: mark_committed overwrites each manifest.json with "status": "committed". One PutObject per checkpoint, idempotent, no extra committed_log/ directory to manage. Thoughts?

chenghuichen · 2026-04-01T17:33:45Z

chenghuichen
Apr 1, 2026

One scenario not covered in the current design: what happens when the checkpoint store itself becomes unavailable?

My thinking: checkpoint store availability shouldn't be on the critical path of task execution. A checkpoint is a best-effort durability enhancement, not a correctness requirement — if the S3 backend becomes inaccessible due to credential expiry, account suspension, or a service-side outage, the task should still complete normally.

The real concern with naive retry logic isn't that retries happen, but that checkpoint operations are frequent throughout a job's lifetime. If each failure triggers multiple retries, the cumulative latency compounds across the entire run — a side-channel concern starts meaningfully degrading end-to-end job performance.

A circuit-breaker approach seems more appropriate: after a configurable number of consecutive failures, the store silently degrades — subsequent stage_keys, stage_files, and checkpoint() calls become no-ops, with a warning logged. The only consequence is that this run isn't checkpointed; on re-run, the full job executes again.

1 reply

rohitkulshreshtha Apr 8, 2026
Maintainer Author

Good thinking, but I'd prefer to wait until this surfaces in practice before designing for it. In most cases the checkpoint store and the sink share the same S3 infrastructure — if S3 is unreachable, the job is failing regardless. Let's keep things simple for now and add resilience patterns when we have real failure cases to design against.

chenghuichen · 2026-04-01T19:14:49Z

chenghuichen
Apr 1, 2026

My thinking on how a store maps to tasks and runs: the same base path should be reusable across re-runs — creating a new path per run defeats the purpose of cross-run progress tracking. But simple reuse is tricky, because partition boundaries aren't stable across runs (as discussed in #6446).

A snapshot-based approach avoids this problem. The store path stays fixed; each run reads from snapshot N and writes completions to snapshot N+1. On re-run, all keys from snapshot N are loaded as a flat set and applied as a filter against source data — bypassing partition structure entirely. Partitions are regenerated fresh each run; the key filter doesn't care how they're arranged.

Another properties this gives:

Cross-job collision is safe — if two unrelated jobs accidentally share the same base path, they race for snapshot IDs rather than corrupting each other's data; the job holding the latest snapshot ID stays correct, the other wastes work but causes no harm

The trade-off worth flagging: the snapshot approach trades write/read amplification for simplicity — no partition matching required, no determinism assumptions needed.

s3://checkpoints/my-job/{snapshot_id}/
├── {checkpoint_id}/
│   ├── keys/          # file paths (file-based sources) or row keys (on= specified)
│   │   ├── 0000.ipc
│   │   └── ...
│   ├── files/
│   │   ├── 0000.bin
│   │   └── ...
│   └── manifest.json         # Written by checkpoint() — presence = sealed
└── committed_log/
    ├── {uuid}.json            # { "committed_checkpoint_ids": ["<id1>", "<id2>", ...] }
    └── ...                    # One file per mark_committed() call — union on read

1 reply

rohitkulshreshtha Apr 8, 2026
Maintainer Author

Thanks for thinking about this. Store-to-run mapping is something we're still working through internally — it touches orchestration concerns that aren't fully settled yet. I'd prefer to keep the S3 backend focused on the core checkpoint lifecycle for now and tackle store identity/scoping as a separate design discussion once we have more clarity. The flat layout is sufficient for the immediate use cases.

chenghuichen · 2026-04-03T10:07:16Z

chenghuichen
Apr 3, 2026

A few thoughts after implementing the S3 backend:

On orphaned staged entries

They matter less than they might seem. Checkpoint data only has value until the next successful job run — at which point all checkpoint data (orphaned or not) can be deleted together. Orphaned entries don't need special handling; they become irrelevant at the same time as the rest.

On TTL via S3 lifecycle rules

I'd avoid this as a built-in approach: applying lifecycle rules requires s3:PutLifecycleConfiguration (a bucket-admin permission) that many users won't have, which would block them from using the store entirely. For users who do have access and prefer time-based expiry, configuring a lifecycle rule on the checkpoint prefix out-of-band is a reasonable complement to Manual mode — worth calling out in the docs as a best-practice option.

Proposed design: CleanupPolicy

Two policies cover the main use cases:

Manual (default) — no automatic cleanup, conservative, no risk of accidental data loss. The operator deletes the prefix when no longer needed, either manually or via orchestration. The doc should note that an S3 lifecycle rule on the prefix is also a valid complement here.
DeleteAfterSuccess — after the entire job completes successfully, the upper layer calls cleanup() on the store, which deletes the entire prefix. Safe because checkpoint data loses its value the moment the job is done.

An explicit cleanup_older_than() API is also reasonable, but I think the two-policy model covers the main use cases more cleanly and avoids the "when is older-than appropriate?" question that has no general answer.

1 reply

rohitkulshreshtha Apr 8, 2026
Maintainer Author

Agree on the orphaned entries point — they're harmless and get cleaned up with everything else.

I'd prefer to not include CleanupPolicy or DeleteAfterSuccess in this PR, even as dead code. Let's keep the scope to the core lifecycle.

chenghuichen · 2026-04-03T10:16:26Z

chenghuichen
Apr 3, 2026

Initial S3 backend implementation is up in #6599 — feedback and questions welcome.

0 replies

rohitkulshreshtha · 2026-04-08T18:15:09Z

rohitkulshreshtha
Apr 8, 2026
Maintainer Author

Thanks for putting this together, chenghuichen — really solid work, and I appreciate both the implementation and the design discussion on #6554. I've left some comments inline. The core structure is good; most of my feedback is about simplifying scope so we can land the foundation and iterate from there.

0 replies

rohitkulshreshtha · 2026-04-08T18:25:05Z

rohitkulshreshtha
Apr 8, 2026
Maintainer Author

Just a thought: One nice side-effect of this design: since the checkpoint store tracks both source keys and output file metadata, we effectively have a 2PC layer that works for any sink — including plain Parquet. The checkpoint store knows which files were produced by each task. After all tasks complete, the head node can treat the checkpointed file list as the authoritative output, and clean up any orphaned files from failed/partial tasks. Parquet writes become atomic without needing a catalog.

0 replies

chenghuichen · 2026-04-09T10:58:56Z

chenghuichen
Apr 9, 2026

@rohitkulshreshtha Thanks for the thorough review! I agree on keeping the S3 backend focused — I've updated the PR accordingly (dropped CleanupPolicy and DeleteAfterSuccess).

On idempotency: you're right that with txnId, partial failure in mark_committed is harmless — the uncommitted IDs can simply be retried, and the catalog-level mechanism prevents duplicate data. My original concern was more about cost and reliability: updating N manifests requires 2N S3 requests (N reads + N writes), and the failure surface grows with N. That said, this is negligible relative to the rest of the checkpoint lifecycle, and the simpler per-manifest layout is easier to reason about and inspect. Point taken.

0 replies

Uh oh!

Checkpoint Store: S3 Backend Design #6554

Uh oh!

rohitkulshreshtha Mar 31, 2026 Maintainer

Interface

Store Path

Key Layout

State Model

Operations → S3 Calls

Why Manifest Instead of Directory Moves

manifest.json

Serialization

Read Path (get_checkpointed_keys)

Idempotency

Open Questions

Replies: 9 comments · 4 replies

Uh oh!

rohitkulshreshtha Mar 31, 2026 Maintainer Author

Uh oh!

Uh oh!

chenghuichen Apr 1, 2026

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

chenghuichen Apr 1, 2026

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

Uh oh!

chenghuichen Apr 1, 2026

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

Uh oh!

chenghuichen Apr 3, 2026

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

chenghuichen Apr 3, 2026

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

rohitkulshreshtha Apr 8, 2026 Maintainer Author

Uh oh!

Uh oh!

chenghuichen Apr 9, 2026

rohitkulshreshtha
Mar 31, 2026
Maintainer

Replies: 9 comments 4 replies

rohitkulshreshtha
Mar 31, 2026
Maintainer Author

chenghuichen
Apr 1, 2026

rohitkulshreshtha Apr 8, 2026
Maintainer Author

chenghuichen
Apr 1, 2026

rohitkulshreshtha Apr 8, 2026
Maintainer Author

chenghuichen
Apr 1, 2026

rohitkulshreshtha Apr 8, 2026
Maintainer Author

chenghuichen
Apr 3, 2026

rohitkulshreshtha Apr 8, 2026
Maintainer Author

chenghuichen
Apr 3, 2026

rohitkulshreshtha
Apr 8, 2026
Maintainer Author

rohitkulshreshtha
Apr 8, 2026
Maintainer Author

chenghuichen
Apr 9, 2026