What happens if two writers to object storage run at the same time? #11

jellevandenhooff · 2024-03-15T16:23:59Z

What happens if two sunlight instances write to object storage run at the same time? Only one instance will be able to advance the true head of the tree using the signed tree head compare-and-swap. But what happens to the (partial) tiles they write? It seems to me that two writers might overwrite each others’ files.

FiloSottile · 2024-03-15T16:39:43Z

Indeed, the losing writer might still overwrite the data tile of the winning writer. This is bad, but as long as the old data can be recovered, it's not fatal (unlike signing two incompatible views of the tree).

For S3, we recommend running with object versioning enabled. I forgot to mention this in the README. For backends that support it (currently only Tigris), we use If-Match headers to avoid overwriting. See 1046953.

jellevandenhooff · 2024-03-16T20:31:29Z

I have a hard time coming up with a scenario where two writers end up running at the same time, so for that rare case relying on manual recovery sounds reasonable.

Perhaps more plausible is the single writer crashing and then later restarting. In that case the If-Match header and the required manual intervention could be annoying.

I played around a bit with making the commit protocol atomic: First stage all the tiles, then update the lock, and then upload the staged tiles. If the checkpoint doesn't match the lock, the recovery mechanism (re-)uploads the staged files. This mechanism makes the commit protocol a bit more expensive and adds some complexity, but perhaps it is worth the peace of mind?

One approach puts all tiles in a single tar files (jellevandenhooff#1) and one uses individual objects (jellevandenhooff#2).

FiloSottile · 2024-03-16T21:51:03Z

Automatic recovery from crashes is a design goal and something that I believe the current implementation provides. It’s even moderately well tested because the sequencing fails out on timeouts, and those do tend to happen. Can you describe a crash scenario that is not automatically recovered?

jellevandenhooff · 2024-03-17T00:55:25Z

I can think of two problematic recovery scenarios:

#12: A sequencing failure after uploading a tile but before updating the lock can cause the log to get stuck:

Sequencing starts.
The sequencer writer tile file "A".
The sequencer fails to update the lock.
Sequencing restarts.
The sequencer tries to write tile file "A".
The sequencer fails because tile file "A" already exists.

#13: A sequencing failure with a delayed earlier write can cause the log to get stuck:

Sequencing starts.
The sequencer starts writing tile file "A" with contents "A1" (asynchronously).
The sequencer times out.
Sequencing restarts.
The sequencer tries to write tile file "A" with contents "A2" and succeeds.
The earlier write to "A" succeeds with contents "A1" replacing "A2".
The sequencer updates the lock to a tree with hash expecting "A" to contain "A2".
The log is in a bad state because "A" contains "A1" but should contain "A2".

FiloSottile · 2024-03-17T10:01:19Z

Thank you for elaborating, this is very useful!

#12 is definitely something I overlooked while adding the If-Match, thank you.

#13 is something I worried about and it's part of why the compare-and-swap mechanism exists. From the design doc:

The compare-and-swap backend also helps relax the criticality of the object storage consistency guarantees. For example, Amazon recommends hedging requests to reduce tail latency: if a request is taking too long, fire off a second identical request in parallel, and cancel one when the other succeeds. This is fine if cancellation is effective, but what if it isn’t? Then the “losing” request might succeed later, in theory even after a subsequent checkpoint was uploaded, rolling it back. (Or, more mundanely, operational error might lead to recovering object storage from a backup.) If the node were to restart during this rolled back state, the tree would fork if it didn’t have the compare-and-swap backend to load the correct STH from.

However, I also think it's fundamentally unfixable: if a request that timed out in the past is to be considered concurrent with all future ones, any object might rollback at any point in the future.

The good news is that at a closer reading of the S3 Consistency Model I think it's actually not allowed to happen.

Amazon S3 internally uses last-writer-wins semantics to determine which write takes precedence. However, the order in which Amazon S3 receives the requests and the order in which applications receive acknowledgments cannot be predicted because of various factors, such as network latency.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html#ConsistencyModel

In our case, the request either timed out before reaching S3, in which case it can't succeed, or after, in which case it has an earlier internal timestamp and will lose to A2.

I think #13 is either impossible or incredibly rare, and difficult to protect against, so manual recovery is ok.

jellevandenhooff · 2024-03-17T16:31:20Z

Glad you found it useful, happy to be thinking along! I agree that #13 should be rare, but I am not sure about impossible. A delay anywhere in the path between PutObject in the SDK and the final write internally in S3 could cause the behavior in the test, I think, and the timeouts and sequencing interval are small enough (5 seconds, 1 second) that I could see this race happening. Anecdotally, I have seen S3 latency do weird things, and the AWS documentation does mention retrying after latency of either 2 or 4 seconds. I do not agree the problem is unsolvable: Changing the commit protocol to not write the immutable files until the lock has been updated would prevent this problem, at the cost of some complexity and more S3 operations. Perhaps still worth it? I don’t know how sensitive to the extra cost or complexity operators will be. In the alternative, if manual recovery is an acceptable scenario, perhaps it’d be worthwhile to have an automated recovery tool? With a versioned bucket, recovery should be a well-defined operation since the lock checkpoint specifies all matching hashes and all correct files should have been uploaded. It could check all versions of problematic tiles and suggest which ones to copy.

FiloSottile · 2024-03-17T16:56:35Z

A delay anywhere in the path between PutObject in the SDK and the final write internally in S3 could cause the behavior in the test, I think, and the timeouts and sequencing interval are small enough (5 seconds, 1 second) that I could see this race happening.

I don't think that's right, the test violates last-writer-wins semantics.

The S3 docs point out correctly that usually you can't rely on them because you don't know if a request had a delay before or after it reached S3, but in out case we know that either it reached S3 before the next request (in which case it will lose last-writer-wins) or it never reached S3 (in which case it can't succeed). That's because we don't ever sequence concurrently, so we stop sending a request before starting a new one.

Anecdotally, I have seen S3 latency do weird things, and the AWS documentation does mention retrying after latency of either 2 or 4 seconds.

Indeed, the Backend implementation races a new upload after 75ms, and that helped tail latency quite a bit.

I do not agree the problem is unsolvable: Changing the commit protocol to not write the immutable files until the lock has been updated would prevent this problem, at the cost of some complexity and more S3 operations. Perhaps still worth it? I don’t know how sensitive to the extra cost or complexity operators will be.

I assume you mean

A. update the checkpoint in the LockBackend
B. upload the tiles to the Backend
C. upload the checkpoint to the Backend

If a crash happens between A and B we'll have to rollback the LockBackend, which means a inconsistent tree was signed, even if not published, which I am uncomfortable about.

I also don't think it solves the problem (assuming it exists). Imagine a crash that happens due to a timeout during B. The old tile upload becomes a zombie that happens to succeed 10s later. The writer restarts, discards the new checkpoint in the LockBackend (because it doesn't have all the tiles it needs to resume from there), sequences a new set of tiles, and then the zombie succeeds and overwrites them.

jellevandenhooff · 2024-03-17T17:08:19Z

The S3 docs point out correctly that usually you can't rely on them because you don't know if a request had a delay before or after it reached S3, but in out case we know that either it reached S3 before the next request (in which case it will lose last-writer-wins) or it never reached S3 (in which case it can't succeed). That's because we don't ever sequence concurrently, so we stop sending a request before starting a new one.

How do you know it reached S3? What if there was a delay on the network?

I assume you mean

A. update the checkpoint in the LockBackend

B. upload the tiles to the Backend

C. upload the checkpoint to the Backend
If a crash happens between A and B we'll have to rollback the LockBackend, which means a inconsistent tree was signed, even if not published, which I am uncomfortable about.

Ah! Sorry, I meant something like jellevandenhooff#1 or jellevandenhooff#2:

Upload the tiles to a unique staging area.
Update the checkpoint in the lock backend.
Upload the staged tiles to the backend.
Upload the checkpoint to the backend.

By staging the tiles, and then updating the checkpoint, all uploads afterwards are deterministic and can be safely retried. Any delayed or zombie writes are perfectly fine: All writes to a backend file are guaranteed to be the same. The signing happens just the same as in the code today, only after uploading the tiles and before publishing the checkpoint.

Fixes #11

@jellevandenhooff

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

@jellevandenhooff

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

@jellevandenhooff

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

FiloSottile closed this as completed in 2f56302 Mar 15, 2024

FiloSottile added a commit that referenced this issue Aug 4, 2024

internal/ctlog: WIP of two-step commit

4143875

Fixes #11

FiloSottile added a commit that referenced this issue Aug 4, 2024

internal/ctlog: two-step commits for idempotent uploads

5df3bfe

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

FiloSottile added a commit that referenced this issue Aug 4, 2024

internal/ctlog: two-step commits for idempotent uploads

1ed5201

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

FiloSottile added a commit that referenced this issue Aug 7, 2024

internal/ctlog: two-step commits for idempotent uploads

a481639

Based on a design by @jellevandenhooff at jellevandenhooff#1. Fixes #11 Co-authored-by: Jelle van den Hooff <jelle@vandenhooff.name>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What happens if two writers to object storage run at the same time? #11

What happens if two writers to object storage run at the same time? #11

jellevandenhooff commented Mar 15, 2024

FiloSottile commented Mar 15, 2024

jellevandenhooff commented Mar 16, 2024 •

edited

Loading

FiloSottile commented Mar 16, 2024 via email

jellevandenhooff commented Mar 17, 2024 •

edited

Loading

FiloSottile commented Mar 17, 2024

jellevandenhooff commented Mar 17, 2024 via email •

edited

Loading

FiloSottile commented Mar 17, 2024

jellevandenhooff commented Mar 17, 2024

What happens if two writers to object storage run at the same time? #11

What happens if two writers to object storage run at the same time? #11

Comments

jellevandenhooff commented Mar 15, 2024

FiloSottile commented Mar 15, 2024

jellevandenhooff commented Mar 16, 2024 • edited Loading

FiloSottile commented Mar 16, 2024 via email

jellevandenhooff commented Mar 17, 2024 • edited Loading

FiloSottile commented Mar 17, 2024

jellevandenhooff commented Mar 17, 2024 via email • edited Loading

FiloSottile commented Mar 17, 2024

jellevandenhooff commented Mar 17, 2024

jellevandenhooff commented Mar 16, 2024 •

edited

Loading

jellevandenhooff commented Mar 17, 2024 •

edited

Loading

jellevandenhooff commented Mar 17, 2024 via email •

edited

Loading