Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sinks] Generate updated as_of in rehydrating client #14785

Merged
merged 1 commit into from Sep 14, 2022

Conversation

cjubb39
Copy link
Contributor

@cjubb39 cjubb39 commented Sep 12, 2022

Previously, we always used the same as_of when the storaged instance was rehydrated. This meant that the following sequence would result in a panic:

  • CREATE SINK FROM SOURCE s (which has as_of t_source)
  • We create the sink with AS OF now() = t_sink > t_source
  • the persistent collection for source s is compacted so that t_source_1 > t_sink`
  • the storaged process crashes and is rehydrated
  • we recreate the sink using the cached command that uses t_sink
  • This panics when trying to read from the persist collection for the source s

We handled the case when the source was recreated on restart of environmentd by looking at the current since of the source in the storage controller. However, in this case, the storage controller doesn't instruct the client to restart -- the rehydrating task does. So we simply move the logic down a layer

Motivation

Fixes #14555

Checklist

@cjubb39 cjubb39 marked this pull request as ready for review September 13, 2022 21:44
@cjubb39
Copy link
Contributor Author

cjubb39 commented Sep 13, 2022

This is ready for review! I was not able to reliably reproduce the panic on main -- but I also haven't been able to reproduce it with this branch. That being said: we are pretty convinced -- just by inspection -- that the current code is wrong.

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work. But I had a concern about race conditions. Might be that we have to merge as is.

In any case, this was some nice sleuthing! 😊

// The controller has the dependency recorded in it's `exported_collections` so this
// should not change at least until the sink is started up (because the storage
// controller will not downgrade the source's since).
let from_since = from_read_handle.since();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I look at this I'm a bit concerned about races. The following might happen:

  1. controller holds a read handle for the implied since t
  2. the actual since of the shard is t - 10 because someone else is still holding a handle
  3. we set the as_of to t - 10 because of this
  4. that third party released their hold, shard since advances to t
  5. the sink, in storaged tries to read and fails

I think 1. and 2. can't currently happen together, because we know that the controller initializes the implied frontier to the shard since. But we might change that in the future, maybe by accident.

It would be nicer if we can thread through the exact since that the controller is holding, but that might not be easily feasible. In that case we should probably merge as is. Tricky ... 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the correctness here is slightly more embedded than you describe: one of the first things the storage controller does is install a dependency between the from source / table / etc and this newly created sink. That will keep it from updating the read capability of the source

so, while someone else can definitely downgrade their handle, the handle managed by the storage controller should keep the since for the collection itself from being downgraded

@@ -183,6 +185,39 @@ where
.await;
}

for export in self.exports.values_mut() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rehydration logic is also used when the controller restarts? I'm asking because we removed the logic from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, that's correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Killing the Kafka sink can cause it to permanently hang
2 participants