[sinks] Generate updated as_of in rehydrating client #14785

cjubb39 · 2022-09-12T19:27:19Z

Previously, we always used the same as_of when the storaged instance was rehydrated. This meant that the following sequence would result in a panic:

CREATE SINK FROM SOURCE s (which has as_of t_source)
We create the sink with AS OF now() = t_sink > t_source
the persistent collection for source s is compacted so that t_source_1 > t_sink`
the storaged process crashes and is rehydrated
we recreate the sink using the cached command that uses t_sink
This panics when trying to read from the persist collection for the source s

We handled the case when the source was recreated on restart of environmentd by looking at the current since of the source in the storage controller. However, in this case, the storage controller doesn't instruct the client to restart -- the rehydrating task does. So we simply move the logic down a layer

Motivation

Fixes #14555

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-protobuf label.
This PR includes the following user-facing behavior changes:

cjubb39 · 2022-09-13T21:51:38Z

This is ready for review! I was not able to reliably reproduce the panic on main -- but I also haven't been able to reproduce it with this branch. That being said: we are pretty convinced -- just by inspection -- that the current code is wrong.

aljoscha

I think this should work. But I had a concern about race conditions. Might be that we have to merge as is.

In any case, this was some nice sleuthing! 😊

aljoscha · 2022-09-14T15:17:33Z

src/storage/src/controller/rehydration.rs

+            // The controller has the dependency recorded in it's `exported_collections` so this
+            // should not change at least until the sink is started up (because the storage
+            // controller will not downgrade the source's since).
+            let from_since = from_read_handle.since();


Now that I look at this I'm a bit concerned about races. The following might happen:

controller holds a read handle for the implied since t

the actual since of the shard is t - 10 because someone else is still holding a handle

we set the as_of to t - 10 because of this

that third party released their hold, shard since advances to t

the sink, in storaged tries to read and fails

I think 1. and 2. can't currently happen together, because we know that the controller initializes the implied frontier to the shard since. But we might change that in the future, maybe by accident.

It would be nicer if we can thread through the exact since that the controller is holding, but that might not be easily feasible. In that case we should probably merge as is. Tricky ... 🙈

I think the correctness here is slightly more embedded than you describe: one of the first things the storage controller does is install a dependency between the from source / table / etc and this newly created sink. That will keep it from updating the read capability of the source

so, while someone else can definitely downgrade their handle, the handle managed by the storage controller should keep the since for the collection itself from being downgraded

aljoscha · 2022-09-14T15:18:26Z

src/storage/src/controller/rehydration.rs

@@ -183,6 +185,39 @@ where
                .await;
        }

+        for export in self.exports.values_mut() {


The rehydration logic is also used when the controller restarts? I'm asking because we removed the logic from there.

it is, that's correct!

cjubb39 mentioned this pull request Sep 12, 2022

Killing the Kafka sink can cause it to permanently hang #14555

Closed

[sinks] Generated updated as_of in rehydrating client

4d8f36d

cjubb39 force-pushed the as_of_rehydrate branch from 13b39ca to 4d8f36d Compare September 13, 2022 21:44

cjubb39 marked this pull request as ready for review September 13, 2022 21:44

cjubb39 requested review from aljoscha and bkirwi September 13, 2022 21:44

aljoscha approved these changes Sep 14, 2022

View reviewed changes

cjubb39 merged commit e4f5ef5 into MaterializeInc:main Sep 14, 2022

aljoscha mentioned this pull request Sep 19, 2022

ci: cluster smoke test: panicked at 'cannot serve requested as_of: #12885

Closed

materialize-bot mentioned this pull request Sep 20, 2022

release: v0.27.0-alpha.21 required reviews #14894

Closed

34 tasks

cjubb39 mentioned this pull request Sep 20, 2022

[sinks] Fastforward asof in storage controller and rehydration #14900

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sinks] Generate updated as_of in rehydrating client #14785

[sinks] Generate updated as_of in rehydrating client #14785

cjubb39 commented Sep 12, 2022 •

edited

cjubb39 commented Sep 13, 2022

aljoscha left a comment

aljoscha Sep 14, 2022

cjubb39 Sep 14, 2022

aljoscha Sep 14, 2022

cjubb39 Sep 14, 2022

[sinks] Generate updated as_of in rehydrating client #14785

[sinks] Generate updated as_of in rehydrating client #14785

Conversation

cjubb39 commented Sep 12, 2022 • edited

Motivation

Checklist

cjubb39 commented Sep 13, 2022

aljoscha left a comment

Choose a reason for hiding this comment

aljoscha Sep 14, 2022

Choose a reason for hiding this comment

cjubb39 Sep 14, 2022

Choose a reason for hiding this comment

aljoscha Sep 14, 2022

Choose a reason for hiding this comment

cjubb39 Sep 14, 2022

Choose a reason for hiding this comment

cjubb39 commented Sep 12, 2022 •

edited