Remove orphaned orchestrator nodes on environmentd startup #16200

lluki · 2022-11-21T16:38:28Z

When environmentd crashes during an operation that removes a computed (DROP CLUSTER REPLICA...) or a storaged (DROP SOURCE ...) it is possible that the object is removed from the catalog but the corresponding service is not removed from the orchestrator. This PR cleans up those orphaned nodes on environmentd restart.

Motivation

This PR fixes a recognized bug: Fixes MaterializeInc/database-issues#4493, Fixes MaterializeInc/database-issues#3391

Implementation:

Unlike #16114 this PR detects orphaned nodes on startup using solely the replica/storaged ID. These IDs are monotonically increasing, thus an environmentd can create replicas only with higher IDs and it is safe to remove all orchestrator services that are not known at envd boot time and have an ID lower than the biggest ID we are currently aware of. If we encounter an orchestrator node with higher ID, we know that another envd with higher epoch is running and we terminate ourselves.

Tips for reviewer

Consider previous discussion in #16114.

First commit cleans up compute replicas, second commit cleans up storaged's on startup and the last commit adds testing.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
This PR includes the following user-facing behavior changes: None

lluki · 2022-11-21T16:57:41Z

@jkosh44 Can you have a look at the "load next id" logic from the catalog Here? It is important for the correctness of this PR that we atomically fetch the next replica id together with the set of replicas.

jkosh44 · 2022-11-21T18:39:41Z

src/adapter/src/coord.rs

-                handle.block_on(coord.bootstrap(builtin_migration_metadata, builtin_table_updates));
+            let bootstrap = handle.block_on(async {
+                coord
+                    .bootstrap(builtin_migration_metadata, builtin_table_updates)


It's possible that bootstrapping creates new objects with a user ID higher than next_ids.0 due to the builtin schema migration.

You can probably fix this by getting the next IDs after bootrapping.

I think it's fine to get those before the migration, I'm interested in objects that would be left over from a previous environmentd run, so no need to look at the ID's being allocated during bootstrap and normal operation.

The only thing I'm worried about is that someone else does a create replica and we see the increased replica ID in the next_replica_id, but not the replica object itself. Then remove_orphans would gladly remove this from the orchestrator. Afaik this can not happen, because both the initial catalog load and replica create happen inside a transaction.

If we see a new object created during migration, when removing orphans then will this code panic in the storage controller?

if id >= next_id { // Found a storaged in kubernetes with a higher id than what we are aware of. This // must have been created by an environmentd with a higher epoch number. panic!( "Found storaged id ({}) in orchestrator >= than next_id ({})", id, next_id ); }

The only thing I'm worried about is that someone else does a create replica and we see the increased replica ID in the next_replica_id, but not the replica object itself. Then remove_orphans would gladly remove this from the orchestrator. Afaik this can not happen, because both the initial catalog load and replica create happen inside a transaction.

I think we're OK here. A user can't create a replica until we're done with bootstrapping and finished removing orphaned nodes. If a new Coordinator creates a new replica and increases the next_replica_id then our read of next_replica_id will fail because we're no longer the leader.

If we see a new object created during migration, when removing orphans then will this code panic in the storage controller?

You're right, it won't be deleted, but we will panic: We need to fetch the next_id after migration. When I just fetch them after the bootstrap (using peek_key_one on the ID allocation collection), that call does an epoch check right?

Yep. All stash reads and writes will do an epoch check.

benesch

Just took a quick skim, but looks 👌🏽. Thank you!

benesch · 2022-11-23T07:16:18Z

src/storage-client/src/controller/hosts.rs

+            if id >= next_id {
+                // Found a storaged in kubernetes with a higher id than what we are aware of. This
+                // must have been created by an environmentd with a higher epoch number.
+                panic!(


Suggested change

panic!(

halt!(

src/compute-client/src/controller.rs

jkosh44 · 2022-11-23T15:17:43Z

src/controller/src/lib.rs

+        next_ids: (GlobalId, ReplicaId),
+    ) -> Result<(), anyhow::Error> {
+        self.storage.remove_orphans(next_ids.0).await?;
+        self.compute.remove_orphans(next_ids.1).await?;


Do we ever attempt to remove orphaned indexes or entire compute clusters?

For orphaned indexes (or dataflows), the rehydration logic should take care of it. In case of a crash, there is no orphans as the computeds are stateless and wait to be rehydrated from environmentd.

Entire compute clusters don't have that problem either, because the only externally persisted state are the replica pod entries in kubernetes. So we should be good there.

I do have another potential source of inconsistencies in mind: The builtin table updates that don't run transactionally with the catalog update. (<- let me know if this is not true). So if envd crashes right after a drop replica has been communicated to the catalog, are we sure the builtin table updates are always correct?

If you know of any other state that envd modifies (outside the catalog) let me know!

Builtin tables should be fine. As part of bootstrapping, the Coord identifies what all the system tables should look like and what they currently look like, and will send the needed appends to get them into the correct state.

materialize/src/adapter/src/coord.rs

Lines 865 to 892 in 796cd14

// Add builtin table updates the clear the contents of all system tables

info!("coordinator init: resetting system tables");

let read_ts = self.get_local_read_ts();

for system_table in entries

.iter()

.filter(|entry| entry.is_table() && entry.id().is_system())

{

info!(

"coordinator init: resetting system table {} ({})",

self.catalog.resolve_full_name(system_table.name(), None),

system_table.id()

);

let current_contents = self

.controller

.storage

.snapshot(system_table.id(), read_ts)

.await

.unwrap();

info!("coordinator init: table size {}", current_contents.len());

let retractions = current_contents

.into_iter()

.map(|(row, diff)| BuiltinTableUpdate {

id: system_table.id(),

row,

diff: diff.neg(),

});

builtin_table_updates.extend(retractions);

}

benesch

Just one little API naming nit and this LGTM! (I reviewed structurally mostly and didn't get into the weeds, since @jkosh44 approved.)

benesch · 2022-11-29T06:08:28Z

src/adapter/src/catalog/storage.rs

@@ -838,6 +838,27 @@ impl<S: Append> Connection<S> {
        Ok(GlobalId::User(id))
    }

+    /// Get the next user and replica id without allocating them.
+    pub async fn get_next_ids(&mut self) -> Result<(GlobalId, ReplicaId), Error> {


Bundling the fetching of these two IDs into a single API method corrupts the public API of the catalog with the specific needs of the Controller. Could you split this into two methods: get_next_replica_id and get_next_user_global_id?

benesch · 2022-11-29T06:08:39Z

src/adapter/src/catalog.rs

@@ -3105,6 +3105,11 @@ impl<S: Append> Catalog<S> {
        self.storage().await.get_persisted_timestamp(timeline).await
    }

+    /// Get the next user and replica id without allocating them.
+    pub async fn get_next_ids(&mut self) -> Result<(GlobalId, ReplicaId), Error> {


(See below.)

benesch · 2022-11-29T06:09:51Z

src/controller/src/lib.rs

+    /// Remove orphaned services from the orchestrator.
+    pub async fn remove_orphans(
+        &mut self,
+        next_ids: (GlobalId, ReplicaId),


Suggested change

next_ids: (GlobalId, ReplicaId),

next_replica_id: ReplicaId,

next_storage_host_id: GlobalId,

List service wrongly applies a filter meant for pods to StatefulSet resulting in an always empty list. This commit reverts the change introduced by 58135e2 to no filter and return all services.

Make environmentd crash on drop replica after the catalog transaction but before the orchestrator call. Then ensure that the orphan will get cleaned up on environmentd restart.

lluki · 2022-11-29T10:35:20Z

Addressed @Bensch's naming nit, now there is get_next_replica_id and get_next_user_global_id

alex-hunt-materialize · 2022-11-29T18:11:00Z

src/orchestrator-kubernetes/src/lib.rs

@@ -768,7 +769,7 @@ impl NamespacedOrchestrator for NamespacedKubernetesOrchestrator {

    /// Lists the identifiers of all known services.
    async fn list_services(&self) -> Result<Vec<String>, anyhow::Error> {
-        let stateful_sets = self.stateful_set_api.list(&self.list_params()).await?;
+        let stateful_sets = self.stateful_set_api.list(&Default::default()).await?;


Maybe I'm missing some context, but do we not want to only list the ones with the correct environmentd.materialize.cloud/namespace?

After discussion in chat, this label is not set on the statefulsets, so was previously returning an empty list.

alex-hunt-materialize

I don't have the context for all of this, but the parts I understand look good.

lluki force-pushed the cleanup-on-startup-id branch from 511f518 to f37b75d Compare November 21, 2022 16:51

jkosh44 reviewed Nov 21, 2022

View reviewed changes

lluki force-pushed the cleanup-on-startup-id branch 3 times, most recently from 874e033 to baf1b17 Compare November 22, 2022 14:04

lluki marked this pull request as ready for review November 22, 2022 14:47

lluki requested review from ggevay, antiguru, benesch and alex-hunt-materialize and removed request for antiguru, benesch, ggevay and alex-hunt-materialize November 22, 2022 14:47

benesch reviewed Nov 23, 2022

View reviewed changes

lluki force-pushed the cleanup-on-startup-id branch 2 times, most recently from e5760a0 to dbf4d4f Compare November 23, 2022 10:40

lluki requested a review from jkosh44 November 23, 2022 11:06

lluki force-pushed the cleanup-on-startup-id branch from 2eae69a to fd7c0ac Compare November 23, 2022 14:57

jkosh44 approved these changes Nov 23, 2022

View reviewed changes

lluki force-pushed the cleanup-on-startup-id branch 2 times, most recently from 745ecb0 to 339d97e Compare November 24, 2022 10:20

lluki requested a review from alex-hunt-materialize November 24, 2022 10:45

benesch approved these changes Nov 29, 2022

View reviewed changes

ggevay approved these changes Nov 29, 2022

View reviewed changes

lluki added 4 commits November 29, 2022 11:17

orchestrator: Fix list service

77ca21b

List service wrongly applies a filter meant for pods to StatefulSet resulting in an always empty list. This commit reverts the change introduced by 58135e2 to no filter and return all services.

orchestrator: Remove orphaned replicas on startup

7fb72cb

orchestrator: Remove orphaned storageds on startup

bc0510e

testing: Check removal of orphaned nodes

61e3f42

Make environmentd crash on drop replica after the catalog transaction but before the orchestrator call. Then ensure that the orphan will get cleaned up on environmentd restart.

lluki force-pushed the cleanup-on-startup-id branch from 339d97e to 61e3f42 Compare November 29, 2022 10:31

alex-hunt-materialize reviewed Nov 29, 2022

View reviewed changes

alex-hunt-materialize approved these changes Nov 29, 2022

View reviewed changes

lluki merged commit f709a94 into MaterializeInc:main Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove orphaned orchestrator nodes on environmentd startup #16200

Remove orphaned orchestrator nodes on environmentd startup #16200

lluki commented Nov 21, 2022 •

edited

Loading

lluki commented Nov 21, 2022

jkosh44 Nov 21, 2022

jkosh44 Nov 21, 2022

lluki Nov 22, 2022

jkosh44 Nov 22, 2022

jkosh44 Nov 22, 2022 •

edited

Loading

lluki Nov 22, 2022

jkosh44 Nov 22, 2022

benesch left a comment

benesch Nov 23, 2022

lluki Nov 23, 2022

jkosh44 Nov 23, 2022

lluki Nov 24, 2022

jkosh44 Nov 28, 2022

benesch left a comment

benesch Nov 29, 2022

benesch Nov 29, 2022

benesch Nov 29, 2022

lluki commented Nov 29, 2022

alex-hunt-materialize Nov 29, 2022

alex-hunt-materialize Nov 29, 2022

alex-hunt-materialize left a comment

	// Add builtin table updates the clear the contents of all system tables
	info!("coordinator init: resetting system tables");
	let read_ts = self.get_local_read_ts();
	for system_table in entries
	.iter()
	.filter(\|entry\| entry.is_table() && entry.id().is_system())
	{
	info!(
	"coordinator init: resetting system table {} ({})",
	self.catalog.resolve_full_name(system_table.name(), None),
	system_table.id()
	);
	let current_contents = self
	.controller
	.storage
	.snapshot(system_table.id(), read_ts)
	.await
	.unwrap();
	info!("coordinator init: table size {}", current_contents.len());
	let retractions = current_contents
	.into_iter()
	.map(\|(row, diff)\| BuiltinTableUpdate {
	id: system_table.id(),
	row,
	diff: diff.neg(),
	});
	builtin_table_updates.extend(retractions);
	}

	next_ids: (GlobalId, ReplicaId),
	next_replica_id: ReplicaId,
	next_storage_host_id: GlobalId,

Remove orphaned orchestrator nodes on environmentd startup #16200

Remove orphaned orchestrator nodes on environmentd startup #16200

Conversation

lluki commented Nov 21, 2022 • edited Loading

Motivation

Implementation:

Tips for reviewer

Checklist

lluki commented Nov 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkosh44 Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lluki commented Nov 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alex-hunt-materialize left a comment

Choose a reason for hiding this comment

lluki commented Nov 21, 2022 •

edited

Loading

jkosh44 Nov 22, 2022 •

edited

Loading