feat: Add cluster maintenance mode#40
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a cluster-wide maintenance mode that freezes the control plane’s indexing plan and rejects metadata mutations while maintenance is active. It adds persistence for the maintenance flag + frozen plan via the metastore KV API, exposes new REST endpoints (and a REST client), and provides a CLI surface to manage the mode.
Changes:
- Add Control Plane maintenance mode state + persistence (metastore-backed) and metrics.
- Add REST API endpoints (
/api/v1/cluster/maintenance) plus REST client support. - Add metastore KV RPCs (proto/codegen + metastore backends) and a
quickwit-cli maintenancecommand.
Reviewed changes
Copilot reviewed 24 out of 25 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| quickwit/quickwit-serve/src/rest.rs | Wires the new maintenance REST handler into /api/v1 routes. |
| quickwit/quickwit-serve/src/lib.rs | Spawns the control plane with metastore-backed maintenance persistence. |
| quickwit/quickwit-serve/src/cluster_api/rest_handler.rs | Adds REST endpoints + OpenAPI paths for maintenance mode. |
| quickwit/quickwit-serve/src/cluster_api/mod.rs | Re-exports maintenance_handler. |
| quickwit/quickwit-rest-client/src/rest_client.rs | Adds a MaintenanceClient with status/enable/disable calls. |
| quickwit/quickwit-proto/src/control_plane/mod.rs | Introduces MaintenanceMode error + RPC name bindings. |
| quickwit/quickwit-proto/src/codegen/quickwit/quickwit.metastore.rs | Codegen updates for metastore KV RPCs/types and tower wiring. |
| quickwit/quickwit-proto/src/codegen/quickwit/quickwit.control_plane.rs | Codegen updates for maintenance RPCs/types and tower wiring. |
| quickwit/quickwit-proto/protos/quickwit/metastore.proto | Adds KV RPCs/messages to the metastore service. |
| quickwit/quickwit-proto/protos/quickwit/control_plane.proto | Adds maintenance mode RPCs/messages to the control plane service. |
| quickwit/quickwit-metastore/src/metastore/postgres/metastore.rs | Implements KV operations for PostgreSQL metastore backend. |
| quickwit/quickwit-metastore/src/metastore/file_backed/state.rs | Adds an in-memory KV store field to file-backed metastore state. |
| quickwit/quickwit-metastore/src/metastore/file_backed/mod.rs | Implements KV operations for the file-backed metastore backend. |
| quickwit/quickwit-metastore/src/metastore/control_plane_metastore.rs | Proxies KV operations through MetastoreServiceClient. |
| quickwit/quickwit-control-plane/src/metrics.rs | Adds a maintenance_mode gauge metric. |
| quickwit/quickwit-control-plane/src/maintenance.rs | New maintenance mode persistence/state utilities (incl. metastore KV persistence). |
| quickwit/quickwit-control-plane/src/lib.rs | Exposes the new maintenance module. |
| quickwit/quickwit-control-plane/src/indexing_scheduler/mod.rs | Adds a method to load a frozen plan into scheduler state. |
| quickwit/quickwit-control-plane/src/indexing_plan.rs | Makes PhysicalIndexingPlan deserializable for persistence. |
| quickwit/quickwit-control-plane/src/control_plane.rs | Core maintenance mode behavior: guards mutations, freezes plan, adds RPC handlers. |
| quickwit/quickwit-control-plane/Cargo.toml | Adds time dependency for RFC3339 timestamps. |
| quickwit/quickwit-cli/src/maintenance.rs | New CLI command group to enable/disable/query maintenance mode. |
| quickwit/quickwit-cli/src/lib.rs | Exposes the new maintenance CLI module. |
| quickwit/quickwit-cli/src/cli.rs | Registers the maintenance CLI command. |
| quickwit/Cargo.lock | Locks the added time dependency. |
Comments suppressed due to low confidence (1)
quickwit/quickwit-control-plane/src/control_plane.rs:639
- In maintenance mode,
ControlPlanLoopreturns early and skipsindexing_scheduler.control_running_plan(&self.model). This prevents re-applying the frozen plan to indexers that restart during maintenance, contradicting the intended behavior described inIndexingScheduler::load_frozen_plandocs and potentially leaving restarted indexers without tasks. Consider still callingcontrol_running_planwhile skipping shard rebalancing and plan rebuilds, so the frozen plan continues to be enforced during maintenance windows.
if self.maintenance.is_active() {
// In maintenance mode: skip shard rebalancing and plan control.
ctx.schedule_self_msg(CONTROL_PLAN_LOOP_INTERVAL, ControlPlanLoop);
return Ok(());
}
if let Err(metastore_error) = self
.ingest_controller
.rebalance_shards(&mut self.model, ctx.mailbox(), ctx.progress())
.await
{
return convert_metastore_error::<()>(metastore_error).map(|_| ());
}
self.indexing_scheduler.control_running_plan(&self.model);
ctx.schedule_self_msg(CONTROL_PLAN_LOOP_INTERVAL, ControlPlanLoop);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ced1764 to
e22ba25
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 26 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ControlPlaneError::TooManyRequests => MetastoreError::TooManyRequests, | ||
| ControlPlaneError::Unavailable(message) => MetastoreError::Unavailable(message), | ||
| ControlPlaneError::MaintenanceMode => { | ||
| MetastoreError::Unavailable("cluster is in maintenance mode".to_string()) |
There was a problem hiding this comment.
I'm not sure precondition failed is an appropriate HTTP error status.
e22ba25 to
5789f5a
Compare
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
5789f5a to
90c0066
Compare
| warn!( | ||
| error = %err, | ||
| "failed to deserialize maintenance state; clearing corrupted key and \ | ||
| starting in normal mode" | ||
| ); |
There was a problem hiding this comment.
I think clearing the maintenance state is too aggressive here. If for some reason deserializaiton fail, it might be better to offer the choice to either clean the db entry manually or rollback.
| warn!( | ||
| error = %err, | ||
| "failed to load maintenance state from persistence, starting in normal mode" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Same here, an error (that might be transient) the has us start out of the maintenance mode. We are now in an inconsistent state where the control plane is not in maintenance mode but if the control plane restarts it might jump back into it.
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
94d5010 to
2d2ee33
Compare
Description
Add a cluster maintenance mode.
When in maintenance the indexing plan is frozen along with all related operations (index creation, ...)
How was this PR tested?
Describe how you tested this PR.