Skip to content

Zone state machine#526

Merged
tertsdiepraam merged 11 commits intomainfrom
zone-state-machine
Mar 24, 2026
Merged

Zone state machine#526
tertsdiepraam merged 11 commits intomainfrom
zone-state-machine

Conversation

@tertsdiepraam
Copy link
Contributor

@tertsdiepraam tertsdiepraam commented Mar 16, 2026

Adds a state machine per zone to verify that all the operations we do to it are correct and will make it easier to report what's currently going on.

Additionally, this overhauls the halting mechanism to have 3 different halting states: a rejected loaded zone, a failed signing operation and a rejected signed zone. Each of these can be fixed with cascade zone reset <zone>, which will put the state back to waiting. The rejected states can also be overridden with cascade zone override --(un)signed <zone> (bikesheddable).


  • If you are changing Rust code or integration tests (Cargo.*, crates/, etc/, integration-tests/, src/):

    • Did you run the integration tests with act through the act-wrapper (as described in TESTING.md)?
  • If you are adding/deleting man pages:

    • Did you update the man_pages config in doc/manual/source/conf.py?
    • Did you update the packaged man pages in the Cargo.toml?
    • Did you commit the freshly built man pages?
  • If you are modifying man pages:

    • Did you commit the updated built man pages?

@tertsdiepraam tertsdiepraam marked this pull request as draft March 16, 2026 10:24
@tertsdiepraam tertsdiepraam force-pushed the zone-state-machine branch 3 times, most recently from 66437c2 to 868714e Compare March 18, 2026 15:00
Copy link
Contributor

@bal-e bal-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @tertsdiepraam. This is a really important aspect of Cascade and I'm glad that we're finally addressing it. I have a lot of comments, but they are nits or starting points for important top-level discussions. Adding to them:

  • Can we merge this PR sooner rather than later (possibly ignoring many of my comments), and does it preserve Cascade's existing functionality? E.g. I think some history events are no longer recorded.
  • Where/how will we store and present loader errors to the user?
  • The state transition methods need some basic documentation (eventually they should describe the state transitions and mention panics; for now a one-liner would be good).
  • Some logging at state transitions would be helpful.
  • I forgot something: when an instance is soft-rejected, we may want to stay in that state until a new operation (e.g. loading or re-signing) is requested. Can that fit under this implementation?


// Initiate the load immediately, if the data storage is not busy.
if let Some(builder) = self.zone().storage().start_load() {
if let Some(builder) = self.zone().try_start_load() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I'm not sure whether the zone state machine needs to handle this step. Making a LoadedZoneBuilder available is IMO the zone data storage's job. Perhaps this could remain a .storage() method and in the success case, we call a .start_load() or .mark_load_started() method on the zone state machine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this a bit more and the actions to take are:

  • Make try_start_load check the storage before the zone state machine.
  • Make on_passive react to the zone storage passive state again.

Ok(()) => {
let built = builder.finish().unwrap_or_else(|_| unreachable!());
handle.storage().finish_sign(built);
handle.start_signed_review(built);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of an important matter to discuss: while it makes sense that signed review should begin immediately after signing finishes, it was nice to just mark that an operation is complete.

}
}

#[derive(Debug, Default)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For each of these states, we should document their expectations (i.e. our invariants) regarding the zone data storage state machine, and other parts of Cascade.

Comment on lines -249 to +245
if review {
info!("Initiating review of newly-loaded instance");
info!("Initiating review of newly-loaded instance");

// TODO: 'on_seek_approval_for_zone' tries to lock zone state.
std::mem::drop(state);
// TODO: 'on_seek_approval_for_zone' tries to lock zone state.
std::mem::drop(state);

center.unsigned_review_server.on_seek_approval_for_zone(
&center,
&zone,
domain::base::Serial(serial.into()),
);
center.unsigned_review_server.on_seek_approval_for_zone(
&center,
&zone,
domain::base::Serial(serial.into()),
);

state = zone.state.lock().unwrap();
}
state = zone.state.lock().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss how the state machine should flow if review is disabled.

@tertsdiepraam tertsdiepraam changed the title Start with making the zone state machine Zone state machine Mar 20, 2026
@tertsdiepraam tertsdiepraam marked this pull request as ready for review March 20, 2026 13:43
Copy link
Contributor

@bal-e bal-e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few leftover comments and minor nits, but overall LGTM. I'm really excited to see this land.

zone,
review_stage: _,
}) => {
println!("Overrode {stage} review for '{zone}'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "Overrode" -> "Overridden"?
  • for '{zone}' -> for '{zone}'

) -> Result<(), ZoneReloadError> {
let mut zone_state = zone.state.lock().expect("lock is not poisoned");
if let Some(reason) = zone_state.halted(true) {
if let Some(reason) = zone_state.halted_reason() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this was the case I was thinking of when I mentioned that .halted_reason() shouldn't be easily accessible. It's not urgent, but I think this check should be moved to the HTTP server -- we just need to check where it's called from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that would be fine

}
}

impl<'a> ZoneHandle<'a> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty impl block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised Clippy doesn't catch that!

@tertsdiepraam tertsdiepraam merged commit e756146 into main Mar 24, 2026
9 checks passed
@tertsdiepraam tertsdiepraam deleted the zone-state-machine branch March 24, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants