Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reliability] Component restarts for our code bases #382

Closed
huitseeker opened this issue Feb 7, 2022 · 0 comments
Closed

[reliability] Component restarts for our code bases #382

huitseeker opened this issue Feb 7, 2022 · 0 comments
Assignees
Milestone

Comments

@huitseeker
Copy link
Contributor

huitseeker commented Feb 7, 2022

Context

This task is about modelizing irrecoverable error management for the FastX & Narwhal code bases. The present URL (an issue on FastNFT) is probably not the best place for it and will probably have to move. This was already discussed in a brainstorming meeting which slides and recording are stored here :
https://drive.google.com/drive/u/1/folders/1dRTqzcN3qlVDFNXcwPRIkOgTYArBi7Iq

User Story (Why?)

  • we want developers across our code bases to be able to raise an irrecoverable error locally. We intend this error to manifest that our inputs and context are faulty, that processing cannot continue, and yield control flow without further consequences of the error they have encountered (e.g. we should not write to a database based on our faulty inputs).
  • but we do not want to model this irrecoverable error with a panic, because this would crash the process, and that remediation is too global: in a BFT system, we depend on 2/3 of authorities to not crash.
  • we want to have the ability to crash and restart smaller components. E.g. the VM of an authority, data synchronization modules, etc.

We expect that this will lead to better reliability for the code base, across the key processes we stand up in the code base:

  • a FastX authority,
  • a FastX client,
  • a Narwhal primary,
  • a Narwhal worker,
  • ...

What?

  • we already have a bare-bones notion of components modeled as tokio tasks. Those components can be recognized as having an initialization function (fn spawn) and a runtime behavior component (fn run).
  • components launch other components. We can call the launcher the parent, and the launchee a child
  • that moment is a good time to set up communication between parent and child (e.g. they can share a (Sender|Receiver)<IrrecoverableError>, with IrrecoverableError: std::error::Error)
  • the child encountering an irrecoverable error can explicitly (through a channel) or implicitly (through scopeguard ) send context about the error to its parent and yield control flow.
  • the parent can read the IrrecoverableError, do the appropriate signaling (& cleanup), and restart this component only (or itself generate an irrecoverable error and yield).

This pattern is known elsewhere as parental supervision

Summary

The goal for this task is to design a Component trait with :

  • a clear lifecycle,
  • the ability to start another component,
  • and the ability to supervise it for irrecoverable errors.

We in particular care about:

  • monitoring: is context about irrecoverable errors captured in enough detail to analyze each instance?
  • maintenability: can we transition out code base as it stand now to this component restart model, without having to refactor every possible point of panic?
  • usability: are devs able to launch an irrecoverable error as soon as they encounter conditions where continued execution could lead to corrupt behavior?
@gdanezis gdanezis added this to the Mainnet milestone Apr 19, 2022
@lanvidr lanvidr closed this as completed Apr 21, 2022
mwtian pushed a commit that referenced this issue Sep 12, 2022
Primary reconfiguration
mwtian pushed a commit to mwtian/sui that referenced this issue Sep 29, 2022
Primary reconfiguration
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants