[reliability] Component restarts for our code bases #382

huitseeker · 2022-02-07T20:39:41Z

Context

This task is about modelizing irrecoverable error management for the FastX & Narwhal code bases. The present URL (an issue on FastNFT) is probably not the best place for it and will probably have to move. This was already discussed in a brainstorming meeting which slides and recording are stored here :
https://drive.google.com/drive/u/1/folders/1dRTqzcN3qlVDFNXcwPRIkOgTYArBi7Iq

User Story (Why?)

we want developers across our code bases to be able to raise an irrecoverable error locally. We intend this error to manifest that our inputs and context are faulty, that processing cannot continue, and yield control flow without further consequences of the error they have encountered (e.g. we should not write to a database based on our faulty inputs).
but we do not want to model this irrecoverable error with a panic, because this would crash the process, and that remediation is too global: in a BFT system, we depend on 2/3 of authorities to not crash.
we want to have the ability to crash and restart smaller components. E.g. the VM of an authority, data synchronization modules, etc.

We expect that this will lead to better reliability for the code base, across the key processes we stand up in the code base:

a FastX authority,
a FastX client,
a Narwhal primary,
a Narwhal worker,
...

What?

we already have a bare-bones notion of components modeled as tokio tasks. Those components can be recognized as having an initialization function (fn spawn) and a runtime behavior component (fn run).
components launch other components. We can call the launcher the parent, and the launchee a child
that moment is a good time to set up communication between parent and child (e.g. they can share a (Sender|Receiver)<IrrecoverableError>, with IrrecoverableError: std::error::Error)
the child encountering an irrecoverable error can explicitly (through a channel) or implicitly (through scopeguard ) send context about the error to its parent and yield control flow.
the parent can read the IrrecoverableError, do the appropriate signaling (& cleanup), and restart this component only (or itself generate an irrecoverable error and yield).

This pattern is known elsewhere as parental supervision

Summary

The goal for this task is to design a Component trait with :

a clear lifecycle,
the ability to start another component,
and the ability to supervise it for irrecoverable errors.

We in particular care about:

monitoring: is context about irrecoverable errors captured in enough detail to analyze each instance?
maintenability: can we transition out code base as it stand now to this component restart model, without having to refactor every possible point of panic?
usability: are devs able to launch an irrecoverable error as soon as they encounter conditions where continued execution could lead to corrupt behavior?

The text was updated successfully, but these errors were encountered:

Primary reconfiguration

huitseeker added help wanted labels Feb 7, 2022

huitseeker assigned lanvidr Feb 8, 2022

huitseeker mentioned this issue Feb 8, 2022

Component restarts for our code bases MystenLabs/narwhal#50

Closed

lanvidr mentioned this issue Feb 23, 2022

work in progress MystenLabs/mysten-infra#25

Closed

lanvidr mentioned this issue Mar 15, 2022

Laura/components MystenLabs/mysten-infra#39

Merged

bholc646 removed help wanted labels Apr 6, 2022

gdanezis added this to the Mainnet milestone Apr 19, 2022

lanvidr closed this as completed Apr 21, 2022

mwtian pushed a commit that referenced this issue Sep 12, 2022

Primary epoch change (#382)

7b9c371

Primary reconfiguration

mwtian pushed a commit to mwtian/sui that referenced this issue Sep 29, 2022

Primary epoch change (MystenLabs#382)

af56e79

Primary reconfiguration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Component restarts for our code bases #382

[reliability] Component restarts for our code bases #382

huitseeker commented Feb 7, 2022 •

edited

Loading

[reliability] Component restarts for our code bases #382

[reliability] Component restarts for our code bases #382

Comments

huitseeker commented Feb 7, 2022 • edited Loading

Context

User Story (Why?)

What?

Summary

huitseeker commented Feb 7, 2022 •

edited

Loading