You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This task is about modelizing irrecoverable error management for the FastX & Narwhal code bases. The present URL (an issue on FastNFT) is probably not the best place for it and will probably have to move. This was already discussed in a brainstorming meeting which slides and recording are stored here : https://drive.google.com/drive/u/1/folders/1dRTqzcN3qlVDFNXcwPRIkOgTYArBi7Iq
User Story (Why?)
we want developers across our code bases to be able to raise an irrecoverable error locally. We intend this error to manifest that our inputs and context are faulty, that processing cannot continue, and yield control flow without further consequences of the error they have encountered (e.g. we should not write to a database based on our faulty inputs).
but we do not want to model this irrecoverable error with a panic, because this would crash the process, and that remediation is too global: in a BFT system, we depend on 2/3 of authorities to not crash.
we want to have the ability to crash and restart smaller components. E.g. the VM of an authority, data synchronization modules, etc.
We expect that this will lead to better reliability for the code base, across the key processes we stand up in the code base:
a FastX authority,
a FastX client,
a Narwhal primary,
a Narwhal worker,
...
What?
we already have a bare-bones notion of components modeled as tokio tasks. Those components can be recognized as having an initialization function (fn spawn) and a runtime behavior component (fn run).
components launch other components. We can call the launcher the parent, and the launchee a child
that moment is a good time to set up communication between parent and child (e.g. they can share a (Sender|Receiver)<IrrecoverableError>, with IrrecoverableError: std::error::Error)
the child encountering an irrecoverable error can explicitly (through a channel) or implicitly (through scopeguard ) send context about the error to its parent and yield control flow.
the parent can read the IrrecoverableError, do the appropriate signaling (& cleanup), and restart this component only (or itself generate an irrecoverable error and yield).
The goal for this task is to design a Component trait with :
a clear lifecycle,
the ability to start another component,
and the ability to supervise it for irrecoverable errors.
We in particular care about:
monitoring: is context about irrecoverable errors captured in enough detail to analyze each instance?
maintenability: can we transition out code base as it stand now to this component restart model, without having to refactor every possible point of panic?
usability: are devs able to launch an irrecoverable error as soon as they encounter conditions where continued execution could lead to corrupt behavior?
The text was updated successfully, but these errors were encountered:
Context
This task is about modelizing irrecoverable error management for the FastX & Narwhal code bases. The present URL (an issue on FastNFT) is probably not the best place for it and will probably have to move. This was already discussed in a brainstorming meeting which slides and recording are stored here :
https://drive.google.com/drive/u/1/folders/1dRTqzcN3qlVDFNXcwPRIkOgTYArBi7Iq
User Story (Why?)
We expect that this will lead to better reliability for the code base, across the key processes we stand up in the code base:
What?
tokio
tasks. Those components can be recognized as having an initialization function (fn spawn
) and a runtime behavior component (fn run
).(Sender|Receiver)<IrrecoverableError>
, withIrrecoverableError: std::error::Error
)IrrecoverableError
, do the appropriate signaling (& cleanup), and restart this component only (or itself generate an irrecoverable error and yield).This pattern is known elsewhere as parental supervision
Summary
The goal for this task is to design a
Component
trait with :We in particular care about:
The text was updated successfully, but these errors were encountered: