You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need more specific reasons as to why a duty failed. For ex: if the duty got stuck at validatorAPI, one of the reasons might be that the VC didn't successfully submit a signed duty. Or if it got stuck at consensus, we can say that quorum might not have reached.
At the same time, we also need to think how specific we want to go. For ex, when quorum was not met in the consensus event, was that because of a byzantine node or simply because not peers were down? Note that determining if a node is byzantine is a difficult problem.
Currently, tracker logs a message like the one below when a duty fails: 10:57:21.000 WARN tracker Duty failed {"component": "parSigDBThreshold", "reason": "12826636/attester failed in parSigDBThreshold component", "duty": "12826636/attester"}
We want something like this: 10:57:21.000 WARN tracker Duty failed {"component": "parSigDBThreshold", "reason": "Not enough partial signatures in parSigDB", "duty": "12826636/attester"}
Proposed solution
Analyze the events for the duty to figure out the probable reason for failure. Although, it's difficult to illustrate ALL the scenarios in which duties can fail. However, here are some examples and proposals to solve them:
Scheduler:
Duty not resolved:
It may happen if there are no active DVs.
It may also happen if either resolveProDuties() or resolveAttDuties() fails.
Fetcher:
Beacon node unavailable:
It may happen that fetcher isn't able to query the BN. We say can that "couldn't fetch due to unavailable beacon node".
Proposer duty may fail due to a failed randao duty. Randao can fail if there are insufficient no of partial signatures.
Consensus:
Quorum not reached:
It may happen that some peers are down at the time of consensus. We can say that "quorum not reached due to insufficient peers".
Or, it might just happen that a byzantine node is thwarting consensus.
ValidatorAPI:
validatorAPI may fail to query data from BN. So, it doesn't provide VC with any data to sign. We can say "validatorAPI is not able to connect to BN".
Or, the issue might be entirely on the VC side. For ex, VC is not properly connected to BN or isn't providing signed data to charon.
ParsigDBInternal:
We can say "couldn't save partially signed duty data set received from VC".
ParSigEx:
Peers down:
Duties can fail at ParSigEx if not enough peers submit their partial signatures. We can say "not enough peers".
Invalid sigs:
It may also happen if byzantine peers send invalid signatures which fail at verification step.
ParsigDBThreshold:
We can say "could not reach threshold".
SigAgg:
Not enough partial signatures
SigAgg can't complete if there aren't enough partial signatures. We can say "not enough peers".
Bcast:
Beacon node unavailable
Broadcast may fail when charon can't connect to BN, either because BN is down or there's some connectivity issues with charon. We can say "bcast failed as charon can't connect to BN".
Out of Scope
Use ONLY the events in tracker to determine the possible reason. Ex: if we suspect that BN is down, don't ping to find out.
The text was updated successfully, but these errors were encountered:
Problem to be solved
We need more specific reasons as to why a duty failed. For ex: if the duty got stuck at
validatorAPI
, one of the reasons might be that the VC didn't successfully submit a signed duty. Or if it got stuck atconsensus
, we can say that quorum might not have reached.At the same time, we also need to think how specific we want to go. For ex, when quorum was not met in the
consensus
event, was that because of a byzantine node or simply because not peers were down? Note that determining if a node is byzantine is a difficult problem.Currently, tracker logs a message like the one below when a duty fails:
10:57:21.000 WARN tracker Duty failed {"component": "parSigDBThreshold", "reason": "12826636/attester failed in parSigDBThreshold component", "duty": "12826636/attester"}
We want something like this:
10:57:21.000 WARN tracker Duty failed {"component": "parSigDBThreshold", "reason": "Not enough partial signatures in parSigDB", "duty": "12826636/attester"}
Proposed solution
Analyze the events for the duty to figure out the probable reason for failure. Although, it's difficult to illustrate ALL the scenarios in which duties can fail. However, here are some examples and proposals to solve them:
resolveProDuties()
orresolveAttDuties()
fails.validatorAPI
may fail to query data from BN. So, it doesn't provide VC with any data to sign. We can say "validatorAPI is not able to connect to BN".ParSigEx
if not enough peers submit their partial signatures. We can say "not enough peers".SigAgg
can't complete if there aren't enough partial signatures. We can say "not enough peers".Broadcast
may fail when charon can't connect to BN, either because BN is down or there's some connectivity issues with charon. We can say "bcast failed as charon can't connect to BN".Out of Scope
Use ONLY the events in tracker to determine the possible reason. Ex: if we suspect that BN is down, don't ping to find out.
The text was updated successfully, but these errors were encountered: