Skip to content

Revisit DKG timeout strategy and add test coverage #465

@therustmonk

Description

@therustmonk

Description

The DKG timeout works differently from Charon. Charon splits the timeout per phase (conf.Timeout / 6); Pluto uses one overall timeout for the whole run, because our sync service runs across all phases and can't be cut per-phase (we tried — it killed healthy runs).

Two gaps come from this. A stuck peer is only caught after the full timeout (~60s) instead of one phase (~10s). And with many validators (untested) a healthy run takes longer and could hit the timeout and be killed by mistake — made worse by parsigex already using the full conf.timeout per exchange.

Works today: stuck peer aborts cleanly, and a small healthy cluster finishes fine. Not covered: many-validator runs, and there's no automated test for the timeout.

Benefit (The "Why")

A DKG should never hang forever on a stuck peer, but it also must never kill a healthy ceremony by mistake. Closing these gaps makes the timeout reliable for real clusters (many validators), gives faster failure detection, and protects it with an automated test.

Acceptance Criteria

  • A healthy DKG with many validators completes without a false timeout under the default --timeout.
  • The relationship between the overall timeout and the per-exchange parsigex timeout is consistent (one budget is not silently exceeded by the other).
  • A stuck/absent peer still aborts cleanly with DkgError::Timeout.
  • An automated test covers the overall timeout (stuck peer aborts) and the healthy path (no false timeout).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions