Skip to content

feat(self_healing): add LEP-6 chain-driven heal-op dispatch runtime#289

Merged
j-rafique merged 1 commit intosupernode/LEP-6-chain-client-extensionsfrom
supernode/LEP-6-heal-op-dispatch
May 4, 2026
Merged

feat(self_healing): add LEP-6 chain-driven heal-op dispatch runtime#289
j-rafique merged 1 commit intosupernode/LEP-6-chain-client-extensionsfrom
supernode/LEP-6-heal-op-dispatch

Conversation

@j-rafique
Copy link
Copy Markdown
Contributor

@j-rafique j-rafique commented May 4, 2026

Replaces the gonode-era peer-watchlist self-healing with a chain-mediated LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the assigned healer over a streaming gRPC RPC (§19 healer-served path) and hash-compare against op.ResultHash, then publish to KAD only after chain VERIFIED quorum.

Three-phase flow

Phase 1 — RECONSTRUCT (no publish)

cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) →
download remaining symbols → RaptorQ-decode → verify file hash
against Action.DataHash → re-encode → stage symbols+idFiles+layout
+reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit
MsgClaimHealComplete{HealManifestHash}; chain transitions
SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash.

Phase 2 — VERIFY (§19 healer-served path)

Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts
on the assigned healer (op.HealerSupernodeAccount), streams the
reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe
via cascadekit.ComputeBlake3DataHashB64), compares against
op.ResultHash (NOT Action.DataHash — chain enforces at
lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits
MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1.

Phase 3 — PUBLISH (only on VERIFIED)

Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded
into single tick loop alongside healer + verifier dispatch), reads
op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same
storeArtefacts path as register/upload), deletes staging on
FAILED/EXPIRED. Chain may reschedule a different healer on
EXPIRED.

Crash-recovery / restart-safety

Submit-then-persist ordering: SQLite dedup row is written ONLY after
chain has accepted the tx. A failed submit (mempool, signing, chain
reject) leaves no row and staging is removed, so the next tick can
retry cleanly. If chain accepted a prior submit but the supernode
crashed before persisting, the next tick's resubmit fails with "does
not accept healer completion claim" and reconcileExistingClaim
re-fetches the heal-op, confirms chain ResultHash equals our manifest,
and persists the dedup row so finalizer takes over.

Negative-attestation hash: chain rejects empty VerificationHash even
on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes
a deterministic non-empty placeholder
(sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed
and hash_compute_failed paths. Chain only validates VerificationHash
content for positive votes (msg_storage_truth.go:288-294), so any
non-empty value is well-formed for negatives.

Components added

supernode/self_healing/
service.go Single tick loop; mode gate (UNSPECIFIED skips);
healer dispatch; verifier dispatch; finalizer poll;
sync.Map in-flight + buffered semaphores
(reconstructs=2, verifications=4, publishes=2).
healer.go Phase 1: submit-then-persist ordering;
reconcileExistingClaim handles post-crash recovery
when chain accepted a prior submit.
verifier.go Phase 2: fetch from assigned healer, retry with
exponential backoff (3 attempts), submit verified=
false with non-empty placeholder hash on persistent
fetch failure; positive-path hash compares against
op.ResultHash; reconciles chain-side
"verification already submitted" idempotency.
finalizer.go Phase 3: VERIFIED → publish + cleanup; FAILED/
EXPIRED → cleanup only; transient states no-op.
peer_client.go secureVerifierFetcher dials via the same
secure-rpc / lumeraid stack the legacy
storage_challenge loop uses.

supernode/transport/grpc/self_healing/handler.go
Streaming ServeReconstructedArtefacts RPC.
DefaultCallerIdentityResolver pulls verifier identity from the
secure-rpc (Lumera ALTS) handshake via
pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses
this so req.VerifierAccount is never trusted alone. Authorizes
caller ∈ op.VerifierSupernodeAccounts AND identity ==
op.HealerSupernodeAccount; refuses with FailedPrecondition when
not the assigned healer and PermissionDenied for unassigned callers.
1 MiB chunks.

proto/supernode/self_healing.proto
SelfHealingService { ServeReconstructedArtefacts streams chunks }.
Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go
regenerated.

supernode/cascade/reseed.go
Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs
PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts +
PublishStagedArtefacts. Stages reconstructed file bytes and a
JSON manifest the §19 transport reads.
supernode/cascade/staged.go
ReadStagedHealOp helper used by the transport handler.
supernode/cascade/interfaces.go
CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts
so self_healing depends only on the factory abstraction.

pkg/storage/queries/self_healing_lep6.go
Tables heal_claims_submitted (PK heal_op_id) and
heal_verifications_submitted (PK (heal_op_id, verifier_account))
for restart dedup. Typed sentinel errors
ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded.
Migrations wired in OpenHistoryDB.
pkg/storage/queries/local.go
LocalStoreInterface embeds LEP6HealQueries.

supernode/config/config.go
SelfHealingConfig YAML block (enabled, poll_interval_ms,
max_concurrent_*, staging_dir, verifier_fetch_timeout_ms,
verifier_fetch_attempts). Default disabled until activation.

supernode/cmd/start.go
Constructs selfHealingService.Service + selfHealingRPC.Server
(with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled,
registers SelfHealingService_ServiceDesc on the gRPC server,
appends the runner to the lifecycle services list. Reuses cService
(cascade factory) and historyStore.

Tests (16 mandatory; all PASS)

supernode/self_healing/service_test.go
1. TestVerifier_ReadsOpResultHashForComparison (R-bug pin)
2. TestVerifier_HashMismatchProducesVerifiedFalse
2b. TestVerifier_FetchFailureSubmitsNonEmptyHash (BLOCKER pin)
3. TestVerifier_FetchesFromAssignedHealerOnly (§19 gate)
6. TestHealer_FailedSubmitDoesNotPersistDedupRow (ordering)
6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery)
7. TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1)
8. TestFinalizer_VerifiedTriggersPublishToKAD (Scenario A)
9. TestFinalizer_FailedSkipsPublish_DeletesStaging (Scenario B)
10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging (Scenario C2)
11. TestService_NoRoleSkipsOp
12. TestService_UnspecifiedModeSkipsEntirely (mode gate)
13. TestService_FinalStateOpsIgnored
14. TestDedup_RestartDoesNotResubmit (3-layer dedup)
supernode/transport/grpc/self_healing/handler_test.go
4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers
5. TestServeReconstructedArtefacts_RejectsUnassignedCaller
(also covers non-assigned-healer FailedPrecondition refusal)
pkg/storage/queries/self_healing_lep6_test.go
TestLEP6_HealClaim_RoundTripAndDedup
TestLEP6_HealVerification_PerVerifierDedup

Validation

go test ./supernode/self_healing/... PASS (2.66s)
go test ./supernode/transport/grpc/self_healing/... PASS (0.09s)
go test ./supernode/cascade/... PASS (0.09s)
go test ./pkg/storage/queries/... PASS (0.20s)
go test ./pkg/storagechallenge/... ./supernode/storage_challenge
./supernode/host_reporter ./pkg/lumera/modules/audit
./pkg/lumera/modules/audit_msg PASS
go vet (touched + all transitively reachable pkgs) PASS
go build (targeted) PASS
(full repo go build fails only on pre-existing
github.com/kolesa-team/go-webp libwebp-dev system-header issue;
unrelated to this change.)

Resolved decisions applied

✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements
(single chain-driven service per Bilal direction; legacy 3-way
Request/Verify/Commit RPC discarded).
✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go
:291). Pinned by TestVerifier_ReadsOpResultHashForComparison.
✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash
recipe). Same recipe healer + verifier + chain enforce.
✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate);
staging directory is the only authority before quorum.
✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into
single tick loop) — no Tendermint WS, no monotonic-growth poll.
✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware),
4 verifications, 2 publishes.
✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick
early-return; verified by TestService_UnspecifiedModeSkipsEntirely).
✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite
(heal_claims_submitted + heal_verifications_submitted).
✓ Submit-then-persist ordering with reconcile path for crash recovery.
✓ Non-empty placeholder VerificationHash on negative attestations
(chain rejects empty regardless of verified bool).
✓ Caller authentication via secure-rpc / Lumera ALTS handshake at
transport layer; req.VerifierAccount never trusted alone in
production.

Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md

@j-rafique j-rafique self-assigned this May 4, 2026
@roomote-v0
Copy link
Copy Markdown

roomote-v0 Bot commented May 4, 2026

Rooviewer Clock   See task

All 3 issues from the previous review have been addressed. No new issues found. Clean to merge.

  • Staging dir leak on hash mismatch -- reconcileExistingClaim in healer.go now calls os.RemoveAll(stagingDir) before returning nil on hash mismatch. Pinned by TestHealer_ReconcileHashMismatchCleansStagingWithoutPersisting.
  • Data race in ensureClient() -- secureVerifierFetcher.ensureClient() in peer_client.go now guards reads/writes of f.grpcClient with f.mu.Lock()/defer f.mu.Unlock(), matching lep6_client_factory.go.
  • Finalizer not-found handling diverges from doc -- finalizeClaim in finalizer.go now detects not-found via isChainHealOpNotFound(err) and routes to cleanupClaim, matching the documented behavior. Pinned by TestFinalizer_NotFoundCleansClaimAndStaging.
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Comment thread supernode/self_healing/healer.go
Comment thread supernode/self_healing/peer_client.go
Comment thread supernode/self_healing/finalizer.go
Replaces the gonode-era peer-watchlist self-healing with a chain-mediated
LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally
and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the
assigned healer over a streaming gRPC RPC (§19 healer-served path) and
hash-compare against op.ResultHash, then publish to KAD only after chain
VERIFIED quorum.

Three-phase flow

  Phase 1 — RECONSTRUCT (no publish)
    cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) →
    download remaining symbols → RaptorQ-decode → verify file hash
    against Action.DataHash → re-encode → stage symbols+idFiles+layout
    +reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit
    MsgClaimHealComplete{HealManifestHash}; chain transitions
    SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash.

  Phase 2 — VERIFY (§19 healer-served path)
    Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts
    on the assigned healer (op.HealerSupernodeAccount), streams the
    reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe
    via cascadekit.ComputeBlake3DataHashB64), compares against
    op.ResultHash (NOT Action.DataHash — chain enforces at
    lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits
    MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1.

  Phase 3 — PUBLISH (only on VERIFIED)
    Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded
    into single tick loop alongside healer + verifier dispatch), reads
    op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same
    storeArtefacts path as register/upload), deletes staging on
    FAILED/EXPIRED. Chain may reschedule a different healer on
    EXPIRED.

Crash-recovery / restart-safety

  Submit-then-persist ordering: SQLite dedup row is written ONLY after
  chain has accepted the tx. A failed submit (mempool, signing, chain
  reject) leaves no row and staging is removed, so the next tick can
  retry cleanly. If chain accepted a prior submit but the supernode
  crashed before persisting, the next tick's resubmit fails with "does
  not accept healer completion claim" and reconcileExistingClaim
  re-fetches the heal-op, confirms chain ResultHash equals our manifest,
  and persists the dedup row so finalizer takes over.

  Negative-attestation hash: chain rejects empty VerificationHash even
  on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes
  a deterministic non-empty placeholder
  (sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed
  and hash_compute_failed paths. Chain only validates VerificationHash
  content for positive votes (msg_storage_truth.go:288-294), so any
  non-empty value is well-formed for negatives.

Components added

  supernode/self_healing/
    service.go      Single tick loop; mode gate (UNSPECIFIED skips);
                    healer dispatch; verifier dispatch; finalizer poll;
                    sync.Map in-flight + buffered semaphores
                    (reconstructs=2, verifications=4, publishes=2).
    healer.go       Phase 1: submit-then-persist ordering;
                    reconcileExistingClaim handles post-crash recovery
                    when chain accepted a prior submit.
    verifier.go     Phase 2: fetch from assigned healer, retry with
                    exponential backoff (3 attempts), submit verified=
                    false with non-empty placeholder hash on persistent
                    fetch failure; positive-path hash compares against
                    op.ResultHash; reconciles chain-side
                    "verification already submitted" idempotency.
    finalizer.go    Phase 3: VERIFIED → publish + cleanup; FAILED/
                    EXPIRED → cleanup only; transient states no-op.
    peer_client.go  secureVerifierFetcher dials via the same
                    secure-rpc / lumeraid stack the legacy
                    storage_challenge loop uses.

  supernode/transport/grpc/self_healing/handler.go
    Streaming ServeReconstructedArtefacts RPC.
    DefaultCallerIdentityResolver pulls verifier identity from the
    secure-rpc (Lumera ALTS) handshake via
    pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses
    this so req.VerifierAccount is never trusted alone. Authorizes
    caller ∈ op.VerifierSupernodeAccounts AND identity ==
    op.HealerSupernodeAccount; refuses with FailedPrecondition when
    not the assigned healer and PermissionDenied for unassigned callers.
    1 MiB chunks.

  proto/supernode/self_healing.proto
    SelfHealingService { ServeReconstructedArtefacts streams chunks }.
    Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go
    regenerated.

  supernode/cascade/reseed.go
    Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs
    PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts +
    PublishStagedArtefacts. Stages reconstructed file bytes and a
    JSON manifest the §19 transport reads.
  supernode/cascade/staged.go
    ReadStagedHealOp helper used by the transport handler.
  supernode/cascade/interfaces.go
    CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts
    so self_healing depends only on the factory abstraction.

  pkg/storage/queries/self_healing_lep6.go
    Tables heal_claims_submitted (PK heal_op_id) and
    heal_verifications_submitted (PK (heal_op_id, verifier_account))
    for restart dedup. Typed sentinel errors
    ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded.
    Migrations wired in OpenHistoryDB.
  pkg/storage/queries/local.go
    LocalStoreInterface embeds LEP6HealQueries.

  supernode/config/config.go
    SelfHealingConfig YAML block (enabled, poll_interval_ms,
    max_concurrent_*, staging_dir, verifier_fetch_timeout_ms,
    verifier_fetch_attempts). Default disabled until activation.

  supernode/cmd/start.go
    Constructs selfHealingService.Service + selfHealingRPC.Server
    (with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled,
    registers SelfHealingService_ServiceDesc on the gRPC server,
    appends the runner to the lifecycle services list. Reuses cService
    (cascade factory) and historyStore.

Tests (16 mandatory; all PASS)

  supernode/self_healing/service_test.go
    1.  TestVerifier_ReadsOpResultHashForComparison       (R-bug pin)
    2.  TestVerifier_HashMismatchProducesVerifiedFalse
    2b. TestVerifier_FetchFailureSubmitsNonEmptyHash      (BLOCKER pin)
    3.  TestVerifier_FetchesFromAssignedHealerOnly        (§19 gate)
    6.  TestHealer_FailedSubmitDoesNotPersistDedupRow     (ordering)
    6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery)
    7.  TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1)
    8.  TestFinalizer_VerifiedTriggersPublishToKAD        (Scenario A)
    9.  TestFinalizer_FailedSkipsPublish_DeletesStaging   (Scenario B)
    10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging  (Scenario C2)
    11. TestService_NoRoleSkipsOp
    12. TestService_UnspecifiedModeSkipsEntirely          (mode gate)
    13. TestService_FinalStateOpsIgnored
    14. TestDedup_RestartDoesNotResubmit                  (3-layer dedup)
  supernode/transport/grpc/self_healing/handler_test.go
    4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers
    5. TestServeReconstructedArtefacts_RejectsUnassignedCaller
       (also covers non-assigned-healer FailedPrecondition refusal)
  pkg/storage/queries/self_healing_lep6_test.go
    TestLEP6_HealClaim_RoundTripAndDedup
    TestLEP6_HealVerification_PerVerifierDedup

Validation

  go test ./supernode/self_healing/...                 PASS (2.66s)
  go test ./supernode/transport/grpc/self_healing/...  PASS (0.09s)
  go test ./supernode/cascade/...                      PASS (0.09s)
  go test ./pkg/storage/queries/...                    PASS (0.20s)
  go test ./pkg/storagechallenge/... ./supernode/storage_challenge \
          ./supernode/host_reporter ./pkg/lumera/modules/audit \
          ./pkg/lumera/modules/audit_msg                  PASS
  go vet (touched + all transitively reachable pkgs)      PASS
  go build (targeted)                                     PASS
  (full repo go build fails only on pre-existing
   github.com/kolesa-team/go-webp libwebp-dev system-header issue;
   unrelated to this change.)

Resolved decisions applied

  ✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements
    (single chain-driven service per Bilal direction; legacy 3-way
    Request/Verify/Commit RPC discarded).
  ✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go
    :291). Pinned by TestVerifier_ReadsOpResultHashForComparison.
  ✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash
    recipe). Same recipe healer + verifier + chain enforce.
  ✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate);
    staging directory is the only authority before quorum.
  ✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into
    single tick loop) — no Tendermint WS, no monotonic-growth poll.
  ✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware),
    4 verifications, 2 publishes.
  ✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick
    early-return; verified by TestService_UnspecifiedModeSkipsEntirely).
  ✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite
    (heal_claims_submitted + heal_verifications_submitted).
  ✓ Submit-then-persist ordering with reconcile path for crash recovery.
  ✓ Non-empty placeholder VerificationHash on negative attestations
    (chain rejects empty regardless of verified bool).
  ✓ Caller authentication via secure-rpc / Lumera ALTS handshake at
    transport layer; req.VerifierAccount never trusted alone in
    production.

Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md
@j-rafique j-rafique force-pushed the supernode/LEP-6-heal-op-dispatch branch from ee952ae to 9ab87cb Compare May 4, 2026 15:25
@j-rafique j-rafique merged commit d86c679 into supernode/LEP-6-chain-client-extensions May 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant