feat(self_healing): add LEP-6 chain-driven heal-op dispatch runtime#289
Merged
j-rafique merged 1 commit intosupernode/LEP-6-chain-client-extensionsfrom May 4, 2026
Conversation
All 3 issues from the previous review have been addressed. No new issues found. Clean to merge.
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
Replaces the gonode-era peer-watchlist self-healing with a chain-mediated
LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally
and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the
assigned healer over a streaming gRPC RPC (§19 healer-served path) and
hash-compare against op.ResultHash, then publish to KAD only after chain
VERIFIED quorum.
Three-phase flow
Phase 1 — RECONSTRUCT (no publish)
cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) →
download remaining symbols → RaptorQ-decode → verify file hash
against Action.DataHash → re-encode → stage symbols+idFiles+layout
+reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit
MsgClaimHealComplete{HealManifestHash}; chain transitions
SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash.
Phase 2 — VERIFY (§19 healer-served path)
Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts
on the assigned healer (op.HealerSupernodeAccount), streams the
reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe
via cascadekit.ComputeBlake3DataHashB64), compares against
op.ResultHash (NOT Action.DataHash — chain enforces at
lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits
MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1.
Phase 3 — PUBLISH (only on VERIFIED)
Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded
into single tick loop alongside healer + verifier dispatch), reads
op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same
storeArtefacts path as register/upload), deletes staging on
FAILED/EXPIRED. Chain may reschedule a different healer on
EXPIRED.
Crash-recovery / restart-safety
Submit-then-persist ordering: SQLite dedup row is written ONLY after
chain has accepted the tx. A failed submit (mempool, signing, chain
reject) leaves no row and staging is removed, so the next tick can
retry cleanly. If chain accepted a prior submit but the supernode
crashed before persisting, the next tick's resubmit fails with "does
not accept healer completion claim" and reconcileExistingClaim
re-fetches the heal-op, confirms chain ResultHash equals our manifest,
and persists the dedup row so finalizer takes over.
Negative-attestation hash: chain rejects empty VerificationHash even
on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes
a deterministic non-empty placeholder
(sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed
and hash_compute_failed paths. Chain only validates VerificationHash
content for positive votes (msg_storage_truth.go:288-294), so any
non-empty value is well-formed for negatives.
Components added
supernode/self_healing/
service.go Single tick loop; mode gate (UNSPECIFIED skips);
healer dispatch; verifier dispatch; finalizer poll;
sync.Map in-flight + buffered semaphores
(reconstructs=2, verifications=4, publishes=2).
healer.go Phase 1: submit-then-persist ordering;
reconcileExistingClaim handles post-crash recovery
when chain accepted a prior submit.
verifier.go Phase 2: fetch from assigned healer, retry with
exponential backoff (3 attempts), submit verified=
false with non-empty placeholder hash on persistent
fetch failure; positive-path hash compares against
op.ResultHash; reconciles chain-side
"verification already submitted" idempotency.
finalizer.go Phase 3: VERIFIED → publish + cleanup; FAILED/
EXPIRED → cleanup only; transient states no-op.
peer_client.go secureVerifierFetcher dials via the same
secure-rpc / lumeraid stack the legacy
storage_challenge loop uses.
supernode/transport/grpc/self_healing/handler.go
Streaming ServeReconstructedArtefacts RPC.
DefaultCallerIdentityResolver pulls verifier identity from the
secure-rpc (Lumera ALTS) handshake via
pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses
this so req.VerifierAccount is never trusted alone. Authorizes
caller ∈ op.VerifierSupernodeAccounts AND identity ==
op.HealerSupernodeAccount; refuses with FailedPrecondition when
not the assigned healer and PermissionDenied for unassigned callers.
1 MiB chunks.
proto/supernode/self_healing.proto
SelfHealingService { ServeReconstructedArtefacts streams chunks }.
Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go
regenerated.
supernode/cascade/reseed.go
Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs
PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts +
PublishStagedArtefacts. Stages reconstructed file bytes and a
JSON manifest the §19 transport reads.
supernode/cascade/staged.go
ReadStagedHealOp helper used by the transport handler.
supernode/cascade/interfaces.go
CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts
so self_healing depends only on the factory abstraction.
pkg/storage/queries/self_healing_lep6.go
Tables heal_claims_submitted (PK heal_op_id) and
heal_verifications_submitted (PK (heal_op_id, verifier_account))
for restart dedup. Typed sentinel errors
ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded.
Migrations wired in OpenHistoryDB.
pkg/storage/queries/local.go
LocalStoreInterface embeds LEP6HealQueries.
supernode/config/config.go
SelfHealingConfig YAML block (enabled, poll_interval_ms,
max_concurrent_*, staging_dir, verifier_fetch_timeout_ms,
verifier_fetch_attempts). Default disabled until activation.
supernode/cmd/start.go
Constructs selfHealingService.Service + selfHealingRPC.Server
(with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled,
registers SelfHealingService_ServiceDesc on the gRPC server,
appends the runner to the lifecycle services list. Reuses cService
(cascade factory) and historyStore.
Tests (16 mandatory; all PASS)
supernode/self_healing/service_test.go
1. TestVerifier_ReadsOpResultHashForComparison (R-bug pin)
2. TestVerifier_HashMismatchProducesVerifiedFalse
2b. TestVerifier_FetchFailureSubmitsNonEmptyHash (BLOCKER pin)
3. TestVerifier_FetchesFromAssignedHealerOnly (§19 gate)
6. TestHealer_FailedSubmitDoesNotPersistDedupRow (ordering)
6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery)
7. TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1)
8. TestFinalizer_VerifiedTriggersPublishToKAD (Scenario A)
9. TestFinalizer_FailedSkipsPublish_DeletesStaging (Scenario B)
10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging (Scenario C2)
11. TestService_NoRoleSkipsOp
12. TestService_UnspecifiedModeSkipsEntirely (mode gate)
13. TestService_FinalStateOpsIgnored
14. TestDedup_RestartDoesNotResubmit (3-layer dedup)
supernode/transport/grpc/self_healing/handler_test.go
4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers
5. TestServeReconstructedArtefacts_RejectsUnassignedCaller
(also covers non-assigned-healer FailedPrecondition refusal)
pkg/storage/queries/self_healing_lep6_test.go
TestLEP6_HealClaim_RoundTripAndDedup
TestLEP6_HealVerification_PerVerifierDedup
Validation
go test ./supernode/self_healing/... PASS (2.66s)
go test ./supernode/transport/grpc/self_healing/... PASS (0.09s)
go test ./supernode/cascade/... PASS (0.09s)
go test ./pkg/storage/queries/... PASS (0.20s)
go test ./pkg/storagechallenge/... ./supernode/storage_challenge \
./supernode/host_reporter ./pkg/lumera/modules/audit \
./pkg/lumera/modules/audit_msg PASS
go vet (touched + all transitively reachable pkgs) PASS
go build (targeted) PASS
(full repo go build fails only on pre-existing
github.com/kolesa-team/go-webp libwebp-dev system-header issue;
unrelated to this change.)
Resolved decisions applied
✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements
(single chain-driven service per Bilal direction; legacy 3-way
Request/Verify/Commit RPC discarded).
✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go
:291). Pinned by TestVerifier_ReadsOpResultHashForComparison.
✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash
recipe). Same recipe healer + verifier + chain enforce.
✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate);
staging directory is the only authority before quorum.
✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into
single tick loop) — no Tendermint WS, no monotonic-growth poll.
✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware),
4 verifications, 2 publishes.
✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick
early-return; verified by TestService_UnspecifiedModeSkipsEntirely).
✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite
(heal_claims_submitted + heal_verifications_submitted).
✓ Submit-then-persist ordering with reconcile path for crash recovery.
✓ Non-empty placeholder VerificationHash on negative attestations
(chain rejects empty regardless of verified bool).
✓ Caller authentication via secure-rpc / Lumera ALTS handshake at
transport layer; req.VerifierAccount never trusted alone in
production.
Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md
ee952ae to
9ab87cb
Compare
d86c679
into
supernode/LEP-6-chain-client-extensions
6 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the gonode-era peer-watchlist self-healing with a chain-mediated LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the assigned healer over a streaming gRPC RPC (§19 healer-served path) and hash-compare against op.ResultHash, then publish to KAD only after chain VERIFIED quorum.
Three-phase flow
Phase 1 — RECONSTRUCT (no publish)
Phase 2 — VERIFY (§19 healer-served path)
Phase 3 — PUBLISH (only on VERIFIED)
Crash-recovery / restart-safety
Submit-then-persist ordering: SQLite dedup row is written ONLY after
chain has accepted the tx. A failed submit (mempool, signing, chain
reject) leaves no row and staging is removed, so the next tick can
retry cleanly. If chain accepted a prior submit but the supernode
crashed before persisting, the next tick's resubmit fails with "does
not accept healer completion claim" and reconcileExistingClaim
re-fetches the heal-op, confirms chain ResultHash equals our manifest,
and persists the dedup row so finalizer takes over.
Negative-attestation hash: chain rejects empty VerificationHash even
on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes
a deterministic non-empty placeholder
(sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed
and hash_compute_failed paths. Chain only validates VerificationHash
content for positive votes (msg_storage_truth.go:288-294), so any
non-empty value is well-formed for negatives.
Components added
supernode/self_healing/
service.go Single tick loop; mode gate (UNSPECIFIED skips);
healer dispatch; verifier dispatch; finalizer poll;
sync.Map in-flight + buffered semaphores
(reconstructs=2, verifications=4, publishes=2).
healer.go Phase 1: submit-then-persist ordering;
reconcileExistingClaim handles post-crash recovery
when chain accepted a prior submit.
verifier.go Phase 2: fetch from assigned healer, retry with
exponential backoff (3 attempts), submit verified=
false with non-empty placeholder hash on persistent
fetch failure; positive-path hash compares against
op.ResultHash; reconciles chain-side
"verification already submitted" idempotency.
finalizer.go Phase 3: VERIFIED → publish + cleanup; FAILED/
EXPIRED → cleanup only; transient states no-op.
peer_client.go secureVerifierFetcher dials via the same
secure-rpc / lumeraid stack the legacy
storage_challenge loop uses.
supernode/transport/grpc/self_healing/handler.go
Streaming ServeReconstructedArtefacts RPC.
DefaultCallerIdentityResolver pulls verifier identity from the
secure-rpc (Lumera ALTS) handshake via
pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses
this so req.VerifierAccount is never trusted alone. Authorizes
caller ∈ op.VerifierSupernodeAccounts AND identity ==
op.HealerSupernodeAccount; refuses with FailedPrecondition when
not the assigned healer and PermissionDenied for unassigned callers.
1 MiB chunks.
proto/supernode/self_healing.proto
SelfHealingService { ServeReconstructedArtefacts streams chunks }.
Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go
regenerated.
supernode/cascade/reseed.go
Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs
PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts +
PublishStagedArtefacts. Stages reconstructed file bytes and a
JSON manifest the §19 transport reads.
supernode/cascade/staged.go
ReadStagedHealOp helper used by the transport handler.
supernode/cascade/interfaces.go
CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts
so self_healing depends only on the factory abstraction.
pkg/storage/queries/self_healing_lep6.go
Tables heal_claims_submitted (PK heal_op_id) and
heal_verifications_submitted (PK (heal_op_id, verifier_account))
for restart dedup. Typed sentinel errors
ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded.
Migrations wired in OpenHistoryDB.
pkg/storage/queries/local.go
LocalStoreInterface embeds LEP6HealQueries.
supernode/config/config.go
SelfHealingConfig YAML block (enabled, poll_interval_ms,
max_concurrent_*, staging_dir, verifier_fetch_timeout_ms,
verifier_fetch_attempts). Default disabled until activation.
supernode/cmd/start.go
Constructs selfHealingService.Service + selfHealingRPC.Server
(with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled,
registers SelfHealingService_ServiceDesc on the gRPC server,
appends the runner to the lifecycle services list. Reuses cService
(cascade factory) and historyStore.
Tests (16 mandatory; all PASS)
supernode/self_healing/service_test.go
1. TestVerifier_ReadsOpResultHashForComparison (R-bug pin)
2. TestVerifier_HashMismatchProducesVerifiedFalse
2b. TestVerifier_FetchFailureSubmitsNonEmptyHash (BLOCKER pin)
3. TestVerifier_FetchesFromAssignedHealerOnly (§19 gate)
6. TestHealer_FailedSubmitDoesNotPersistDedupRow (ordering)
6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery)
7. TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1)
8. TestFinalizer_VerifiedTriggersPublishToKAD (Scenario A)
9. TestFinalizer_FailedSkipsPublish_DeletesStaging (Scenario B)
10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging (Scenario C2)
11. TestService_NoRoleSkipsOp
12. TestService_UnspecifiedModeSkipsEntirely (mode gate)
13. TestService_FinalStateOpsIgnored
14. TestDedup_RestartDoesNotResubmit (3-layer dedup)
supernode/transport/grpc/self_healing/handler_test.go
4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers
5. TestServeReconstructedArtefacts_RejectsUnassignedCaller
(also covers non-assigned-healer FailedPrecondition refusal)
pkg/storage/queries/self_healing_lep6_test.go
TestLEP6_HealClaim_RoundTripAndDedup
TestLEP6_HealVerification_PerVerifierDedup
Validation
go test ./supernode/self_healing/... PASS (2.66s)
go test ./supernode/transport/grpc/self_healing/... PASS (0.09s)
go test ./supernode/cascade/... PASS (0.09s)
go test ./pkg/storage/queries/... PASS (0.20s)
go test ./pkg/storagechallenge/... ./supernode/storage_challenge
./supernode/host_reporter ./pkg/lumera/modules/audit
./pkg/lumera/modules/audit_msg PASS
go vet (touched + all transitively reachable pkgs) PASS
go build (targeted) PASS
(full repo go build fails only on pre-existing
github.com/kolesa-team/go-webp libwebp-dev system-header issue;
unrelated to this change.)
Resolved decisions applied
✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements
(single chain-driven service per Bilal direction; legacy 3-way
Request/Verify/Commit RPC discarded).
✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go
:291). Pinned by TestVerifier_ReadsOpResultHashForComparison.
✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash
recipe). Same recipe healer + verifier + chain enforce.
✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate);
staging directory is the only authority before quorum.
✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into
single tick loop) — no Tendermint WS, no monotonic-growth poll.
✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware),
4 verifications, 2 publishes.
✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick
early-return; verified by TestService_UnspecifiedModeSkipsEntirely).
✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite
(heal_claims_submitted + heal_verifications_submitted).
✓ Submit-then-persist ordering with reconcile path for crash recovery.
✓ Non-empty placeholder VerificationHash on negative attestations
(chain rejects empty regardless of verified bool).
✓ Caller authentication via secure-rpc / Lumera ALTS handshake at
transport layer; req.VerifierAccount never trusted alone in
production.
Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md