Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284
Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284mateeullahmalik wants to merge 9 commits intomasterfrom
Conversation
Re-reviewed commits
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
| func (s *DHT) setStoreAllowlist(ctx context.Context, allow map[[32]byte]struct{}) { | ||
| if s == nil { | ||
| return | ||
| } | ||
| // Integration tests may use synthetic bootstrap sets; do not enforce chain-state gating. | ||
| if integrationTestEnabled() { | ||
| return | ||
| } | ||
|
|
||
| s.storeAllowMu.Lock() | ||
| s.storeAllow = allow | ||
| s.storeAllowMu.Unlock() | ||
|
|
||
| s.storeAllowCount.Store(int64(len(allow))) | ||
| s.storeAllowReady.Store(true) | ||
|
|
||
| logtrace.Debug(ctx, "store allowlist updated", logtrace.Fields{ | ||
| logtrace.FieldModule: "p2p", | ||
| "store_peers": len(allow), | ||
| }) | ||
| } |
There was a problem hiding this comment.
setStoreAllowlist unconditionally accepts an empty map and marks storeAllowReady=true with storeAllowCount=0. If a transient chain issue returns zero ACTIVE supernodes, eligibleForStore will return false for every peer, blocking all writes network-wide until the next bootstrap refresh (up to 10 min). setRoutingAllowlist explicitly guards against this by returning early when len(allow) == 0 and retaining the previous allowlist. The store allowlist should have the same protection.
Fix it with Roo Code or mention @roomote and request a fix.
| func (s *DHT) eligibleForStore(n *Node) bool { | ||
| if s == nil { | ||
| return false | ||
| } | ||
| // In integration tests allow everything; chain state gating is not stable/available there. | ||
| if integrationTestEnabled() { | ||
| return true | ||
| } | ||
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | ||
| if !s.storeAllowReady.Load() { | ||
| return true | ||
| } | ||
| // Once initialized, an empty active set means no write-eligible peers. | ||
| if s.storeAllowCount.Load() == 0 { | ||
| return false | ||
| } | ||
| if n == nil || len(n.ID) == 0 { | ||
| return false | ||
| } |
There was a problem hiding this comment.
The nil/empty node check (n == nil || len(n.ID) == 0) is placed after the storeAllowReady and storeAllowCount checks. When storeAllowReady is false (pre-bootstrap), this function returns true for a nil node, which could cause a nil pointer dereference if a nil *Node slips through a call site during bootstrap. eligibleForRouting checks n == nil as the very first thing (line 214). Moving the nil check before the ready/count checks would be consistent and defensive.
| func (s *DHT) eligibleForStore(n *Node) bool { | |
| if s == nil { | |
| return false | |
| } | |
| // In integration tests allow everything; chain state gating is not stable/available there. | |
| if integrationTestEnabled() { | |
| return true | |
| } | |
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | |
| if !s.storeAllowReady.Load() { | |
| return true | |
| } | |
| // Once initialized, an empty active set means no write-eligible peers. | |
| if s.storeAllowCount.Load() == 0 { | |
| return false | |
| } | |
| if n == nil || len(n.ID) == 0 { | |
| return false | |
| } | |
| func (s *DHT) eligibleForStore(n *Node) bool { | |
| if s == nil { | |
| return false | |
| } | |
| if n == nil || len(n.ID) == 0 { | |
| return false | |
| } | |
| // In integration tests allow everything; chain state gating is not stable/available there. | |
| if integrationTestEnabled() { | |
| return true | |
| } | |
| // If the store allowlist isn't ready yet, avoid blocking writes during bootstrap. | |
| if !s.storeAllowReady.Load() { | |
| return true | |
| } | |
| // Once initialized, an empty active set means no write-eligible peers. | |
| if s.storeAllowCount.Load() == 0 { | |
| return false | |
| } |
Fix it with Roo Code or mention @roomote and request a fix.
| if !s.dht.selfStoreEligible() { | ||
| newKeys := 0 | ||
| for _, data := range request.Data { | ||
| k, _ := utils.Blake3Hash(data) | ||
| existing, rErr := s.dht.store.Retrieve(ctx, k) | ||
| if rErr != nil || len(existing) == 0 { | ||
| newKeys++ | ||
| } | ||
| } | ||
| if newKeys > 0 { | ||
| logtrace.Warn(ctx, "rejecting BatchStore: self is not store-eligible", logtrace.Fields{ | ||
| logtrace.FieldModule: "p2p", | ||
| "sender": message.Sender.String(), | ||
| "self_state": s.dht.selfState.Load(), | ||
| "new_keys": newKeys, | ||
| "total_keys": len(request.Data), | ||
| }) | ||
| return s.generateResponseMessage(ctx, BatchStoreData, message.Sender, ResultFailed, "batch store rejected: self not store-eligible") | ||
| } | ||
| } |
There was a problem hiding this comment.
handleStoreData was refactored to use shouldRejectStore(1), but this batch handler still calls selfStoreEligible() directly and reimplements the newKeys-counting logic inline. If shouldRejectStore is updated later (e.g., to add logging, metrics, or a grace period), this path will silently diverge. Consider replacing the !s.dht.selfStoreEligible() guard + manual newKeys count with s.dht.shouldRejectStore(newKeys) after computing newKeys, matching the single-store handler's pattern.
Fix it with Roo Code or mention @roomote and request a fix.
Aligns both supernode go.mod and tests/system go.mod with the v1.12.0-rc release tag (lumera commit 7ca770a / Everlight #113). Resolves the install-lumera CI step which requires a real downloadable release asset rather than a pseudo-version.
Extends PR #272's routing-vs-store allowlist split to cover the new SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight). Policy: - routing (reads) = {ACTIVE, POSTPONED, STORAGE_FULL} - store (writes) = {ACTIVE} STORAGE_FULL nodes continue to serve reads and earn payout, but must not receive new STORE/BatchStore writes or be targeted by replication. Changes: - p2p/kademlia/supernode_state.go: SSoT helpers using chain's sntypes.SuperNodeState enum (no numeric literals), selfState cache, pruneIneligibleStorePeers, selfStoreEligible. - bootstrap.go: use isRoutingEligibleState / isStoreEligibleState; record self-state during chain sync; call pruneIneligibleStorePeers after setStoreAllowlist so replication_info.Active is cleared eagerly for ineligible peers (closes ping-cadence window). - network.go: STORE RPC self-guard for single handleStoreData (reject new-key write when self not store-eligible; replication of already- held key still permitted); BatchStoreData self-guard rejects when batch contains any genuinely new keys. - dht.go: selfState atomic fields; defensive filterEligibleNodes on BatchRetrieve and BatchRetrieveStream closest-contact lists. - dht_batch_store_test.go: accept new "no eligible store peers" error variant alongside legacy "no candidate nodes".
Coverage matrix aligned with the invariant table in the plan doc: - I1 routing allowlist population/pre-init: TestEligibleForRouting_PreInit_AndPopulated - I2 store allowlist strictly ⊆ routing: TestEligibleForStore_StrictlyContainedInRouting - I3+I9 shouldRejectStore contract (ACTIVE, STORAGE_FULL, POSTPONED, DISABLED, with newKeys=0 vs >0, pre-init permissive): TestShouldRejectStore, TestSelfStoreEligible - I5 eager replication-info prune on storeAllowlist update: TestPruneIneligibleStorePeers_ClearsNonStorePeers, TestPruneIneligibleStorePeers_SkipsWhenNotReady - I6 state-classification SSoT (no drift possible): TestStateClassification_Table (parametric over all 7 chain enum values) Refactors handleStoreData self-guard to use shouldRejectStore helper for direct test coverage without a full *Network + hashtable boot. Adds a minimal fakeStore in-package (satisfies the full Store interface with no-op implementations) used only by prune tests.
af9c32c to
7a2b535
Compare
Consolidates the Everlight supernode workstream into one PR. Supersedes #282 (closed) and #272 (closed).
Implements the supernode side of lumera #113 (Everlight Phase 1). Three layers, stacked in commit order:
host_reportermeasures disk usage on the p2p data dir mount and emitscascade_kademlia_db_bytes+disk_usage_percentin audit epoch reports; probing + verifier acceptACTIVEandSTORAGE_FULLas operational states; supernode query client gains wrappers for the new chain state fields;go.modbumped to lumera v1.12.0-rc.routingIDs(reads) andstoreIDs(writes). Replaces the single-allowlist model and is a prerequisite for the STORAGE_FULL gate.SUPERNODE_STATE_STORAGE_FULL(enum 6). STORAGE_FULL nodes continue to serve reads and earn payout but do not receive new STORE / BatchStore writes and are not targeted by replication.State-class matrix
How
p2p/kademlia/supernode_state.goimportssntypes.SuperNodeStatefrom lumera. No numeric literals anywhere in p2p outside this file. Helpers:isRoutingEligibleState,isStoreEligibleState.bootstrap.go::loadBootstrapCandidatesFromChain) —routingIDs = {ACTIVE, POSTPONED, STORAGE_FULL},storeIDs = {ACTIVE}.DHT.selfState(atomic). Consumed by the STORE RPC self-guard.network.go::handleStoreData,handleBatchStoreData) — if self is not store-eligible, reject STORE/BatchStore requests that contain any genuinely new keys. Replication of already-held keys still allowed (preserves availability during transitions).bootstrap.go::SyncBootstrapOnce→pruneIneligibleStorePeers) — on every bootstrap refresh flipreplication_info.Active=falsefor any peer no longer in the store allowlist. Closes the 10-minute+ window between a chain STORAGE_FULL transition and the next successful ping.dht.go::BatchRetrieve,BatchRetrieveStream) —filterEligibleNodesapplied to the closest-contact list as belt-and-braces.host_reporter/service.go) — measures the filesystem backing the p2p data dir rather than/, and reportscascade_kademlia_db_bytesso the chain can drive STORAGE_FULL / POSTPONED transitions deterministically.supernode_metrics/active_probing.go,verifier/verifier.go) —ACTIVEandSTORAGE_FULLboth count as operational for probe-role assignment and challenge response.Tests — invariant-oriented, one violation test per enforcement point
TestEligibleForRouting_PreInit_AndPopulatedTestEligibleForStore_StrictlyContainedInRoutingTestShouldRejectStore,TestSelfStoreEligibleTestPruneIneligibleStorePeers_*TestStateClassification_Table(parametric over all 7 chain enum values)host_reporter/tick_behavior_test.goverifier/verifier_test.goreachability_active_probing_test.goNo numeric state literals anywhere in
p2p/kademlia/outsidesupernode_state.go.Verification
go build ./...OKgo vet ./...OKgo test ./...OK incl. integrationcoin_spentevent of10000ulumebut the v1.12.0-rc fee payout is now split (200ulumeprotocol cut +9800ulumesupernode share). Fix is a test-only change to sumcoin_spentamounts for the spender (or match onaction_registered.fee). Pushing that fix next.Out of scope (follow-ups)
tests/systemtests/everlight_p2p_test.go— Phase D.replication_info.Activewhich is pruned eagerly here).Risks
sntypes.SuperNodeState*constants directly makes this a compile error on the next go-mod bump rather than silent drift.pruneIneligibleStorePeersscans fullreplication_info. O(n) per 10-min bootstrap refresh; negligible for current network size.Rollback
Per-feature reverts are clean:
Related: