Skip to content

Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284

Open
mateeullahmalik wants to merge 9 commits intomasterfrom
feat/p2p-storage-full-gate
Open

Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate#284
mateeullahmalik wants to merge 9 commits intomasterfrom
feat/p2p-storage-full-gate

Conversation

@mateeullahmalik
Copy link
Copy Markdown
Collaborator

@mateeullahmalik mateeullahmalik commented Apr 22, 2026

Consolidates the Everlight supernode workstream into one PR. Supersedes #282 (closed) and #272 (closed).

Implements the supernode side of lumera #113 (Everlight Phase 1). Three layers, stacked in commit order:

  1. Compat / telemetry (was Everlight #282) — host_reporter measures disk usage on the p2p data dir mount and emits cascade_kademlia_db_bytes + disk_usage_percent in audit epoch reports; probing + verifier accept ACTIVE and STORAGE_FULL as operational states; supernode query client gains wrappers for the new chain state fields; go.mod bumped to lumera v1.12.0-rc.
  2. P2P store/retrieve list separation (was Update/store retrieve list seperation #272, rewritten) — bootstrap seeds two distinct allowlists: routingIDs (reads) and storeIDs (writes). Replaces the single-allowlist model and is a prerequisite for the STORAGE_FULL gate.
  3. P2P STORAGE_FULL eligibility gate (this PR's new commits) — extends the routing-vs-store split to the new SUPERNODE_STATE_STORAGE_FULL (enum 6). STORAGE_FULL nodes continue to serve reads and earn payout but do not receive new STORE / BatchStore writes and are not targeted by replication.

State-class matrix

State Routing (reads)? Store (writes)? Payout?
ACTIVE
POSTPONED
STORAGE_FULL
DISABLED / STOPPED / PENALIZED

How

  1. Single source of truthp2p/kademlia/supernode_state.go imports sntypes.SuperNodeState from lumera. No numeric literals anywhere in p2p outside this file. Helpers: isRoutingEligibleState, isStoreEligibleState.
  2. Bootstrap population (bootstrap.go::loadBootstrapCandidatesFromChain) — routingIDs = {ACTIVE, POSTPONED, STORAGE_FULL}, storeIDs = {ACTIVE}.
  3. Self-state cache — this node's latest chain state cached on DHT.selfState (atomic). Consumed by the STORE RPC self-guard.
  4. STORE self-guard (network.go::handleStoreData, handleBatchStoreData) — if self is not store-eligible, reject STORE/BatchStore requests that contain any genuinely new keys. Replication of already-held keys still allowed (preserves availability during transitions).
  5. Eager replication prune (bootstrap.go::SyncBootstrapOncepruneIneligibleStorePeers) — on every bootstrap refresh flip replication_info.Active=false for any peer no longer in the store allowlist. Closes the 10-minute+ window between a chain STORAGE_FULL transition and the next successful ping.
  6. Defensive read-gate (dht.go::BatchRetrieve, BatchRetrieveStream) — filterEligibleNodes applied to the closest-contact list as belt-and-braces.
  7. Host-reporter disk usage (host_reporter/service.go) — measures the filesystem backing the p2p data dir rather than /, and reports cascade_kademlia_db_bytes so the chain can drive STORAGE_FULL / POSTPONED transitions deterministically.
  8. Probing + verifier (supernode_metrics/active_probing.go, verifier/verifier.go) — ACTIVE and STORAGE_FULL both count as operational for probe-role assignment and challenge response.

Tests — invariant-oriented, one violation test per enforcement point

Invariant Test
Routing allowlist population / pre-init permissive / post-init enforcing TestEligibleForRouting_PreInit_AndPopulated
Store allowlist strictly ⊆ routing TestEligibleForStore_StrictlyContainedInRouting
STORE self-guard across 7 states × newKeys={0,>0} TestShouldRejectStore, TestSelfStoreEligible
Eager replication prune TestPruneIneligibleStorePeers_*
State classification SSoT TestStateClassification_Table (parametric over all 7 chain enum values)
Host-reporter tick behavior (STORAGE_FULL transitions, disk metric emission) host_reporter/tick_behavior_test.go
Verifier accepts STORAGE_FULL verifier/verifier_test.go
Probing accepts STORAGE_FULL reachability_active_probing_test.go

No numeric state literals anywhere in p2p/kademlia/ outside supernode_state.go.

Verification

  • go build ./... OK
  • go vet ./... OK
  • go test ./... OK incl. integration
  • CI: unit-tests ✅ integration-tests ✅ build ✅ Rooview ✅
  • CI: cascade-e2e-tests ❌ — unrelated: test asserts a single coin_spent event of 10000ulume but the v1.12.0-rc fee payout is now split (200ulume protocol cut + 9800ulume supernode share). Fix is a test-only change to sum coin_spent amounts for the spender (or match on action_registered.fee). Pushing that fix next.

Out of scope (follow-ups)

  • SDK parity tests (sdk-go / sdk-js / sdk-rs) — Phase C.
  • Lumera tests/systemtests/everlight_p2p_test.go — Phase D.
  • Server-side Replicate RPC self-guard (the replication-worker path already honors replication_info.Active which is pruned eagerly here).

Risks

  • R1: STORAGE_FULL transition arrives mid-flight. Server-side STORE guard makes rejection deterministic; client-side retry already exists; test matrix covers this.
  • R2: Chain enum renumbering. Using sntypes.SuperNodeState* constants directly makes this a compile error on the next go-mod bump rather than silent drift.
  • R3: pruneIneligibleStorePeers scans full replication_info. O(n) per 10-min bootstrap refresh; negligible for current network size.

Rollback

Per-feature reverts are clean:

  • p2p STORAGE_FULL gate commits (top 4) can be reverted independently — restores the routing/store split without the STORAGE_FULL class.
  • host-reporter + verifier + probing commits revert to pre-Everlight behavior with no compile breaks on master (no chain state dependencies).
  • go.mod bump is the only cross-cutting change; reverting it requires rolling back the lumera side (feat: Add capabilities management to supernode #113).

Related:

@roomote-v0
Copy link
Copy Markdown

roomote-v0 Bot commented Apr 22, 2026

Rooviewer Clock   See task

Re-reviewed commits af9c32c...7a2b535 (4 new commits). No new issues found. The 3 items from the previous review remain unresolved:

  • setStoreAllowlist missing empty-list guard (dht.go): Unlike setRoutingAllowlist, the store variant accepts an empty map and marks ready=true, which would block all writes network-wide on a transient chain issue.
  • eligibleForStore nil-check ordering (dht.go): The n == nil check runs after the storeAllowReady early-return, so a nil node returns true during pre-bootstrap. eligibleForRouting checks nil first.
  • handleBatchStoreData bypasses shouldRejectStore (network.go): The single-store handler uses shouldRejectStore(1) but the batch handler reimplements the logic inline via selfStoreEligible() + manual newKeys count, risking future divergence.
Previous reviews

Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues.

Comment thread p2p/kademlia/dht.go
Comment on lines +188 to 208
func (s *DHT) setStoreAllowlist(ctx context.Context, allow map[[32]byte]struct{}) {
if s == nil {
return
}
// Integration tests may use synthetic bootstrap sets; do not enforce chain-state gating.
if integrationTestEnabled() {
return
}

s.storeAllowMu.Lock()
s.storeAllow = allow
s.storeAllowMu.Unlock()

s.storeAllowCount.Store(int64(len(allow)))
s.storeAllowReady.Store(true)

logtrace.Debug(ctx, "store allowlist updated", logtrace.Fields{
logtrace.FieldModule: "p2p",
"store_peers": len(allow),
})
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setStoreAllowlist unconditionally accepts an empty map and marks storeAllowReady=true with storeAllowCount=0. If a transient chain issue returns zero ACTIVE supernodes, eligibleForStore will return false for every peer, blocking all writes network-wide until the next bootstrap refresh (up to 10 min). setRoutingAllowlist explicitly guards against this by returning early when len(allow) == 0 and retaining the previous allowlist. The store allowlist should have the same protection.

Fix it with Roo Code or mention @roomote and request a fix.

Comment thread p2p/kademlia/dht.go
Comment on lines +244 to +262
func (s *DHT) eligibleForStore(n *Node) bool {
if s == nil {
return false
}
// In integration tests allow everything; chain state gating is not stable/available there.
if integrationTestEnabled() {
return true
}
// If the store allowlist isn't ready yet, avoid blocking writes during bootstrap.
if !s.storeAllowReady.Load() {
return true
}
// Once initialized, an empty active set means no write-eligible peers.
if s.storeAllowCount.Load() == 0 {
return false
}
if n == nil || len(n.ID) == 0 {
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nil/empty node check (n == nil || len(n.ID) == 0) is placed after the storeAllowReady and storeAllowCount checks. When storeAllowReady is false (pre-bootstrap), this function returns true for a nil node, which could cause a nil pointer dereference if a nil *Node slips through a call site during bootstrap. eligibleForRouting checks n == nil as the very first thing (line 214). Moving the nil check before the ready/count checks would be consistent and defensive.

Suggested change
func (s *DHT) eligibleForStore(n *Node) bool {
if s == nil {
return false
}
// In integration tests allow everything; chain state gating is not stable/available there.
if integrationTestEnabled() {
return true
}
// If the store allowlist isn't ready yet, avoid blocking writes during bootstrap.
if !s.storeAllowReady.Load() {
return true
}
// Once initialized, an empty active set means no write-eligible peers.
if s.storeAllowCount.Load() == 0 {
return false
}
if n == nil || len(n.ID) == 0 {
return false
}
func (s *DHT) eligibleForStore(n *Node) bool {
if s == nil {
return false
}
if n == nil || len(n.ID) == 0 {
return false
}
// In integration tests allow everything; chain state gating is not stable/available there.
if integrationTestEnabled() {
return true
}
// If the store allowlist isn't ready yet, avoid blocking writes during bootstrap.
if !s.storeAllowReady.Load() {
return true
}
// Once initialized, an empty active set means no write-eligible peers.
if s.storeAllowCount.Load() == 0 {
return false
}

Fix it with Roo Code or mention @roomote and request a fix.

Comment thread p2p/kademlia/network.go
Comment on lines +1247 to +1266
if !s.dht.selfStoreEligible() {
newKeys := 0
for _, data := range request.Data {
k, _ := utils.Blake3Hash(data)
existing, rErr := s.dht.store.Retrieve(ctx, k)
if rErr != nil || len(existing) == 0 {
newKeys++
}
}
if newKeys > 0 {
logtrace.Warn(ctx, "rejecting BatchStore: self is not store-eligible", logtrace.Fields{
logtrace.FieldModule: "p2p",
"sender": message.Sender.String(),
"self_state": s.dht.selfState.Load(),
"new_keys": newKeys,
"total_keys": len(request.Data),
})
return s.generateResponseMessage(ctx, BatchStoreData, message.Sender, ResultFailed, "batch store rejected: self not store-eligible")
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleStoreData was refactored to use shouldRejectStore(1), but this batch handler still calls selfStoreEligible() directly and reimplements the newKeys-counting logic inline. If shouldRejectStore is updated later (e.g., to add logging, metrics, or a grace period), this path will silently diverge. Consider replacing the !s.dht.selfStoreEligible() guard + manual newKeys count with s.dht.shouldRejectStore(newKeys) after computing newKeys, matching the single-store handler's pattern.

Fix it with Roo Code or mention @roomote and request a fix.

Aligns both supernode go.mod and tests/system go.mod with the v1.12.0-rc
release tag (lumera commit 7ca770a / Everlight #113). Resolves the
install-lumera CI step which requires a real downloadable release asset
rather than a pseudo-version.
Extends PR #272's routing-vs-store allowlist split to cover the new
SUPERNODE_STATE_STORAGE_FULL introduced by lumera #113 (Everlight).

Policy:
- routing (reads)  = {ACTIVE, POSTPONED, STORAGE_FULL}
- store   (writes) = {ACTIVE}

STORAGE_FULL nodes continue to serve reads and earn payout, but must
not receive new STORE/BatchStore writes or be targeted by replication.

Changes:
- p2p/kademlia/supernode_state.go: SSoT helpers using chain's
  sntypes.SuperNodeState enum (no numeric literals), selfState cache,
  pruneIneligibleStorePeers, selfStoreEligible.
- bootstrap.go: use isRoutingEligibleState / isStoreEligibleState;
  record self-state during chain sync; call pruneIneligibleStorePeers
  after setStoreAllowlist so replication_info.Active is cleared
  eagerly for ineligible peers (closes ping-cadence window).
- network.go: STORE RPC self-guard for single handleStoreData (reject
  new-key write when self not store-eligible; replication of already-
  held key still permitted); BatchStoreData self-guard rejects when
  batch contains any genuinely new keys.
- dht.go: selfState atomic fields; defensive filterEligibleNodes on
  BatchRetrieve and BatchRetrieveStream closest-contact lists.
- dht_batch_store_test.go: accept new "no eligible store peers"
  error variant alongside legacy "no candidate nodes".
Coverage matrix aligned with the invariant table in the plan doc:

- I1 routing allowlist population/pre-init: TestEligibleForRouting_PreInit_AndPopulated
- I2 store allowlist strictly ⊆ routing:   TestEligibleForStore_StrictlyContainedInRouting
- I3+I9 shouldRejectStore contract (ACTIVE, STORAGE_FULL, POSTPONED,
  DISABLED, with newKeys=0 vs >0, pre-init permissive):
  TestShouldRejectStore, TestSelfStoreEligible
- I5 eager replication-info prune on storeAllowlist update:
  TestPruneIneligibleStorePeers_ClearsNonStorePeers,
  TestPruneIneligibleStorePeers_SkipsWhenNotReady
- I6 state-classification SSoT (no drift possible):
  TestStateClassification_Table (parametric over all 7 chain enum values)

Refactors handleStoreData self-guard to use shouldRejectStore helper for
direct test coverage without a full *Network + hashtable boot.

Adds a minimal fakeStore in-package (satisfies the full Store interface
with no-op implementations) used only by prune tests.
@mateeullahmalik mateeullahmalik force-pushed the feat/p2p-storage-full-gate branch from af9c32c to 7a2b535 Compare April 22, 2026 15:58
@mateeullahmalik mateeullahmalik changed the title feat(p2p): Everlight STORAGE_FULL eligibility gate Everlight: supernode compatibility + p2p STORAGE_FULL eligibility gate Apr 22, 2026
@mateeullahmalik mateeullahmalik changed the base branch from everlight-supernode-compat-plan to master April 22, 2026 17:47
@mateeullahmalik mateeullahmalik mentioned this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant