Optimize DHT Symbol Retrieval: Primary-Provider Waves, Streamed Writes, and Bounded Concurrency #227
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
🧩 Summary
This PR completely overhauls the symbol retrieval pipeline to make it production-ready, memory-efficient, and network-optimized.
It introduces a streaming retrieval mechanism (
BatchRetrieveStream) that:Streams symbols directly to disk in the RaptorQ workspace — no in-memory accumulation.
Performs local-first retrieval using
RetrieveBatchValuesin 5k-key batches.Introduces a deterministic primary-provider algorithm to ensure that each key is initially requested from exactly one node, maximizing unique symbol coverage.
Adds multi-wave fallback logic — if primaries fail or are slow, subsequent waves fetch missing keys from alternate top-K nodes.
Implements strict per-node payload caps (
perNodeRequestCap = 600⇒ ~36 MB @ 60 KB/symbol).Enforces two-level concurrency limits for predictable performance (
fetchSymbolsBatchConcurrency = 8,storeSameSymbolsBatchConcurrency = 4).Uses atomic early-stop and cancellation to halt network fetches the moment 17 % symbol-threshold is reached.
Maintains global de-duplication via a concurrent
resSeenmap to avoid duplicate writes or wasted network calls.🧠 Design Goals
🧪 Testing Plan
Unit:
Mock DHT with in-memory store → verify early exit when
foundLocalCount ≥ required.Simulate 50 nodes with XOR spread → confirm unique provider per key per wave.
Validate per-node request count ≤ 600.
Integration (testnet):
Spin up 50-validator cluster.
Upload 1 GB file → measure retrieval time, memory footprint, and bandwidth.
Verify file reconstruction hash matches action’s
dataHash.Validate logs show expected early-stop (
found_network ≥ needNetwork).🧾 Key Files Changed
pkg/dht/retrieve_stream.goNew
BatchRetrieveStream,processBatchStream, anditerateBatchGetValuesStreamwith streaming and waves.pkg/dht/local.goAdded
fetchAndWriteLocalKeysBatchedfor batched local streaming.pkg/dht/constants.goAdded tuning constants and
perNodeRequestCap.🧰 Backwards Compatibility
✅ 100 % backwards-compatible.
Old
BatchRetrieveAPI remains untouched;BatchRetrieveStreamis a non-breaking enhancement used by cascade restore path.🧠 Reviewer Notes
Verify
doBatchGetValuesCallobservesctxdeadlines (it should).Confirm
s.ht.closestContactsWithIncludingNodereturns up-to-date routing info (top-K list correctness).Look out for any log spam; consider lowering some debug levels to trace if necessary.
✅ Checklist
Local streaming verified
Primary-provider waves tested
Concurrency limits validated (8×4 = 32 RPCs max)
No over-fetch or memory blow-up
File reconstruction verified via hash match
Integration test passed on 50-validator testnet
TL;DR:
This PR makes the supernode’s symbol retrieval fast, predictable, and production-grade — zero memory pressure, bounded concurrency, and smarter use of the network.