Refactor: decouple allocator from master and client#309
Conversation
| }; | ||
| std::vector<PerItemSnapshot> snaps; | ||
| snaps.reserve(peer_item_indices.size()); | ||
| void PoolClient::ExecuteRemotePutTransfers(std::vector<RemotePutEntry>& entries, |
There was a problem hiding this comment.
One BatchPut pipeline concern: the old path intentionally submitted RDMA first, then ran local memcpy / staging-backed work while RDMA was in flight, and waited at the end.
Here all staging entries are copied before BatchWrite is issued, so memcpy no longer overlaps with RDMA progress. Should we preserve that pipeline behavior?
There was a problem hiding this comment.
Yes, good suggestion
| if (!ok) { | ||
| plans[orig_idx].kind = BatchPutItemPlan::Kind::REMOTE_ZC_PREP_FAILED; | ||
| continue; | ||
| bool PoolClient::BuildRemotePutTransfers(std::vector<RemotePutEntry>& entries, PeerConnection& peer, |
There was a problem hiding this comment.
If we copy all non-zero-copy entries into staging at once, the staging buffer may not be large enough for the whole peer group. Since staging memory is RDMA-registered, the previous serial fallback also let us keep this registered buffer small and reused, instead of sizing it for the whole batch. Should we preserve that fallback behavior?
| return results; | ||
| } | ||
|
|
||
| bool PoolClient::Get(const std::string& key, void* dst, size_t size) { | ||
| std::vector<bool> PoolClient::BatchGet(const std::vector<std::string>& keys, |
There was a problem hiding this comment.
BatchGet looks similar to BatchPut: the pipeline behavior and staging fallback may have the same issues here as well.
| } | ||
|
|
||
| std::vector<KvEvent> events; | ||
| if (peer_alloc_ != nullptr) events = peer_alloc_->DrainPendingEvents(); |
There was a problem hiding this comment.
Heartbeat failure currently drops drained ADD/REMOVE events without retrying or re-queueing them, so a committed Put may never become visible to the master?
Reshapes the wire surface for the redesign: master becomes a routing advisor, peer owns all per-key state. - umbp.proto: drop Register/Unregister/Lookup/Finalize/Abort/Publish (and their batch variants); shrink RoutePut/RouteGet to placement only; Heartbeat now carries seq + KvEvent stream + is_full_sync; KvEvent message added; Location message removed. - umbp_peer.proto: add AllocateSlot/CommitSlot/AbortSlot/ResolveKey/ EvictKey + 4 batch variants; GetPeerInfoResponse now ships DRAM/HBM buffer descs + page_size; imports umbp.proto for shared types. - types.h: Location is now (node_id, tier, size); location_id removed. - route_put_strategy.h: RoutePutResult shrunk; Select gains exclude_nodes. - master_client.h: RouteGetResult shrunk to (node_id, tier, size, peer_address). - Relocate page_bitmap_allocator.h to distributed/peer/peer_page_allocator.h. - CMake: declare cross-proto dep so peer regen tracks umbp.proto edits. Phase A intentionally leaves the .cpp tree broken at type and RPC sites; phases B (peer build-out) and C (master gut) close it.
Adds the peer-side machinery that becomes the canonical owner of every
DRAM/HBM key in the redesign. Master code is still broken from
phase A; phase C closes that side.
- types.h: KvEvent C++ struct (mirrors the wire-level umbp::KvEvent so
the allocator's outbox doesn't pull in proto headers).
- peer_dram_allocator.{h,cpp}: per-tier PageBitmapAllocator wrapper +
pending/owned key maps + read-lease tracking + event outbox + reaper.
Resolve bumps a lease deadline; Evict refuses keys with active
leases (Bug #7 resolution). RunReaperOnceForTest is the unit-test
seam for deterministic TTL expiry.
- peer_service.{h,cpp}: extends GetPeerInfo with DRAM/HBM descs +
page_size; adds the 5 new RPCs (AllocateSlot/CommitSlot/AbortSlot/
ResolveKey/EvictKey) plus 4 batch variants. PeerServiceServer takes
a non-owning PeerDramAllocator*; null is tolerated for SSD-only
deployments.
- tests/test_peer_dram_allocator.cpp: 11 unit tests covering ENOSPC,
pending TTL, idempotent Evict, full-sync snapshot, Resolve-during-
Evict races, capacity tracking, commit-after-reap. All passing.
- CMake: register the new .cpp in umbp_common and the new test exe.
Strips master to a routing advisor: no per-Put state, no AllocateForPut, no FinalizeAllocation, no master-side allocators. The block index is now mutated only through peer-shipped KvEvents in heartbeats. Master-side - ClientRegistry: drops AllocateForPut/Finalize/Abort/PublishLocalBlock/ TrackKey/UntrackKey, the pending- and finalized-allocation maps, per-tier PageBitmapAllocator, ssd_allocators, dram descriptor list. ~70% line reduction. New Heartbeat takes (seq, last_acked_seq, capacities, events, is_full_sync) and applies events under GlobalBlockIndex. Reaper expires nodes only. - ClientRecord: drops allocators / descs / log throttle; adds last_applied_seq. - GlobalBlockIndex: adds ApplyEvents + ReplaceNodeLocations; drops Register/Unregister/UnregisterByNode/BatchRegister/BatchUnregister/ EvictEntries/SetDepth/GetDepth. Dedup key is (node_id, tier). - master_server: drops Register/Unregister/Lookup/BatchLookup/ Finalize/Publish/Abort handlers; reshapes RegisterClient (no DRAM args), Heartbeat (event-driven), RoutePut/RouteGet (placement only). Drops GetBlockDepthForTest. - master_client: matches; new RoutePut/RouteGet take exclude_nodes; HeartbeatLoop drains a bound PeerDramAllocator outbox, sends seq, and reships full-sync on master gap-detect. Drops Register/ Unregister/Lookup/Finalize/Publish/Abort/BatchAbort wire calls. - master_metrics: drops dead per-RPC counters, adds events-applied + seq-gap counters. - eviction_manager: gutted to find-candidates only; phase D wires the EvictKey RPC dispatch. Sort drops the depth tiebreaker (master no longer carries depth). - Routing: RoutePutStrategy::Select gains exclude_nodes; Router drops AllocateForPut callout, threads exclude through RoutePut/RouteGet. Peer-side - types.h: Location is (node_id, tier, size); drops ParseDramLocationId/SizeMatchesAllocation/PendingAllocation/ ParseLocationId — peer-internal now. - peer_page_allocator: drops DeallocateByLocationId. - PeerDramAllocator: adds QueueExternalEvent so the SSD CommitSsdWrite path can ship ADD events through the same outbox. - peer_service: ctor drops the PoolClient& dependency entirely; SSD CommitSsdWrite no longer calls coordinator_.FinalizeAllocation, and instead queues ADD on the dram allocator outbox. Build state - Every master-side / peer-side TU touched here compiles green in the rocm-7 container. pool_client.cpp / distributed_client.cpp / umbp_client_factory.cpp still fail — that's phase E's job.
Wires the eviction loop end-to-end: master picks victims, ships
EvictKey to peers via a master-server-owned stub pool, and waits for
peer-shipped REMOVE events on the next heartbeat to actually shrink
the index. Heartbeat-event plumbing already landed in phase C.
- eviction_manager.h: new EvictKeyDispatcher interface; EvictionManager
ctor takes a non-owning dispatcher pointer (null is tolerated for
routing-only test fixtures).
- eviction_manager.cpp: groups victims by node, looks up peer_address
from the alive-clients snapshot, hands off via the dispatcher.
Master state stays untouched per the design — re-eviction next
round is safe because peer Evict is idempotent.
- master_server.cpp: adds MasterPeerStubPool (defined in the impl
TU's anonymous namespace) with a (node_id -> {peer_address, stub})
cache; bounded RPC deadline (UMBP_EVICTKEY_DEADLINE_MS, default
1000ms); failed dispatches drop the cached stub so the next try
reconnects.
- master_server.h: holds the pool as unique_ptr<EvictKeyDispatcher>
to keep MasterPeerStubPool's definition out of the header.
The HeartbeatLoop reshape called out in phase D's description was
already done in phase C (drains DrainPendingEvents, advances seq,
reships SnapshotOwnedKeys on master-requested full sync).
Build state: master-side TUs all compile clean. pool_client and
distributed_client still fail — phase E's job.
Rewrites the writer-side data plane. Master only routes; the peer
RPCs (AllocateSlot/CommitSlot/AbortSlot/ResolveKey) handle the actual
allocation + key-set ownership. Self-targeted Puts/Gets bypass RPCs
entirely and reach into the in-process PeerDramAllocator.
Pool client surface
- pool_client.{h,cpp}: full rewrite (~2476 → ~720 LOC). Drops the
RegisterWithMaster/FinalizeAllocation/PublishLocalBlock/AbortAllocation/
UnregisterFromMaster/IsRegistered surface; Exists/BatchExists now
go through RouteGet. PoolClient owns the per-process
PeerDramAllocator and binds it to the master client so the
heartbeat thread drains the outbox.
- Put: RoutePut → if self, peer_alloc_->Allocate + LocalPutPages +
peer_alloc_->Commit; else AllocateSlot → ScatterWrite → CommitSlot,
with bounded ENOSPC retry against exclude_nodes. Failures fire-
and-forget AbortSlot; the peer reaper covers any drop.
- Get: RouteGet → if self, peer_alloc_->Resolve + LocalGetPages; else
ResolveKey → ScatterRead, with bounded retry on peer-evict races.
- BatchPut/BatchGet: per-item loop for now (the prior fused
RDMA pipeline lived against the old master-RPC tail and didn't
survive the redesign cleanly; can return as a follow-up).
- Drops the cluster_locations_ cache (unused under the new wire shape).
Cascade
- distributed_client.cpp: BatchPutWithDepth drops the depth hint
(master no longer carries per-key depth in the new event model).
- bin/client_main.cpp: demo loop trimmed to RoutePut/RouteGet +
external KV — Register/Unregister are gone from MasterClient.
Build state
- libumbp_common.a, umbp_master, umbp_client all build clean in the
rocm-7 container. test_peer_dram_allocator (11 tests) still
passes. The legacy tests/cpp/umbp/distributed/*.cpp suite no
longer builds — that's phase F's job (rewrite tests against the
new wire shape).
Closes the redesign by making the default build green again and adding
unit coverage for the new event-shipping flow.
- src/umbp/tests/test_global_block_index_events.cpp: 14 tests across
GlobalBlockIndex::ApplyEvents / ReplaceNodeLocations and
ClientRegistry::Heartbeat. Covers ADD/REMOVE round-trip, multi-
node coexistence on the same key, full-sync replacement, empty-
full-sync (used by Unregister + reaper), seq-gap detection,
REMOVE-in-full-sync no-op, and FindEvictionCandidates tier
filtering. All passing in the rocm-7 container.
- tests/cpp/umbp/distributed/CMakeLists.txt: gate the pre-redesign
suite behind MORI_UMBP_LEGACY_DISTRIBUTED_TESTS (default OFF).
16 targets there reference deleted APIs (AllocateForPut,
FinalizeAllocation, Location.location_id, EvictionCandidate.depth,
the fused BatchPut/BatchGet observability counters, etc.) and
belong to a future test rewrite — they stay on disk for reference
rather than being deleted outright.
Total UMBP unit-test surface, full build green:
test_external_kv_block_index 10 tests
test_client_registry_external_kv 6 tests
test_peer_dram_allocator 11 tests (phase B)
test_global_block_index_events 14 tests (this commit)
---------
41 tests
Deferred to follow-up:
- rewrite tests/cpp/umbp/distributed/* against the new wire shape
(ENOSPC retry, master-restart full-resync, RAW window bound,
master-driven eviction round-trip)
- re-add the per-peer fused BatchPut/BatchGet RDMA pipeline that
Phase E simplified to a per-item loop
- bench_pd_disagg.sh regression sweep against prior commit
51f8f80 to
d24a289
Compare
- Refactor current design to decouple master & client allocation logics - New pattern follows eventual consistency and master-as-advisor
- Refactor current design to decouple master & client allocation logics - New pattern follows eventual consistency and master-as-advisor
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist