Skip to content

Refactor: decouple allocator from master and client#309

Merged
TianDi101 merged 16 commits into
refactor/umbp-dual-scheme-abcfrom
refactor/umbp-dual-scheme-abc-alloc-decoup
May 11, 2026
Merged

Refactor: decouple allocator from master and client#309
TianDi101 merged 16 commits into
refactor/umbp-dual-scheme-abcfrom
refactor/umbp-dual-scheme-abc-alloc-decoup

Conversation

@TianDi101
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

};
std::vector<PerItemSnapshot> snaps;
snaps.reserve(peer_item_indices.size());
void PoolClient::ExecuteRemotePutTransfers(std::vector<RemotePutEntry>& entries,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One BatchPut pipeline concern: the old path intentionally submitted RDMA first, then ran local memcpy / staging-backed work while RDMA was in flight, and waited at the end.

Here all staging entries are copied before BatchWrite is issued, so memcpy no longer overlaps with RDMA progress. Should we preserve that pipeline behavior?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good suggestion

if (!ok) {
plans[orig_idx].kind = BatchPutItemPlan::Kind::REMOTE_ZC_PREP_FAILED;
continue;
bool PoolClient::BuildRemotePutTransfers(std::vector<RemotePutEntry>& entries, PeerConnection& peer,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we copy all non-zero-copy entries into staging at once, the staging buffer may not be large enough for the whole peer group. Since staging memory is RDMA-registered, the previous serial fallback also let us keep this registered buffer small and reused, instead of sizing it for the whole batch. Should we preserve that fallback behavior?

return results;
}

bool PoolClient::Get(const std::string& key, void* dst, size_t size) {
std::vector<bool> PoolClient::BatchGet(const std::vector<std::string>& keys,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BatchGet looks similar to BatchPut: the pipeline behavior and staging fallback may have the same issues here as well.

}

std::vector<KvEvent> events;
if (peer_alloc_ != nullptr) events = peer_alloc_->DrainPendingEvents();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeat failure currently drops drained ADD/REMOVE events without retrying or re-queueing them, so a committed Put may never become visible to the master?

TianDi101 added 16 commits May 9, 2026 08:38
Reshapes the wire surface for the redesign: master becomes a routing
advisor, peer owns all per-key state.

- umbp.proto: drop Register/Unregister/Lookup/Finalize/Abort/Publish (and
  their batch variants); shrink RoutePut/RouteGet to placement only;
  Heartbeat now carries seq + KvEvent stream + is_full_sync; KvEvent
  message added; Location message removed.
- umbp_peer.proto: add AllocateSlot/CommitSlot/AbortSlot/ResolveKey/
  EvictKey + 4 batch variants; GetPeerInfoResponse now ships DRAM/HBM
  buffer descs + page_size; imports umbp.proto for shared types.
- types.h: Location is now (node_id, tier, size); location_id removed.
- route_put_strategy.h: RoutePutResult shrunk; Select gains exclude_nodes.
- master_client.h: RouteGetResult shrunk to (node_id, tier, size,
  peer_address).
- Relocate page_bitmap_allocator.h to distributed/peer/peer_page_allocator.h.
- CMake: declare cross-proto dep so peer regen tracks umbp.proto edits.

Phase A intentionally leaves the .cpp tree broken at type and RPC sites;
phases B (peer build-out) and C (master gut) close it.
Adds the peer-side machinery that becomes the canonical owner of every
DRAM/HBM key in the redesign.  Master code is still broken from
phase A; phase C closes that side.

- types.h: KvEvent C++ struct (mirrors the wire-level umbp::KvEvent so
  the allocator's outbox doesn't pull in proto headers).
- peer_dram_allocator.{h,cpp}: per-tier PageBitmapAllocator wrapper +
  pending/owned key maps + read-lease tracking + event outbox + reaper.
  Resolve bumps a lease deadline; Evict refuses keys with active
  leases (Bug #7 resolution).  RunReaperOnceForTest is the unit-test
  seam for deterministic TTL expiry.
- peer_service.{h,cpp}: extends GetPeerInfo with DRAM/HBM descs +
  page_size; adds the 5 new RPCs (AllocateSlot/CommitSlot/AbortSlot/
  ResolveKey/EvictKey) plus 4 batch variants.  PeerServiceServer takes
  a non-owning PeerDramAllocator*; null is tolerated for SSD-only
  deployments.
- tests/test_peer_dram_allocator.cpp: 11 unit tests covering ENOSPC,
  pending TTL, idempotent Evict, full-sync snapshot, Resolve-during-
  Evict races, capacity tracking, commit-after-reap.  All passing.
- CMake: register the new .cpp in umbp_common and the new test exe.
Strips master to a routing advisor: no per-Put state, no AllocateForPut,
no FinalizeAllocation, no master-side allocators.  The block index is
now mutated only through peer-shipped KvEvents in heartbeats.

Master-side
- ClientRegistry: drops AllocateForPut/Finalize/Abort/PublishLocalBlock/
  TrackKey/UntrackKey, the pending- and finalized-allocation maps,
  per-tier PageBitmapAllocator, ssd_allocators, dram descriptor list.
  ~70% line reduction.  New Heartbeat takes (seq, last_acked_seq,
  capacities, events, is_full_sync) and applies events under
  GlobalBlockIndex.  Reaper expires nodes only.
- ClientRecord: drops allocators / descs / log throttle; adds
  last_applied_seq.
- GlobalBlockIndex: adds ApplyEvents + ReplaceNodeLocations; drops
  Register/Unregister/UnregisterByNode/BatchRegister/BatchUnregister/
  EvictEntries/SetDepth/GetDepth.  Dedup key is (node_id, tier).
- master_server: drops Register/Unregister/Lookup/BatchLookup/
  Finalize/Publish/Abort handlers; reshapes RegisterClient (no DRAM
  args), Heartbeat (event-driven), RoutePut/RouteGet (placement only).
  Drops GetBlockDepthForTest.
- master_client: matches; new RoutePut/RouteGet take exclude_nodes;
  HeartbeatLoop drains a bound PeerDramAllocator outbox, sends seq,
  and reships full-sync on master gap-detect.  Drops Register/
  Unregister/Lookup/Finalize/Publish/Abort/BatchAbort wire calls.
- master_metrics: drops dead per-RPC counters, adds events-applied +
  seq-gap counters.
- eviction_manager: gutted to find-candidates only; phase D wires the
  EvictKey RPC dispatch.  Sort drops the depth tiebreaker (master no
  longer carries depth).
- Routing: RoutePutStrategy::Select gains exclude_nodes; Router drops
  AllocateForPut callout, threads exclude through RoutePut/RouteGet.

Peer-side
- types.h: Location is (node_id, tier, size); drops
  ParseDramLocationId/SizeMatchesAllocation/PendingAllocation/
  ParseLocationId — peer-internal now.
- peer_page_allocator: drops DeallocateByLocationId.
- PeerDramAllocator: adds QueueExternalEvent so the SSD CommitSsdWrite
  path can ship ADD events through the same outbox.
- peer_service: ctor drops the PoolClient& dependency entirely; SSD
  CommitSsdWrite no longer calls coordinator_.FinalizeAllocation, and
  instead queues ADD on the dram allocator outbox.

Build state
- Every master-side / peer-side TU touched here compiles green in the
  rocm-7 container.  pool_client.cpp / distributed_client.cpp /
  umbp_client_factory.cpp still fail — that's phase E's job.
Wires the eviction loop end-to-end: master picks victims, ships
EvictKey to peers via a master-server-owned stub pool, and waits for
peer-shipped REMOVE events on the next heartbeat to actually shrink
the index.  Heartbeat-event plumbing already landed in phase C.

- eviction_manager.h: new EvictKeyDispatcher interface; EvictionManager
  ctor takes a non-owning dispatcher pointer (null is tolerated for
  routing-only test fixtures).
- eviction_manager.cpp: groups victims by node, looks up peer_address
  from the alive-clients snapshot, hands off via the dispatcher.
  Master state stays untouched per the design — re-eviction next
  round is safe because peer Evict is idempotent.
- master_server.cpp: adds MasterPeerStubPool (defined in the impl
  TU's anonymous namespace) with a (node_id -> {peer_address, stub})
  cache; bounded RPC deadline (UMBP_EVICTKEY_DEADLINE_MS, default
  1000ms); failed dispatches drop the cached stub so the next try
  reconnects.
- master_server.h: holds the pool as unique_ptr<EvictKeyDispatcher>
  to keep MasterPeerStubPool's definition out of the header.

The HeartbeatLoop reshape called out in phase D's description was
already done in phase C (drains DrainPendingEvents, advances seq,
reships SnapshotOwnedKeys on master-requested full sync).

Build state: master-side TUs all compile clean.  pool_client and
distributed_client still fail — phase E's job.
Rewrites the writer-side data plane.  Master only routes; the peer
RPCs (AllocateSlot/CommitSlot/AbortSlot/ResolveKey) handle the actual
allocation + key-set ownership.  Self-targeted Puts/Gets bypass RPCs
entirely and reach into the in-process PeerDramAllocator.

Pool client surface
- pool_client.{h,cpp}: full rewrite (~2476 → ~720 LOC).  Drops the
  RegisterWithMaster/FinalizeAllocation/PublishLocalBlock/AbortAllocation/
  UnregisterFromMaster/IsRegistered surface; Exists/BatchExists now
  go through RouteGet.  PoolClient owns the per-process
  PeerDramAllocator and binds it to the master client so the
  heartbeat thread drains the outbox.
- Put: RoutePut → if self, peer_alloc_->Allocate + LocalPutPages +
  peer_alloc_->Commit; else AllocateSlot → ScatterWrite → CommitSlot,
  with bounded ENOSPC retry against exclude_nodes.  Failures fire-
  and-forget AbortSlot; the peer reaper covers any drop.
- Get: RouteGet → if self, peer_alloc_->Resolve + LocalGetPages; else
  ResolveKey → ScatterRead, with bounded retry on peer-evict races.
- BatchPut/BatchGet: per-item loop for now (the prior fused
  RDMA pipeline lived against the old master-RPC tail and didn't
  survive the redesign cleanly; can return as a follow-up).
- Drops the cluster_locations_ cache (unused under the new wire shape).

Cascade
- distributed_client.cpp: BatchPutWithDepth drops the depth hint
  (master no longer carries per-key depth in the new event model).
- bin/client_main.cpp: demo loop trimmed to RoutePut/RouteGet +
  external KV — Register/Unregister are gone from MasterClient.

Build state
- libumbp_common.a, umbp_master, umbp_client all build clean in the
  rocm-7 container.  test_peer_dram_allocator (11 tests) still
  passes.  The legacy tests/cpp/umbp/distributed/*.cpp suite no
  longer builds — that's phase F's job (rewrite tests against the
  new wire shape).
Closes the redesign by making the default build green again and adding
unit coverage for the new event-shipping flow.

- src/umbp/tests/test_global_block_index_events.cpp: 14 tests across
  GlobalBlockIndex::ApplyEvents / ReplaceNodeLocations and
  ClientRegistry::Heartbeat.  Covers ADD/REMOVE round-trip, multi-
  node coexistence on the same key, full-sync replacement, empty-
  full-sync (used by Unregister + reaper), seq-gap detection,
  REMOVE-in-full-sync no-op, and FindEvictionCandidates tier
  filtering.  All passing in the rocm-7 container.
- tests/cpp/umbp/distributed/CMakeLists.txt: gate the pre-redesign
  suite behind MORI_UMBP_LEGACY_DISTRIBUTED_TESTS (default OFF).
  16 targets there reference deleted APIs (AllocateForPut,
  FinalizeAllocation, Location.location_id, EvictionCandidate.depth,
  the fused BatchPut/BatchGet observability counters, etc.) and
  belong to a future test rewrite — they stay on disk for reference
  rather than being deleted outright.

Total UMBP unit-test surface, full build green:
  test_external_kv_block_index       10 tests
  test_client_registry_external_kv    6 tests
  test_peer_dram_allocator           11 tests   (phase B)
  test_global_block_index_events     14 tests   (this commit)
                                    ---------
                                     41 tests

Deferred to follow-up:
  - rewrite tests/cpp/umbp/distributed/* against the new wire shape
    (ENOSPC retry, master-restart full-resync, RAW window bound,
    master-driven eviction round-trip)
  - re-add the per-peer fused BatchPut/BatchGet RDMA pipeline that
    Phase E simplified to a per-item loop
  - bench_pd_disagg.sh regression sweep against prior commit
@TianDi101 TianDi101 force-pushed the refactor/umbp-dual-scheme-abc-alloc-decoup branch from 51f8f80 to d24a289 Compare May 11, 2026 01:54
@TianDi101 TianDi101 merged commit 3a2499d into refactor/umbp-dual-scheme-abc May 11, 2026
isytwu pushed a commit that referenced this pull request May 11, 2026
- Refactor current design to decouple master & client allocation logics
- New pattern follows eventual consistency and master-as-advisor
maning00 pushed a commit that referenced this pull request May 13, 2026
- Refactor current design to decouple master & client allocation logics
- New pattern follows eventual consistency and master-as-advisor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants