feat(data-plane): add TQ fault tolerance APIs#2492
Conversation
1d02615 to
430dee5
Compare
14cd92d to
b63c18f
Compare
Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com>
430dee5 to
b2cb045
Compare
| keys=list(sample_ids), | ||
| partition_id=partition_id, | ||
| ), | ||
| DataPlaneClearError, |
There was a problem hiding this comment.
Emm, is it good practice to pass Error type in a function?
This error type point to a specific function call, and this information should be available within the error stack. So DataPlaneClearError (same as DataPlaneReadError, DataPlaneWriteError, DataPlaneClearError) may not provide additional info.
Do you think if it is necessary?
Alternatively, any possibility to make it more lightweighted like
def _call_tq(
raise Exception(f"{operation} failed with ...")
| weight_version=_as_int(first_tag.get("weight_version")), | ||
| created_at=_as_float(first_tag.get("created_at")), | ||
| committed=all(_as_bool(t.get("committed", False)) for t in tags), | ||
| expected_num_keys=expected_num_keys, | ||
| size_bytes=_as_int(first_tag.get("size_bytes")), | ||
| tags=first_tag, | ||
| ) | ||
| ) | ||
| return groups |
There was a problem hiding this comment.
I think this is to provide the data plane API to async RL algorithm.
Are we align well with @mehraakash's async RL PR
#2700?
|
|
||
| # ── (C) recovery/control-plane ───────────────────────────────────── | ||
|
|
||
| def ping(self, timeout_s: float | None = None) -> None: |
There was a problem hiding this comment.
Will this function be called only after recovery?
I didn't see this function is used other than the tests.
So is that because we don't have recovery / ckpt / persistent recovery support yet?
|
Thank you a lot @pthombre for the fault tolerance PR. This PR added some apis for async RL specifically and we should check with @mehraakash. For the fault tolerance API, I think we are still missing the persistent recovery and therefore some API can't be verified functionally. Not sure if we can mimic that in a test first and we need later add persistent recovery as the next step. For error type, personally I wanna to simplify it good to keep those providing additional error message. Otherwise the error stack would just tell us the location of failed api call. |
What does this PR do ?
Adds the recovery/control-plane API surface that the async SingleController needs to coordinate TransferQueue from metadata only, without moving tensor payloads through the controller.
Issues
N/A
Usage
The async controller-facing methods added to
DataPlaneClientare:Methods Added for Async Controller Support
ping(timeout_s)health-checks the real data-plane request path so SingleController can detect data-plane/TQ availability failures.list_metadata(partition_id) -> list[DataPlaneGroupMeta]returns non-consuming rollout-group metadata. This lets SingleController inspect queued groups without advancing TQ consumer counters or fetching tensors.depth(partition_id) -> intcounts committed, complete groups visible to recovery. This supports rebuilding queue capacity after controller or data-plane recovery.pop(keys, partition_id)removes successfully trained keys. The base implementation routes throughclear_samples(), and the TQ adapter translates that to backendkv_clear.evict(keys, partition_id)removes stale or abandoned keys, using the sameclear_samples()path.get_capabilities() -> DataPlaneCapabilitiesexposes backend recovery guarantees such as persistent recovery, server-side filtering, atomic batch put, and verified clear support.Supporting Types and Behavior
DataPlaneGroupMetafor control-plane-only rollout group records:group_id,keys,weight_version,created_at,committed,expected_num_keys,size_bytes, andtags.DataPlaneCapabilitiesso adapters can advertise recovery-relevant behavior.DataPlaneUnavailable,DataPlaneTimeout,DataPlaneReadError,DataPlaneWriteError,DataPlaneClearError, andDataPlaneBadRequest.TQDataPlaneClientto translate Ray/TQ/storage errors into typed data-plane exceptions.MetricsDataPlaneClientso the recovery API is testable and observable without booting Ray/TQ.Why This Helps Async SingleController
SingleController needs to orchestrate async rollout, training, recovery, cleanup, and backpressure while preserving the invariant that it never handles tensor payloads. These APIs give it a metadata-only boundary:
list_metadata()lets SC reconstruct queue state and select trainable groups.depth()lets SC rebuild capacity accounting after restart or TQ recovery.pop()centralizes cleanup of rows that trained successfully.evict()gives SC an explicit stale-row cleanup path.ping()and typed failures give SC clear recovery triggers instead of parsing generic Ray/TQ exceptions.This PR does not implement SingleController. It adds the TransferQueue/DataPlane support methods needed by that controller.
Before your PR is "Ready for review"
Pre checks:
Additional Information
Testing was not run after the rebase per request. An earlier
uv run pytest ...attempt did not execute tests because this workspace does not have the repo-required Python3.13.13interpreter available touv.