-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed debug handlers #126601
distributed debug handlers #126601
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126601
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 469b55d with merge base 38a33c3 (): UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
26c5779
to
7b71448
Compare
os.environ[TORCH_WORKER_SERVER_SOCKET] = socket_path | ||
|
||
print("starting") | ||
with main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a more descriptive name, given that is from another module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about following the hydra pattern here: https://hydra.cc/docs/intro/
In practice this should be something like @worker_server.main
.
@requires_cuda | ||
def test_dump_nccl_trace_pickle(self) -> None: | ||
with local_worker_server() as session: | ||
resp = session.post("http+unix://worker/handler/dump_nccl_trace_pickle") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UX:
- Do we need to register the handler first?
- If not for some cases, how would users know what is enabled by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should adopt some RESTful API approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handlers do need to be registered before the post request -- we're currently statically registering all handlers so they'll always be present by this call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RESTful is a pretty overloaded term could you clarify what you mean? I'm not really sure how to express "dump_nccl_trace_pickle" in the rest model since it doesn't really map to a data object with CRUD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess naively, GET "/worker/handler/nccl_trace"
seems 'restful' as you're reading the state of an object. I'm not sure if the RESTful model accounts for an object that is constantly changing, but otherwise that seems ok to me?
or is the difference that we are requesting a dump to disk in this call vs a return of the current value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns the data directly. For this one I think it's more or less fine to do get, but for some other potential handlers it's weirder. "Start tracing for 10 seconds" is not a very restful behavior.
I think the prior art for that is a CREATE/POST call to create the profiling and then other requests to get the data. i.e.
CREATE /profile {"time": 10} -> {"id": 15}
GET /profile/15 -> {...}
vs
POST /handler/profile {"time": 10} -> {...}
We'd have to overhaul the simpler handler design to support that I think -- so in someways would just prefer keeping it all POSTs even if it isn't pure "rest"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RESTful is a pretty overloaded term could you clarify what you mean?
Worth exploring HATEOAS that could support various cases given pluggable nature of handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were going that route I think something like openapi or Swagger might be best. Could potentially generate thrift defs off of Swagger as well
I do like keeping things simple though initially until we flesh out what these interfaces should be
class RegisterHandler { | ||
public: | ||
RegisterHandler(const std::string& name, HandlerFunc f) { | ||
registerHandler(name, f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a standard pattern to update shared state in a constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a static object guard -- we use the same thing for torch custom op registration. The shared state is in a meyers singleton in getHandlersRegistry so it neatly avoids any static initialization ordering issues.
Ex:
explicit RegisterOperators(std::vector<std::optional<Operator>> operators) { |
7b71448
to
1e2c7fc
Compare
didnt do a careful review yet but it looks great to me overall! |
6cfac6d
to
9dbfd35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, approving to unblock
- Would be good to get additional reviews on C++ code structure
- Address failing tests
256640a
to
f9ed48c
Compare
@requires_cuda | ||
def test_dump_nccl_trace_pickle(self) -> None: | ||
with local_worker_server() as pool: | ||
resp = pool.request("POST", "/handler/dump_nccl_trace_pickle") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
|
||
auto it = handlers_.find(name); | ||
if (it == handlers_.end()) { | ||
throw std::runtime_error(fmt::format("Failed to find handler {}", name)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this cause a crash at runtime
if I call with a BAD
request?
Or will it be caught somewhere?
Maybe it is better to just return some HTTP friendly code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that it doesn't cause a crash and just returns a 500 erorr. I added a custom error handling though to report a sane error instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - aside from the concern of crash when calling
curl http://host/NON_EXISTENT_PATH
f9ed48c
to
041cfe5
Compare
041cfe5
to
6f41b08
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
6f41b08
to
469b55d
Compare
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: #126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o
@pytorchbot revert -m "breaking internal typechecking tests" -c "ghfirst" |
@pytorchbot successfully started a revert job. Check the current status here. |
@d4l3k your PR has been successfully reverted. |
This reverts commit 3d54183. Reverted #126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](#126601 (comment)))
This adds debug handlers as described in:
This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.
This adds 2 handlers out of the box:
/handler/ping
for testing purposes/handler/dump_nccl_trace_pickle
as a POC integration with Flight RecorderTest plan:
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang