Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed debug handlers #126601

Closed
wants to merge 1 commit into from
Closed

distributed debug handlers #126601

wants to merge 1 commit into from

Conversation

d4l3k
Copy link
Collaborator

@d4l3k d4l3k commented May 18, 2024

This adds debug handlers as described in:

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

  • /handler/ping for testing purposes
  • /handler/dump_nccl_trace_pickle as a POC integration with Flight Recorder

Test plan:

python test/distributed/elastic/test_control_plane.py

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang

Copy link

pytorch-bot bot commented May 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126601

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 469b55d with merge base 38a33c3 (image):

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels May 18, 2024
@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 4 times, most recently from 26c5779 to 7b71448 Compare May 21, 2024 18:06
@d4l3k d4l3k changed the title distributed/control_plane distributed debug handlers May 21, 2024
@d4l3k d4l3k marked this pull request as ready for review May 21, 2024 18:11
test/distributed/elastic/test_control_plane.py Outdated Show resolved Hide resolved
os.environ[TORCH_WORKER_SERVER_SOCKET] = socket_path

print("starting")
with main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a more descriptive name, given that is from another module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about following the hydra pattern here: https://hydra.cc/docs/intro/

In practice this should be something like @worker_server.main.

@requires_cuda
def test_dump_nccl_trace_pickle(self) -> None:
with local_worker_server() as session:
resp = session.post("http+unix://worker/handler/dump_nccl_trace_pickle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX:

  • Do we need to register the handler first?
  • If not for some cases, how would users know what is enabled by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should adopt some RESTful API approach?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handlers do need to be registered before the post request -- we're currently statically registering all handlers so they'll always be present by this call

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESTful is a pretty overloaded term could you clarify what you mean? I'm not really sure how to express "dump_nccl_trace_pickle" in the rest model since it doesn't really map to a data object with CRUD

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess naively, GET "/worker/handler/nccl_trace" seems 'restful' as you're reading the state of an object. I'm not sure if the RESTful model accounts for an object that is constantly changing, but otherwise that seems ok to me?

or is the difference that we are requesting a dump to disk in this call vs a return of the current value?

Copy link
Collaborator Author

@d4l3k d4l3k May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the data directly. For this one I think it's more or less fine to do get, but for some other potential handlers it's weirder. "Start tracing for 10 seconds" is not a very restful behavior.

I think the prior art for that is a CREATE/POST call to create the profiling and then other requests to get the data. i.e.

CREATE /profile {"time": 10} -> {"id": 15}
GET /profile/15 -> {...}

vs

POST /handler/profile {"time": 10} -> {...}

We'd have to overhaul the simpler handler design to support that I think -- so in someways would just prefer keeping it all POSTs even if it isn't pure "rest"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESTful is a pretty overloaded term could you clarify what you mean?

Worth exploring HATEOAS that could support various cases given pluggable nature of handlers.

Copy link
Collaborator Author

@d4l3k d4l3k May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were going that route I think something like openapi or Swagger might be best. Could potentially generate thrift defs off of Swagger as well

I do like keeping things simple though initially until we flesh out what these interfaces should be

class RegisterHandler {
public:
RegisterHandler(const std::string& name, HandlerFunc f) {
registerHandler(name, f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a standard pattern to update shared state in a constructor?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a static object guard -- we use the same thing for torch custom op registration. The shared state is in a meyers singleton in getHandlersRegistry so it neatly avoids any static initialization ordering issues.

Ex:

explicit RegisterOperators(std::vector<std::optional<Operator>> operators) {

@wconstab
Copy link
Contributor

didnt do a careful review yet but it looks great to me overall!

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 2 times, most recently from 6cfac6d to 9dbfd35 Compare May 23, 2024 18:52
Copy link
Contributor

@kurman kurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, approving to unblock

  • Would be good to get additional reviews on C++ code structure
  • Address failing tests

@d4l3k d4l3k force-pushed the d4l3k/control_plane branch 4 times, most recently from 256640a to f9ed48c Compare May 23, 2024 22:46
@requires_cuda
def test_dump_nccl_trace_pickle(self) -> None:
with local_worker_server() as pool:
resp = pool.request("POST", "/handler/dump_nccl_trace_pickle")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


auto it = handlers_.find(name);
if (it == handlers_.end()) {
throw std::runtime_error(fmt::format("Failed to find handler {}", name));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cause a crash at runtime if I call with a BAD request?

Or will it be caught somewhere?
Maybe it is better to just return some HTTP friendly code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that it doesn't cause a crash and just returns a 500 erorr. I added a custom error handling though to report a sane error instead.

Copy link
Contributor

@c-p-i-o c-p-i-o left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - aside from the concern of crash when calling
curl http://host/NON_EXISTENT_PATH

@d4l3k
Copy link
Collaborator Author

d4l3k commented May 24, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 24, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@d4l3k
Copy link
Collaborator Author

d4l3k commented May 29, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request May 30, 2024
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: #126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
@d4l3k d4l3k deleted the d4l3k/control_plane branch May 30, 2024 15:59
@PaliC
Copy link
Contributor

PaliC commented May 31, 2024

@pytorchbot revert -m "breaking internal typechecking tests" -c "ghfirst"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@d4l3k your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request May 31, 2024
This reverts commit 3d54183.

Reverted #126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](#126601 (comment)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants