Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Communications IRs inheriting from Expr. #2185

Merged
merged 6 commits into from
May 11, 2024

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented May 2, 2024

This PR makes the "Communication" class a proper IR inheriting from Expr. This patch is needed for implementing Host Irs. It is also one step towards making the Communications (and more generally the multidevice module) fully symbolic.

By the way, we proceed to a couple of refactoring, and remove Communicator::sendRecv method.

Remarks:

  1. This IR will be used for now in the context of Host Irs. Later, they could also serve as kernel IR (backed by device-side communication APIs at runtime).
  2. Before this patch, we had a base class Communication and one derived class per collective type (Allgather, Allreduce, Broadcast, etc.). Now, there is only the class Communication, and the collective type is encoded through an enum class CollectiveType added to the parameter member CommParams
  3. the Communication::post method was replaced by a standalone function postCollective. The motivation is to scoop out the runtime execution from the symbolic representation of the collective.
  4. Note that step 3) is only implented halfway here since a Collective is instantiated with concrete device Idx and concrete at::Tensor, while it should be instantiated with symbolic representations and binded to actual device indices and Aten buffers at runtime. This will be added in a future PR.

CI: https://gitlab-master.nvidia.com/dl/pytorch/fuser-gh-mirror/-/pipelines/14923305

@samnordmann samnordmann force-pushed the make_communications_IRs branch 2 times, most recently from 3d79609 to 3e6fddf Compare May 2, 2024 14:21
Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general! I made a few comments to address before I merge this.

csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved
csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved
csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved
csrc/multidevice/communication.h Outdated Show resolved Hide resolved
} else {
assertBufferCount(params_.dst_bufs, 0);
// TODO add checking symbolic representation of src and dst buffers
bool Communication::sameAs(const Statement* other) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I failed to see why we couldn't reuse the "default" implementation:

bool Expr::sameAs(const Statement* other) const {
. Can you clarify?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two Communications could be "Expr::sameAs" without being the same, for example if one is Allgather and the other Allreduce. Right?
My goal here is to mimick what's done for other IRs, such as

bool IterDomain::sameAs(const Statement* other) const {

But I don't actually use sameAs -- I simply thought implementing it was necessary for matching the IR specs.

So I'm open to any suggestion about this implementation, including removing it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expr::sameAs calls Expr::sameOp, which checks equality of all attributes. I believe the communication type is one of the attributes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if Im wrong, but in the current implementation it is not an attribute. The reason is that, IIUC, all attributes need to be Statement*s, and CommunicationParams is not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is addDataAttribute as used in

Fuser/csrc/ir/nodes.cpp

Lines 2423 to 2424 in 730abb5

addDataAttribute(op_type);
addDataAttribute(cache_op);
. But the PR as is is already hard to merge, I'll clean that up in a separate PR.

csrc/multidevice/executor.cpp Outdated Show resolved Hide resolved
@samnordmann samnordmann requested a review from wujingyue May 6, 2024 15:43
Copy link
Collaborator

@cowanmeg cowanmeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming that the post and validate logic is just moved around! If there is changes in those let me know so I can look more closely.

Also it is great to see Send/Recv finally removed from communicator!

csrc/multidevice/communication.cpp Show resolved Hide resolved
csrc/multidevice/communication.cpp Outdated Show resolved Hide resolved
@samnordmann
Copy link
Collaborator Author

!build --dist

Copy link
Collaborator

@wujingyue wujingyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll try to merge this. I may need to split this up into multiple PRs, but will let you know.

@wujingyue
Copy link
Collaborator

I'm in the process of resolving conflicts. I should be able to merge this tomorrow or Monday.

@samnordmann
Copy link
Collaborator Author

I'm in the process of resolving conflicts. I should be able to merge this tomorrow or Monday.

Thank you! I'm waiting for this one to be merge before going on with host ir dev. Let me know if I can help!

@wujingyue
Copy link
Collaborator

!build --dist

@wujingyue wujingyue changed the title make Communications IRs inheriting from Expr Make Communications IRs inheriting from Expr. May 11, 2024
@wujingyue wujingyue merged commit b020415 into NVIDIA:main May 11, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants