-
Couldn't load subscription status.
- Fork 74
Setup rccl for pytorch [REDUX] #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup rccl for pytorch [REDUX] #347
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good in general - couple comments/questions.
| ("ncclSuccess", ("rcclSuccess", CONV_TYPE, API_RCCL)), | ||
| ("ncclChar", ("rcclChar", CONV_TYPE, API_RCCL)), | ||
| ("ncclInt8", ("rcclChar", CONV_TYPE, API_RCCL)), | ||
| ("ncclUint8", ("rcclChar", CONV_TYPE, API_RCCL)), #FIXME: This should be mapped to an unsigned int8 type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my education: why is it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rccl doesn't have one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - can we get that adjusted?
| '/usr/include', | ||
| ])) | ||
|
|
||
| if IS_CONDA: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume we never tested on conda?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Although the CI environment is on conda, is it not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to my knowledge.
|
Also, you did test this also w/ |
|
I did not test with |
|
yes, please test locally. I just want to make sure if we introduce |
|
test_cudnn_multiple_threads_same_device seems flaky since it passes on centos build as well as on local Ubuntu build. It was recently enabled for ROCm in upstream pytorch#16994. I'd prefer to re-disable it in a separate PR since it's not related to multigpu/rccl in any way. |
|
test_cudnn_multiple_threads_same_device re-disabled in #349 |
|
@pytorchbot retest this please flaky tests were disabled |
|
@iotamudelta Couldn't figure out what the deal with the rocm centos failure is... can you please take a look and let me know if you think it's related to this PR's changes? |
|
@jithunnair-amd it does look unrelated. but lets make sure and loop in @petrex to have a look |
@iotamudelta So it builds successfully for ROCm with USE_DISTRIBUTED=True and USE_RCCL=False. I checked that the test_nn data parallel tests that worked with USE_RCCL=True also work in this case. Can we consider this ready to merge now? |
|
@jithunnair-amd @iotamudelta the caffe2 failure seen on centos is a flaky test. Refer to FBA-26. |
|
@jithunnair-amd can you merge master in - after the antistatic stuff, there is a trivial conflict. |
|
@iotamudelta Windows failure is about pyyaml not being installed, hence unrelated. PR looks ready to merge. |
Changes to build scripts and codebase to add rccl integration to pytorch multigpu code