Skip to content

Conversation

@jithunnair-amd
Copy link
Collaborator

Changes to build scripts and codebase to add rccl integration to pytorch multigpu code

Copy link

@iotamudelta iotamudelta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good in general - couple comments/questions.

("ncclSuccess", ("rcclSuccess", CONV_TYPE, API_RCCL)),
("ncclChar", ("rcclChar", CONV_TYPE, API_RCCL)),
("ncclInt8", ("rcclChar", CONV_TYPE, API_RCCL)),
("ncclUint8", ("rcclChar", CONV_TYPE, API_RCCL)), #FIXME: This should be mapped to an unsigned int8 type

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my education: why is it not?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rccl doesn't have one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - can we get that adjusted?

'/usr/include',
]))

if IS_CONDA:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we never tested on conda?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. Although the CI environment is on conda, is it not?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to my knowledge.

@iotamudelta
Copy link

Also, you did test this also w/ USE_RCCL=OFF?

@jithunnair-amd
Copy link
Collaborator Author

I did not test with USE_RCCL=OFF. Shall I run a build locally to see if it works?

@iotamudelta
Copy link

yes, please test locally. I just want to make sure if we introduce USE_ROCM=yes and USE_RCCL=no as an option that it actually works.

@jithunnair-amd
Copy link
Collaborator Author

test_cudnn_multiple_threads_same_device seems flaky since it passes on centos build as well as on local Ubuntu build. It was recently enabled for ROCm in upstream pytorch#16994. I'd prefer to re-disable it in a separate PR since it's not related to multigpu/rccl in any way.

@jithunnair-amd
Copy link
Collaborator Author

jithunnair-amd commented Feb 14, 2019

test_cudnn_multiple_threads_same_device re-disabled in #349

@iotamudelta
Copy link

@pytorchbot retest this please

flaky tests were disabled

@jithunnair-amd
Copy link
Collaborator Author

@iotamudelta Couldn't figure out what the deal with the rocm centos failure is... can you please take a look and let me know if you think it's related to this PR's changes?

@iotamudelta
Copy link

@jithunnair-amd it does look unrelated. but lets make sure and loop in @petrex to have a look

@jithunnair-amd
Copy link
Collaborator Author

yes, please test locally. I just want to make sure if we introduce USE_ROCM=yes and USE_RCCL=no as an option that it actually works.

@iotamudelta So it builds successfully for ROCm with USE_DISTRIBUTED=True and USE_RCCL=False. I checked that the test_nn data parallel tests that worked with USE_RCCL=True also work in this case. Can we consider this ready to merge now?

@rohithkrn
Copy link

@jithunnair-amd @iotamudelta the caffe2 failure seen on centos is a flaky test. Refer to FBA-26.

@iotamudelta
Copy link

@jithunnair-amd can you merge master in - after the antistatic stuff, there is a trivial conflict.

@jithunnair-amd
Copy link
Collaborator Author

@iotamudelta Windows failure is about pyyaml not being installed, hence unrelated. PR looks ready to merge.

@iotamudelta iotamudelta merged commit 5b08ec8 into ROCm:master Feb 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants