Setup rccl for pytorch [REDUX] #347

jithunnair-amd · 2019-02-13T17:56:39Z

Changes to build scripts and codebase to add rccl integration to pytorch multigpu code

…il with RCCL

iotamudelta

Looks very good in general - couple comments/questions.

tools/amd_build/pyHIPIFY/cuda_to_hip_mappings.py

iotamudelta · 2019-02-13T22:19:26Z

tools/amd_build/pyHIPIFY/cuda_to_hip_mappings.py

+    ("ncclSuccess", ("rcclSuccess", CONV_TYPE, API_RCCL)),
+    ("ncclChar", ("rcclChar", CONV_TYPE, API_RCCL)),
+    ("ncclInt8", ("rcclChar", CONV_TYPE, API_RCCL)),
+    ("ncclUint8", ("rcclChar", CONV_TYPE, API_RCCL)), #FIXME: This should be mapped to an unsigned int8 type


for my education: why is it not?

rccl doesn't have one.

Cool - can we get that adjusted?

tools/amd_build/pyHIPIFY/rccl1_compat.h

tools/setup_helpers/dist_check.py

iotamudelta · 2019-02-13T22:24:38Z

tools/setup_helpers/rccl.py

+       '/usr/include',
+   ]))
+
+   if IS_CONDA:


I presume we never tested on conda?

That's right. Although the CI environment is on conda, is it not?

Not to my knowledge.

torch/CMakeLists.txt

torch/lib/c10d/CMakeLists.txt

iotamudelta · 2019-02-13T22:34:05Z

Also, you did test this also w/ USE_RCCL=OFF?

jithunnair-amd · 2019-02-13T23:04:17Z

I did not test with USE_RCCL=OFF. Shall I run a build locally to see if it works?

iotamudelta · 2019-02-13T23:05:32Z

yes, please test locally. I just want to make sure if we introduce USE_ROCM=yes and USE_RCCL=no as an option that it actually works.

… RCCL

jithunnair-amd · 2019-02-14T17:13:52Z

test_cudnn_multiple_threads_same_device seems flaky since it passes on centos build as well as on local Ubuntu build. It was recently enabled for ROCm in upstream pytorch#16994. I'd prefer to re-disable it in a separate PR since it's not related to multigpu/rccl in any way.

jithunnair-amd · 2019-02-14T17:23:44Z

test_cudnn_multiple_threads_same_device re-disabled in #349

iotamudelta · 2019-02-14T20:45:29Z

@pytorchbot retest this please

flaky tests were disabled

jithunnair-amd · 2019-02-14T23:10:55Z

@iotamudelta Couldn't figure out what the deal with the rocm centos failure is... can you please take a look and let me know if you think it's related to this PR's changes?

iotamudelta · 2019-02-14T23:13:20Z

@jithunnair-amd it does look unrelated. but lets make sure and loop in @petrex to have a look

jithunnair-amd · 2019-02-15T19:26:02Z

yes, please test locally. I just want to make sure if we introduce USE_ROCM=yes and USE_RCCL=no as an option that it actually works.

@iotamudelta So it builds successfully for ROCm with USE_DISTRIBUTED=True and USE_RCCL=False. I checked that the test_nn data parallel tests that worked with USE_RCCL=True also work in this case. Can we consider this ready to merge now?

rohithkrn · 2019-02-15T19:51:08Z

@jithunnair-amd @iotamudelta the caffe2 failure seen on centos is a flaky test. Refer to FBA-26.

iotamudelta · 2019-02-15T19:54:27Z

@jithunnair-amd can you merge master in - after the antistatic stuff, there is a trivial conflict.

jithunnair-amd · 2019-02-15T23:05:21Z

@iotamudelta Windows failure is about pyyaml not being installed, hence unrelated. PR looks ready to merge.

jithunnair-amd added 2 commits February 13, 2019 17:49

Reworked changes for rccl integration on top of latest pytorch master

e0f617f

Remove nccl dependency for ROCm

252e38e

jithunnair-amd mentioned this pull request Feb 13, 2019

Setup rccl for PyTorch #309

Closed

Skip test_nn data_parallel unit tests on ROCm since they currently fa…

74de720

…il with RCCL

iotamudelta approved these changes Feb 13, 2019

View reviewed changes

Update some comments based on review

3759615

Skip test_cuda coalesced tests on ROCm since they currently fail with…

928174f

… RCCL

Merge branch 'master' into setup_rccl_for_pytorch_redux

7574ad5

Merge branch 'master' into setup_rccl_for_pytorch_redux

7765689

iotamudelta merged commit 5b08ec8 into ROCm:master Feb 16, 2019

Uh oh!

Setup rccl for pytorch [REDUX] #347

Setup rccl for pytorch [REDUX] #347

Uh oh!

Conversation

jithunnair-amd commented Feb 13, 2019

Uh oh!

iotamudelta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iotamudelta Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

iotamudelta Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iotamudelta Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

iotamudelta Feb 13, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iotamudelta commented Feb 13, 2019

Uh oh!

jithunnair-amd commented Feb 13, 2019

Uh oh!

iotamudelta commented Feb 13, 2019

Uh oh!

jithunnair-amd commented Feb 14, 2019

Uh oh!

jithunnair-amd commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iotamudelta commented Feb 14, 2019

Uh oh!

jithunnair-amd commented Feb 14, 2019

Uh oh!

iotamudelta commented Feb 14, 2019

Uh oh!

jithunnair-amd commented Feb 15, 2019

Uh oh!

rohithkrn commented Feb 15, 2019

Uh oh!

iotamudelta commented Feb 15, 2019

Uh oh!

jithunnair-amd commented Feb 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jithunnair-amd commented Feb 14, 2019 •

edited

Loading