Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL performance for Deep Learning workloads on AWS EFA network #235

Closed
yaroslavvb opened this issue Jun 21, 2019 · 9 comments
Closed

NCCL performance for Deep Learning workloads on AWS EFA network #235

yaroslavvb opened this issue Jun 21, 2019 · 9 comments

Comments

@yaroslavvb
Copy link

@yaroslavvb yaroslavvb commented Jun 21, 2019

I'm wondering if there are settings I can use to improve NCCL performance on AWS EFA network with 8 GPU instances.

Generally over 2 instances I've observed effective bandwidth improvement of 17 Gbps -> 50 Gbps by switching from ethernet to EFA, and then up to 126.4 Gbps when applying aws-ofi-nccl patch

However, going to 4 and 8 instances, the performance drops up to 1.6x. I'm wondering if it's due to jitter, or if it's something that can be improved by customizing nccl settings.

Below are numbers for 70MB allreduce in nccl 2.4.7ms1+cuda10.0, which is a size chosen to balance total time + time to wait for last layer to sync

  • 16 GPU: 11ms log
  • 32 GPU: 16ms log
  • 64 GPU: 18ms log
@kwen2501

This comment has been minimized.

Copy link
Collaborator

@kwen2501 kwen2501 commented Jun 22, 2019

Hi, in the two-node cases, NCCL uses the tree algorithm for all message sizes by default. As explained here, this will give a 2x improvement over the ring algorithm in theoretical bandwidth.

However, this 2x factor does not hold for >= 3 nodes cases. That may explain why you see a drop.

Also, for >= 3 nodes, NCCL will calculate something called tree threshold, for message size greater than which NCCL will switch back to the ring algorithm. So the peak BWs you observed in the 4 and 8 nodes cases are indeed the peak BW of the ring algorithm, which is a "true" reflection of the network link BW you can have (+ overhead).

That said, we have not yet fully confirmed that the default tree thresholds calculated by NCCL (you can see it in the INFO log) is the best switching points on AWS+EFA, for all node scales. So you can try to tweak it a little bit and see how it impact the performance. But it may only have impact on the (medium) sizes around the threshold but not the big ones.

@yaroslavvb

This comment has been minimized.

Copy link
Author

@yaroslavvb yaroslavvb commented Jun 23, 2019

I'm manually setting NCCL_TREE_THRESHOLD=4294967296 which is threshold suggested by EFA team, so I think I'm using a tree for 64 GPU case as well, here's INFO log.

BTW, using NCCL_DEBUG=INFO instead of NCCL_DEBUG=VERSION I no longer get version of nccl printed, is there a way to get both?

@yaroslavvb

This comment has been minimized.

@bobzhuyb

This comment has been minimized.

Copy link

@bobzhuyb bobzhuyb commented Jun 28, 2019

@yaroslavvb I don't think the busbw was reported correctly here. https://github.com/NVIDIA/nccl-tests/ was developed before NCCL had trees. nccl-tests is calculating busbw from algbw using the ring equations (https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md). I read nccl-test code before. The busbw is calculated, not measured.

I suggest you only focus on the algbw, which is the real allreduce throughput. That is, if you have X MB data to allreduce, NCCL will need (X/algbw) wall clock time. This is what application cares in the end.

NCCL developers, please correct me if I am wrong.

@yaroslavvb

This comment has been minimized.

Copy link
Author

@yaroslavvb yaroslavvb commented Jun 28, 2019

I agree that the time to complete the operation is the important metric. I made graphs to determine the cut-off point for switching from ring to tree. The values in graph were obtained by taking the time to complete (column right before busbw) and normalizing by 50% of total transfer size (to roughly correspond to busbw column), you can see the calculation in my notebook here -- https://github.com/cybertronai/aws-network-benchmarks/blob/efa/nccl-bench-wandb.ipynb . You can see the actual end-to-end times in the logs.
IE, 4.29GB vector took 1.127 seconds to all-reduce across 256 GPUs using ring strategy

@bobzhuyb

This comment has been minimized.

Copy link

@bobzhuyb bobzhuyb commented Jun 28, 2019

@yaroslavvb I see. So basically the values in your graph are algbw*2.

The problem of busbw is that, for 2-machine case, the calculation is way off. The 120Gbps number is basically wrong, because the busbw is mistakenly calculated as algbw * 2 * (16-1)/16. 16 is the number of GPUs. While for tree, the real bandwidth between the two machines should be exactly algbw.

@sjeaugey

This comment has been minimized.

Copy link
Contributor

@sjeaugey sjeaugey commented Jul 16, 2019

@rashikakheria,

FYI @yaroslavvb reported a few numbers above running with the EFA plugin. He gets 80Gb/s (10GB/s) on 2 nodes, but performance drops to 50-60Gb/s (6-8 GB/s) when using 8 to 32 nodes.

Should we move this issue to https://github.com/aws/aws-ofi-nccl/ ?

@rashikakheria

This comment has been minimized.

Copy link

@rashikakheria rashikakheria commented Jul 16, 2019

@rashikakheria,

FYI @yaroslavvb reported a few numbers above running with the EFA plugin. He gets 80Gb/s (10GB/s) on 2 nodes, but performance drops to 50-60Gb/s (6-8 GB/s) when using 8 to 32 nodes.

Should we move this issue to https://github.com/aws/aws-ofi-nccl/ ?

Sure, we can!

@yaroslavvb

This comment has been minimized.

Copy link
Author

@yaroslavvb yaroslavvb commented Jul 26, 2019

@yaroslavvb yaroslavvb closed this Jul 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.