throughput scaling issues #204

vsuthichai · 2018-08-04T00:35:57Z

I'm attempting to benchmark throughput on transformer-big on the following:

AWS p3.16xlarge (8 gpus per node)
Horovod 0.13.10
OpenMPI 3.1.1
TensorFlow 1.9.0
CUDA 9.0
FP32

I'm benchmarking for 100 steps -- 10 to 109, skipping the first 0 to 9 steps. Here are some results. It seems to plateau at 8 gpus and then doesn't scale any further. I'm primarily interested in getting the throughput samples per second to scale well. Any thoughts?

Nodes	GPUs	Steps	Global Batch Size	Per GPU Batch Size	Seconds / Step	Objects / Sec	Samples / Sec
1	2	100	256	128	0.712	21489.658	360
1	4	100	512	128	0.877	34967.614	583
1	8	100	1024	128	1.287	47788.866	795
2	16	100	2048	128	2.906	42235.197	704
3	24	100	3072	128	3.972	46429.704	773
4	32	100	4096	128	5.09	48363.986	804

vsuthichai · 2018-08-04T04:16:51Z

Update. I've tracked a performance bottleneck on the horovod timeline to an mpi_allgather, whose duration is 2-3 seconds after every training step.

okuchaiev · 2018-08-07T16:33:48Z

@vsuthichai Do you know how fast is connection between 2 AWS p3.16xlarge instances?
Does this mean a network bottleneck?
Try setting "iter_size" > 1. https://nvidia.github.io/OpenSeq2Seq/html/api-docs/models.html?highlight=iter_size This might help with perf if network is a bottleneck.

vsuthichai · 2018-08-09T00:57:55Z

So I believe the issue has to do with sparse gradient updates, particulary the shared embeddings. Horovod implements the allreduce as allgather https://github.com/uber/horovod/blob/f43ad4763574b5652f488648ca1860c1e55a8152/horovod/tensorflow/__init__.py#L72 if the gradient is an IndexedSlice. Also, they're not using the NCCL allgather but instead the MPI allgather.

On a side note, IndexedSlices object are created by tf.gather if I'm not mistaken. It makes sense to only update the relevant embeddings and it should be more efficient. Just puzzled why the implementation of MPI allgather is much slower. The bandwidth is 25Gbps and it does get saturated in the MPI allgather except for a few spots.

This is a measurement of the network traffic on one of the nodes. It's for two training steps. There are 5 NCCL allreduces happening, followed by a really long (~2.5secs) MPI allgather per training step. X axis is time in milliseconds

alsrgv · 2018-08-09T18:27:40Z

I believe this is AWS specific issue. AWS provides 25Gbit bandwidth on p3.16xlarge instances, however, each TCP connection is limited to 10Gbit. NCCL is able to use multi-connections with NCCL_MIN_NRINGS=4 environment variable, while MPI does not do that and gets capped at 10Gbit. This is an argument in favor of NCCL allgather support in Horovod.

vsuthichai · 2018-08-10T01:49:05Z

@alsrgv , thank you for the input. It sounds like I may have to wait until Horovod supports NCCL allgather.

vsuthichai · 2018-08-29T17:35:27Z

@okuchaiev @alsrgv An interesting discovery... After an allgather, the concatenated gradient shape across two nodes was around (800k, 512). Additionally, this shape doesn't change and stays fixed from allgather to allgather. Is this because of the padding to some fixed length? The shape suggests the concatenated gradient size is really large.. Given that the vocab embedding matrix is just (32k, 512), a workaround was tried to convert the IndexedSlices gradient into a Tensor with tf.convert_to_tensor. This workaround avoids the allgather path and makes everything allreduce. We saw a significant speedup. I was wondering what your thoughts are on this workaround?

okuchaiev · 2018-08-29T22:02:57Z

I thought converting indexedslices to tensors can result in memory loss since you are effectively converting sparse to dense? TF even seems to give a warning about that.

Btw, we just published some scaling numbers we see within single machine:
https://nvidia.github.io/OpenSeq2Seq/html/machine-translation/transformer.html#training

vsuthichai · 2018-08-30T02:02:24Z

@okuchaiev That's a good point. Going from sparse to dense does consume more memory. I'm trying to address the allgather issue that comes along with using IndexesSlices. For multi node configurations, I've had problems as the concatenated gradients become larger and larger. Have you experienced this bottleneck?

vsuthichai · 2018-09-11T22:29:10Z

@okuchaiev For the embedding matrix, I was wondering if you've noticed this issue with the size of tf.IndexedSlices during the allgather. I'm doing a simple experiment where I set the mini batch size to 1, print the the number of tokens in the src and target matrix out. I've noticed the size of IndexesSlices is the size of the vocabulary plus the number of tokens in src matrix plus the number of tokens in the target matrix. This seems a bit much for something that I would presume to be sparse. The size of the IndexedSlices should just be the number of tokens within src and target matrix, but for some reason the entire size of the vocabulary is included. This would be certainly be overkill for horovod's allgather, and the problem worsens with more gpus. Any thoughts?

Per training step, the size of the IndexedSlices is (32k + (batch_size * src_timestep_length) + (batch_size * target_timestep_length)). The 32k sized vocab is I guess what is in question here. I would think that this should not be part of IndexedSlices. Are gradients being computed for every embedding per training step? And is it necessary?

vsuthichai · 2018-09-12T00:01:12Z

I think i discovered the answer to my question. The embedding matrix is shared as the projection matrix pre-softmax as well. So the gradients are computed for the entire embedding matrix.

Though, I feel some performance improvement could be done here? If the gradients for repeated indices within the IndexedValues are summed together, and then it is just converted to a dense Tensor. Horovod will then take the standard fast allreduce path instead of allgather. The size of the IndexedSlices will always be at least the size of the vocabulary. Might as well just use a dense Tensor.

vsuthichai · 2018-09-12T16:57:49Z

@okuchaiev Would you have any benchmark numbers for "big" parameter set and mixed precision training? Ideally multi-node. I'm essentially trying to see how possible it is to get to what the fairseq transformer achieves. https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inference/

okuchaiev · 2018-09-12T20:07:47Z

@vsuthichai sorry for late responses - I am currently on paternity leave and will be back in October.
Answers to some of your questions:

" The embedding matrix is shared as the projection matrix pre-softmax as well."

this is correct for transformer. But not true for GNMT. It sounds unoptimal that entire gradients are computed instead sparse parts that are actually necessary...

"Would you have any benchmark numbers for "big" parameter set and mixed precision training?"

this is on my to-do list. The big config should look something like this: 701893e
I can get test SacreBLEU on en-de 2014 of > 27.3 but I think there is still room for improvement.

What cards are u using 32Gb or 16Gb?

vsuthichai · 2018-09-12T22:29:25Z

@okuchaiev Congratulations on fatherhood :) I apologize for sending these issues while you're on leave. Please feel free to address them whenever you're free.

I am doing benchmarking on the transformer model. WIth iter_size == 1, the gradient for the embedding matrix is an IndexedSlices object but with iter_size > 1, the gradient for the embedding / pre-softmax projection matrix is a tf.Variable updated with scatter_nd_add.

Enter Horovod into the picture, when iter_size == 1, it will utilize a somewhat slower allgather path compared to when iter_size > 1, because the gradient is an IndexedSlices object. When iter_size > 1, the tf.Variable is treated as dense and uses allreduce, which is much faster.

I think simply changing the logic to iter_size >= 1 will force it to go down the same code path.

OpenSeq2Seq/open_seq2seq/optimizers/optimizers.py

Line 194 in 694a230

if iter_size > 1:

Have you experienced any strange shaping errors when trying to profile with iter_size set to something greater than 1?

2865 ops no flops stats due to incomplete shapes.
Node ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding_1/mul_2 incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node Loss_Optimization/gradients/ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding_1/mul_3_grad/Mul incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding/mul_2 incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node Loss_Optimization/gradients/ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding/mul_3_grad/Mul incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Cannot parse returned proto: Error parsing message.

I'm using 16gb V100, Voltas on p3.16xl AWS.

vsuthichai · 2018-09-29T18:13:54Z

@okuchaiev In the dataset pipelining, moving shard before map will provide a bit of performance gain. Usually any types of filtering operations, and shard seems like one of these should be the first operations in the dataset pipeline. I becomes inefficient to map and apply a function to all the samples and then shard afterward

okuchaiev · 2018-10-11T04:57:46Z

looks like this is duplicate of 243 and 244. closing for now

vsuthichai mentioned this issue Aug 9, 2018

question: why nccl allgather is not supported? horovod/horovod#430

Closed

This was referenced Oct 1, 2018

iter_size set to 1 with transformer distributed training for horovod #244

Closed

text2text.py data pipeline inefficiency #243

Closed

okuchaiev closed this as completed Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

throughput scaling issues #204

throughput scaling issues #204

vsuthichai commented Aug 4, 2018 •

edited

Loading

vsuthichai commented Aug 4, 2018

okuchaiev commented Aug 7, 2018 •

edited

Loading

vsuthichai commented Aug 9, 2018 •

edited

Loading

alsrgv commented Aug 9, 2018

vsuthichai commented Aug 10, 2018

vsuthichai commented Aug 29, 2018 •

edited

Loading

okuchaiev commented Aug 29, 2018

vsuthichai commented Aug 30, 2018

vsuthichai commented Sep 11, 2018 •

edited

Loading

vsuthichai commented Sep 12, 2018 •

edited

Loading

vsuthichai commented Sep 12, 2018

okuchaiev commented Sep 12, 2018

vsuthichai commented Sep 12, 2018 •

edited

Loading

vsuthichai commented Sep 29, 2018

okuchaiev commented Oct 11, 2018

throughput scaling issues #204

throughput scaling issues #204

Comments

vsuthichai commented Aug 4, 2018 • edited Loading

vsuthichai commented Aug 4, 2018

okuchaiev commented Aug 7, 2018 • edited Loading

vsuthichai commented Aug 9, 2018 • edited Loading

alsrgv commented Aug 9, 2018

vsuthichai commented Aug 10, 2018

vsuthichai commented Aug 29, 2018 • edited Loading

okuchaiev commented Aug 29, 2018

vsuthichai commented Aug 30, 2018

vsuthichai commented Sep 11, 2018 • edited Loading

vsuthichai commented Sep 12, 2018 • edited Loading

vsuthichai commented Sep 12, 2018

okuchaiev commented Sep 12, 2018

vsuthichai commented Sep 12, 2018 • edited Loading

vsuthichai commented Sep 29, 2018

okuchaiev commented Oct 11, 2018

vsuthichai commented Aug 4, 2018 •

edited

Loading

okuchaiev commented Aug 7, 2018 •

edited

Loading

vsuthichai commented Aug 9, 2018 •

edited

Loading

vsuthichai commented Aug 29, 2018 •

edited

Loading

vsuthichai commented Sep 11, 2018 •

edited

Loading

vsuthichai commented Sep 12, 2018 •

edited

Loading

vsuthichai commented Sep 12, 2018 •

edited

Loading