Skip to content
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.

throughput scaling issues #204

Closed
vsuthichai opened this issue Aug 4, 2018 · 15 comments
Closed

throughput scaling issues #204

vsuthichai opened this issue Aug 4, 2018 · 15 comments

Comments

@vsuthichai
Copy link
Contributor

vsuthichai commented Aug 4, 2018

I'm attempting to benchmark throughput on transformer-big on the following:

  • AWS p3.16xlarge (8 gpus per node)
  • Horovod 0.13.10
  • OpenMPI 3.1.1
  • TensorFlow 1.9.0
  • CUDA 9.0
  • FP32

I'm benchmarking for 100 steps -- 10 to 109, skipping the first 0 to 9 steps. Here are some results. It seems to plateau at 8 gpus and then doesn't scale any further. I'm primarily interested in getting the throughput samples per second to scale well. Any thoughts?

Nodes GPUs Steps Global Batch Size Per GPU Batch Size Seconds / Step Objects / Sec Samples / Sec
1 2 100 256 128 0.712 21489.658 360
1 4 100 512 128 0.877 34967.614 583
1 8 100 1024 128 1.287 47788.866 795
2 16 100 2048 128 2.906 42235.197 704
3 24 100 3072 128 3.972 46429.704 773
4 32 100 4096 128 5.09 48363.986 804
@vsuthichai
Copy link
Contributor Author

Update. I've tracked a performance bottleneck on the horovod timeline to an mpi_allgather, whose duration is 2-3 seconds after every training step.

screen shot 2018-08-03 at 9 14 16 pm

@okuchaiev
Copy link
Member

okuchaiev commented Aug 7, 2018

@vsuthichai Do you know how fast is connection between 2 AWS p3.16xlarge instances?
Does this mean a network bottleneck?
Try setting "iter_size" > 1. https://nvidia.github.io/OpenSeq2Seq/html/api-docs/models.html?highlight=iter_size This might help with perf if network is a bottleneck.

@vsuthichai
Copy link
Contributor Author

vsuthichai commented Aug 9, 2018

So I believe the issue has to do with sparse gradient updates, particulary the shared embeddings. Horovod implements the allreduce as allgather https://github.com/uber/horovod/blob/f43ad4763574b5652f488648ca1860c1e55a8152/horovod/tensorflow/__init__.py#L72 if the gradient is an IndexedSlice. Also, they're not using the NCCL allgather but instead the MPI allgather.

On a side note, IndexedSlices object are created by tf.gather if I'm not mistaken. It makes sense to only update the relevant embeddings and it should be more efficient. Just puzzled why the implementation of MPI allgather is much slower. The bandwidth is 25Gbps and it does get saturated in the MPI allgather except for a few spots.

This is a measurement of the network traffic on one of the nodes. It's for two training steps. There are 5 NCCL allreduces happening, followed by a really long (~2.5secs) MPI allgather per training step. X axis is time in milliseconds

screen shot 2018-08-08 at 5 54 42 pm

@alsrgv
Copy link

alsrgv commented Aug 9, 2018

I believe this is AWS specific issue. AWS provides 25Gbit bandwidth on p3.16xlarge instances, however, each TCP connection is limited to 10Gbit. NCCL is able to use multi-connections with NCCL_MIN_NRINGS=4 environment variable, while MPI does not do that and gets capped at 10Gbit. This is an argument in favor of NCCL allgather support in Horovod.

@vsuthichai
Copy link
Contributor Author

@alsrgv , thank you for the input. It sounds like I may have to wait until Horovod supports NCCL allgather.

@vsuthichai
Copy link
Contributor Author

vsuthichai commented Aug 29, 2018

@okuchaiev @alsrgv An interesting discovery... After an allgather, the concatenated gradient shape across two nodes was around (800k, 512). Additionally, this shape doesn't change and stays fixed from allgather to allgather. Is this because of the padding to some fixed length? The shape suggests the concatenated gradient size is really large.. Given that the vocab embedding matrix is just (32k, 512), a workaround was tried to convert the IndexedSlices gradient into a Tensor with tf.convert_to_tensor. This workaround avoids the allgather path and makes everything allreduce. We saw a significant speedup. I was wondering what your thoughts are on this workaround?

@okuchaiev
Copy link
Member

I thought converting indexedslices to tensors can result in memory loss since you are effectively converting sparse to dense? TF even seems to give a warning about that.

Btw, we just published some scaling numbers we see within single machine:
https://nvidia.github.io/OpenSeq2Seq/html/machine-translation/transformer.html#training

@vsuthichai
Copy link
Contributor Author

@okuchaiev That's a good point. Going from sparse to dense does consume more memory. I'm trying to address the allgather issue that comes along with using IndexesSlices. For multi node configurations, I've had problems as the concatenated gradients become larger and larger. Have you experienced this bottleneck?

@vsuthichai
Copy link
Contributor Author

vsuthichai commented Sep 11, 2018

@okuchaiev For the embedding matrix, I was wondering if you've noticed this issue with the size of tf.IndexedSlices during the allgather. I'm doing a simple experiment where I set the mini batch size to 1, print the the number of tokens in the src and target matrix out. I've noticed the size of IndexesSlices is the size of the vocabulary plus the number of tokens in src matrix plus the number of tokens in the target matrix. This seems a bit much for something that I would presume to be sparse. The size of the IndexedSlices should just be the number of tokens within src and target matrix, but for some reason the entire size of the vocabulary is included. This would be certainly be overkill for horovod's allgather, and the problem worsens with more gpus. Any thoughts?

Per training step, the size of the IndexedSlices is (32k + (batch_size * src_timestep_length) + (batch_size * target_timestep_length)). The 32k sized vocab is I guess what is in question here. I would think that this should not be part of IndexedSlices. Are gradients being computed for every embedding per training step? And is it necessary?

@vsuthichai
Copy link
Contributor Author

vsuthichai commented Sep 12, 2018

I think i discovered the answer to my question. The embedding matrix is shared as the projection matrix pre-softmax as well. So the gradients are computed for the entire embedding matrix.

Though, I feel some performance improvement could be done here? If the gradients for repeated indices within the IndexedValues are summed together, and then it is just converted to a dense Tensor. Horovod will then take the standard fast allreduce path instead of allgather. The size of the IndexedSlices will always be at least the size of the vocabulary. Might as well just use a dense Tensor.

@vsuthichai
Copy link
Contributor Author

@okuchaiev Would you have any benchmark numbers for "big" parameter set and mixed precision training? Ideally multi-node. I'm essentially trying to see how possible it is to get to what the fairseq transformer achieves. https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inference/

@okuchaiev
Copy link
Member

@vsuthichai sorry for late responses - I am currently on paternity leave and will be back in October.
Answers to some of your questions:

" The embedding matrix is shared as the projection matrix pre-softmax as well."

  • this is correct for transformer. But not true for GNMT. It sounds unoptimal that entire gradients are computed instead sparse parts that are actually necessary...

"Would you have any benchmark numbers for "big" parameter set and mixed precision training?"

  • this is on my to-do list. The big config should look something like this: 701893e
    I can get test SacreBLEU on en-de 2014 of > 27.3 but I think there is still room for improvement.

What cards are u using 32Gb or 16Gb?

@vsuthichai
Copy link
Contributor Author

vsuthichai commented Sep 12, 2018

@okuchaiev Congratulations on fatherhood :) I apologize for sending these issues while you're on leave. Please feel free to address them whenever you're free.

I am doing benchmarking on the transformer model. WIth iter_size == 1, the gradient for the embedding matrix is an IndexedSlices object but with iter_size > 1, the gradient for the embedding / pre-softmax projection matrix is a tf.Variable updated with scatter_nd_add.

Enter Horovod into the picture, when iter_size == 1, it will utilize a somewhat slower allgather path compared to when iter_size > 1, because the gradient is an IndexedSlices object. When iter_size > 1, the tf.Variable is treated as dense and uses allreduce, which is much faster.

I think simply changing the logic to iter_size >= 1 will force it to go down the same code path.

Have you experienced any strange shaping errors when trying to profile with iter_size set to something greater than 1?

2865 ops no flops stats due to incomplete shapes.
Node ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding_1/mul_2 incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node Loss_Optimization/gradients/ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding_1/mul_3_grad/Mul incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding/mul_2 incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Node Loss_Optimization/gradients/ForwardPass/transformer_encoder/encode/embedding_shared_weights/embedding/mul_3_grad/Mul incompatible shapes: Shapes (?, ?, 512) and (128, 56, 1) are not compatible.
Cannot parse returned proto: Error parsing message.

I'm using 16gb V100, Voltas on p3.16xl AWS.

@vsuthichai
Copy link
Contributor Author

@okuchaiev In the dataset pipelining, moving shard before map will provide a bit of performance gain. Usually any types of filtering operations, and shard seems like one of these should be the first operations in the dataset pipeline. I becomes inefficient to map and apply a function to all the samples and then shard afterward

@okuchaiev
Copy link
Member

looks like this is duplicate of 243 and 244. closing for now

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants