throughput scaling issues #204
Comments
@vsuthichai Do you know how fast is connection between 2 AWS p3.16xlarge instances? |
So I believe the issue has to do with sparse gradient updates, particulary the shared embeddings. Horovod implements the allreduce as allgather https://github.com/uber/horovod/blob/f43ad4763574b5652f488648ca1860c1e55a8152/horovod/tensorflow/__init__.py#L72 if the gradient is an IndexedSlice. Also, they're not using the NCCL allgather but instead the MPI allgather. On a side note, IndexedSlices object are created by tf.gather if I'm not mistaken. It makes sense to only update the relevant embeddings and it should be more efficient. Just puzzled why the implementation of MPI allgather is much slower. The bandwidth is 25Gbps and it does get saturated in the MPI allgather except for a few spots. This is a measurement of the network traffic on one of the nodes. It's for two training steps. There are 5 NCCL allreduces happening, followed by a really long (~2.5secs) MPI allgather per training step. X axis is time in milliseconds |
I believe this is AWS specific issue. AWS provides 25Gbit bandwidth on p3.16xlarge instances, however, each TCP connection is limited to 10Gbit. NCCL is able to use multi-connections with |
@alsrgv , thank you for the input. It sounds like I may have to wait until Horovod supports NCCL allgather. |
@okuchaiev @alsrgv An interesting discovery... After an allgather, the concatenated gradient shape across two nodes was around (800k, 512). Additionally, this shape doesn't change and stays fixed from allgather to allgather. Is this because of the padding to some fixed length? The shape suggests the concatenated gradient size is really large.. Given that the vocab embedding matrix is just (32k, 512), a workaround was tried to convert the IndexedSlices gradient into a Tensor with |
I thought converting indexedslices to tensors can result in memory loss since you are effectively converting sparse to dense? TF even seems to give a warning about that. Btw, we just published some scaling numbers we see within single machine: |
@okuchaiev That's a good point. Going from sparse to dense does consume more memory. I'm trying to address the allgather issue that comes along with using IndexesSlices. For multi node configurations, I've had problems as the concatenated gradients become larger and larger. Have you experienced this bottleneck? |
@okuchaiev For the embedding matrix, I was wondering if you've noticed this issue with the size of tf.IndexedSlices during the allgather. I'm doing a simple experiment where I set the mini batch size to 1, print the the number of tokens in the src and target matrix out. I've noticed the size of IndexesSlices is the size of the vocabulary plus the number of tokens in src matrix plus the number of tokens in the target matrix. This seems a bit much for something that I would presume to be sparse. The size of the IndexedSlices should just be the number of tokens within src and target matrix, but for some reason the entire size of the vocabulary is included. This would be certainly be overkill for horovod's allgather, and the problem worsens with more gpus. Any thoughts? Per training step, the size of the IndexedSlices is (32k + (batch_size * src_timestep_length) + (batch_size * target_timestep_length)). The 32k sized vocab is I guess what is in question here. I would think that this should not be part of IndexedSlices. Are gradients being computed for every embedding per training step? And is it necessary? |
I think i discovered the answer to my question. The embedding matrix is shared as the projection matrix pre-softmax as well. So the gradients are computed for the entire embedding matrix. Though, I feel some performance improvement could be done here? If the gradients for repeated indices within the IndexedValues are summed together, and then it is just converted to a dense Tensor. Horovod will then take the standard fast allreduce path instead of allgather. The size of the IndexedSlices will always be at least the size of the vocabulary. Might as well just use a dense Tensor. |
@okuchaiev Would you have any benchmark numbers for "big" parameter set and mixed precision training? Ideally multi-node. I'm essentially trying to see how possible it is to get to what the fairseq transformer achieves. https://code.fb.com/ai-research/scaling-neural-machine-translation-to-bigger-data-sets-with-faster-training-and-inference/ |
@vsuthichai sorry for late responses - I am currently on paternity leave and will be back in October.
What cards are u using 32Gb or 16Gb? |
@okuchaiev Congratulations on fatherhood :) I apologize for sending these issues while you're on leave. Please feel free to address them whenever you're free. I am doing benchmarking on the transformer model. WIth Enter Horovod into the picture, when I think simply changing the logic to
Have you experienced any strange shaping errors when trying to profile with
I'm using 16gb V100, Voltas on p3.16xl AWS. |
@okuchaiev In the dataset pipelining, moving shard before map will provide a bit of performance gain. Usually any types of filtering operations, and shard seems like one of these should be the first operations in the dataset pipeline. I becomes inefficient to map and apply a function to all the samples and then shard afterward |
looks like this is duplicate of 243 and 244. closing for now |
I'm attempting to benchmark throughput on transformer-big on the following:
I'm benchmarking for 100 steps -- 10 to 109, skipping the first 0 to 9 steps. Here are some results. It seems to plateau at 8 gpus and then doesn't scale any further. I'm primarily interested in getting the throughput samples per second to scale well. Any thoughts?
The text was updated successfully, but these errors were encountered: