You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 3, 2021. It is now read-only.
There are some inefficiencies with applying a shard after cache in text2text.py build_graph method. Rewriting part of the code such that the shard is called first before map, cache results in a pretty sizable decrease in time per step when training big transformer on wmt14 en-de. Happy to provide a pull request. On 4 nodes, 32 gpus, batch size 128, iter size 16, mixed precision training, 16gb voltas, I've noticed time per step drop from 13.9s without the fix below to 6.6s with the following code change in the dataset pipeline.
Thanks @vsuthichai ! It does seem to make a big difference if horovod isn't used.
Is this what you meant: #246 ?
My only concern is whether shard is deterministic or not (it seems such) so that mapping between src and tgt is preserved
There are some inefficiencies with applying a
shard
aftercache
in text2text.pybuild_graph
method. Rewriting part of the code such that the shard is called first beforemap
,cache
results in a pretty sizable decrease in time per step when training big transformer on wmt14 en-de. Happy to provide a pull request. On 4 nodes, 32 gpus, batch size 128, iter size 16, mixed precision training, 16gb voltas, I've noticed time per step drop from 13.9s without the fix below to 6.6s with the following code change in the dataset pipeline.#204
The text was updated successfully, but these errors were encountered: