Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This speeds up the multi-gpu training with Horovod by asynchronously triggering the outputs of XLA clusters fed to the HorovodAllreduce nodes. The feature is currently off by default; turn it on by setting TF_XLA_FLAGS="--tf_xla_auto_jit=1 --tf_xla_async_io_level=1". Some data points on 8GPUs on DGX-1, XLA-Async vs. XLA-Sync (i.e., before this commit): 13% perf gain on BERT large pretrain squad fp32, BatchSize=2 7% perf gain on Unet medical trainbench fp32 Design Doc: https://docs.google.com/document/d/1oohJC3BgQYmCb0njAf1Iqd1MShg4cPSeu9ZelnOFY8c/edit Authors: Trent Lo and Ayan Moitra
- Loading branch information