You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've used Megatron to train 13B gpt model on a H100 machine.
Before I use fp8 transformer engine, the speed of the training is about 0.34s/step.
After I enabled the fp8 transformer engine with these two arguments --fp8-hybrid, --transformer-impl "transformer_engine", the speed of the training is about 0.24s/step.
From this blog, the fp8 should have 100% spped up compared with bf16. But I only got 35% speed up on Megatron.
Does the 35% speed up reasonable or I've made some mistakes on using fp8 transformer engine?
Thanks a lot for the reply.
The text was updated successfully, but these errors were encountered:
I assume you are referencing Figure 9 from the white paper linked from that blog? If so, that figure is simply stating that fp8 is computationally 2x the throughput of bf16, when isolating arithmetic operations. The actual end-to-end speedup will be less than this, since you must account for other overheads like communication, memory bandwidth, and the optimizer step. The speedup will also vary greatly depending on your model size and micro batch size.
Hi there,
I've used Megatron to train 13B gpt model on a H100 machine.
Before I use fp8 transformer engine, the speed of the training is about 0.34s/step.
After I enabled the fp8 transformer engine with these two arguments
--fp8-hybrid, --transformer-impl "transformer_engine"
, the speed of the training is about 0.24s/step.From this blog, the fp8 should have 100% spped up compared with bf16. But I only got 35% speed up on Megatron.
Does the 35% speed up reasonable or I've made some mistakes on using fp8 transformer engine?
Thanks a lot for the reply.
The text was updated successfully, but these errors were encountered: