You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
In ParallelTransformerLayer.forward, when using the gpt_j_residual path, both the SelfAttention block and the MLP block produce parallel_output=True outputs and return activations and biases separately. The output biases of those blocks are replicated across model-parallel ranks (not divided), and each rank adds the bias before doing the model-parallel reduce. In pseudocode (and ignoring dropout), you have:
As a result, if you run k-way model parallel, output contains k * bias_parallel rather than a single addition of the bias.
If you are training from scratch, it sort of is what it is (the model can learn to down-scale bias by k), but this breaks using an existing checkpoint for any model-parallel degree greater than 1. (And does so in a small way that is hard to perceive -- the accuracy penalty is not large, but it is measureable.)
Expected behavior
The number of times the bias is included should be irrespective of model parallel degree.
Proposed solution
This one is tricky. I have hacked around it in my own code for the dropout=0 case: wait to add the bias until after model-parallel-allreduce. But a general solution requires a more careful reorganization of the code.
Environment (please complete the following information):
GPUs: A100s
Configs: a 16B parameter model running model-parallel-size: 2
The text was updated successfully, but these errors were encountered:
This is quite interesting, thanks for flagging it.
What if we just do activations_parallel += bias_parallel / tp_size? And I would guess that we could convert old checkpoints to work with this new code by multiplying the biases by tp_size?
Describe the bug
In
ParallelTransformerLayer.forward
, when using thegpt_j_residual
path, both the SelfAttention block and the MLP block produceparallel_output=True
outputs and return activations and biases separately. The output biases of those blocks are replicated across model-parallel ranks (not divided), and each rank adds the bias before doing the model-parallel reduce. In pseudocode (and ignoring dropout), you have:As a result, if you run k-way model parallel,
output
containsk * bias_parallel
rather than a single addition of the bias.If you are training from scratch, it sort of is what it is (the model can learn to down-scale bias by
k
), but this breaks using an existing checkpoint for any model-parallel degree greater than 1. (And does so in a small way that is hard to perceive -- the accuracy penalty is not large, but it is measureable.)Expected behavior
The number of times the bias is included should be irrespective of model parallel degree.
Proposed solution
This one is tricky. I have hacked around it in my own code for the dropout=0 case: wait to add the bias until after model-parallel-allreduce. But a general solution requires a more careful reorganization of the code.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: