Bias weights are multi-added when using `gpt_j_residual` in model-parallel execution #962

cbcase · 2023-05-31T18:33:06Z

Describe the bug
In ParallelTransformerLayer.forward, when using the gpt_j_residual path, both the SelfAttention block and the MLP block produce parallel_output=True outputs and return activations and biases separately. The output biases of those blocks are replicated across model-parallel ranks (not divided), and each rank adds the bias before doing the model-parallel reduce. In pseudocode (and ignoring dropout), you have:

activations_parallel, bias_parallel = MLP(...)
activations_parallel += bias_parallel
output = activations_parallel.reduce()

As a result, if you run k-way model parallel, output contains k * bias_parallel rather than a single addition of the bias.

If you are training from scratch, it sort of is what it is (the model can learn to down-scale bias by k), but this breaks using an existing checkpoint for any model-parallel degree greater than 1. (And does so in a small way that is hard to perceive -- the accuracy penalty is not large, but it is measureable.)

Expected behavior
The number of times the bias is included should be irrespective of model parallel degree.

Proposed solution
This one is tricky. I have hacked around it in my own code for the dropout=0 case: wait to add the bias until after model-parallel-allreduce. But a general solution requires a more careful reorganization of the code.

Environment (please complete the following information):

GPUs: A100s
Configs: a 16B parameter model running model-parallel-size: 2

The text was updated successfully, but these errors were encountered:

StellaAthena · 2023-06-02T03:24:25Z

This is quite interesting, thanks for flagging it.

What if we just do activations_parallel += bias_parallel / tp_size? And I would guess that we could convert old checkpoints to work with this new code by multiplying the biases by tp_size?

cbcase added the bug Something isn't working label May 31, 2023

StellaAthena added the good first issue Good for newcomers label Jun 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bias weights are multi-added when using `gpt_j_residual` in model-parallel execution #962

Bias weights are multi-added when using `gpt_j_residual` in model-parallel execution #962

cbcase commented May 31, 2023

StellaAthena commented Jun 2, 2023

Bias weights are multi-added when using gpt_j_residual in model-parallel execution #962

Bias weights are multi-added when using gpt_j_residual in model-parallel execution #962

Comments

cbcase commented May 31, 2023

StellaAthena commented Jun 2, 2023

Bias weights are multi-added when using `gpt_j_residual` in model-parallel execution #962

Bias weights are multi-added when using `gpt_j_residual` in model-parallel execution #962