You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Activations can take up a significant amount of memory [7] during training.
As a concrete example, the 1.5B parameter GPT-2 model trained with sequence length of 1K and batch size of 32
requires about 60GB of memory.
The activation memory of a transformer-based model is proportional to
the number of transformer layers × hidden dimensions × sequence length × batch size.
For a GPT-2 like architecture the total activations is about
12 × hidden dim × batch × seq length × transformer layers.
Hi. The activation memory is proportional to the product of these values. For the model tested in the ZeRO paper, the proportional value is 12, but it is not the case for Llama. To test the size of activations without using gradient checkpoints, you can disable the gradient checkpoint setting and check the memory usage.
Looking at the table, the actual number of parameters in llama7b is 6738000000 so at half precision it makes sense that 6738000000 * 2 / (1024**3) == 12.55GB. I usually see approximations like 7B * 2 = 14GB, and so was initially confused.
What's the reasoning for the values in the rest of the table though?
E.g. Following the above reasoning would imply adamw takes 12 bytes of optimizer state per parameter, but I believe that shouldn't be the case? AFAIK AdamW optimizer state should only require two floats of per parameter.
I usually see approximations like 7B * 2 = 14GB, and so was initially confused.
Model parameters (Billion) * 2 is actually an estimation.
The gradient tensors are of the same size with parameters, so they occupy the same amount of GPU memory.
With mixed-precision training, AdamW keeps an fp32 copy of the parameters as well as the momentum and varience. So it is 12 bytes per parameter.
Thanks for your amazing work.
Since I am not very familiar with the memory usage computation, I want to know if you can put more details about the
Table 1
into the Appendix.I can obtain
the memory usage of Params = 12.x GB
by my own.But I am very confusing about
the memory usage of Activations = 45.x GB
.I read the paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In the section 3.2, they said that
The following is my computation:
Can you explain why you give a 45.x GB not 12 GB memory usage for activations?
The text was updated successfully, but these errors were encountered: