Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

Open
kiseliu opened this issue Jun 24, 2023 · 3 comments

Comments

@kiseliu
Copy link

kiseliu commented Jun 24, 2023

Thanks for your amazing work.

Since I am not very familiar with the memory usage computation, I want to know if you can put more details about the Table 1 into the Appendix.
image

I can obtain the memory usage of Params = 12.x GB by my own.
But I am very confusing about the memory usage of Activations = 45.x GB.

I read the paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In the section 3.2, they said that

Activations can take up a significant amount of memory [7] during training. 
As a concrete example, the 1.5B parameter GPT-2 model trained with sequence length of 1K and batch size of 32 
requires about 60GB of memory. 

The activation memory of a transformer-based model is proportional to 
the number of transformer layers × hidden dimensions × sequence length × batch size. 

For a GPT-2 like architecture the total activations is about 
12 × hidden dim × batch × seq length × transformer layers.

The following is my computation:

GPT-2 XL:
memory in fp16 : (12 * 1600 * 32 * 1000 * 48) / 1024 / 1024 / 1024 * 2 = 54.9x GB

LLaMA-7b
memory in fp16 : (12 * 4096 * 8 * 512 * 32) / 1024 / 1024 / 1024 * 2 = 12.0 GB

Can you explain why you give a 45.x GB not 12 GB memory usage for activations?

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 24, 2023

Hi. The activation memory is proportional to the product of these values. For the model tested in the ZeRO paper, the proportional value is 12, but it is not the case for Llama. To test the size of activations without using gradient checkpoints, you can disable the gradient checkpoint setting and check the memory usage.

@nairbv
Copy link

nairbv commented Aug 9, 2023

Looking at the table, the actual number of parameters in llama7b is 6738000000 so at half precision it makes sense that 6738000000 * 2 / (1024**3) == 12.55GB. I usually see approximations like 7B * 2 = 14GB, and so was initially confused.

What's the reasoning for the values in the rest of the table though?

E.g. Following the above reasoning would imply adamw takes 12 bytes of optimizer state per parameter, but I believe that shouldn't be the case? AFAIK AdamW optimizer state should only require two floats of per parameter.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Aug 10, 2023

I usually see approximations like 7B * 2 = 14GB, and so was initially confused.

Model parameters (Billion) * 2 is actually an estimation.

The gradient tensors are of the same size with parameters, so they occupy the same amount of GPU memory.
With mixed-precision training, AdamW keeps an fp32 copy of the parameters as well as the momentum and varience. So it is 12 bytes per parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants