Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

kiseliu · 2023-06-24T08:41:52Z

Thanks for your amazing work.

Since I am not very familiar with the memory usage computation, I want to know if you can put more details about the Table 1 into the Appendix.

I can obtain the memory usage of Params = 12.x GB by my own.
But I am very confusing about the memory usage of Activations = 45.x GB.

I read the paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. In the section 3.2, they said that

Activations can take up a significant amount of memory [7] during training. 
As a concrete example, the 1.5B parameter GPT-2 model trained with sequence length of 1K and batch size of 32 
requires about 60GB of memory. 

The activation memory of a transformer-based model is proportional to 
the number of transformer layers × hidden dimensions × sequence length × batch size. 

For a GPT-2 like architecture the total activations is about 
12 × hidden dim × batch × seq length × transformer layers.

The following is my computation:

GPT-2 XL:
memory in fp16 : (12 * 1600 * 32 * 1000 * 48) / 1024 / 1024 / 1024 * 2 = 54.9x GB

LLaMA-7b
memory in fp16 : (12 * 4096 * 8 * 512 * 32) / 1024 / 1024 / 1024 * 2 = 12.0 GB

Can you explain why you give a 45.x GB not 12 GB memory usage for activations?

The text was updated successfully, but these errors were encountered:

KaiLv69 · 2023-06-24T12:06:12Z

Hi. The activation memory is proportional to the product of these values. For the model tested in the ZeRO paper, the proportional value is 12, but it is not the case for Llama. To test the size of activations without using gradient checkpoints, you can disable the gradient checkpoint setting and check the memory usage.

nairbv · 2023-08-09T20:23:14Z

Looking at the table, the actual number of parameters in llama7b is 6738000000 so at half precision it makes sense that 6738000000 * 2 / (1024**3) == 12.55GB. I usually see approximations like 7B * 2 = 14GB, and so was initially confused.

What's the reasoning for the values in the rest of the table though?

E.g. Following the above reasoning would imply adamw takes 12 bytes of optimizer state per parameter, but I believe that shouldn't be the case? AFAIK AdamW optimizer state should only require two floats of per parameter.

KaiLv69 · 2023-08-10T03:25:33Z

I usually see approximations like 7B * 2 = 14GB, and so was initially confused.

Model parameters (Billion) * 2 is actually an estimation.

The gradient tensors are of the same size with parameters, so they occupy the same amount of GPU memory.
With mixed-precision training, AdamW keeps an fp32 copy of the parameters as well as the momentum and varience. So it is 12 bytes per parameter.

LinHanyueEsar mentioned this issue Dec 21, 2023

A question about layernorm activation memory. cli99/llm-analysis#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

kiseliu commented Jun 24, 2023 •

edited

Loading

KaiLv69 commented Jun 24, 2023

nairbv commented Aug 9, 2023

KaiLv69 commented Aug 10, 2023 •

edited

Loading

Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

Question about Memory usage (GB) when training LLaMA-7B under different settings. #16

Comments

kiseliu commented Jun 24, 2023 • edited Loading

KaiLv69 commented Jun 24, 2023

nairbv commented Aug 9, 2023

KaiLv69 commented Aug 10, 2023 • edited Loading

kiseliu commented Jun 24, 2023 •

edited

Loading

KaiLv69 commented Aug 10, 2023 •

edited

Loading