Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Closed
Qubitium opened this issue Mar 31, 2023 · 7 comments
Closed

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Qubitium opened this issue Mar 31, 2023 · 7 comments

Comments

@Qubitium
Copy link

Is the following a typo or the lit-llama implementation requires vastly more vram than original implementation? 7B fits natively on a single 3090 24G gpu in original llama implementation.

This will run the 7B model and require ~26 GB of GPU memory (A100 GPU).
@lantiga
Copy link
Collaborator

lantiga commented Mar 31, 2023

Without any quantization 7B float32 parameters means

7e9 (7B) * 4 (4 bytes per float32 parameter) / 1024**3 (bytes to GB) = 26.07 GB

so it checks out.

You can quantize the model as shown in the finetuning example and you can make it fit in a lot less memory of course, but vanilla will take you 26GB just out of the parameter count (the original repo is not different in that regard).

@Qubitium
Copy link
Author

The hugging faces (native) conversion of the model is fp16 and use about 14GB of vram. There was no quantizing performed for the huggingfaces model.

https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/config.json

Or am I mistaken and they quantized it from fp32 to fp16?

@lantiga
Copy link
Collaborator

lantiga commented Mar 31, 2023

oh for sure, you can run it with fp16 or bf16 mixed precision super easily already with Fabric. In fact, in train.py you can see at line 45

 fabric = L.Fabric(accelerator="cuda", devices=4, precision="bf16-mixed", strategy=strategy)

which runs it with bf16 mixed precision already and decreases the memory requirements. Note however that mixed precision is only meaningful during training. During inference you'll need to quantize, to either 8 bit (provided in the example) or 16 bit.

For clarity, in the README we are referring to the vanilla model in full precision which we use to compare against the original llama repo.

We should definitely make the README clearer there, thanks for pointing that out.

/cc @awaelchli

@carmocca
Copy link
Contributor

Another missing piece here is loading directly with a lower precision. Opened #71

@lantiga
Copy link
Collaborator

lantiga commented Mar 31, 2023

@diegomontoya for clarity: for inference, you if you just half the precision in generate.py and then move to GPU

model.to(device="cuda", dtype=torch.float16)

then you'll get 14GB memory usage even with the original checkpoints. However I'm testing generation with this and I'm not getting great results, investigating.

@lantiga
Copy link
Collaborator

lantiga commented Apr 4, 2023

Putting this on your radar @diegomontoya #91

@lantiga
Copy link
Collaborator

lantiga commented Apr 6, 2023

Solved with #100

@lantiga lantiga closed this as completed Apr 6, 2023
gkroiz pushed a commit to gkroiz/lit-llama that referenced this issue May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants