-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67
Comments
Without any quantization 7B float32 parameters means
so it checks out. You can quantize the model as shown in the finetuning example and you can make it fit in a lot less memory of course, but vanilla will take you 26GB just out of the parameter count (the original repo is not different in that regard). |
The hugging faces (native) conversion of the model is fp16 and use about 14GB of vram. There was no quantizing performed for the huggingfaces model. https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/config.json Or am I mistaken and they quantized it from fp32 to fp16? |
oh for sure, you can run it with
which runs it with bf16 mixed precision already and decreases the memory requirements. Note however that mixed precision is only meaningful during training. During inference you'll need to quantize, to either 8 bit (provided in the example) or 16 bit. For clarity, in the README we are referring to the vanilla model in full precision which we use to compare against the original llama repo. We should definitely make the README clearer there, thanks for pointing that out. /cc @awaelchli |
Another missing piece here is loading directly with a lower precision. Opened #71 |
@diegomontoya for clarity: for inference, you if you just half the precision in generate.py and then move to GPU
then you'll get 14GB memory usage even with the original checkpoints. However I'm testing generation with this and I'm not getting great results, investigating. |
Putting this on your radar @diegomontoya #91 |
Solved with #100 |
Is the following a typo or the lit-llama implementation requires vastly more vram than original implementation? 7B fits natively on a single 3090 24G gpu in original llama implementation.
The text was updated successfully, but these errors were encountered: