Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Qubitium · 2023-03-31T06:53:26Z

Is the following a typo or the lit-llama implementation requires vastly more vram than original implementation? 7B fits natively on a single 3090 24G gpu in original llama implementation.

This will run the 7B model and require ~26 GB of GPU memory (A100 GPU).

The text was updated successfully, but these errors were encountered:

lantiga · 2023-03-31T06:57:57Z

Without any quantization 7B float32 parameters means

7e9 (7B) * 4 (4 bytes per float32 parameter) / 1024**3 (bytes to GB) = 26.07 GB

so it checks out.

You can quantize the model as shown in the finetuning example and you can make it fit in a lot less memory of course, but vanilla will take you 26GB just out of the parameter count (the original repo is not different in that regard).

Qubitium · 2023-03-31T07:05:48Z

The hugging faces (native) conversion of the model is fp16 and use about 14GB of vram. There was no quantizing performed for the huggingfaces model.

https://huggingface.co/decapoda-research/llama-7b-hf/blob/main/config.json

Or am I mistaken and they quantized it from fp32 to fp16?

lantiga · 2023-03-31T10:21:23Z

oh for sure, you can run it with fp16 or bf16 mixed precision super easily already with Fabric. In fact, in train.py you can see at line 45

 fabric = L.Fabric(accelerator="cuda", devices=4, precision="bf16-mixed", strategy=strategy)

which runs it with bf16 mixed precision already and decreases the memory requirements. Note however that mixed precision is only meaningful during training. During inference you'll need to quantize, to either 8 bit (provided in the example) or 16 bit.

For clarity, in the README we are referring to the vanilla model in full precision which we use to compare against the original llama repo.

We should definitely make the README clearer there, thanks for pointing that out.

/cc @awaelchli

carmocca · 2023-03-31T12:46:04Z

Another missing piece here is loading directly with a lower precision. Opened #71

lantiga · 2023-03-31T12:56:47Z

@diegomontoya for clarity: for inference, you if you just half the precision in generate.py and then move to GPU

model.to(device="cuda", dtype=torch.float16)

then you'll get 14GB memory usage even with the original checkpoints. However I'm testing generation with this and I'm not getting great results, investigating.

lantiga · 2023-04-04T10:21:31Z

Putting this on your radar @diegomontoya #91

lantiga · 2023-04-06T06:43:17Z

Solved with #100

lantiga closed this as completed Apr 6, 2023

gkroiz pushed a commit to gkroiz/lit-llama that referenced this issue May 17, 2023

Update generate.py (Lightning-AI#67)

977a5e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Qubitium commented Mar 31, 2023

lantiga commented Mar 31, 2023

Qubitium commented Mar 31, 2023

lantiga commented Mar 31, 2023 •

edited

Loading

carmocca commented Mar 31, 2023

lantiga commented Mar 31, 2023 •

edited

Loading

lantiga commented Apr 4, 2023

lantiga commented Apr 6, 2023

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Typo? "7B require ~26 GB of GPU memory (A100 GPU)." #67

Comments

Qubitium commented Mar 31, 2023

lantiga commented Mar 31, 2023

Qubitium commented Mar 31, 2023

lantiga commented Mar 31, 2023 • edited Loading

carmocca commented Mar 31, 2023

lantiga commented Mar 31, 2023 • edited Loading

lantiga commented Apr 4, 2023

lantiga commented Apr 6, 2023

lantiga commented Mar 31, 2023 •

edited

Loading

lantiga commented Mar 31, 2023 •

edited

Loading