New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support QLoRA 4-bit finetuning with bitsandbytes #275
Conversation
Not sure if the tests themselves should be updated instead.
Nice. I wanted to get this done this week. Do you mind if I push commits to your branch directly? |
Sure go ahead. I'm not familiar with the test framework, so I don't know how to resolve the check errors. |
I just merged your latest changes with my branch and tested bnb.int8 and bnb-nf4-dq with Llama 2. Do you want me to push it? |
cc: @carmocca |
Sry, but would it be possible to prioritize this among the other PRs @carmocca ? It is kind of very relevant for the NeuRIPS competition and people are requesting it 😅 |
@rasbt You should be able to proceed now |
Thanks to the latest PRs, it works (again)! The performance of the non-quantized runs is also not impacted. Will post a table with the latest numbers later today once I have all the results. |
Here are the results for the fixed quantized runs:
python generate/lora.py --lora_path out/lora/bf16-true-nf4-dq/lit_model_lora_finetuned.pth --precision "bf16-true" --quantize "bnb.nf4-dq" for example, the timing looks good but the generated text it gobbledygook:
Same for bnb.nf4 without double quantization. Now, I wanted to run without the quantize flags: python generate/lora.py --lora_path out/lora/bf16-true-nf4-dq/lit_model_lora_finetuned.pth --precision "bf16-true" or python generate/lora.py --lora_path out/lora/bf16-true-nf4-dq/lit_model_lora_finetuned.pth but it results in |
I was able to reproduce the row at the bottom. Very interesting. Inference also looks normal now:
The previous issues could have been related to some temporary results or some issue due to merging main. I don't know. Other than that, I think everything is pretty good now except that the memory savings are not as good as I expected. Any thoughts? |
I ran fine tuning for 5000 iterations on Llama-2-7b-hf with the Alpaca data set and micro_batch_size = 1 on an RTX 4090 (24GB -- can't run unquantized with larger settings): --precision bf16-true: --precision bf16-true -quantize bnb.nf4: Significant difference in both speed and memory usage, which is what I've seen with the quantized qlora models all along using various frameworks. Currently using bitsandbytes version 0.40.0.post4. I'll upgrade to 0.41.1 once my bnb-nf4-dq run completes. |
Thanks for this! This gives me hope! I am using 0.41.1. Pls let me know what you find. Another thing is that I just used the default model, which is StableLM 3B. Let me rerun all the experiments tomorrow with a different model. |
--precision bf16-true -quantize bnb.nf4-dq: I tried newer bitsandbytes, but from 0.40.1 on, I got runtime errors claiming not to find libcudart.so, which is clearly in /usr/local/cuda/lib64 which is clearly in my LD_LIBRARY_PATH in the environment. There's a note in the bitsandbytes changelog for 0.40.1 about relying on the pytorch CUDA libraries in CUDA SETUP, and another in 0.40.2 about handling a missing LD_LIBRARY_PATH, so I'll need to dig into this more tomorrow. |
Same results with bnb 0.41.1. Needed to set BNB_CUDA_VERSION=122 for CUDA version 12.2, in case anyone has a similar problem. |
Happy to report that I am getting similar results now @Andrei-Aksionov, I think that StableLM 3B was a bad test case (I reran everything and can confirm the results from earlier). Bottom-line: the advantage for 7B models is more obvious. Now
Other than that, I think things look good.
|
That's very thorough! Awesome work Seb. I suggest using falcon-7b, you can just document any hyperparameter changes. This is what I did in https://github.com/Lightning-AI/lit-gpt/blob/main/tutorials/quantize.md. At some point we should replace StableLM as the default. A related idea is #142 |
You may want to revisit #286. The micro_batch_size setting has a significant impact on memory usage. |
Thanks @rasbt for the table, I can imagine it took a while to compile.
|
Thanks for the feedback. And absolutely, now that we have these baselines, Computational performance is one thing, the other important one is modeling performance. So I am also looking into that as well. |
Yes, I agree, it makes a big difference. But as you can see in the tables above, I kept it consistent between each suite of runs. The reason I started with 4 for the StableLM 3B models is that it's the default setting. Then, I increased it to 1 for the 7B models, because otherwise it would not work with most settings. |
I've also tried adjusting it from 1 to 4 with different models and quantization to find the best fit for each combination. Setting it on the command line would simplify automated runs on a matrix of models and settings. |
I think there was a bug in the |
Ok, here is a new batch of results. The 8bit optimizer didn't make a huge difference but things works well overall regarding inference performance etc. I you can get bigger perf differences with bigger models. I think this should be good to merge imho unless there are any issues you find or suggestions you have.
|
Merging 🚀 |
Awesome, exciting that this is finally merged! Thanks for getting this started and all your help @patrickhwood @carmocca and @Andrei-Aksionov ! |
Added quantize command-line option. Allowed values are "bnb.nf4", "bnb.nf4-dq", "bnb.fp4", "bnb.fp4-dq". GPTQ int4 format only supports inference, not training.
Note that fabric.init_module needed empty_init=True for bnb.int8 training; otherwise, a RuntimeError: "normal_kernel_cuda" not implemented for 'Char' error is thrown.Only lora.py was extensively tested (see #242 (comment) for some results). adapter.py was only tested to the point of running about 1K iterations for each of the quantization types.
Closes #277
Closes #242
Closes #240
Closes #165
Closes #207
Closes #198
Fixes #176