Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial FSDP Support for QLoRA Finetuning #970

Merged

Conversation

warner-benjamin
Copy link
Contributor

This PR adds initial FSDP support for training QLoRA models. It enables basic FSDP and CPU Offload support, with low memory training via FSDP.sync_module_states option unsupported.

This PR builds off of #840 commit 8278fca and BNB FSDP by @TimDettmers and @Titus-von-Koeller.

An example of using this PR to finetune QLoRA models with FSDP can be found in our demo script: fsdp_qlora.

Rational

The primary blocker for FSDP QLoRA finetuning is the quantized storage type of uint8. FSDP can only shard float data types. Additionally, every time FSDP moves a Linear4Bit from CPU to GPU when using CPU offloading will result in a quantization of existing data. Even if the data is already quantized.

Changes Made

Selectable Quantization Storage

This PR adds a selectable quantization storage option quant_storage to Linear4Bit and Params4Bit. The quantization storage dtype defaults to torch.uint8 for backward compatibility with existing code.

While selecting any floating-point storage type will allow FSDP to shard Linear4Bit layers, setting the quantization storage dtype to match the rest of the non-LoRA layers' dtype allows Linear4Bit layers to be wrapped identically to Linear layers in a LoRA wrapping policy, such as the fsdp_auto_wrap_policy from llama-recipes.

If the quantization storage dtype does not match the rest of the layers's dtype, then the Linear4Bit layers will have to be wrapped individually.

Prevent Multiple Quantization

The PR adds a quantization flag to prevent Params4Bit from quantizing already quantized params when transferring from CPU to GPU. For example, when training with FSDP CPU offloading.

Set Quant State during FSDP Forward

FSDP does not copy the Params4Bit QuantState dictionary in when moving sharded layers. This PR sets the QuantState as a component of Linear4Bit and copies it to Params4Bit if it no longer exists in Params4Bit.

Testing

This PR adds quant_storage testing to the Linear4Bit tests and fixes an issue with the current tests where NF4 wasn't tested.

We also tested these changes against PEFT's QLoRA tests and did not find any regressions from the current bitsandbytes behavior.

We have also tested FSDP Mixed Precision in fp32 and bf16 and noticed no changes in training behavior when setting the Linear4Bit and Params4Bit's quant_storage dtype to match the FSDP MixedPrecision.param dtype.

We have successfully finetuned Llama-2, Mistral, and TinyLlama models with FSDP & QLoRA using our demo script.

Downstream Implications

Existing implementations may require some modification to work with this, for example:

  • Model loading via Transformers load_in_4bit will need a way to set quant_storage (the demo script uses custom model loading)
  • PEFT's prepare_model_for_kbit_training upcasts all non-uint8 params to float32 under the assumption that the base (quantized) weights are stored in uint8, which is now no longer guaranteed
  • PEFT's get_nb_trainable_parameters multiplies the number of parameters from Params4Bit by two which is only valid if quant_storage is uint8

Future Work

Currently QLoRA finetuning using FSDP's low memory via the sync_module_states option doesn't work. Enabling this will require a future PR.

@Titus-von-Koeller
Copy link
Collaborator

Thank you all for your fine work and describing it thoroughly. We're very happy with the collaboration and after our talks in the last days and having given the code an initial review, the next step is to merge onto main as that triggers our daily pre-release CI pipeline on the HF side, running all the HF integration tests, making sure that nothing breaks in Transformers, PEFT + Accelerate. (This is our workaround for not having our own GPU runners, so we're using a pipeline on the HF side, which doesn't yet include the BnB tests itself.)

Have you run the BnB test suite itself? Everything looking good there so far? We have a few flaky tests that we still need to make more reproducible, but this is still important information. It would be good to paste the output here for review.

The procedure now is that we'll do a preliminary merge, get back to you tmr with the integration test results tmr and then speak about the next steps in our video call, including any potential improvements that we still might want to add, as well as Transformers integration, etc.

Thanks again for your contribution and good work! Really happy to move forward with this.

@Titus-von-Koeller Titus-von-Koeller merged commit dcfb6f8 into TimDettmers:main Jan 17, 2024
@KeremTurgutlu
Copy link
Contributor

KeremTurgutlu commented Jan 17, 2024

Have you run the BnB test suite itself? Everything looking good there so far? We have a few flaky tests that we still need to make more reproducible, but this is still important information. It would be good to paste the output here for review.

I ran the test suite with pytest tests/*.py -k "not test_triton" | tee $HOME/bnb_tests.log, here sharing the stdout/err logs.

summary: 2 failed, 5337 passed, 35 skipped, 2 deselected, 1410 warnings in 358.02s (0:05:58) =

tests/test_functional.py::test_gemv_4bit[fp16-bf16-fc2-nf4-DQ_True] FAILED [ 51%]
tests/test_functional.py::test_gemv_4bit[uint8-bf16-fc2-nf4-DQ_False] FAILED [ 50%]

Note: Previously I've tested test_functional.py and test_modules.py separately and they pass after a few trials, so these errors can be regarded as flaky.

bnb_tests.log

python -m bitsandbytes
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/cuda_setup/main.py:108: UserWarning: 

================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
Loading CUDA version: BNB_CUDA_VERSION=123
================================================================================


  warn((f'\n\n{"="*80}\n'
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/paperspace/miniconda3/pkgs/icu-73.1-h6a678d5_0/lib/libicudata.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
/home/paperspace/miniconda3/lib/libicudata.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs/libcuda.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/libbitsandbytes_cuda123.so

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
++++ /home/paperspace/local/cuda-12.3/lib64 CUDA PATHS +++++


++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

@warner-benjamin
Copy link
Contributor Author

Further testing shows no issues using FSDP Mixed Precision with autocast when Linear4Bit.quant_storage=torch.float32 to match the FSDP Mixed Precision param_dtype: MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

@Titus-von-Koeller
Copy link
Collaborator

Results of the single_gpu_huggingface/peft-gpu-bnb-source:latest PEFT scheduled tests.
core_single_gpu.log: 2 failed tests
+----+-----------------+--------------+----------------------------------+
|    | Test Location   | Test Case    | Test Name                        |
+====+=================+==============+==================================+
|  0 | tests/test_c    | PeftGPUCommo | test_4bit_merge_and_disable_lora |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+
|  1 | tests/test_c    | PeftGPUCommo | test_4bit_merge_lora             |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+

Results of the single_gpu_huggingface/peft-gpu-bnb-latest:latest PEFT scheduled tests.
core_single_gpu.log: 2 failed tests
+----+-----------------+--------------+----------------------------------+
|    | Test Location   | Test Case    | Test Name                        |
+====+=================+==============+==================================+
|  0 | tests/test_c    | PeftGPUCommo | test_4bit_merge_and_disable_lora |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+
|  1 | tests/test_c    | PeftGPUCommo | test_4bit_merge_lora             |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+

Both times (HF libs from source vs pip, with BNB always from source) the same tests are failing in the same way:

core_single_gpu.log

This is the same failing test case - but with a different error! - that we got after merging @jph00 's commit, which we had to then revert. See the error log of core_single_gpu-Jeremy's_fix.log for reference, as I think that this might hint that both are related to the prevention of "Params4Bit from quantizing already quantized params when transferring from CPU to GPU".

I'll try to look more into this during the day, but just wanted to share the necessary info with everyone asap.

cc @Sourab @younesbelkada @TimDettmers for visibility / collab in fixing this

@Titus-von-Koeller
Copy link
Collaborator

x-posting @pacman100 's message on Discord here for visibility:

sourab_mangrulkar — Today at 1:38 PM

Hello everyone! 😄

Thank you for adding me to this collaboration and I am looking forward to the QLoRA+FSDP support which would be huge!

I have gone through the following:
bitsandbytes implementation of following classes/functions Linear4bit, Params4bit, quantize_4bit, matmul_4bit, MatMul4Bit and dequantize_4bit .
PR #970 and demo script https://github.com/AnswerDotAI/fsdp_qlora

wrt to the CI failures, I have following hypothesis:
In merge method, the following things happen wherein the lora delta weights are added to the dequantized base layer weights. This is followed by setting the weight of the base layer to a new instance of Params4bit with the new merged weights:

w_data = bnb.functional.dequantize_4bit(weight.data, weight.quant_state) + lora_data
self.get_base_layer().weight = bnb.nn.Params4bit(w_data.to("cpu"), requires_grad=False,**kwargs).to(weight.device)

Now, kwargs copies the ones currently there which has bnb_quantized=True and the old quant_state so the new merged weights aren't quantized and the old quant state remains. After this, during Linear4bit forward call, the old quant state is used while the weights are unquantized.

Trying out this PR huggingface/peft#1370 to see it gets fixed.
I can confirm the above PR fixes the 2 failing tests, yay!

@weifengpy
Copy link

amazing! will take a look from FSDP side to see what can be improved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants