Initial FSDP Support for QLoRA Finetuning #970

warner-benjamin · 2024-01-17T00:27:49Z

This PR adds initial FSDP support for training QLoRA models. It enables basic FSDP and CPU Offload support, with low memory training via FSDP.sync_module_states option unsupported.

This PR builds off of #840 commit 8278fca and BNB FSDP by @TimDettmers and @Titus-von-Koeller.

An example of using this PR to finetune QLoRA models with FSDP can be found in our demo script: fsdp_qlora.

Rational

The primary blocker for FSDP QLoRA finetuning is the quantized storage type of uint8. FSDP can only shard float data types. Additionally, every time FSDP moves a Linear4Bit from CPU to GPU when using CPU offloading will result in a quantization of existing data. Even if the data is already quantized.

Changes Made

Selectable Quantization Storage

This PR adds a selectable quantization storage option quant_storage to Linear4Bit and Params4Bit. The quantization storage dtype defaults to torch.uint8 for backward compatibility with existing code.

While selecting any floating-point storage type will allow FSDP to shard Linear4Bit layers, setting the quantization storage dtype to match the rest of the non-LoRA layers' dtype allows Linear4Bit layers to be wrapped identically to Linear layers in a LoRA wrapping policy, such as the fsdp_auto_wrap_policy from llama-recipes.

If the quantization storage dtype does not match the rest of the layers's dtype, then the Linear4Bit layers will have to be wrapped individually.

Prevent Multiple Quantization

The PR adds a quantization flag to prevent Params4Bit from quantizing already quantized params when transferring from CPU to GPU. For example, when training with FSDP CPU offloading.

Set Quant State during FSDP Forward

FSDP does not copy the Params4Bit QuantState dictionary in when moving sharded layers. This PR sets the QuantState as a component of Linear4Bit and copies it to Params4Bit if it no longer exists in Params4Bit.

Testing

This PR adds quant_storage testing to the Linear4Bit tests and fixes an issue with the current tests where NF4 wasn't tested.

We also tested these changes against PEFT's QLoRA tests and did not find any regressions from the current bitsandbytes behavior.

We have also tested FSDP Mixed Precision in fp32 and bf16 and noticed no changes in training behavior when setting the Linear4Bit and Params4Bit's quant_storage dtype to match the FSDP MixedPrecision.param dtype.

We have successfully finetuned Llama-2, Mistral, and TinyLlama models with FSDP & QLoRA using our demo script.

Downstream Implications

Existing implementations may require some modification to work with this, for example:

Model loading via Transformers load_in_4bit will need a way to set quant_storage (the demo script uses custom model loading)
PEFT's prepare_model_for_kbit_training upcasts all non-uint8 params to float32 under the assumption that the base (quantized) weights are stored in uint8, which is now no longer guaranteed
PEFT's get_nb_trainable_parameters multiplies the number of parameters from Params4Bit by two which is only valid if quant_storage is uint8

Future Work

Currently QLoRA finetuning using FSDP's low memory via the sync_module_states option doesn't work. Enabling this will require a future PR.

Peft tests

Titus-von-Koeller · 2024-01-17T05:31:26Z

Thank you all for your fine work and describing it thoroughly. We're very happy with the collaboration and after our talks in the last days and having given the code an initial review, the next step is to merge onto main as that triggers our daily pre-release CI pipeline on the HF side, running all the HF integration tests, making sure that nothing breaks in Transformers, PEFT + Accelerate. (This is our workaround for not having our own GPU runners, so we're using a pipeline on the HF side, which doesn't yet include the BnB tests itself.)

Have you run the BnB test suite itself? Everything looking good there so far? We have a few flaky tests that we still need to make more reproducible, but this is still important information. It would be good to paste the output here for review.

The procedure now is that we'll do a preliminary merge, get back to you tmr with the integration test results tmr and then speak about the next steps in our video call, including any potential improvements that we still might want to add, as well as Transformers integration, etc.

Thanks again for your contribution and good work! Really happy to move forward with this.

KeremTurgutlu · 2024-01-17T08:03:30Z

Have you run the BnB test suite itself? Everything looking good there so far? We have a few flaky tests that we still need to make more reproducible, but this is still important information. It would be good to paste the output here for review.

I ran the test suite with pytest tests/*.py -k "not test_triton" | tee $HOME/bnb_tests.log, here sharing the stdout/err logs.

summary: 2 failed, 5337 passed, 35 skipped, 2 deselected, 1410 warnings in 358.02s (0:05:58) =

tests/test_functional.py::test_gemv_4bit[fp16-bf16-fc2-nf4-DQ_True] FAILED [ 51%]
tests/test_functional.py::test_gemv_4bit[uint8-bf16-fc2-nf4-DQ_False] FAILED [ 50%]

Note: Previously I've tested test_functional.py and test_modules.py separately and they pass after a few trials, so these errors can be regarded as flaky.

bnb_tests.log

python -m bitsandbytes
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/cuda_setup/main.py:108: UserWarning: 

================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64
Loading CUDA version: BNB_CUDA_VERSION=123
================================================================================


  warn((f'\n\n{"="*80}\n'
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/paperspace/miniconda3/pkgs/icu-73.1-h6a678d5_0/lib/libicudata.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/pkgs/pytorch-2.1.2-py3.11_cuda12.1_cudnn8.9.2_0/lib/python3.11/site-packages/torch/lib/libc10_cuda.so
/home/paperspace/miniconda3/lib/libicudata.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so
/home/paperspace/miniconda3/lib/python3.11/site-packages/torch/lib/libc10_cuda.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.3/targets/x86_64-linux/lib/stubs/libcuda.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++
/home/paperspace/workdir/git/bitsandbytes/bitsandbytes/libbitsandbytes_cuda123.so

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
++++ /home/paperspace/local/cuda-12.3/lib64 CUDA PATHS +++++


++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['8.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable


WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

warner-benjamin · 2024-01-17T22:57:11Z

Further testing shows no issues using FSDP Mixed Precision with autocast when Linear4Bit.quant_storage=torch.float32 to match the FSDP Mixed Precision param_dtype: MixedPrecision(param_dtype=torch.float32, reduce_dtype=torch.float32, buffer_dtype=torch.float32)

Titus-von-Koeller · 2024-01-18T08:25:42Z

Results of the single_gpu_huggingface/peft-gpu-bnb-source:latest PEFT scheduled tests.
core_single_gpu.log: 2 failed tests
+----+-----------------+--------------+----------------------------------+
|    | Test Location   | Test Case    | Test Name                        |
+====+=================+==============+==================================+
|  0 | tests/test_c    | PeftGPUCommo | test_4bit_merge_and_disable_lora |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+
|  1 | tests/test_c    | PeftGPUCommo | test_4bit_merge_lora             |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+

Results of the single_gpu_huggingface/peft-gpu-bnb-latest:latest PEFT scheduled tests.
core_single_gpu.log: 2 failed tests
+----+-----------------+--------------+----------------------------------+
|    | Test Location   | Test Case    | Test Name                        |
+====+=================+==============+==================================+
|  0 | tests/test_c    | PeftGPUCommo | test_4bit_merge_and_disable_lora |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+
|  1 | tests/test_c    | PeftGPUCommo | test_4bit_merge_lora             |
|    | ommon_gpu.py    | nTests       |                                  |
+----+-----------------+--------------+----------------------------------+

Both times (HF libs from source vs pip, with BNB always from source) the same tests are failing in the same way:

core_single_gpu.log

This is the same failing test case - but with a different error! - that we got after merging @jph00 's commit, which we had to then revert. See the error log of core_single_gpu-Jeremy's_fix.log for reference, as I think that this might hint that both are related to the prevention of "Params4Bit from quantizing already quantized params when transferring from CPU to GPU".

I'll try to look more into this during the day, but just wanted to share the necessary info with everyone asap.

cc @Sourab @younesbelkada @TimDettmers for visibility / collab in fixing this

Titus-von-Koeller · 2024-01-18T13:02:31Z

x-posting @pacman100 's message on Discord here for visibility:

sourab_mangrulkar — Today at 1:38 PM

Hello everyone! 😄

Thank you for adding me to this collaboration and I am looking forward to the QLoRA+FSDP support which would be huge!

I have gone through the following:
bitsandbytes implementation of following classes/functions Linear4bit, Params4bit, quantize_4bit, matmul_4bit, MatMul4Bit and dequantize_4bit .
PR #970 and demo script https://github.com/AnswerDotAI/fsdp_qlora

wrt to the CI failures, I have following hypothesis:
In merge method, the following things happen wherein the lora delta weights are added to the dequantized base layer weights. This is followed by setting the weight of the base layer to a new instance of Params4bit with the new merged weights:

w_data = bnb.functional.dequantize_4bit(weight.data, weight.quant_state) + lora_data
self.get_base_layer().weight = bnb.nn.Params4bit(w_data.to("cpu"), requires_grad=False,**kwargs).to(weight.device)

Now, kwargs copies the ones currently there which has bnb_quantized=True and the old quant_state so the new merged weights aren't quantized and the old quant state remains. After this, during Linear4bit forward call, the old quant state is used while the weights are unquantized.

Trying out this PR huggingface/peft#1370 to see it gets fixed.
I can confirm the above PR fixes the 2 failing tests, yay!

weifengpy · 2024-01-25T23:43:33Z

amazing! will take a look from FSDP side to see what can be improved

warner-benjamin and others added 11 commits January 4, 2024 15:55

Minimal changes for fp32 4bit storage from BNB commit 8278fca

fcc5bb9

Params4bit with selectable storage dtype

c51403c

possible fix for double quantizing linear weight & quant storage dtype

83c538a

minor fixes in Params4bit for peft tests

359bfc9

remove redundant

112741a

Merge pull request #1 from KeremTurgutlu/peft_tests

754e75b

Peft tests

add float16

9e5a15b

Merge branch 'TimDettmers:main' into cuda_fix_quant_storage_dtype

e8e8eed

update test

4aecbdf

Merge pull request #2 from KeremTurgutlu/peft_tests

4f15388

Peft tests

Remove float16 quant cast as there are fp32, bf16, & fp16 quant kernels

89ebdb2

Titus-von-Koeller merged commit dcfb6f8 into TimDettmers:main Jan 17, 2024

pacman100 mentioned this pull request Jan 18, 2024

account for the new merged/unmerged weight to perform the quantization again huggingface/peft#1370

Merged

Titus-von-Koeller mentioned this pull request Jan 18, 2024

Fix bug: when moving quantized model between CPU and GPU, the 4bit weights gets re-quantized #953

Closed

younesbelkada mentioned this pull request Feb 4, 2024

ValueError: Cannot flatten integer dtype tensors huggingface/peft#1432

Closed

4 tasks

KeremTurgutlu mentioned this pull request Feb 23, 2024

Pack/Unpack to Different Dtypes for FSDP mobiusml/hqq#14

Closed

hamishivi mentioned this pull request Feb 25, 2024

CUDA OOM for qlora allenai/open-instruct#65

Closed

warner-benjamin mentioned this pull request Mar 15, 2024

Example with AMD ROCm/HIP AnswerDotAI/fsdp_qlora#32

Closed

TNTran92 mentioned this pull request Apr 6, 2024

Upgrade fro 0.42 to 0.43 ROCm/bitsandbytes#18

Draft

justheuristic mentioned this pull request Apr 7, 2024

Support integer parameters in FullyShardedDataParallel pytorch/pytorch#123528

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial FSDP Support for QLoRA Finetuning #970

Initial FSDP Support for QLoRA Finetuning #970

warner-benjamin commented Jan 17, 2024

Titus-von-Koeller commented Jan 17, 2024

KeremTurgutlu commented Jan 17, 2024 •

edited

Loading

warner-benjamin commented Jan 17, 2024

Titus-von-Koeller commented Jan 18, 2024

Titus-von-Koeller commented Jan 18, 2024

weifengpy commented Jan 25, 2024

Initial FSDP Support for QLoRA Finetuning #970

Initial FSDP Support for QLoRA Finetuning #970

Conversation

warner-benjamin commented Jan 17, 2024

Rational

Changes Made

Selectable Quantization Storage

Prevent Multiple Quantization

Set Quant State during FSDP Forward

Testing

Downstream Implications

Future Work

Titus-von-Koeller commented Jan 17, 2024

KeremTurgutlu commented Jan 17, 2024 • edited Loading

warner-benjamin commented Jan 17, 2024

Titus-von-Koeller commented Jan 18, 2024

Titus-von-Koeller commented Jan 18, 2024

weifengpy commented Jan 25, 2024

KeremTurgutlu commented Jan 17, 2024 •

edited

Loading