Skip to content

Add load_in_16bit Parameter and Fix 8-bit Quantization Config #2022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: nightly
Choose a base branch
from

Conversation

marcelodiaz558
Copy link

This pull request introduces a new parameter, load_in_16bit, across our model loading functions and fixes an issue with 8-bit quantization configuration.

Current Issues Addressed:

  1. No 16-bit LoRA Support: Currently, there is no way to train a model with 16-bit precision using the FastModel class because the code automatically falls back to QLoRA (4-bit) if none of the following arguments are set to True: load_in_4bit, load_in_8bit, or full_finetuning. This creates a significant limitation for users who want to use 16-bit LoRA finetuning.

  2. 8-bit Quantization Config Bug: The code was only checking for load_in_4bit when setting the quantization_config parameter, which meant that proper 8-bit finetuning wasn't being configured correctly even when load_in_8bit=True was specified.

Key Changes:

• Added load_in_16bit parameter to FastBaseModel.from_pretrained, FastModel.from_pretrained, and FastLanguageModel.from_pretrained with a default value of False.

• Fixed the quantization config logic to properly set kwargs["quantization_config"] = bnb_config when either load_in_4bit OR load_in_8bit is True. Before it only checked for load_in_4bit value.

• Implemented logic to check for conflicting loading options (load_in_4bit, load_in_8bit, load_in_16bit, and full_finetuning) so that only one can be enabled at a time.

• Added code to remove load_in_16bit from kwargs before calling the Transformers library's from_pretrained to avoid passing an invalid parameter to Transformers.

• Updated the fallback logic to consider the new load_in_16bit parameter before defaulting to QLoRA.

Benefits:

• Enables explicit 16-bit LoRA finetuning without falling back to 4-bit quantization.

• Fixes 8-bit quantization configuration, ensuring proper setup when users select 8-bit training.

• Provides a clearer and flexible API for users who wish to load models in different precision formats.

- Add load_in_16bit parameter with default value of False
- Add validation to prevent conflicting loading options
- Add support for loading models in 16-bit precision (float16/bfloat16)
- Update error messages to include the new 16-bit option
Update condition to assign quantization_config to kwargs when either load_in_4bit or load_in_8bit is True
@danielhanchen danielhanchen changed the base branch from main to nightly March 18, 2025 04:47
@danielhanchen
Copy link
Contributor

Appreciate it! Actually I did notice if all load_in_4bit, load_in_8bit and full_finetuning are all False, it should do 16bit LoRA, but rather it used 4bit QLoRA! I added a fix yesterday for it!

But I like load_in_16bit for LoRA actually! There are some merge conflicts but happy to add load_in_16bit!

@marcelodiaz558
Copy link
Author

Appreciate it! Actually I did notice if all load_in_4bit, load_in_8bit and full_finetuning are all False, it should do 16bit LoRA, but rather it used 4bit QLoRA! I added a fix yesterday for it!

But I like load_in_16bit for LoRA actually! There are some merge conflicts but happy to add load_in_16bit!

Awesome, @danielhanchen! I just solved the merge conflict, and the default is now 16bit LoRA, as in the fix you added last week.

Comment on lines 81 to +83
load_in_4bit = True,
load_in_8bit = False,
load_in_16bit = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have an arg called load_dtype which would take the values 4bit, 8bit, 16bit instead of having these three args? Makes things cleaner and simpler ig?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Datta0! Thanks a lot for reviewing the PR. I believe that load_in_4bit and load_in_8bit are good arguments because they match transformers.BitsAndBytesConfig names and are accepted directly by auto_model.from_pretrained when you pass them as kwargs, keeping those two arguments is consistent with the Transformer's implementation.

With the latest fix performed by @danielhanchen to change fallback precision from 4bits to 16bits, it is not crucial to merge this pull request anymore since now to train using 16bits LoRA you can just set load_in_4bit and load_in_8bit as False (it was not possible at the time I made the changes since the default was always set as 4bit QLoRA, so this PR was essential back then). load_in_16bit would add some extra verbosity and it is an argument that users might try intuitively after seeing that two parameters already exist for 4bits and 8bits training; also, another benefit is that if the load_in_16bit argument is included in the sample notebooks, the users will know right away that training with 16bits precision is possible.

Furthermore, commit bf3ca8e might be important to train models with 8bits precision, as currently we are only passing the quantization_config keyword argument for 4bits QLoRA, not for 8bits.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that load_in_4bit and load_in_8bit are good arguments because they match transformers.BitsAndBytesConfig names and are accepted directly by auto_model.from_pretrained

Now that you put it that way, it makes sense.

Copy link
Collaborator

@Datta0 Datta0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants