-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Add load_in_16bit Parameter and Fix 8-bit Quantization Config #2022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nightly
Are you sure you want to change the base?
Conversation
- Add load_in_16bit parameter with default value of False - Add validation to prevent conflicting loading options - Add support for loading models in 16-bit precision (float16/bfloat16) - Update error messages to include the new 16-bit option
Update condition to assign quantization_config to kwargs when either load_in_4bit or load_in_8bit is True
Appreciate it! Actually I did notice if all load_in_4bit, load_in_8bit and full_finetuning are all False, it should do 16bit LoRA, but rather it used 4bit QLoRA! I added a fix yesterday for it! But I like load_in_16bit for LoRA actually! There are some merge conflicts but happy to add |
Awesome, @danielhanchen! I just solved the merge conflict, and the default is now 16bit LoRA, as in the fix you added last week. |
load_in_4bit = True, | ||
load_in_8bit = False, | ||
load_in_16bit = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have an arg called load_dtype
which would take the values 4bit
, 8bit
, 16bit
instead of having these three args? Makes things cleaner and simpler ig?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Datta0! Thanks a lot for reviewing the PR. I believe that load_in_4bit
and load_in_8bit
are good arguments because they match transformers.BitsAndBytesConfig
names and are accepted directly by auto_model.from_pretrained
when you pass them as kwargs, keeping those two arguments is consistent with the Transformer's implementation.
With the latest fix performed by @danielhanchen to change fallback precision from 4bits to 16bits, it is not crucial to merge this pull request anymore since now to train using 16bits LoRA you can just set load_in_4bit
and load_in_8bit
as False
(it was not possible at the time I made the changes since the default was always set as 4bit QLoRA, so this PR was essential back then). load_in_16bit
would add some extra verbosity and it is an argument that users might try intuitively after seeing that two parameters already exist for 4bits and 8bits training; also, another benefit is that if the load_in_16bit
argument is included in the sample notebooks, the users will know right away that training with 16bits precision is possible.
Furthermore, commit bf3ca8e might be important to train models with 8bits precision, as currently we are only passing the quantization_config
keyword argument for 4bits QLoRA, not for 8bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that load_in_4bit and load_in_8bit are good arguments because they match transformers.BitsAndBytesConfig names and are accepted directly by auto_model.from_pretrained
Now that you put it that way, it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This pull request introduces a new parameter,
load_in_16bit
, across our model loading functions and fixes an issue with 8-bit quantization configuration.Current Issues Addressed:
No 16-bit LoRA Support: Currently, there is no way to train a model with 16-bit precision using the FastModel class because the code automatically falls back to QLoRA (4-bit) if none of the following arguments are set to True:
load_in_4bit
,load_in_8bit
, orfull_finetuning
. This creates a significant limitation for users who want to use 16-bit LoRA finetuning.8-bit Quantization Config Bug: The code was only checking for
load_in_4bit
when setting thequantization_config
parameter, which meant that proper 8-bit finetuning wasn't being configured correctly even whenload_in_8bit=True
was specified.Key Changes:
• Added
load_in_16bit
parameter to FastBaseModel.from_pretrained, FastModel.from_pretrained, and FastLanguageModel.from_pretrained with a default value of False.• Fixed the quantization config logic to properly set
kwargs["quantization_config"] = bnb_config
when eitherload_in_4bit
ORload_in_8bit
is True. Before it only checked forload_in_4bit
value.• Implemented logic to check for conflicting loading options (load_in_4bit, load_in_8bit, load_in_16bit, and full_finetuning) so that only one can be enabled at a time.
• Added code to remove load_in_16bit from kwargs before calling the Transformers library's from_pretrained to avoid passing an invalid parameter to Transformers.
• Updated the fallback logic to consider the new load_in_16bit parameter before defaulting to QLoRA.
Benefits:
• Enables explicit 16-bit LoRA finetuning without falling back to 4-bit quantization.
• Fixes 8-bit quantization configuration, ensuring proper setup when users select 8-bit training.
• Provides a clearer and flexible API for users who wish to load models in different precision formats.