Add load_in_16bit Parameter and Fix 8-bit Quantization Config #2022
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces a new parameter,
load_in_16bit
, across our model loading functions and fixes an issue with 8-bit quantization configuration.Current Issues Addressed:
No 16-bit LoRA Support: Currently, there is no way to train a model with 16-bit precision using the FastModel class because the code automatically falls back to QLoRA (4-bit) if none of the following arguments are set to True:
load_in_4bit
,load_in_8bit
, orfull_finetuning
. This creates a significant limitation for users who want to use 16-bit LoRA finetuning.8-bit Quantization Config Bug: The code was only checking for
load_in_4bit
when setting thequantization_config
parameter, which meant that proper 8-bit finetuning wasn't being configured correctly even whenload_in_8bit=True
was specified.Key Changes:
• Added
load_in_16bit
parameter to FastBaseModel.from_pretrained, FastModel.from_pretrained, and FastLanguageModel.from_pretrained with a default value of False.• Fixed the quantization config logic to properly set
kwargs["quantization_config"] = bnb_config
when eitherload_in_4bit
ORload_in_8bit
is True. Before it only checked forload_in_4bit
value.• Implemented logic to check for conflicting loading options (load_in_4bit, load_in_8bit, load_in_16bit, and full_finetuning) so that only one can be enabled at a time.
• Added code to remove load_in_16bit from kwargs before calling the Transformers library's from_pretrained to avoid passing an invalid parameter to Transformers.
• Updated the fallback logic to consider the new load_in_16bit parameter before defaulting to QLoRA.
Benefits:
• Enables explicit 16-bit LoRA finetuning without falling back to 4-bit quantization.
• Fixes 8-bit quantization configuration, ensuring proper setup when users select 8-bit training.
• Provides a clearer and flexible API for users who wish to load models in different precision formats.