# `transformers` meets `bitsandbytes` for democratzing Large Language Models (LLMs) through 4bit quantization

<center>
<img src="https://github.com/huggingface/blog/blob/main/assets/96_hf_bitsandbytes_integration/Thumbnail_blue.png?raw=true" alt="drawing" width="700" class="center"/>
</center>

Welcome to this notebook that goes through the recent `bitsandbytes` integration that includes the work from the [QLoRA paper](https://arxiv.org/abs/2305.14314) that introduces no performance degradation 4bit quantization techniques, for democratizing LLMs inference and training.

In this notebook, we will learn together how to load a model in 4bit, understand all its variants and how to run them for inference.

[In the training notebook](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing), you will learn how to use 4bit models to fine-tune these models.

If you liked the previous work for integrating [*LLM.int8*](https://arxiv.org/abs/2208.07339), you can have a look at the [introduction blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) to lean more about that quantization method.

Note that this could be used for any model that supports `device_map` (i.e. loading the model with `accelerate`) - meaning that this totally agnostic to modalities, you can use it for `Blip2`, etc.

## Download requirements

First, install the dependencies below to get started. As these features are available on the `main` branches only, we need to install the libraries below from source.

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/92.2 MB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m78.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m219.1/219.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build depe

## Basic usage

Similarly as 8bit models, you can load and convert a model in 8bit by just adding the argument `load_in_4bit`! As simple as that!
Let's first try to load small models, by starting with `facebook/opt-350m`.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-350m"

model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/644 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/663M [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

The model conversion technique is totally similar as the one presented in the [8 bit integration blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) - it is based on module replacement. If you print the model, you will see that most of the `nn.Linear` layers are replaced by `bnb.nn.Linear4bit` layers!

In [None]:
print(model)

OPTForCausalLM(
  (model): OPTModel(
    (decoder): OPTDecoder(
      (embed_tokens): Embedding(50272, 512, padding_idx=1)
      (embed_positions): OPTLearnedPositionalEmbedding(2050, 1024)
      (project_out): Linear4bit(in_features=1024, out_features=512, bias=False)
      (project_in): Linear4bit(in_features=512, out_features=1024, bias=False)
      (layers): ModuleList(
        (0-23): 24 x OPTDecoderLayer(
          (self_attn): OPTAttention(
            (k_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (v_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (q_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
            (out_proj): Linear4bit(in_features=1024, out_features=1024, bias=True)
          )
          (activation_fn): ReLU()
          (self_attn_layer_norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear4bit(in_features=1024, out_features=4096, bias=True)
          (

Once loaded, run a prediction as you would do it with a classic model

In [None]:
text = "Hello my name is"
device = "cuda:0"

inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


## Advaced usages

Let's review in this section advanced usage of the 4bit integration. First, you need to understand the different arguments that can be tweaked and used.

All these parameters can be changed by using the `BitsandBytesConfig` from `transformers` and pass it to `quantization_config` argument when calling `from_pretrained`.

Make sure to pass `load_in_4bit=True` when using the `BitsAndBytesConfig`!

### Changing the compute dtype

The compute dtype is used to change the dtype that will be used during computation. For example, hidden states could be in `float32` but computation can be set to `bf16` for speedups.
By default, the compute dtype is set to `float32`.


In [None]:
import torch
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
model_cd_bf16 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

In [None]:
outputs = model_cd_bf16.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


### Changing the quantization type

The 4bit integration comes with 2 different quantization types: FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the [QLoRA paper](https://arxiv.org/abs/2305.14314)

YOu can switch between these two dtype using `bnb_4bit_quant_type` from `BitsAndBytesConfig`. By default, the FP4 quantization is used.

In [None]:
from transformers import BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

In [None]:
outputs = model_nf4.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John and I am a very happy man. I am a very happy man. I am a very


### Use nested quantization for more memory efficient inference and training

We also advise users to use the nested quantization technique. This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4.

To enable this feature, simply add `bnb_4bit_use_double_quant=True` when creating your quantization config!

In [None]:
from transformers import BitsAndBytesConfig

double_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)

model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config)

In [None]:
outputs = model_double_quant.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is jimmy and I am a new member of the reddit clan. I am a new member of


### Combining all the features together

Of course, the features are not mutually exclusive. You can combine these features together inside a single quantization config. Let us assume you want to run a model with `nf4` as the quantization type, with nested quantization and using `bfloat16` as the compute dtype:

In [None]:
import torch
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)

In [None]:
outputs = model_4bit.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello my name is John and I am a very happy man. I am a very happy man. I am a very


## Pushing the limits of Google Colab

How far can we go using 4bit quantization? We'll see below that it is possible to load a 20B-scale model (40GB in half precision) entirely on the GPU using this quantization method! 🤯

Let's load the model with NF4 quantization type for better results, `bfloat16` compute dtype as well as nested quantization for a more memory efficient model loading.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/57.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/46 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00046.bin:   0%|          | 0.00/926M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00034-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00035-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00036-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00037-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00038-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00039-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00040-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00041-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00042-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00043-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00044-of-00046.bin:   0%|          | 0.00/910M [00:00<?, ?B/s]

Downloading (…)l-00045-of-00046.bin:   0%|          | 0.00/604M [00:00<?, ?B/s]

Downloading (…)l-00046-of-00046.bin:   0%|          | 0.00/620M [00:00<?, ?B/s]


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


Loading checkpoint shards:   0%|          | 0/46 [00:00<?, ?it/s]

Let's make sure we loaded the whole model on GPU

In [None]:
model_4bit.hf_device_map

{'': 0}

Once loaded, run a generation!

In [None]:
text = "Hello my name is"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model_4bit.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
  attn_scores = torch.where(causal_mask, attn_scores, mask_value)


Hello my name is john and i am a student at the university of phoenix. i am a member of


As you can see, we were able to load and run the 4bit gpt-neo-x model entirely on the GPU