# Quantize pytorch model with Accelerate and `bitsandbytes`

In this notebook, we will learn how to quantize a GTP2 model in 8-bit or in 4-bit and run them for inference. We will use the GPT2 model from minGPT.

## Install dependencies

## Import libraries

In [None]:
import torch
from mingpt.model import GPT
from mingpt.bpe import BPETokenizer
from huggingface_hub import snapshot_download
from accelerate import init_empty_weights
from accelerate.utils import load_and_quantize_model, BnbQuantizationConfig
from accelerate import Accelerator

## Quantization:
* This flag is used to enable 4-bit quantization by replacing the Linear layers with FP4/NF4 layers from bitsandbytes.
* 

## Tokenizer and weights

In [None]:
# create tokenizer
prompt = "Hello my name is"
tokenizer = BPETokenizer()
x1 = tokenizer(prompt).to(0)

I've stored the weights of the GPT2 model in huggingface hub.

In [None]:
# download weights from huggingface hub
weights_location = snapshot_download(repo_id="marcsun13/gpt2-xl-linear-sharded")

## Loading 8-bit model

We instantiate the model under the init_empty_weights context manager in order to load an empty model

In [None]:
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

# load model on meta device
with init_empty_weights():
  empty_model = GPT(model_config)

number of parameters: 1557.61M


In [None]:
print(empty_model)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head

We define the quantization configuration that we want and call `load_and_quantize_model` to load the weights and quantize our empty GPT2 model.

In [None]:
# get quantization config
config = BnbQuantizationConfig(load_in_8bit=True, llm_int8_threshold=6)
model_8bit = load_and_quantize_model(empty_model,
                                     bnb_quantization_config = config,
                                     weights_location = weights_location,
                                     device_map="auto")

As you can see, the `nn.Linear` layers are replaced by `bnb.nn.Linear8bitLt` layers. `lm_head` was not replaced in order to ensure stability.

In [None]:
print(model_8bit)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear8bitLt(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear8bitLt(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear8bitLt(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear8bitLt(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_aff

We put the model in evaluation mode and we run the model.

In [None]:
model_8bit.eval()
outputs = model_8bit.generate(x1, max_new_tokens=10, do_sample=False)[0]
print(tokenizer.decode(outputs.cpu().squeeze()))

Hello my name is John Doe, and I am a member of the


## Saving 8-bit model

You can save 8-bit model with `save_model` method from `Accelerator()`. Then, you can use these new weights to load your 8-bit model.

In [None]:
accelerate = Accelerator()
new_weights_location = "gpt2-xl-linear-8-bit"
accelerate.save_model(model_8bit, new_weights_location)

## Loading 4-bit model

In [None]:
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
  empty_model = GPT(model_config)

number of parameters: 1557.61M


In [None]:
# get quantization config
config = BnbQuantizationConfig(load_in_4bit=True,
                               bnb_4bit_compute_dtype=torch.bfloat16,
                               bnb_4bit_use_double_quant=True, # Quantize the already quantized values 
                               bnb_4bit_quant_type="nf4" # This sets the quantization data type in the bnb.nn.Linear4Bit layers. 
                               # Options are FP4 and NF4 data types 
                               )

model_4bit = load_and_quantize_model(empty_model,
                                     bnb_quantization_config = config,
                                     weights_location = weights_location,
                                     device_map="auto")

As you can see, the `nn.Linear` layers are replaced by `bnb.nn.Linear4bit layers`. `lm_head` was not replaced in order to ensure stability.



In [None]:
print(model_4bit)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear4bit(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear4bit(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear4bit(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear4bit(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True

In [None]:
model_4bit.eval()
outputs = model_4bit.generate(x1, max_new_tokens=10, do_sample=False)[0]
print(tokenizer.decode(outputs.cpu().squeeze()))

Hello my name is John Doe, I am a student at the University


## Bitsandbytes
* 4bit quantization such as NF4 (normalized float 4 (default)) or pure FP4 quantization. 
* While 4-bit bitsandbytes stores weights in 4-bits, the computation still happens in 16 or 32-bit 
* This will enable a second quantization after the first one to save an additional 0.4 bits per parameter.

In [None]:
from transformers import BitsAndBytesConfig


nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4", # use NF4 for higher precision
   bnb_4bit_use_double_quant=True, # use double quant if you have problems with memory
   bnb_4bit_compute_dtype=torch.bfloat16 # use a 16-bit dtype for faster finetuning
)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
