# Quantize pytorch model with 🤗 Accelerate and `bitsandbytes`

## Running GPT2 on Google Colab

Welcome!

In this notebook, we will learn how to quantize a GTP2 model in 8-bit or in 4-bit and run them for inference. We will use the GPT2 model from minGPT.

Check out this [notebook](https://colab.research.google.com/drive/1VoYNfYDKcKRQRor98Zbf2-9VQTtGJ24k?usp=sharing) if you want to learn how to finetune a quantized model with peft.


## Install dependencies

In [None]:
!git clone https://github.com/karpathy/minGPT.git
!pip install --quiet minGPT/
!pip install --quiet bitsandbytes huggingface_hub
!pip install --quiet git+https://github.com/huggingface/accelerate.git

Cloning into 'minGPT'...
remote: Enumerating objects: 489, done.[K
remote: Total 489 (delta 0), reused 0 (delta 0), pack-reused 489[K
Receiving objects: 100% (489/489), 1.44 MiB | 28.33 MiB/s, done.
Resolving deltas: 100% (260/260), done.
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for minGPT (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


## Import libraries

In [None]:
import torch
from mingpt.model import GPT
from mingpt.bpe import BPETokenizer
from huggingface_hub import snapshot_download
from accelerate import init_empty_weights
from accelerate.utils import load_and_quantize_model, BnbQuantizationConfig
from accelerate import Accelerator


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)


## Tokenizer and weights

In [None]:
# create tokenizer
prompt = "Hello my name is"
tokenizer = BPETokenizer()
x1 = tokenizer(prompt).to(0)

downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/encoder.json to /root/.cache/mingpt/encoder.json
downloading https://openaipublic.blob.core.windows.net/gpt-2/models/124M/vocab.bpe to /root/.cache/mingpt/vocab.bpe


I've stored the weights of the GPT2 model in huggingface hub.

In [None]:
# download weights from huggingface hub
weights_location = snapshot_download(repo_id="marcsun13/gpt2-xl-linear-sharded")

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading (…)22c7e/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)l-00001-of-00007.bin:   0%|          | 0.00/999M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00007.bin:   0%|          | 0.00/976M [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/46.8k [00:00<?, ?B/s]

Downloading (…)l-00006-of-00007.bin:   0%|          | 0.00/972M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00007.bin:   0%|          | 0.00/987M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00007.bin:   0%|          | 0.00/871M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00007.bin:   0%|          | 0.00/976M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00007.bin:   0%|          | 0.00/972M [00:00<?, ?B/s]

## Loading 8-bit model

We instantiate the model under the init_empty_weights context manager in order to load an empty model

In [None]:
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

# load model on meta device
with init_empty_weights():
  empty_model = GPT(model_config)

number of parameters: 1557.61M


In [None]:
print(empty_model)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head

We define the quantization configuration that we want and call `load_and_quantize_model` to load the weights and quantize our empty GPT2 model.

In [None]:
# get quantization config
config = BnbQuantizationConfig(load_in_8bit=True, llm_int8_threshold=6)
model_8bit = load_and_quantize_model(empty_model,
                                     bnb_quantization_config = config,
                                     weights_location = weights_location,
                                     device_map="auto")

As you can see, the `nn.Linear` layers are replaced by `bnb.nn.Linear8bitLt` layers. `lm_head` was not replaced in order to ensure stability.

In [None]:
print(model_8bit)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear8bitLt(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear8bitLt(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear8bitLt(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear8bitLt(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_aff

We put the model in evaluation mode and we run the model.

In [None]:
model_8bit.eval()
outputs = model_8bit.generate(x1, max_new_tokens=10, do_sample=False)[0]
print(tokenizer.decode(outputs.cpu().squeeze()))

Hello my name is John Doe, and I am a member of the


## Saving 8-bit model

You can save 8-bit model with `save_model` method from `Accelerator()`. Then, you can use these new weights to load your 8-bit model.

In [None]:
accelerate = Accelerator()
new_weights_location = "gpt2-xl-linear-8-bit"
accelerate.save_model(model_8bit, new_weights_location)

## Loading 4-bit model

In [None]:
model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
  empty_model = GPT(model_config)

number of parameters: 1557.61M


In [None]:
# get quantization config
config = BnbQuantizationConfig(load_in_4bit=True,
                               bnb_4bit_compute_dtype=torch.bfloat16,
                               bnb_4bit_use_double_quant=True,
                               bnb_4bit_quant_type="nf4"
                               )

model_4bit = load_and_quantize_model(empty_model,
                                     bnb_quantization_config = config,
                                     weights_location = weights_location,
                                     device_map="auto")

As you can see, the `nn.Linear` layers are replaced by `bnb.nn.Linear4bit layers`. `lm_head` was not replaced in order to ensure stability.



In [None]:
print(model_4bit)

GPT(
  (transformer): ModuleDict(
    (wte): Embedding(50257, 1600)
    (wpe): Embedding(1024, 1600)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-47): 48 x Block(
        (ln_1): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear4bit(in_features=1600, out_features=4800, bias=True)
          (c_proj): Linear4bit(in_features=1600, out_features=1600, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((1600,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear4bit(in_features=1600, out_features=6400, bias=True)
          (c_proj): Linear4bit(in_features=6400, out_features=1600, bias=True)
          (act): NewGELU()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((1600,), eps=1e-05, elementwise_affine=True

In [None]:
model_4bit.eval()
outputs = model_4bit.generate(x1, max_new_tokens=10, do_sample=False)[0]
print(tokenizer.decode(outputs.cpu().squeeze()))

Hello my name is John Doe, I am a student at the University
