<a href="https://colab.research.google.com/github/NID123-CH/LLM-Codes/blob/main/03_BitsAndBytes_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
!pip install bitsandbytes accelerate peft



# BitsAndBytes

Let's use BitsAndBytes to create a quantization config before we load our pretrained model. This way, it will be quantized during loading. We'll load it in 4 bits, and we'll use the `NF4` quantization type. It constructs a quantization data type where each bin has equal area under a standard normal distribution N(0, 1) that is normalized into the range [-1, 1].

In [13]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

base_model_id = "EleutherAI/pythia-160m"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float32
)

# Loads base model without quantization
model = AutoModelForCausalLM.from_pretrained(base_model_id)

# Loads quantized model right away
q_model = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


We can use `get_memory_footprint()` method to get an estimated size of the model.

In [14]:
print(model.get_memory_footprint()/1e6, q_model.get_memory_footprint()/1e6)

702.769584 250.721688


We'll use the `model_size()` helper function to compare the models along the way in more detail.

In [15]:
# https://discuss.pytorch.org/t/finding-model-size/130275/11

def model_size(model, include_buffer=True):
    param_size = 0
    parm_list = []
    for name, param in model.named_parameters():
        subtotal = param.nelement() * param.element_size()
        parm_list.append((name, subtotal, param.requires_grad))
        param_size += subtotal
    buffer_size = 0
    if include_buffer:
        for buffer in model.buffers():
            buffer_size += buffer.nelement() * buffer.element_size()

    size_all_mb = (param_size + buffer_size) / 1024**2
    return size_all_mb, parm_list

## Regular Model

Let's look at the regular model first:

In [16]:
model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear(in_features=768, out_features=2304, bias=True)
          (dense): Linear(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
      

In [17]:
size1, parm_list1 = model_size(model)
size1

670.2133026123047

It is roughly 670Mb in size (160M parameters times 4 bytes (32 bits) for each parameter).

Let's take a look at the weights of a linear layer in the attention mechanism:

In [18]:
lin = model.gpt_neox.layers[0].attention.query_key_value
lin.weight.shape, lin.weight

(torch.Size([2304, 768]),
 Parameter containing:
 tensor([[ 0.0295,  0.0594,  0.0711,  ...,  0.0077, -0.0146,  0.0082],
         [-0.0202, -0.0193, -0.0186,  ..., -0.0018,  0.0001, -0.0081],
         [ 0.0210,  0.0049,  0.0044,  ..., -0.0050, -0.0068,  0.0227],
         ...,
         [-0.0251, -0.0046, -0.0168,  ...,  0.0253, -0.0157, -0.0254],
         [ 0.0089, -0.0146, -0.0023,  ..., -0.0260,  0.0066, -0.0007],
         [-0.0109,  0.0204,  0.0143,  ..., -0.0116,  0.0042,  0.0450]],
        requires_grad=True))

## Quantized Model

Now, let's take a look at its quantized version:

In [19]:
q_model

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=768, out_features=2304, bias=True)
          (dense): Linear4bit(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear4bit(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear4bit(in_features=3072, out_features=768, b

Did you notice the difference? Linear layers are `Linear4bit` layers now!

How did it impact the size?

In [20]:
size2, parm_list2 = model_size(q_model)
size2

239.1068344116211

Nice! The model is roughly 1/3 of its original size thanks to quantization!

What if we peek at the same linear layer as before?

In [21]:
qlin = q_model.gpt_neox.layers[0].attention.query_key_value
qlin.weight.shape

torch.Size([884736, 1])

In [22]:
qlin.weight.quant_state.__dict__

{'absmax': tensor([255, 242, 230,  ...,  15,  30,  27], device='cuda:0',
        dtype=torch.uint8),
 'shape': torch.Size([2304, 768]),
 'code': tensor([-1.0000, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848, -0.0911,  0.0000,
          0.0796,  0.1609,  0.2461,  0.3379,  0.4407,  0.5626,  0.7230,  1.0000],
        device='cuda:0'),
 'dtype': torch.float16,
 'blocksize': 64,
 'quant_type': 'nf4',
 'offset': tensor(0.0793, device='cuda:0'),
 'state2': <bitsandbytes.functional.QuantState at 0x7ee5ec957940>,
 'nested': True}

Wow! That's quite something else!

### Comparison

In [23]:
import pandas as pd
comparison_df = pd.DataFrame(parm_list1, columns=['name', 'n_parms', 'grad']).merge(pd.DataFrame(parm_list2, columns=['name', 'q_n_parms', 'q_grad']))
comparison_df

Unnamed: 0,name,n_parms,grad,q_n_parms,q_grad
0,gpt_neox.embed_in.weight,154533888,True,77266944,True
1,gpt_neox.layers.0.input_layernorm.weight,3072,True,1536,True
2,gpt_neox.layers.0.input_layernorm.bias,3072,True,1536,True
3,gpt_neox.layers.0.post_attention_layernorm.weight,3072,True,1536,True
4,gpt_neox.layers.0.post_attention_layernorm.bias,3072,True,1536,True
...,...,...,...,...,...
143,gpt_neox.layers.11.mlp.dense_4h_to_h.weight,9437184,True,1179648,False
144,gpt_neox.layers.11.mlp.dense_4h_to_h.bias,3072,True,1536,False
145,gpt_neox.final_layer_norm.weight,3072,True,1536,True
146,gpt_neox.final_layer_norm.bias,3072,True,1536,True


In [24]:
comparison_df.query('not q_grad')

Unnamed: 0,name,n_parms,grad,q_n_parms,q_grad
5,gpt_neox.layers.0.attention.query_key_value.we...,7077888,True,884736,False
6,gpt_neox.layers.0.attention.query_key_value.bias,9216,True,4608,False
7,gpt_neox.layers.0.attention.dense.weight,2359296,True,294912,False
8,gpt_neox.layers.0.attention.dense.bias,3072,True,1536,False
9,gpt_neox.layers.0.mlp.dense_h_to_4h.weight,9437184,True,1179648,False
...,...,...,...,...,...
140,gpt_neox.layers.11.attention.dense.bias,3072,True,1536,False
141,gpt_neox.layers.11.mlp.dense_h_to_4h.weight,9437184,True,1179648,False
142,gpt_neox.layers.11.mlp.dense_h_to_4h.bias,12288,True,6144,False
143,gpt_neox.layers.11.mlp.dense_4h_to_h.weight,9437184,True,1179648,False


Quantization turned off gradient computation for quantized attention layers.

## Keeping Original Dtypes

It is said that increasing precision of some internal layers may help stabilize and improve training.

### Head

Let's start with the head (`mlp`):

In [25]:
head = model.gpt_neox.layers[-1].mlp
q_head = q_model.gpt_neox.layers[-1].mlp
head, q_head

(GPTNeoXMLP(
   (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
   (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
   (act): GELUActivation()
 ),
 GPTNeoXMLP(
   (dense_h_to_4h): Linear4bit(in_features=768, out_features=3072, bias=True)
   (dense_4h_to_h): Linear4bit(in_features=3072, out_features=768, bias=True)
   (act): GELUActivation()
 ))

These weights have been quantized as well, but "messing with the model's head" may not be such a good idea.

In [26]:
head.dense_4h_to_h.weight.dtype, q_head.dense_4h_to_h.weight.dtype

(torch.float32, torch.uint8)

Let's use the `llm_int8_skip_modules` argument to keep the `mlp` layer from being quantized. Don't be fooled by the argument's name, it *will* skip the listed modules even if we're quantizing it to 4 bits.

In [27]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mlp"]
)

# Loads quantized model right away
q_model2 = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Now, let's take a look at the model and its head:

In [28]:
q_model2

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=768, out_features=2304, bias=True)
          (dense): Linear4bit(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True

In [29]:
q_model2.gpt_neox.layers[-1].mlp

GPTNeoXMLP(
  (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
  (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True)
  (act): GELUActivation()
)

How big is it?

In [30]:
print(q_model2.get_memory_footprint()/1e6)

277.706136


### Layer Norm

Layer norm layers were converted to float16 to save space. However, this may negatively affect performance and stability. Ideally, we'd like to have them as bfloat16 but, should the hardware not support this type, it's probably better to keep them as float32.

In [31]:
q_model2.gpt_neox.final_layer_norm.weight.dtype

torch.float16

To accomplish this, we can use the `torch_dtype` argument of the `from_pretrained()` method:

In [32]:
# Loads quantized model right away
q_model3 = AutoModelForCausalLM.from_pretrained(base_model_id,
                                                quantization_config=bnb_config,
                                                torch_dtype=torch.float32)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [33]:
q_model3

GPTNeoXForCausalLM(
  (gpt_neox): GPTNeoXModel(
    (embed_in): Embedding(50304, 768)
    (emb_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-11): 12 x GPTNeoXLayer(
        (input_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_layernorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (post_attention_dropout): Dropout(p=0.0, inplace=False)
        (post_mlp_dropout): Dropout(p=0.0, inplace=False)
        (attention): GPTNeoXSdpaAttention(
          (rotary_emb): GPTNeoXRotaryEmbedding()
          (query_key_value): Linear4bit(in_features=768, out_features=2304, bias=True)
          (dense): Linear4bit(in_features=768, out_features=768, bias=True)
          (attention_dropout): Dropout(p=0.0, inplace=False)
        )
        (mlp): GPTNeoXMLP(
          (dense_h_to_4h): Linear(in_features=768, out_features=3072, bias=True)
          (dense_4h_to_h): Linear(in_features=3072, out_features=768, bias=True

In [34]:
q_model3.gpt_neox.final_layer_norm.weight.dtype

torch.float32

The side-effect is that the model is much bigger now (although still smaller than the original):

In [35]:
print(q_model3.get_memory_footprint()/1e6)

468.462


### Embeddings

The embeddings returned as output have also been quantized:

In [36]:
q_model3.embed_out.weight.dtype

torch.uint8

In [37]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_skip_modules=["mlp", "embed_out"]
)

# Loads quantized model right away
q_model4 = AutoModelForCausalLM.from_pretrained(base_model_id, quantization_config=bnb_config, torch_dtype=torch.float16)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [38]:
q_model4.embed_out.weight.dtype

torch.float16

In [39]:
print(q_model4.get_memory_footprint()/1e6)

335.656344


### Manual Intervention

Although convenient, skipping modules and converting layer norm layers to float32 wholesale may offset most of the gains we had with the initial quantization.

It's possible to *manually* intervene and cast some types in order to try improving performance and stability without incurring such a high cost in memory space.

Beware that these manual interventions may cause your model to break, so be careful and double-check everything.

#### Head

Now we're casting *only* the output of the last MLP head to float32:

In [40]:
class CastOutputToFloat(torch.nn.Sequential):
    def forward(self, x): return super().forward(x).to(torch.float32)

q_model.gpt_neox.layers[-1].mlp = CastOutputToFloat(q_model.gpt_neox.layers[-1].mlp)

In [41]:
q_model.gpt_neox.layers[-1].mlp

CastOutputToFloat(
  (0): GPTNeoXMLP(
    (dense_h_to_4h): Linear4bit(in_features=768, out_features=3072, bias=True)
    (dense_4h_to_h): Linear4bit(in_features=3072, out_features=768, bias=True)
    (act): GELUActivation()
  )
)

#### Layer Norm

The weights and biases of the **final layer norm only** are cast to float32 as well:

In [42]:
q_model.gpt_neox.final_layer_norm.weight.dtype

torch.float16

In [43]:
q_model.gpt_neox.final_layer_norm.weight.data = q_model.gpt_neox.final_layer_norm.weight.data.to(torch.float32)
q_model.gpt_neox.final_layer_norm.bias.data = q_model.gpt_neox.final_layer_norm.bias.data.to(torch.float32)

#### Embeddings

In this case, noth manual intervention and skipping modules produce the same result.

In [44]:
q_model.embed_out.weight.data = q_model.embed_out.weight.data.to(torch.float32)

In [45]:
q_model.gpt_neox.final_layer_norm.weight.dtype

torch.float32

#### Size

These changes increase the model size once again, although it's less than half of its original size:

In [46]:
model_size(q_model)[0], print(q_model.get_memory_footprint()/1e6)

327.991704


(312.7972640991211, None)

**IMPORTANT**: if you make this kind of changes - casting and modifying dtypes - **make sure** your model is still capable of producing outputs without breaking:

In [47]:
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
q_model(**tokenizer('Testing the changes', return_tensors='pt'))

tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



CausalLMOutputWithPast(loss={'logits': tensor([[[823.2774,  13.8725, 823.7270,  ...,  13.9288,  13.8787,  13.9429],
         [831.8468,  14.1120, 831.5814,  ...,  14.1692,  14.1238,  14.1848],
         [832.4774,  13.7687, 834.7003,  ...,  13.8197,  13.7655,  13.8283]]],
       grad_fn=<ToCopyBackward0>), 'past_key_values': ((tensor([[[[-2.8594,  3.1855, -2.2148,  ...,  1.9160, -1.3281, -1.2969],
          [-3.7478,  5.8221, -1.3330,  ...,  0.9546, -0.3628,  0.1792],
          [ 0.9579,  4.1571, -2.6934,  ...,  1.2578, -1.9883, -0.0756]],

         [[ 0.3335,  0.2463, -0.2556,  ..., -1.1885, -0.2783, -2.9297],
          [ 0.6544, -3.1055, -3.6083,  ...,  0.0500,  0.0132, -2.7090],
          [ 0.3434,  0.3906, -0.9336,  ..., -1.1025,  1.3184, -2.0527]],

         [[ 2.8984, -2.2090,  2.9844,  ...,  1.4609,  0.4304,  2.0273],
          [ 0.1505, -1.2421,  3.7420,  ...,  0.7734, -0.0315,  2.4121],
          [-1.8125, -3.1717,  3.0269,  ...,  0.8047, -1.2207,  3.5117]],

         ...,

   

## Dequantize

In [48]:
deq_model = q_model.dequantize()

The model is going to be dequantized in torch.float16 - if you want to upcast it to another dtype, make sure to pass the desired dtype when quantizing the model through `bnb_4bit_quant_type` argument of `BitsAndBytesConfig`
For some reason the model has not been properly dequantized. You might see unexpected behavior.


In [49]:
print(deq_model.get_memory_footprint()/1e6)

455.393688


In [50]:
q_model.hf_quantizer??

Object `q_model.hf_quantizer` not found.
