<a href="https://www.kaggle.com/code/aisuko/quantization-with-gptq?scriptVersionId=160693644" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

The advancements in weight quantization allow us to run massive large language models on consumer hardware. This is possible thanks to novel 4-bit quantization techniques with minimal performance degradation, like GPTQ, GGML and NF4. We are going to explore GPTQ algorithm in this notebook.

GPTQ algorithm is a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. See [Post-training Quantization methods](https://www.kaggle.com/code/aisuko/post-training-quantization-methods?scriptVersionId=159968307&cellId=19)

# Optimal Brain Quantizer(OBQ) framework

For every layer $\ell$ in the network, we want to find a quantized version $\hat{W_{I}}$ of the original weights $W_{I}$. This is called the **Layer-wide compression problem**. More specifically, to minimize performance degradation, we want to outputs $({\hat{W_{Q}}X_{Q}})$ of these new weights to be as close as possible to the original ones $(W_{Q}X_{Q})$. In other words, we want to find:

$$arg min_{\hat{W_{\ell}}}||W_{\ell}X_{\ell}-{\hat{W_{\ell}}X_{\ell}}||_{2}^{2}$$


Different approaches have been proposed to solve this problem, but we're interested in the [Optimal Brain Quantizer](https://arxiv.org/abs/2208.11580)(OBQ) framework here.

This method is inspired by a **pruning technique**(see example in [another notebook](https://www.kaggle.com/code/aisuko/deep-learning-inference-runtime-on-cpus)) to carefully remove weights from a fully trained dense neural network(Optimal Brain Surgeon). It uses an approximation technique and provides explicit formulas for the best single weight $w$ to remove and optimial update $\delta$ to adjust the set of remianing non-quantized weights F to make up for the removal:

$$\displaystyle w_{q}=arg min_{w_{q}} \frac{(quant(w_{q})-w_{q})^2}{[H_{F}^{-1}]_{qq}},$$

$$\ell_{F}=- \frac{w_{q}-quant(w_{q})}{[H_{F}^{-1}]_{qq}}*(H_{F}^{-1}):,q$$

where quant(w) is the weight rounding given by the quantization and H is the Hessian.

Using OBQ, we can quantize the easiest weight first and then adjust all remaining non-quantized weights to **compensate for this precision loss**. Then we pick the next weight to quantize, and so on.

A potential issue with this approach is when there are outlier weights, which can result in high **quantization error**. Usually, these outliers would be quantized last, when there are few non-quantized weights left that could be adjusted to compensate for the large error. This effect can worsen when some weights are pushed further outside the grid by intermediate updates. A simple heuristic is applied to prevent this: **outliers are quantized as soon as they appear**.

This process could be computationally heavy, especially for LLMs. To deal with this, the OBQ method uses a trick that avoids redoing the entire computation each time a weight is simplified. After quantizing a weight, it adjusts the matrix used in calculations(the Hessian) by **removing the row and column** associated with the weight(using Gaussian elimination):

$$H_{-q}^{-1}=(H^{-1}-\frac{1}{[H^{-1}]_{qq}}H_{:,q}^{-1}H_{q,:}^{-1})_{-p}$$

The method also employs vectorization to process multiple rows of the weight matrix at once. Despite its efficiency, the OBQ's computation time increases significantly as the size of the weight matrix increases. This cubic growth makes it difficult to use OBQ on very large models with billions of parameters.

# The GPTQ Algorithm

The GPTQ algorithm takes inspiration from the OBQ method, but with significant improvements to scale it for (very) large language models.


## Arbitary Order Insight

The OBQ method selects weights(parameters in a model) for quantization in a certain order, determined by which will **add the least additional error**. However, GPTQ ovserves that for large models, quantizing weights in any fixed order can perform just as well. This is because even though some weights might introduce more error individually, they are quantized later in the process when there are few other weights left that could increase the error. So the order doesn't matter as much as we though.

Based on this insight, GPTQ aims to quantize all weights in the **same order for all rows** of a matrix. This makes the process faster because certain computations have to be done only once for each column, rather than once for each weight.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/802/378/995/925/932/original/f4a452d46817ed43.webp" width="60%" heigh="60%" alt="quantize weights in the same order"></div>


## Lazy Batch-Updates

This scheme won't be fast because it requires updating a **huge matrix** with very few computations for each entry. This type of operation can't utilize the full compute capabilities of GPUs and will be slowed down by memeory limitations(memeory throughput bottleneck).

To resolve this, GPTQ introduces "lazy batch" updates. It turns out that the final rounding decisions for a given column are only affected by updates performed on that column, not on later columns. Therefore, GPTQ can apply the algorithm to a batch of columns at a time(like 128 columns), updating only those columns and a corresponding block of the matrix. After a block is fully processed, the algorithm performs global updates on the entire matrix.


## Cholesky Reformulation

However, there's one more issue to address. When the algorithm scales up to very large models, numerical inaccuracies can become a problem. Specially, repeated applications of a certain operation can **accumulate numerical errors**.

To tackle this, GPTQ uses a Cholesky decomposition, a numberically stable method for solving certain mathematical problems. It involves precomputing some required information from the matrix using the Cholesky method. This approach, combined with a slight "dampening"(adding a small constant to diagonal elements of the matrix), helps the algorithm to avoid numerical issues.

The full algorithm can be summarized in afew steps:

1. The GPTQ algorithm beings with a Cholesky decomposition of the Hessian inverse(a matrix that helps decide how to adjust the weights)
2. It then runs in loops, handling batches of columns at a time.
3. For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly.
4. After processing the batch, it updates all remaining weights based on block's errors.


The GPTQ algorithm was tested on various language generation tasks. It was compared with other quantization methods, like rounding all weights to the nearest quantized value(RTN). GPTQ was used with the BLOOM(176B parameters) and OPT(175B parameters) model families, and models were quantized using a single GPU(NVIDIA A100).

# Quantize an LLM with AutoGPTQ

GPTQ has been very popular to create models in 4-bit precision that can effiently run on GPUs. We can find many examples from Huggingface hub, like [TheBloke](https://huggingface.co/TheBloke). [GGML](https://github.com/ggerganov/ggml) is for CPU. 

We can use `load_in_4bit=True` argument of `transformers` models to quantize a model. It requires download full models and store it in your RAM.

In [1]:
!pip install transformers==4.36.2
# !pip install accelerate==0.25.0
# !pip install peft==0.7.1
# !pip install bitsandbytes==0.41.3
!pip install auto-gptq==0.6.0
!pip install optimum==1.16.2

Collecting auto-gptq==0.6.0
  Obtaining dependency information for auto-gptq==0.6.0 from https://files.pythonhosted.org/packages/09/b2/c964b7f286ce5f782c1be0b46700091daa60a121b41e06d9a59047b45e57/auto_gptq-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading auto_gptq-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting rouge (from auto-gptq==0.6.0)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq==0.6.0)
  Downloading gekko-1.0.6-py3-none-any.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
Collecting peft>=0.5.0 (from auto-gptq==0.6.0)
  Obtaining dependency information for peft>=0.5.0 from https://files.pythonhosted.org/packages/8b/1b/aee2a330d050c493642d59ba6af51f3910cb138ea48ede228c84c204a5af/peft-0.7.1-py3-none-any.whl.metadata
  Downloading peft-0.7.1-py3-none-any.whl.metadata

In [2]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()

login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Quantize-models"
os.environ["WANDB_NOTES"] = "Quantize models by using Post-training quantization methods"
os.environ["MODEL_NAME"] = "facebook/opt-125m"
os.environ["WANDB_NAME"] = "quantized-opt-125m-with-GPTQ"

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

tokenizer=AutoTokenizer.from_pretrained(os.getenv("MODEL_NAME"))
gptq_config=GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer
)

quantized_model=AutoModelForCausalLM.from_pretrained(
    os.getenv("MODEL_NAME"),
    device_map="auto",
    quantization_config=gptq_config
)
quantized_model.get_memory_footprint()

tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading and preparing dataset json/allenai--c4 to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/allenai--c4-ec45c889631c3c39/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b. Subsequent calls will reuse this data.


Quantizing model.decoder.layers blocks :   0%|          | 0/12 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/6 [00:00<?, ?it/s]

125067264

Checking the model has been correctly quantized. The attributes of the linear layer should contain `qweight` and `qzeros`, and it should be in `torch.int32` dtype.

In [4]:
quantized_model.model.decoder.layers[0].self_attn.q_proj.__dict__

{'training': True,
 '_parameters': OrderedDict(),
 '_buffers': OrderedDict([('qweight',
               tensor([[ 1711760090, -1248295259, -2025411892,  ..., -1486452502,
                         2019142072, -1735820810],
                       [-2000132747,  -578262345,  1484081337,  ..., -1230600537,
                        -2019252040, -2023311003],
                       [ -710293851, -1153090188,  1431922298,  ..., -1768449094,
                         2042194587, -2004125258],
                       ...,
                       [-1183500136, -1494510422, -1772782904,  ..., -1518753378,
                         -411710600,  -392845654],
                       [-1990626701,  1469278281,  1469864108,  ...,  1740208533,
                        -1732560507, -1738077576],
                       [ 2015914598,  2040232821,  2005572185,  ..., -1463179655,
                        -1450400136, -2024523156]], device='cuda:0', dtype=torch.int32)),
              ('qzeros',
               tensor(

Checking the quantization configuration.

In [5]:
quantized_model.config.quantization_config.to_dict()

{'quant_method': <QuantizationMethod.GPTQ: 'gptq'>,
 'bits': 4,
 'tokenizer': None,
 'dataset': 'c4',
 'group_size': 128,
 'damp_percent': 0.1,
 'desc_act': False,
 'sym': True,
 'true_sequential': True,
 'use_cuda_fp16': False,
 'model_seqlen': None,
 'block_name_to_quantize': None,
 'module_name_preceding_first_block': None,
 'batch_size': 1,
 'pad_token_id': None,
 'use_exllama': True,
 'max_input_length': None,
 'exllama_config': {'version': <ExllamaVersion.ONE: 1>},
 'cache_block_outputs': True}

In [6]:
text="The weather today in Melbourne is"
inputs=tokenizer(text, return_tensors="pt").to(0)

output=quantized_model.generate(**inputs)
tokenizer.decode(output[0], skip_special_tokens=True)



'The weather today in Melbourne is expected to be mild and dry, with temperatures expected to be in'

In [7]:
quantized_model.push_to_hub(os.getenv("WANDB_NAME"))
tokenizer.push_to_hub(os.getenv("WANDB_NAME"))

CommitInfo(commit_url='https://huggingface.co/aisuko/quantized-opt-125m-with-GPTQ/commit/99f9383c1d778115ac96e00c4b513c64ac8bb4b6', commit_message='Upload tokenizer', commit_description='', oid='99f9383c1d778115ac96e00c4b513c64ac8bb4b6', pr_url=None, pr_revision=None, pr_num=None)

# LLM.int8

We discussed LLM.int8(8-bit) quantization techniques in [Introduction to weight quantization](https://www.kaggle.com/code/aisuko/introduction-to-weight-quantization/notebook)

# Credit
* https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
* https://huggingface.co/docs/transformers/v4.37.0/quantization