# transformers meets AutoGPTQ library for lighter and faster quantized inference of LLMs

HuggingFace GPTQ Integration

https://huggingface.co/blog/gptq-integration

Original Notebook

https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb



<!-- ![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/159_autogptq_transformers/thumbnail.jpg) -->

Last year the [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) has been published by Frantar et al. The paper details an algorithm to compress any transformer-based language model in few bits with a tiny performance degradation.

We now support loading models that are quantized with GPTQ algorithm in 🤗 transformers thanks to the [`auto-gptq`](https://github.com/PanQiWei/AutoGPTQ.git) library that is used as backend.

Let's check in this notebook the different options (quantize a model, push a quantized model on the 🤗 Hub, load an already quantized model from the Hub, etc.) that are offered in this integration!

## Load required libraries

Let us first load the required libraries that are 🤗 transformers, optimum and auto-gptq library.

In [2]:
!pip install -q -U transformers peft accelerate optimum
!pip install -q datasets

For now, until the next release of AutoGPTQ, we will build the library from source!

In [None]:
!pip install -q auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/

## Quantize transformers model using auto-gptq, 🤗 transformers and optimum

There are two different scenarios you might be interested in using this integration.

1- Quantize a language model from scratch.

2- Load a model that has been already quantized from 🤗 Hub


The GPTQ algorithm requires to calibrate the quantized weights of the model by doing inference on the quantized model. The detailed quantization algorithm is described in [the original paper](https://arxiv.org/pdf/2210.17323.pdf).

For quantizing a model using auto-gptq, we need to pass a dataset to the quantizer. This can be achieved either by passing a supported default dataset among `['wikitext2','c4','c4-new','ptb','ptb-new']` or a list of strings that will be used as a dataset.

### Quantize a model by passing a supported dataset




In the example below, let us try to quantize the model in 4-bit precision using the `"c4"` dataset. Supported precisions are `[2, 4, 6, 8]`.

Note that this cell will take more than 3 minutes to be completed. If you want to check how to quantize the model by passing a custom dataset, check out the next section.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "facebook/opt-125m"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="wikitext2",
     desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')

You can make sure the model has been correctly quantized by checking the attributes of the linear layers, they should contain `qweight` and `qzeros` attributes that should be in `torch.int32` dtype.

In [None]:
quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__

Now let's perform an inference on the quantized model. Use the same API as transformers!

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = quant_model.generate(**inputs)
print(tokenizer.decode(out[0], skip_special_tokens=True))

### Quantize a model by passing a custom dataset

You can also quantize a model by passing a custom dataset, for that you can provide a list of strings to the quantization config. A good number of sample to pass is 128. If you do not pass enough data, the performance of the model will suffer.

In [None]:
from transformers import AutoModelForCausalLM, GPTQConfig, AutoTokenizer

model_id = "facebook/opt-125m"

quantization_config = GPTQConfig(
    bits=4,
    group_size=128,
    desc_act=False,
    dataset=["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, torch_dtype=torch.float16, device_map="auto")

As you can see from the generation below, the performance seems to be slightly worse than the model quantized using the `c4` dataset.

In [None]:
text = "My name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = quant_model.generate(**inputs)
print(tokenizer.decode(out[0], skip_special_tokens=True))

## Share quantized models on 🤗 Hub

After quantizing the model, it can be used out-of-the-box for inference or you can push the quantized weights on the 🤗 Hub to share your quantized model with the community

In [1]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
quant_model.push_to_hub("opt-125m-gptq-4bit")
tokenizer.push_to_hub("opt-125m-gptq-4bit")