## Load required libraries

Let us first load the required libraries that are 🤗 transformers, optimum and auto-gptq library.

In [None]:
!pip install -q -U transformers peft accelerate optimum datasets

For now, until the next release of AutoGPTQ, we will build the library from source!

In [None]:
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/

### Quantize a model by passing a supported dataset




In the example below, let us try to quantize the model in 4-bit precision using the `"c4"` dataset. Supported precisions are `[2, 4, 6, 8]`.

Note that this cell will take more than 3 minutes to be completed. If you want to check how to quantize the model by passing a custom dataset, check out the next section.

In [None]:
from datasets import load_dataset

dataset = load_dataset('Ali-C137/Mixed-Arabic-Datasets', 'Ara--Wikipedia')
# lets convert the train dataset to a pandas df
df = dataset["train"].to_pandas()
text_list = df['text'].head(100).tolist()

In [None]:
# text_list

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

model_id = "inception-mbzuai/jais-13b"

quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset=text_list,
     desc_act=False,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
quant_model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, device_map='auto')

You can make sure the model has been correctly quantized by checking the attributes of the linear layers, they should contain `qweight` and `qzeros` attributes that should be in `torch.int32` dtype.

In [None]:
quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__

Now let's perform an inference on the quantized model. Use the same API as transformers!

## Share quantized models on 🤗 Hub

After quantizing the model, it can be used out-of-the-box for inference or you can push the quantized weights on the 🤗 Hub to share your quantized model with the community

In [None]:
from huggingface_hub import notebook_login

### Use this token : GETTOKENFROMHF
notebook_login()

In [None]:
quant_model.push_to_hub("Ali-C137/Jais-13b-GPTQ")
tokenizer.push_to_hub("Ali-C137/Jais-13b-GPTQ")

## Load quantized models from the 🤗 Hub

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "Ali-C137/Jais-13b-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Once tokenizer and model has been loaded, let's generate some text. Before that, we can inspect the model to make sure it has loaded a quantized model

In [None]:
print(model)

As you can see, linear layers have been modified to `QuantLinear` modules from auto-gptq library.

Furthermore, we can see that from the quantization config that we are using exllama kernel (`disable_exllama = False`). Note that it only works with 4-bit model.

In [None]:
model.config.quantization_config.to_dict()

In [None]:
text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))