<a href="https://colab.research.google.com/github/TAUforPython/machinelearning/blob/main/example_LLM_AWQ_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer meets AWQ quantization for lighter and faster quantized inference of LLMs

In June 2023, the [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf) has been published by Ji Lin et al. The paper details an algorithm to compress any transformer-based language model in few bits with a tiny performance degradation.

[AutoAWQ](https://github.com/casper-hansen/AutoAWQ).

Квантование нейронных сетей (quantization), включая квантование языковых моделей (LLM), является процессом преобразования весов модели и операций вычислений из формата с плавающей точкой (floating-point) в формат с фиксированной точкой (fixed-point). Это позволяет уменьшить объем памяти, занимаемый моделью, а также ускорить ее работу за счет упрощения арифметических операций.


## Load required libraries

In [1]:
!pip install -q transformers accelerate

AutoAWQ will default to CUDA 12.1, since google colab has CUDA < 12.1 installed, we will install the wheels for CUDA 11.8. For 12.1 you can simply do `pip install autoawq`

In [8]:
!pip install -q torch==2.3.1

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.6.0 requires torch==2.6.0, but you have torch 2.3.1 which is incompatible.
torchvision 0.21.0 requires torch==2.6.0, but you have torch 2.3.1 which is incompatible.
autoawq 0.2.8 requires torch>=2.5.1, but you have torch 2.3.1 which is incompatible.[0m[31m
[0m

In [4]:
!pip install -q fsspec

In [6]:
!pip install -q autoawq-kernels

In [10]:
!pip install -q autoawq

## AutoAWQ integration with Transformers

Let's first quantize `opt-125m` using `autoawq`!

In [11]:
!pip install --upgrade -q huggingface_hub

In [12]:
import os
from huggingface_hub import login
from google.colab import userdata

os.environ["HF_token"] = userdata.get("HF_token")

login(os.environ["HF_token"])

In [13]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

In [20]:
#model_path = "facebook/opt-125m"
#quant_path = "opt-125m-awq"

#model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
#quant_path = 'mistral-instruct-v0.2-awq'

model_path = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B'
quant_path = 'DeepSeek-R1-Distill-Qwen-1.5B-awq'

quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMM"}

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/19.0k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

figures/benchmark.jpg:   0%|          | 0.00/777k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (57054 > 16384). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████| 28/28 [25:19<00:00, 54.28s/it]


In [21]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

In order to make it compatible with transformers, we need to modify the config file.

In [22]:
from transformers import AwqConfig, AutoConfig
from huggingface_hub import HfApi

# modify the config file so that it is compatible with transformers integration
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# the pretrained transformers model is stored in the model attribute + we need to pass a dict
model.model.config.quantization_config = quantization_config
# a second solution would be to use Autoconfig and push to hub (what we do at llm-awq)


# save model weights
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

('DeepSeek-R1-Distill-Qwen-1.5B-awq/tokenizer_config.json',
 'DeepSeek-R1-Distill-Qwen-1.5B-awq/special_tokens_map.json',
 'DeepSeek-R1-Distill-Qwen-1.5B-awq/tokenizer.json')

In [23]:
# optional -> push the quantized weights to the hub
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: fineG

Необходимо создать на HuggingFace сперва модель TutorForTAU/example-AWQ-LLM-model
а потом уже загрузить туда квантифизировнную LLM

In [29]:
api = HfApi()
api.upload_folder(
    folder_path = "DeepSeek-R1-Distill-Qwen-1.5B-awq",
    repo_id = "TutorForTAU/example-AWQ-LLM-model",
    repo_type="model",
)

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/TutorForTAU/example-AWQ-LLM-model/commit/71df35663cf4a0786b4456b01144972656082fd8', commit_message='Upload folder using huggingface_hub', commit_description='', oid='71df35663cf4a0786b4456b01144972656082fd8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/TutorForTAU/example-AWQ-LLM-model', endpoint='https://huggingface.co', repo_type='model', repo_id='TutorForTAU/example-AWQ-LLM-model'), pr_revision=None, pr_num=None)

Now we can use our model with transformers library to run inference !

In [30]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TutorForTAU/example-AWQ-LLM-model")
model = AutoModelForCausalLM.from_pretrained("TutorForTAU/example-AWQ-LLM-model").to(0)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True))

tokenizer_config.json:   0%|          | 0.00/6.75k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/485 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


Hello my name is... I need to find


In [33]:

text = "World is a..."
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:151643 for open-end generation.


World is a... Let me think.

Okay, so the question is asking me to imagine a world and describe it in detail. The example given is "World is a place where people live, work, and play their lives." It's a pretty straightforward description, but


## Loading a large model on Google colab

Let's now try to load a very large model that would not fit on a single Google Colab instance using this integration. The integration is compatible with any AWQ model that is under [`TheBloke`](https://huggingface.co/TheBloke) namespace! For our demo we will use `TheBloke/Llama-2-13B-chat-AWQ`. That model would require ~26GB in float16 but thanks to AWQ would be very easy to run on a 16GB GPU!

<img src="https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/thebloke-screenshot.png" width="800"/>

In [None]:
model_id = "TheBloke/Llama-2-13B-chat-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda")

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/7.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
text = "User:\nHello can you provide me with top-3 cool places to visit in Paris?\n\nAssistant:\n"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(out[0], skip_special_tokens=True))

User:
Hello can you provide me with top-3 cool places to visit in Paris?

Assistant:
Bonjour! There are so many amazing places to visit in Paris, but here are my top three recommendations:

1. The Eiffel Tower - an iconic symbol of the city, this tower offers breathtaking views of the city from its observation decks.
2. The Louvre Museum - home to some of the world's most famous artworks, including the Mona Lisa, this museum is a must-visit for art lovers.
3. Notre Dame Cathedral - a beautiful and historic church that is one of the most famous landmarks in Paris.

I hope you enjoy your visit to Paris! Is there anything else you'd like to know?
