# Overview
This notebook demonstrates how to load, configure, and run inference with large language models (LLMs) that have been quantized using **AWQ (Activation-aware Weight Quantization)**. We focus on using models that are already quantized and hosted on the Hugging Face Hub — especially the efficient and popular **Mistral-7B-Instruct-AWQ** variant.

# Purpose
- Run 4-bit quantized LLMs using AutoAWQ, optimized for speed and memory usage.

- Configure and validate quantization metadata for compatibility with Hugging Face's transformers.

- Save and (optionally) push the quantized model back to the Hugging Face Hub.

- Perform inference with reduced GPU requirements — ideal for Colab and other limited-resource environments.

#  AWQ Quantization: Efficient & Scalable LLM Inference
### **What is AWQ?**
**AWQ (Activation-aware Weight Quantization)** is a method to quantize LLM weights down to 4-bit precision while preserving important channels based on activation sensitivity. It enables:
- 3–4× lower memory usage
- 2–4× faster inference
- Minimal accuracy loss
- Efficient deployment on edge devices or free GPU platforms like Colab


**Activation-aware Weight Quantization (AWQ)** is a state-of-the-art technique designed to compress large language models (LLMs) to ultra-low bitwidths (e.g., 4-bit) while preserving performance. Unlike traditional weight-only quantization methods, AWQ intelligently identifies and protects the most critical weights based on activation statistics, leading to minimal accuracy degradation and significant memory and latency improvements.
- **AWQ**: The core algorithm that quantizes weights by scaling salient channels identified through activation distributions.

- **AutoAWQ**: A user-friendly tool that automates the quantization process, allowing users to convert models to 4-bit precision with minimal effort.
- **LLM-AWQ**: A specialized implementation optimized for large language models, supporting efficient inference on both desktop and mobile GPUs.

## 1. Load required libraries

Let us first load the required libraries that are 🤗 transformers and llm-awq, autoawq library.

In [1]:
!pip install -q transformers accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AutoAWQ will default to CUDA 12.1, since google colab has CUDA < 12.1 installed, we will install the wheels for CUDA 11.8. For 12.1 you can simply do `pip install autoawq`

In [19]:
!pip install -q -U autoawq[triton]

[0m

## LLM-AWQ integration with Transformers

As LLM-AWQ is not supported on T4 devices (such as the one we use on free-tier Google Colab instances) you need to have access to a hardware that is compatible with that repository and follow the [instructions](https://github.com/mit-han-lab/llm-awq/tree/main) provided by llm-awq repository.

You can follow the instructions stated on [this section](https://github.com/mit-han-lab/llm-awq/blob/main/examples/chat_demo.ipynb) then use the conversion script exposed [here](https://github.com/mit-han-lab/llm-awq/blob/main/examples/convert_to_hf.py) to convert your model into a transformers compatible version.

## 2. AutoAWQ integration with Transformers




Let's first quantize `Mistral-7b` using `autoawq`!

In [23]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "TheBloke/Mistral-7B-Instruct-v0.1-AWQ"

# Load quantized model directly (no quantization step needed!)
model = AutoAWQForCausalLM.from_quantized(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Example generation
import torch
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


config.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

quant_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.0k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/962 [00:00<?, ?B/s]

Replacing layers...: 100%|██████████| 32/32 [00:12<00:00,  2.57it/s]


Explain quantum computing in simple terms:

Quantum computing is a type of computing that uses quantum mechanics to perform calculations. In classical computing, information is stored in bits, which can be either 0 or 1. In quantum computing, information is stored in quantum bits, or qubits, which can be both 0 and 1 at the same time. This allows quantum computers to perform certain types of calculations much faster than classical computers.


In [25]:
quant_config

{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}

# 3. Modify Config
In order to make it compatible with transformers, we need to modify the config file.


In [27]:
from transformers import AwqConfig
from huggingface_hub import HfApi

# Define the AWQ config dictionary (matches the current model setup)
quant_config = {
    "w_bit": 4,
    "q_group_size": 128,
    "zero_point": True,
    "version": "GEMM"
}

# Convert to AWQ-compatible config
quantization_config = AwqConfig(
    bits=quant_config["w_bit"],
    group_size=quant_config["q_group_size"],
    zero_point=quant_config["zero_point"],
    version=quant_config["version"].lower(),
).to_dict()

# Assign quant config to the model (ensures compatibility with transformers)
model.model.config.quantization_config = quantization_config

# Save model and tokenizer
quant_path = "mistral-7b-instruct-awq-custom"
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)


('mistral-7b-instruct-awq-custom/tokenizer_config.json',
 'mistral-7b-instruct-awq-custom/special_tokens_map.json',
 'mistral-7b-instruct-awq-custom/tokenizer.model',
 'mistral-7b-instruct-awq-custom/added_tokens.json',
 'mistral-7b-instruct-awq-custom/tokenizer.json')

In [28]:
quant_config

{'w_bit': 4, 'q_group_size': 128, 'zero_point': True, 'version': 'GEMM'}

# 4. Upload to Hub

In [29]:
# optional -> push the quantized weights to the hub
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 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
Invalid input. Must be one of ('y', 'yes', '1', 'n', 'no', '0', '')
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
The token `Access` has been saved to /root/.cache/hugg

In [32]:
api = HfApi()

# Create the repository on the Hugging Face Hub
# Make sure the repo_id matches the one you use for uploading
api.create_repo(repo_id="Adiii143/mistral-7b-instruct-awq-custom", repo_type="model")

# Now upload the folder
api.upload_folder(
    folder_path="mistral-7b-instruct-awq-custom",
    repo_id="Adiii143/mistral-7b-instruct-awq-custom",
    repo_type="model",
)

Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Adiii143/mistral-7b-instruct-awq-custom/commit/884cb0188cb98970b82786c37515d0a17802aae6', commit_message='Upload folder using huggingface_hub', commit_description='', oid='884cb0188cb98970b82786c37515d0a17802aae6', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Adiii143/mistral-7b-instruct-awq-custom', endpoint='https://huggingface.co', repo_type='model', repo_id='Adiii143/mistral-7b-instruct-awq-custom'), pr_revision=None, pr_num=None)

Now we can use our model with transformers library to run inference !

# 6. Use the Uploaded Model

In [35]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Adiii143/mistral-7b-instruct-awq-custom")
model = AutoModelForCausalLM.from_pretrained("Adiii143/mistral-7b-instruct-awq-custom").to(0)

text = "Hello my name is"
inputs = tokenizer(text, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=5)
print(tokenizer.decode(out[0], skip_special_tokens=True))

tokenizer_config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.51M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/858 [00:00<?, ?B/s]

You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.


model.safetensors:   0%|          | 0.00/4.15G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/153 [00:00<?, ?B/s]

Hello my name is John.

I


## Loading a large model on Google colab

Let's now try to load a very large model that would not fit on a single Google Colab instance using this integration. The integration is compatible with any AWQ model that is under [`TheBloke`](https://huggingface.co/TheBloke) namespace! For our demo we will use `TheBloke/Llama-2-13B-chat-AWQ`. That model would require ~26GB in float16 but thanks to AWQ would be very easy to run on a 16GB GPU!


In [36]:
# Model and tokenizer paths
model_id = "TheBloke/Llama-2-13B-chat-AWQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the quantized model
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda")

config.json:   0%|          | 0.00/750 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/7.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [39]:
text = "User:\nHello can you provide me with top-3 cool places to visit in Paris?\n\nAssistant:\n"

# Tokenize the input
inputs = tokenizer(text, return_tensors="pt").to(0)

# Generate text with the model
out = model.generate(**inputs, max_new_tokens=300)

# Decode and print the result
print(tokenizer.decode(out[0], skip_special_tokens=True))

User:
Hello can you provide me with top-3 cool places to visit in Paris?

Assistant:
Bonjour! Certainly! Here are my top three cool places to visit in Paris:

1. The Musée d'Orsay: This museum is home to an impressive collection of Impressionist and Post-Impressionist art, including works by Monet, Renoir, and Van Gogh. The building itself is also a work of art, with a beautiful Beaux-Arts style facade and a stunning interior courtyard.
2. The Palais de Tokyo: This contemporary art museum is located in a former palace and features a diverse range of exhibitions and installations. The building's modern architecture and urban vibe make it a must-visit for anyone looking for a unique and edgy experience.
3. The Jardin du Luxembourg: This beautiful garden is located in the heart of the city and features stunning views of the Eiffel Tower and the Luxembourg Palace. The garden is also home to several sculptures and fountains, as well as a charming café where you can relax and enjoy a coffee 

## Key Takeaways
- AWQ Quantization allows us to run very large models (like Llama-2-13B-chat-AWQ) on smaller GPUs with 16GB VRAM.
- This method takes advantage of 4-bit quantization for memory efficiency.
- Inference can be run easily on Colab, with minimal overhead due to AWQ's efficient use of activation-aware quantization.