# Weight-Only INT4 Quantization with AWQ using TensorRT ModelOpt PTQ

This notebook demonstrates how to apply weight-only INT4 quantization using the Activation-aware Weight Quantization (AWQ) technique via NVIDIA TensorRT-LLM Model Optimizer (ModelOpt) PTQ.

Unlike standard min-max calibration, AWQ does not quantize activations‚Äîinstead, it uses knowledge of activation ranges to inform how model weights are quantized.

Key Dependencies: 
- nvidia-modelopt
- torch
- transformers

## Quantization with AWQ Quantization

### 1. Import Dependencies
Import all necessary libraries:

- `torch`: Used for tensor computation and model execution.

- `modelopt.torch.quantization`: Core API for quantization using TensorRT ModelOpt PTQ.

- `transformers`: Hugging Face interface to load and tokenize LLMs.

- `get_dataset_dataloader` and `create_forward_loop`: Utilities to prepare calibration data and run calibration.

- `login`: Required to download gated models (like Llama 3.1) from Hugging Face.

üí° If you're using this in Colab or a restricted environment, make sure all packages are installed and CUDA is available.

In [1]:
import torch
from huggingface_hub import login
from transformers import AutoModelForCausalLM, AutoTokenizer

import modelopt.torch.quantization as mtq
from modelopt.torch.utils.dataset_utils import create_forward_loop, get_dataset_dataloader

### 2. Set Configurations and Login to Hugging Face

Set the model you want to quantize (Llama-3.1-8B-Instruct) and the dataset to use for calibration (cnn_dailymail).

- `batch_size` and `calib_samples` control how much data is used during calibration‚Äîmore samples improve accuracy but - increase calibration time.

üîê You must `login()` with a valid Hugging Face token to access gated models. Get your token at hf.co/settings/tokens.

üîÅ You can substitute your own model or dataset as long as the inputs are compatible with the model's tokenizer.

In [None]:
model_name = "meta-llama/Llama-3.1-8B-Instruct"
dataset_name = "cnn_dailymail"
batch_size = 8
calib_samples = 512

login()

### 3. Load Model and Tokenizer

- Load the model into GPU memory.
- Set `pad_token` to eos_token to prevent padding errors in decoder-only models like Llama.

üí° Always check for token mismatch warnings in console when loading tokenizer.
üß† Setting `pad_token` helps avoid errors during batch generation or dataset collation.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

### 4. Configure Dataloader
- Load a few batches of real-world text to extract representative activation ranges.
- The calibration dataset should reflect your expected inference use case for best results.

‚ö†Ô∏è More samples = better accuracy, but takes longer. We recommend 512 samples or more.
üß™ Use your target task‚Äôs dataset (e.g., chat, summarization, code) for domain-specific calibration.

In [None]:
dataloader = get_dataset_dataloader(
    dataset_name=dataset_name,
    tokenizer=tokenizer,
    batch_size=batch_size,
    num_samples=calib_samples,
    device="cuda",
)

### 5. Create the Forward Loop
- Wraps your `dataloader` into a loop that feeds batches into the model.
- Required by `modelopt.quantize()` to perform calibration pass.

üß∞ You can create your own custom forward loop if you're doing multi-modal or conditional generation tasks.

In [5]:
forward_loop = create_forward_loop(dataloader=dataloader)

### 6. Set Quantization Configuration and Apply
üîß Retrieve and customize the AWQ (Activation-aware Weight Quantization) config for INT4 quantization.
- `mtq.INT4_AWQ_CFG` provides a pre-tuned config optimized for low-bit weight quantization with block-wise granularity.
- `block_sizes` control how quantization groups are split across dimensions. This affects compression ratio, memory layout, and accuracy.
- The last dimension (typically 128 or 64) defines the quantization block size for each row of weights.

üí° You can experiment with smaller block sizes (e.g., 64 or 32) for better accuracy at the cost of less compression.

In [None]:
# Get default AWQ config and optionally adjust block size
quant_cfg = mtq.INT4_AWQ_CFG
weight_quantizer = quant_cfg["quant_cfg"]["*weight_quantizer"]
if isinstance(weight_quantizer, list):
    weight_quantizer = weight_quantizer[0]
weight_quantizer["block_sizes"][-1] = 128  # Optional: override block size

# Apply AWQ quantization
model = mtq.quantize(model, quant_cfg, forward_loop=forward_loop)

### 7. Quick Test of Quantized Model
- Test the quantized model with a simple prompt.
- This helps verify that quantization didn‚Äôt break forward generation or drastically harm output quality.

üß™ You can test on more complex prompts to evaluate qualitative performance further.

In [None]:
model = torch.compile(model)
inputs = tokenizer("Hello world", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=20)

In [None]:
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

### 8. Export Quantized Checkpoint
- Save the quantized model in Hugging Face-compatible format for reuse or deployment.
- Export includes weights and config files in standard structure.

üìÅ This allows you to upload it to Hugging Face Hub or load later with from_pretrained() üß∞ You can also use this exported model with inference engines like vLLM, SGLang, or TensorRT-LLM.

In [None]:
from modelopt.torch.export import export_hf_checkpoint

export_path = "./quantized_model_awq/"
export_hf_checkpoint(model, export_dir=export_path)
tokenizer.save_pretrained(export_path)

# ‚úÖ Conclusion & Key Takeaways
    ‚úÖ AWQ (Activation-aware Weight Quantization) is an efficient, deployment-ready method for compressing large language models without quantizing activations.

    ‚úÖ Using INT4 weight-only quantization, AWQ significantly reduces model memory footprint and improves inference throughput‚Äîideal for GPU inference workloads.

    ‚úÖ Block-wise quantization (e.g., block size = 128) enables hardware-friendly tensor layouts that optimize for tensor core utilization on NVIDIA GPUs.

    ‚úÖ The TensorRT-LLM ModelOpt PTQ API provides a flexible and high-level interface for experimenting with quantization formats, including full customization of AWQ configs.

    ‚úÖ Exported models remain compatible with Hugging Face interfaces, making them easy to use in production pipelines or deploy via inference frameworks like vLLM or TensorRT-LLM.