# Oumi - Model Quantization Tutorial

This tutorial demonstrates how to use AWQ (Activation-aware Weight Quantization) to compress large language models while maintaining performance.

## Prerequisites

- GPU with CUDA support (required for AWQ)
- Oumi installed with GPU support: `pip install oumi[gpu]`
- AutoAWQ library: `pip install autoawq`

## 1. Basic AWQ Quantization

Let's start by quantizing TinyLlama to 4-bit using AWQ:

In [None]:
from oumi.core.configs import QuantizationConfig, ModelParams
from oumi.quantize import quantize

# Configure quantization
config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="awq_q4_0",  # 4-bit AWQ quantization
    output_path="tinyllama_awq_4bit.pytorch",
    output_format="pytorch",
    calibration_samples=512,  # Number of calibration samples
)

# Run quantization
print("Starting AWQ quantization...")
result = quantize(config)

# Calculate sizes and compression
original_size_gb = 2.2  # TinyLlama 1.1B in fp16
quantized_size_gb = result["quantized_size_bytes"] / (1024**3)  # type: ignore
compression_ratio = original_size_gb / quantized_size_gb

print(f"\n✅ Quantization complete!")
print(f"Original size (fp16): {original_size_gb:.2f}GB")
print(f"Quantized size (4-bit): {quantized_size_gb:.2f}GB")
print(f"Compression ratio: {compression_ratio:.1f}x")
print(
    f"Size reduction: {((original_size_gb - quantized_size_gb) / original_size_gb * 100):.1f}%"
)

Starting AWQ quantization...
[2025-08-01 03:40:43,383][oumi][rank0][pid:2763901][MainThread][INFO]][main.py:52] Starting quantization of model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
[2025-08-01 03:40:43,386][oumi][rank0][pid:2763901][MainThread][INFO]][main.py:53] Quantization method: awq_q4_0
[2025-08-01 03:40:43,387][oumi][rank0][pid:2763901][MainThread][INFO]][main.py:54] Output path: tinyllama_awq_4bit.pytorch
[2025-08-01 03:40:43,511][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:94] Starting AWQ quantization pipeline...
[2025-08-01 03:40:43,512][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:177] Loading model for AWQ quantization: TinyLlama/TinyLlama-1.1B-Chat-v1.0
[2025-08-01 03:40:43,513][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:180] 📥 Loading base model...


Fetching 10 files:   0%|          | 0/10 [00:00<?, ?it/s]

[2025-08-01 03:40:43,768][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:199] 🔧 Configuring AWQ quantization parameters...
[2025-08-01 03:40:43,769][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:216] ⚙️  AWQ config: {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
[2025-08-01 03:40:43,770][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:217] 📊 Using 512 calibration samples
[2025-08-01 03:40:43,771][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:218] 🧮 Starting AWQ calibration and quantization...


Repo card metadata block was not found. Setting CardData to empty.
Token indices sequence length is longer than the specified maximum sequence length for this model (8322 > 2048). Running this sequence through the model will result in indexing errors
AWQ: 100%|██████████| 22/22 [05:52<00:00, 16.01s/it]


[2025-08-01 03:46:39,020][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:139] PyTorch format requested. Saving AWQ model...
[2025-08-01 03:46:40,228][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:147] ✅ AWQ quantization successful! Saved as pytorch format.
[2025-08-01 03:46:40,229][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:148] 📊 Final size: 734.0 MB
[2025-08-01 03:46:40,230][oumi][rank0][pid:2763901][MainThread][INFO]][awq_quantizer.py:159] 💡 Use this model with: AutoAWQForCausalLM.from_quantized('tinyllama_awq_4bit.pytorch')
[2025-08-01 03:46:40,232][oumi][rank0][pid:2763901][MainThread][INFO]][main.py:98] Quantization completed successfully!

✅ Quantization complete!
Original size (fp16): 2.20GB
Quantized size (4-bit): 0.72GB
Compression ratio: 3.1x
Size reduction: 67.4%

🔍 Debug info:
Raw quantized size: 769,872,152 bytes
Expected 4-bit size: ~0.55GB


## 2. Using the Quantized Model

Now let's load and use the quantized model for inference:

In [9]:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

# Load the quantized model
model_path = "tinyllama_awq_4bit.pytorch"

print(f"Loading AWQ model from: {model_path}")
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=False,  # Disable layer fusion to avoid compatibility issues
    device_map="auto",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"✅ Model loaded! GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f}GB")

Loading AWQ model from: tinyllama_awq_4bit.pytorch


Replacing layers...: 100%|██████████| 22/22 [00:02<00:00,  8.15it/s]


  0%|          | 0/509 [00:00<?, ?w/s]

✅ Model loaded! GPU memory: 0.42GB


In [10]:
# Test inference
prompt = "Explain the benefits of model quantization in simple terms:"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate
print(f"Prompt: {prompt}\n")
print("Generating response...")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

# Decode and print response
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Response:\n{response}")

Prompt: Explain the benefits of model quantization in simple terms:

Generating response...
Response:
Explain the benefits of model quantization in simple terms:

Model quantization is a technique that can achieve significant speed-up by reducing the amount of storage required to hold model parameters. In the context of deep learning, it reduces the storage needed to represent the weights of a model, which in turn reduces the size of the model and makes it easier to transport, store, and transfer. This helps with data reduction, which is crucial for real-time deployment scenarios. Moreover, model quantization can also lead to better utilization of processing capabilities of the GPU, since it reduces the memory requirements for storing model parameters. This reduces the amount of memory required to store and manipulate the model, and enables the usage of the entire memory capacity of the GPU. Overall, model quantization is an


## 3. Advanced Configuration

AWQ offers several configuration options for fine-tuning the quantization process:

In [11]:
# Advanced AWQ configuration with more calibration samples
advanced_config = QuantizationConfig(
    model=ModelParams(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"),
    method="awq_q4_0",
    output_path="tinyllama_awq_advanced.pytorch",
    # AWQ-specific parameters
    calibration_samples=1024,  # More samples for better calibration
    awq_group_size=128,  # Weight grouping size
    awq_version="GEMM",  # AWQ kernel version (GEMM is faster)
    awq_zero_point=True,  # Use zero-point quantization
)

print("Configuration:")
print(f"- Calibration samples: {advanced_config.calibration_samples}")
print(f"- Group size: {advanced_config.awq_group_size}")
print(f"- AWQ version: {advanced_config.awq_version}")
print(f"- Zero point: {advanced_config.awq_zero_point}")

Configuration:
- Calibration samples: 1024
- Group size: 128
- AWQ version: GEMM
- Zero point: True


## Summary

In this tutorial, you learned how to:

1. ✅ Quantize models using AWQ to 4-bit precision
2. ✅ Load and use AWQ quantized models for inference
3. ✅ Configure AWQ parameters for better quality


### Key Benefits of AWQ:
- **Memory Efficiency**: ~75% reduction in model size
- **Speed**: Faster inference due to reduced memory bandwidth
- **Quality**: Minimal performance degradation
- **Compatibility**: Works with most transformer models

Happy quantizing! 🚀