GPTQ = original algorithm/paper (ETH Zurich, 2022).

AutoGPTQ = practical Python library (Hugging Face maintained) that made GPTQ mainstream and easy to use.

https://github.com/AutoGPTQ/AutoGPTQ



| Feature / Library       | **AutoGPTQ**                          | **bitsandbytes**                                      | **gptqmodel**                                                  |
| ----------------------- | ------------------------------------- | ----------------------------------------------------- | -------------------------------------------------------------- |
| **Origin**              | Hugging Face ecosystem (2023)         | Tim Dettmers (2022)                                   | ModelCloud (2024–2025)                                         |
| **Quantization Method** | GPTQ (post-training, 4/8-bit)         | Linear quantization (INT8, NF4, FP4, 8-bit optimizer) | GPTQ v1, GPTQ v2, QQQ, EoRA, GAR                               |
| **Target Use**          | Easy Hugging Face integration         | Training & inference memory savings                   | **Full production deployment toolkit**                         |
| **Hardware Support**    | CUDA (Nvidia GPU)                     | CUDA (Nvidia GPU) only                                | CUDA (Nvidia), ROCm (AMD), XPU (Intel), MPS (Apple), CPU       |
| **Integration**         | Hugging Face Transformers             | PyTorch Optimizer + HF integration                    | HF Transformers, vLLM, SGLang, Optimum, Peft                   |
| **Inference Kernels**   | ExLlama, Triton, Marlin (via plugins) | cuBLAS-based INT8 kernels                             | Marlin, ExLlama v2, Torch fused, BitBLAS                       |
| **Training Support**    | No                                    | Optimizer states (8-bit Adam, NF4 LoRA)             | Partial (LoRA + EoRA fine-tune on quantized model)             |
| **Flexibility**         | Focused on GPTQ only                  | Training + inference memory efficiency                | **Dynamic per-layer configs, adapters, GAR, eval integration** |
| **Evaluation**          | None built-in                         | None built-in                                         | Built-in `lm-eval` & `evalplus` hooks                          |
| **Ease of Use**         | Very easy (HF-style API)            | Very easy (drop-in optimizer / load\_in\_8bit=True) | More advanced config, but still has high-level API             |
| **Community Models**    | Huge (TheBloke, HF Hub)               | Many LoRA + finetune models                           | Growing rapidly (ModelCloud & HF Hub vortex/EoRA releases)     |
| **Strengths**           | Easy Hugging Face usage               | Simple + effective for training                       | All-in-one production toolkit, multi-hardware                  |
| **Weaknesses**          | Limited to GPTQ only                  | Nvidia-only, no GPTQ                                  | More complex, newer ecosystem (still maturing)                 |


Versions & Techniques

GPTQ v1, GPTQ v2 → GPTQ ke alag implementations / improvements

QQQ → Quick Quantization for Transformers (speed optimized variant)

EoRA → Efficient Online Row-wise Approximation (better per-row error handling)

GAR → Gradient Aware Rounding (quantization ke liye advanced rounding strategy)

https://github.com/ModelCloud/GPTQModel

pip install gptqmodel → install the gptqmodel package.

-v → verbose mode, so pip prints more details about what it is doing (downloads, builds, etc.).

--no-build-isolation → disables pip’s default build isolation behavior when installing from source (sdist). It means pip will not create an isolated environment to build dependencies; you’re responsible to have all build dependencies present already.

In [None]:
!pip install -v gptqmodel --no-build-isolation

In [None]:
!pip install "protobuf<6.30"

In [None]:
from gptqmodel import GPTQModel

In [None]:
import gptqmodel
dir(gptqmodel)

https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2

In [None]:
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")

In [None]:
result = model.generate("Uncovering deep insights begins with")[0] # tokens

In [None]:
print(model.tokenizer.decode(result)) # string output

<|begin_of_text|>Uncovering deep insights begins with a deep understanding of the underlying principles and concepts that govern the behavior of the system in question. In

In [None]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

In [None]:
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

In [None]:
from datasets import load_dataset
calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

In [None]:
from gptqmodel import GPTQModel, QuantizeConfig

In [None]:
quant_config = QuantizeConfig(bits=4, group_size=128)

In [None]:
model = GPTQModel.load(model_id, quant_config)

QuantizeConfig Parameters Explained

bits: int = 4
Number of bits for quantization (2, 3, 4, 8). Lower bits = more compression, less accuracy.

dynamic: Dict[...] | None
Allows per-layer overrides. Example: quantize some layers with 8 bits, others skip quantization.

group_size: int = 128
Number of weights grouped together before quantization.
Smaller = better accuracy, but slower. Larger = faster, more compression.

damp_percent: float = 0.05
Used in Hessian damping (numerical stability). Prevents division by very small numbers.

damp_auto_increment: float = 0.01
If quantization fails for a layer, damping value is automatically increased by this amount.

desc_act: bool = True
Whether to use activation ordering (descending importance) for better accuracy.
If False, disables this reordering.

act_group_aware: bool = False
(GAR feature) Group-aware reordering. Preserves activation sensitivity per group.
Improves accuracy for grouped quantization.

static_groups: bool = False
If True, fixes the grouping instead of dynamic grouping.

sym: bool = True
Symmetric quantization (weights centered around zero).
If False, asymmetric (different zero-point for positive/negative).

true_sequential: bool = True
Forces strictly sequential quantization (layer by layer).
Safer but slower.

lm_head: bool = False
Whether to quantize the final LM head (output projection).
Usually skipped for better accuracy.

quant_method: QUANT_METHOD = GPTQ
Which quantization algorithm to use (GPTQ, GPTQv2, EoRA, QQQ, etc.).

format: FORMAT = GPTQ
Output format (GPTQ, Marlin, ExLlamaV2 kernels, etc.).

mse: float = 0
If set > 0, enables MSE minimization during quantization for extra accuracy recovery.

parallel_packing: bool = True
Enables multi-threaded packing of quantized weights for speedup.

meta: Dict | None
Extra metadata for custom configs.

device: str | device | None
Device to run quantization (cuda, cpu, mps, etc.).

pack_dtype: str | dtype | None = torch.int32
Packing dtype (int32 default, but can be int16 in some kernels).

adapter: Dict | Lora | None
Allows LoRA/EoRA adapters to be applied during or after quantization.

rotation: str | None
If rotation quantization (RQ, QRQ) is applied.

is_marlin_format: bool = False
If set, stores quantized weights in Marlin kernel format (fast inference).

v2: bool = False
Enables GPTQ v2 quantization (better accuracy, more VRAM required).

v2_alpha: float = 0.25
Extra parameter controlling Hessian approximation in GPTQv2.

v2_memory_device: str = "auto"
Controls where GPTQv2 Hessians are computed (cpu, gpu, or auto).

mock_quantization: bool = False
Runs a fake quantization pass (for debugging/testing without actually quantizing).

In [None]:
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=500)

In [None]:
model.save(quant_path)

In [None]:
!pip -q install -U huggingface_hub
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from huggingface_hub import whoami
whoami()

In [None]:
from huggingface_hub import upload_folder
upload_folder(
    repo_id="sunny199/Llama-3.2-1B-Instruct-gptqmodel-4bit",
    folder_path="./Llama-3.2-1B-Instruct-gptqmodel-4bit",
    commit_message="Upload GPTQ quantized LLaMA-3.2-1B model"
)

In [None]:
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

In [None]:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
import torch

def calculate_avg_ppl(model, tokenizer, texts, max_length=512, batch_size=8):
    """Compute average perplexity over a list of plain text strings."""
    from gptqmodel.utils.perplexity import Perplexity

    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path=None,
        split=None,
        text_column=None,
    )
    # Pass list of text strings
    return ppl.calculate_from_texts(texts, max_length, batch_size)

def load_normal_model(model_id):
    return GPTQModel.load(model_id, quantize_config=None)

def load_quant_model(quant_path, device):
    return GPTQModel.load(quant_path, device=device)

def main():
    model_id = "meta-llama/Llama-3.2-1B-Instruct"
    quant_path = "Llama-3.2-1B-Instruct-gptq-4bit"
    # Load calibration / eval dataset texts
    calibration_texts = load_dataset(
        "allenai/c4",
        data_files="en/c4-train.00001-of-01024.json.gz",
        split="train"
    ).select(range(1024))["text"]

    # Subset for evaluation
    eval_texts = calibration_texts[:100]

    # Try loading quantized model if exists
    try:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        quant_model = load_quant_model(quant_path, device=device)
        quant_exists = True
    except Exception as e:
        print("Quantized model not found or failed to load:", e)
        quant_exists = False

    # Load normal (unquantized) model
    normal_model = load_normal_model(model_id)

    # If quantized model doesn't exist, create it
    if not quant_exists:
        quant_config = QuantizeConfig(bits=4, group_size=128)
        print("Quantizing model now...")
        model = GPTQModel.load(model_id, quant_config)
        model.quantize(calibration_texts, batch_size=1)
        model.save(quant_path)
        quant_model = load_quant_model(quant_path, device=device)

    # Evaluate normal model
    print("Evaluating normal model...")
    ppl_normal = calculate_avg_ppl(normal_model, normal_model.tokenizer, eval_texts)
    print("Normal model avg PPL:", ppl_normal)

    # Evaluate quantized model
    print("Evaluating quantized model...")
    ppl_quant = calculate_avg_ppl(quant_model, quant_model.tokenizer, eval_texts)
    print("Quantized model avg PPL:", ppl_quant)

if __name__ == "__main__":
    main()


Allocated = currently in use.

Reserved = currently held by PyTorch (in use + cached).

Peak Allocated = highest ever “in use.”

Peak Reserved = highest ever “held by cache.”

In [None]:
import torch

def measure_gpu_memory(model_loader_fn, *args, **kwargs):
    # Reset stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model_loader_fn(*args, **kwargs).to(device)

    # Do a dummy forward pass to trigger memory usage (adjust input shape as per your model)
    # For example, if it’s a language model:
    dummy_input = torch.randint(0, 100, (1, 16)).to(device)
    try:
        _ = model.generate(dummy_input)
    except Exception:
        # fallback: try forward if generate not available
        _ = model(dummy_input)

    # Collect memory stats
    mem_alloc = torch.cuda.memory_allocated(device)
    mem_reserved = torch.cuda.memory_reserved(device)
    peak_alloc = torch.cuda.max_memory_allocated(device)
    peak_reserved = torch.cuda.max_memory_reserved(device)

    print(f"Memory allocated: {mem_alloc / (1024**2):.2f} MB")
    print(f"Memory reserved:  {mem_reserved / (1024**2):.2f} MB")
    print(f"Peak alloc:       {peak_alloc / (1024**2):.2f} MB")
    print(f"Peak reserved:    {peak_reserved / (1024**2):.2f} MB")

    del model
    torch.cuda.empty_cache()


def load_normal_model(model_id):
    quantize_config=QuantizeConfig(bits=4, group_size=128)
    return GPTQModel.load(model_id, quantize_config=quantize_config)
#Example usage:
measure_gpu_memory(lambda: load_normal_model(model_id))
# measure_gpu_memory(lambda: load_quant_model(quant_path, device="cuda:0"))


In [None]:
# gptqmodel is integrated into lm-eval >= v0.4.7
!pip install lm-eval>=0.4.7

In [None]:
from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])


These are the evaluation metrics for the quantized model on the ARC Challenge task.

acc is accuracy, acc_norm is normalized accuracy, etc.

0.2799 means ~27.99% accuracy, with a standard error ±0.0131.

The evaluation harness printed the result table.

### Bits & Bytes / Quantization Context

bitsandbytes is a library that implements efficient quantized operations (4-bit, 8-bit) for matrices / linear layers, especially useful for large models.

The transformers library supports integration with bitsandbytes via a BitsAndBytesConfig class. That config tells from_pretrained how to load model weights in lower precision (4-bit or 8-bit) rather than default high precision.

This integration is meant for inference (or adapter training) rather than full training from scratch. It reduces VRAM / memory footprint a lot, and you can still generate text.

In [None]:
!pip install transformers accelerate
!pip install -U bitsandbytese
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"   # example HF model
bnb = BitsAndBytesConfig(load_in_4bit=True)      # or load_in_8bit=True
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb, device_map="auto"
)

inputs = tok("Explain GPTQ vs AWQ simply.", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=80)
print(tok.decode(out[0], skip_special_tokens=True))

### ExLlama

ExLlama is a Python/C++/CUDA implementation tailored for Llama (and Llama-style) models, working especially with 4-bit GPTQ weights.

It provides optimized kernels (linear layers, quantized operators) for inference, to accelerate generation speed and reduce memory overhead.

ExLlamaV2 adds more features: e.g. EXL2 format (a flexible quantization format), better kernel optimizations, dynamic batching, new generation strategies.

ExLlama is often integrated into quantization+inference pipelines — e.g. Transformers’ GPTQ support may allow using ExLlama kernels by specifying certain config flags.

In [None]:
!pip install exllamav2
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Generator
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Generator
# Download a GPTQ/EXL2 model first or point to HF snapshot dir with config + shards
# Example assumes local path `./model` with GPTQ weights (ex: TheBloke GPTQ)
model_path = "/content/model"  # put your GPTQ model dir here

cfg = ExLlamaV2Config()
cfg.model_dir = model_path
model = ExLlamaV2(cfg)
tok = ExLlamaV2Tokenizer(model_path)
gen = ExLlamaV2Generator(model, tok)

print(gen.generate_simple("Explain GPTQ vs AWQ simply.", max_new_tokens=80))
