GPTQ = original algorithm/paper (ETH Zurich, 2022).

AutoGPTQ = practical Python library (Hugging Face maintained) that made GPTQ mainstream and easy to use.

https://github.com/AutoGPTQ/AutoGPTQ



| Feature / Library       | **AutoGPTQ**                          | **bitsandbytes**                                      | **gptqmodel**                                                  |
| ----------------------- | ------------------------------------- | ----------------------------------------------------- | -------------------------------------------------------------- |
| **Origin**              | Hugging Face ecosystem (2023)         | Tim Dettmers (2022)                                   | ModelCloud (2024–2025)                                         |
| **Quantization Method** | GPTQ (post-training, 4/8-bit)         | Linear quantization (INT8, NF4, FP4, 8-bit optimizer) | GPTQ v1, GPTQ v2, QQQ, EoRA, GAR                               |
| **Target Use**          | Easy Hugging Face integration         | Training & inference memory savings                   | **Full production deployment toolkit**                         |
| **Hardware Support**    | CUDA (Nvidia GPU)                     | CUDA (Nvidia GPU) only                                | CUDA (Nvidia), ROCm (AMD), XPU (Intel), MPS (Apple), CPU       |
| **Integration**         | Hugging Face Transformers             | PyTorch Optimizer + HF integration                    | HF Transformers, vLLM, SGLang, Optimum, Peft                   |
| **Inference Kernels**   | ExLlama, Triton, Marlin (via plugins) | cuBLAS-based INT8 kernels                             | Marlin, ExLlama v2, Torch fused, BitBLAS                       |
| **Training Support**    | No                                    | Optimizer states (8-bit Adam, NF4 LoRA)             | Partial (LoRA + EoRA fine-tune on quantized model)             |
| **Flexibility**         | Focused on GPTQ only                  | Training + inference memory efficiency                | **Dynamic per-layer configs, adapters, GAR, eval integration** |
| **Evaluation**          | None built-in                         | None built-in                                         | Built-in `lm-eval` & `evalplus` hooks                          |
| **Ease of Use**         | Very easy (HF-style API)            | Very easy (drop-in optimizer / load\_in\_8bit=True) | More advanced config, but still has high-level API             |
| **Community Models**    | Huge (TheBloke, HF Hub)               | Many LoRA + finetune models                           | Growing rapidly (ModelCloud & HF Hub vortex/EoRA releases)     |
| **Strengths**           | Easy Hugging Face usage               | Simple + effective for training                       | All-in-one production toolkit, multi-hardware                  |
| **Weaknesses**          | Limited to GPTQ only                  | Nvidia-only, no GPTQ                                  | More complex, newer ecosystem (still maturing)                 |


Versions & Techniques

GPTQ v1, GPTQ v2 → GPTQ ke alag implementations / improvements

QQQ → Quick Quantization for Transformers (speed optimized variant)

EoRA → Efficient Online Row-wise Approximation (better per-row error handling)

GAR → Gradient Aware Rounding (quantization ke liye advanced rounding strategy)

https://github.com/ModelCloud/GPTQModel

pip install gptqmodel → install the gptqmodel package.

-v → verbose mode, so pip prints more details about what it is doing (downloads, builds, etc.).

--no-build-isolation → disables pip’s default build isolation behavior when installing from source (sdist). It means pip will not create an isolated environment to build dependencies; you’re responsible to have all build dependencies present already.

In [1]:
!pip install -v gptqmodel --no-build-isolation

Using pip 24.1.2 from /usr/local/lib/python3.12/dist-packages/pip (python 3.12)
Collecting gptqmodel
  Downloading gptqmodel-4.2.5.tar.gz (331 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.7/331.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command Preparing metadata (pyproject.toml)
  CUDA_ARCH_LIST: 7.5
  TORCH_CUDA_ARCH_LIST: 7.5
  CUDA 7.5
  HAS_CUDA_V8 False
  SETUP_KWARGS {'ext_modules': [], 'cmdclass': {'build_ext': <class 'torch.utils.cpp_extension.BuildExtension'>}}
  running dist_info
  creating /tmp/pip-modern-metadata-2iuo6oj0/gptqmodel.egg-info
  writing /tmp/pip-modern-metadata-2iuo6oj0/gptqmodel.egg-info/PKG-INFO
  writing dependency_links to /tmp/pip-modern-metadata-2iuo6oj0/gptqmodel.egg-info/dependency_links.txt
  writing requirements to /tmp/pip-modern-metadata-2iuo6oj0/gptqmodel.egg-info/requires.txt
  writing top-level names to /tmp/pip-modern-metadata-2iuo6oj0/gptqmodel.egg-info/top_level.txt
  writing manifest file '/tmp

In [1]:
!pip install "protobuf<6.30"



In [2]:
from gptqmodel import GPTQModel


[33mWARN[0m  Python GIL is enabled: Multi-gpu quant acceleration for MoE models is sub-optimal and multi-core accelerated cpu packing is also disabled. We recommend Python >= 3.13.3t with Pytorch > 2.8 for mult-gpu quantization and multi-cpu packing with env `PYTHON_GIL=0`.
[33mWARN[0m  Feature `utils/Perplexity` requires python GIL or Python >= 3.13.3T (T for Threading-Free edition of Python) plus Torch 2.8. Feature is currently skipped/disabled.
[32mINFO[0m  ENV: Auto setting PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' for memory saving.
[32mINFO[0m  ENV: Auto setting CUDA_DEVICE_ORDER=PCI_BUS_ID for correctness.          


In [3]:
import gptqmodel
dir(gptqmodel)

['BACKEND',
 'BaseQuantizeConfig',
 'GPTQModel',
 'QuantizeConfig',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 'adapter',
 'exllama_set_max_input_length',
 'get_best_device',
 'looper',
 'models',
 'nn_modules',
 'os',
 'quantization',
 'utils',
 'version']

https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2

In [4]:
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")

config.json: 0.00B [00:00, ?B/s]

from_quantized: adapter: None


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

tokenizer.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

quantize_config.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


[32mINFO[0m  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   
[32mINFO[0m  Estimated Quantization BPW (bits per weight): 4.85 bpw, based on [bits: 4, group_size: 32]


`torch_dtype` is deprecated! Use `dtype` instead!


[32mINFO[0m   Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`          
[32mINFO[0m   Kernel: Auto-selection: adding candidate `TorchQuantLinear`             
[32mINFO[0m  Kernel: candidates -> `[TritonV2QuantLinear, TorchQuantLinear]`          
[32mINFO[0m  Kernel: selected -> `TritonV2QuantLinear`.                               




[32mINFO[0m  Format: Converting `checkpoint_format` from `FORMAT.GPTQ` to internal `FORMAT.GPTQ_V2`.
[32mINFO[0m  Format: Converting GPTQ v1 to v2                                         
[32mINFO[0m  Format: Conversion complete: 0.058113813400268555s                       
[32mINFO[0m   Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`          
[32mINFO[0m  Optimize: `TritonV2QuantLinear` compilation triggered.                   
[32mINFO[0m  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ]
}

[32mINFO[0m  Model: `generation_config.json` not found. Skipped checking.             
[32mINFO[0m  Kernel: loaded -> `[TritonV2QuantLinear]`                                


In [5]:
result = model.generate("Uncovering deep insights begins with")[0] # tokens

In [6]:
print(model.tokenizer.decode(result)) # string output

<|begin_of_text|>Uncovering deep insights begins with a deep understanding of the underlying principles and concepts that govern the behavior of the system in question. In


<|begin_of_text|>Uncovering deep insights begins with a deep understanding of the underlying principles and concepts that govern the behavior of the system in question. In

In [8]:
model_id = "meta-llama/Llama-3.2-1B-Instruct"

In [9]:
quant_path = "Llama-3.2-1B-Instruct-gptqmodel-4bit"

In [11]:
from datasets import load_dataset
calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

In [12]:
from gptqmodel import GPTQModel, QuantizeConfig

In [14]:
quant_config = QuantizeConfig(bits=4, group_size=128)

In [15]:
model = GPTQModel.load(model_id, quant_config)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

[32mINFO[0m  Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]


Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

params.json:   0%|          | 0.00/220 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

original/consolidated.00.pth:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/41.7k [00:00<?, ?B/s]

LICENSE.txt:   0%|          | 0.00/7.71k [00:00<?, ?B/s]

original/tokenizer.model:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

USE_POLICY.md:   0%|          | 0.00/6.02k [00:00<?, ?B/s]

[32mINFO[0m  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   
[32mINFO[0m  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[32mINFO[0m  Kernel: loaded -> `[]`                                                   


QuantizeConfig Parameters Explained

bits: int = 4
Number of bits for quantization (2, 3, 4, 8). Lower bits = more compression, less accuracy.

dynamic: Dict[...] | None
Allows per-layer overrides. Example: quantize some layers with 8 bits, others skip quantization.

group_size: int = 128
Number of weights grouped together before quantization.
Smaller = better accuracy, but slower. Larger = faster, more compression.

damp_percent: float = 0.05
Used in Hessian damping (numerical stability). Prevents division by very small numbers.

damp_auto_increment: float = 0.01
If quantization fails for a layer, damping value is automatically increased by this amount.

desc_act: bool = True
Whether to use activation ordering (descending importance) for better accuracy.
If False, disables this reordering.

act_group_aware: bool = False
(GAR feature) Group-aware reordering. Preserves activation sensitivity per group.
Improves accuracy for grouped quantization.

static_groups: bool = False
If True, fixes the grouping instead of dynamic grouping.

sym: bool = True
Symmetric quantization (weights centered around zero).
If False, asymmetric (different zero-point for positive/negative).

true_sequential: bool = True
Forces strictly sequential quantization (layer by layer).
Safer but slower.

lm_head: bool = False
Whether to quantize the final LM head (output projection).
Usually skipped for better accuracy.

quant_method: QUANT_METHOD = GPTQ
Which quantization algorithm to use (GPTQ, GPTQv2, EoRA, QQQ, etc.).

format: FORMAT = GPTQ
Output format (GPTQ, Marlin, ExLlamaV2 kernels, etc.).

mse: float = 0
If set > 0, enables MSE minimization during quantization for extra accuracy recovery.

parallel_packing: bool = True
Enables multi-threaded packing of quantized weights for speedup.

meta: Dict | None
Extra metadata for custom configs.

device: str | device | None
Device to run quantization (cuda, cpu, mps, etc.).

pack_dtype: str | dtype | None = torch.int32
Packing dtype (int32 default, but can be int16 in some kernels).

adapter: Dict | Lora | None
Allows LoRA/EoRA adapters to be applied during or after quantization.

rotation: str | None
If rotation quantization (RQ, QRQ) is applied.

is_marlin_format: bool = False
If set, stores quantized weights in Marlin kernel format (fast inference).

v2: bool = False
Enables GPTQ v2 quantization (better accuracy, more VRAM required).

v2_alpha: float = 0.25
Extra parameter controlling Hessian approximation in GPTQv2.

v2_memory_device: str = "auto"
Controls where GPTQv2 Hessians are computed (cpu, gpu, or auto).

mock_quantization: bool = False
Runs a fake quantization pass (for debugging/testing without actually quantizing).

In [18]:
# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=500)

[32mINFO[0m  Packing Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`   
Quantizing layer 0 of 15 [0 of 15] -------------| 0:00:38 / 0:10:08 [1/16] 6.2%

OutOfMemoryError: CUDA out of memory. Tried to allocate 13.68 GiB. GPU 0 has a total capacity of 14.74 GiB of which 1.43 GiB is free. Process 13291 has 13.31 GiB memory in use. Of the allocated memory 12.31 GiB is allocated by PyTorch, and 836.73 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
model.save(quant_path)

In [None]:
!pip -q install -U huggingface_hub
from huggingface_hub import notebook_login
notebook_login()

In [None]:
from huggingface_hub import whoami
whoami()

In [None]:
from huggingface_hub import upload_folder
upload_folder(
    repo_id="sunny199/Llama-3.2-1B-Instruct-gptqmodel-4bit",
    folder_path="./Llama-3.2-1B-Instruct-gptqmodel-4bit",
    commit_message="Upload GPTQ quantized LLaMA-3.2-1B model"
)

In [None]:
# test post-quant inference
model = GPTQModel.load(quant_path)
result = model.generate("Uncovering deep insights begins with")[0] # tokens
print(model.tokenizer.decode(result)) # string output

In [14]:
from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig
import torch

def calculate_avg_ppl(model, tokenizer, texts, max_length=512, batch_size=8):
    """Compute average perplexity over a list of plain text strings."""
    from gptqmodel.utils.perplexity import Perplexity

    ppl = Perplexity(
        model=model,
        tokenizer=tokenizer,
        dataset_path=None,
        split=None,
        text_column=None,
    )
    # Pass list of text strings
    return ppl.calculate_from_texts(texts, max_length, batch_size)

def load_normal_model(model_id):
    return GPTQModel.load(model_id, quantize_config=None)

def load_quant_model(quant_path, device):
    return GPTQModel.load(quant_path, device=device)

def main():
    model_id = "meta-llama/Llama-3.2-1B-Instruct"
    quant_path = "Llama-3.2-1B-Instruct-gptq-4bit"
    # Load calibration / eval dataset texts
    calibration_texts = load_dataset(
        "allenai/c4",
        data_files="en/c4-train.00001-of-01024.json.gz",
        split="train"
    ).select(range(1024))["text"]

    # Subset for evaluation
    eval_texts = calibration_texts[:100]

    # Try loading quantized model if exists
    try:
        device = "cuda:0" if torch.cuda.is_available() else "cpu"
        quant_model = load_quant_model(quant_path, device=device)
        quant_exists = True
    except Exception as e:
        print("Quantized model not found or failed to load:", e)
        quant_exists = False

    # Load normal (unquantized) model
    normal_model = load_normal_model(model_id)

    # If quantized model doesn't exist, create it
    if not quant_exists:
        quant_config = QuantizeConfig(bits=4, group_size=128)
        print("Quantizing model now...")
        model = GPTQModel.load(model_id, quant_config)
        model.quantize(calibration_texts, batch_size=1)
        model.save(quant_path)
        quant_model = load_quant_model(quant_path, device=device)

    # Evaluate normal model
    print("Evaluating normal model...")
    ppl_normal = calculate_avg_ppl(normal_model, normal_model.tokenizer, eval_texts)
    print("Normal model avg PPL:", ppl_normal)

    # Evaluate quantized model
    print("Evaluating quantized model...")
    ppl_quant = calculate_avg_ppl(quant_model, quant_model.tokenizer, eval_texts)
    print("Quantized model avg PPL:", ppl_quant)

if __name__ == "__main__":
    main()


Allocated = currently in use.

Reserved = currently held by PyTorch (in use + cached).

Peak Allocated = highest ever “in use.”

Peak Reserved = highest ever “held by cache.”

In [22]:
import torch

def measure_gpu_memory(model_loader_fn, *args, **kwargs):
    # Reset stats
    torch.cuda.reset_peak_memory_stats()
    torch.cuda.empty_cache()

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model = model_loader_fn(*args, **kwargs).to(device)

    # Do a dummy forward pass to trigger memory usage (adjust input shape as per your model)
    # For example, if it’s a language model:
    dummy_input = torch.randint(0, 100, (1, 16)).to(device)
    try:
        _ = model.generate(dummy_input)
    except Exception:
        # fallback: try forward if generate not available
        _ = model(dummy_input)

    # Collect memory stats
    mem_alloc = torch.cuda.memory_allocated(device)
    mem_reserved = torch.cuda.memory_reserved(device)
    peak_alloc = torch.cuda.max_memory_allocated(device)
    peak_reserved = torch.cuda.max_memory_reserved(device)

    print(f"Memory allocated: {mem_alloc / (1024**2):.2f} MB")
    print(f"Memory reserved:  {mem_reserved / (1024**2):.2f} MB")
    print(f"Peak alloc:       {peak_alloc / (1024**2):.2f} MB")
    print(f"Peak reserved:    {peak_reserved / (1024**2):.2f} MB")

    del model
    torch.cuda.empty_cache()


def load_normal_model(model_id):
    quantize_config=QuantizeConfig(bits=4, group_size=128)
    return GPTQModel.load(model_id, quantize_config=quantize_config)
#Example usage:
measure_gpu_memory(lambda: load_normal_model(model_id))
# measure_gpu_memory(lambda: load_quant_model(quant_path, device="cuda:0"))


[32mINFO[0m  Estimated Quantization BPW (bits per weight): 4.2875 bpw, based on [bits: 4, group_size: 128]
Quantizing layer 0 of 15 [0 of 15] -------------| 0:10:31 / 2:48:16 [1/16] 6.2%

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

[32mINFO[0m  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   
[32mINFO[0m  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "temperature": 0.6,
  "top_p": 0.9
}

[32mINFO[0m  Kernel: loaded -> `[]`                                                   
Quantizing layer 0 of 15 [0 of 15] -------------| 0:10:33 / 2:48:48 [1/16] 6.2%

OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 22.12 MiB is free. Process 13291 has 14.72 GiB memory in use. Of the allocated memory 14.52 GiB is allocated by PyTorch, and 17.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
# gptqmodel is integrated into lm-eval >= v0.4.7
!pip install lm-eval>=0.4.7

In [None]:
from gptqmodel import GPTQModel
from gptqmodel.utils.eval import EVAL

model_id = "ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v1"

# Use `lm-eval` as framework to evaluate the model
lm_eval_results = GPTQModel.eval(model_id, framework=EVAL.LM_EVAL, tasks=[EVAL.LM_EVAL.ARC_CHALLENGE])


[32mINFO[0m  Eval: loading using backend = `BACKEND.AUTO`                             
Quantizing layer 0 of 15 [0 of 15] -------------| 0:27:34 / 7:21:04 [1/16] 6.2%from_quantized: adapter: None
[32mINFO[0m  Loader: Auto dtype (native bfloat16): `torch.bfloat16`                   
[32mINFO[0m  Estimated Quantization BPW (bits per weight): 4.85 bpw, based on [bits: 4, group_size: 32]
[32mINFO[0m   Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`          
[32mINFO[0m   Kernel: Auto-selection: adding candidate `TorchQuantLinear`             
[32mINFO[0m  Kernel: candidates -> `[TritonV2QuantLinear, TorchQuantLinear]`          
[32mINFO[0m  Kernel: selected -> `TritonV2QuantLinear`.                               
Quantizing layer 0 of 15 [0 of 15] -------------| 0:27:36 / 7:21:36 [1/16] 6.2%



[32mINFO[0m  Format: Converting `checkpoint_format` from `FORMAT.GPTQ` to internal `FORMAT.GPTQ_V2`.
[32mINFO[0m  Format: Conversion complete: 0.012720584869384766s                       
[32mINFO[0m   Kernel: Auto-selection: adding candidate `TritonV2QuantLinear`          
[32mINFO[0m  Model: Loaded `generation_config`: GenerationConfig {
  "bos_token_id": 128000,
  "eos_token_id": [
    128001,
    128008,
    128009
  ]
}

[32mINFO[0m  Model: `generation_config.json` not found. Skipped checking.             
[32mINFO[0m  Kernel: loaded -> `[TritonV2QuantLinear]`                                
Quantizing layer 0 of 15 [0 of 15] -------------| 0:27:44 / 7:23:44 [1/16] 6.2%



[32mINFO[0m  LM-EVAL: `gen_kwargs` = `do_sample=False,temperature=1.0,top_k=50,top_p=1.0`
[32mINFO[0m  LM-EVAL: `apply_chat_template` = `False`                                 
Quantizing layer 0 of 15 [0 of 15] -------------| 0:27:44 / 7:23:44 [1/16] 6.2%

        applied. Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).


README.md: 0.00B [00:00, ?B/s]

ARC-Challenge/train-00000-of-00001.parqu(…):   0%|          | 0.00/190k [00:00<?, ?B/s]

ARC-Challenge/test-00000-of-00001.parque(…):   0%|          | 0.00/204k [00:00<?, ?B/s]

ARC-Challenge/validation-00000-of-00001.(…):   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

100%|██████████| 1172/1172 [00:01<00:00, 1031.46it/s]
Running loglikelihood requests: 100%|██████████| 4687/4687 [07:08<00:00, 10.93it/s]


--------lm_eval Eval Result---------
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.2799|±  |0.0131|
|             |       |none  |     0|acc_norm|↑  |0.3063|±  |0.0135|

--------lm_eval Result End---------


These are the evaluation metrics for the quantized model on the ARC Challenge task.

acc is accuracy, acc_norm is normalized accuracy, etc.

0.2799 means ~27.99% accuracy, with a standard error ±0.0131.

The evaluation harness printed the result table.

### Bits & Bytes / Quantization Context

bitsandbytes is a library that implements efficient quantized operations (4-bit, 8-bit) for matrices / linear layers, especially useful for large models.

The transformers library supports integration with bitsandbytes via a BitsAndBytesConfig class. That config tells from_pretrained how to load model weights in lower precision (4-bit or 8-bit) rather than default high precision.

This integration is meant for inference (or adapter training) rather than full training from scratch. It reduces VRAM / memory footprint a lot, and you can still generate text.

In [None]:
!pip install transformers accelerate
!pip install -U bitsandbytese
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"   # example HF model
bnb = BitsAndBytesConfig(load_in_4bit=True)      # or load_in_8bit=True
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb, device_map="auto"
)

inputs = tok("Explain GPTQ vs AWQ simply.", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=80)
print(tok.decode(out[0], skip_special_tokens=True))

### ExLlama

ExLlama is a Python/C++/CUDA implementation tailored for Llama (and Llama-style) models, working especially with 4-bit GPTQ weights.

It provides optimized kernels (linear layers, quantized operators) for inference, to accelerate generation speed and reduce memory overhead.

ExLlamaV2 adds more features: e.g. EXL2 format (a flexible quantization format), better kernel optimizations, dynamic batching, new generation strategies.

ExLlama is often integrated into quantization+inference pipelines — e.g. Transformers’ GPTQ support may allow using ExLlama kernels by specifying certain config flags.

In [None]:
!pip install exllamav2
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Generator
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Tokenizer, ExLlamaV2Generator
# Download a GPTQ/EXL2 model first or point to HF snapshot dir with config + shards
# Example assumes local path `./model` with GPTQ weights (ex: TheBloke GPTQ)
model_path = "/content/model"  # put your GPTQ model dir here

cfg = ExLlamaV2Config()
cfg.model_dir = model_path
model = ExLlamaV2(cfg)
tok = ExLlamaV2Tokenizer(model_path)
gen = ExLlamaV2Generator(model, tok)

print(gen.generate_simple("Explain GPTQ vs AWQ simply.", max_new_tokens=80))
