# Requirements

In [1]:
!nvcc --version

/bin/bash: nvcc: command not found


**Requirements**

Requires: Transformers 4.33.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.

```shell
pip3 install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
pip3 install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
pip3 install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/
```

If you are using PyTorch 2.0, you will need to install AutoGPTQ from source. Likewise if you have problems with the pre-built wheels, you should try building from source:

```shell
pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
git checkout v0.5.1
pip3 install .
```


In [2]:
!pip install --upgrade transformers optimum
# If using PyTorch 2.1 + CUDA 12.x:
!pip install --upgrade auto-gptq
# or, if using PyTorch 2.1 + CUDA 11.x:
# !pip install --upgrade auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

Collecting transformers
  Obtaining dependency information for transformers from https://files.pythonhosted.org/packages/45/d6/a69764e89fc5c2c957aa473881527c8c35521108d553df703e9ba703daeb/transformers-4.48.0-py3-none-any.whl.metadata
  Using cached transformers-4.48.0-py3-none-any.whl.metadata (44 kB)
Collecting optimum
  Obtaining dependency information for optimum from https://files.pythonhosted.org/packages/48/33/97cf226c47e4cf5a79159668732038cdd6c0199c72782d5b5a0db54f9a2d/optimum-1.23.3-py3-none-any.whl.metadata
  Using cached optimum-1.23.3-py3-none-any.whl.metadata (20 kB)
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers)
  Obtaining dependency information for huggingface-hub<1.0,>=0.24.0 from https://files.pythonhosted.org/packages/6c/3f/50f6b25fafdcfb1c089187a328c95081abf882309afd86f4053951507cd1/huggingface_hub-0.27.1-py3-none-any.whl.metadata
  Using cached huggingface_hub-0.27.1-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)

Check last available versions of the Auto-GPTQ here: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/docs/INSTALLATION.md

In [None]:
!huggingface-cli login --token 

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `3B` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `3B`


In [4]:
from typing import Any
import random
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import os
def get_c4(tokenizer: Any, seqlen: int, nsamples: int, split: str = "train"):
    if split == "train":
        data = load_dataset("allenai/c4", split="train", data_files={"train": "en/c4-train.00000-of-01024.json.gz"})
    elif split == "validation":
        data = load_dataset(
            "allenai/c4",
            split="validation",
            data_files={"validation": "en/c4-validation.00000-of-00008.json.gz"},
        )
    dataset = []
    for _ in range(nsamples):
        while True:

            i = random.randint(0, len(data) - 1)
            enc = tokenizer(data[i]["text"], return_tensors="pt")
            if enc.input_ids.shape[1] >= seqlen:
                break
        if enc.input_ids.shape[1] - seqlen - 1 >0:
            i = random.randint(0, enc.input_ids.shape[1] - seqlen - 1)
            j = i + seqlen
            inp = enc.input_ids[:, i:j]
            attention_mask = torch.ones_like(inp)
            dataset.append({"input_ids": inp, "attention_mask": attention_mask})
    return dataset

  from .autonotebook import tqdm as notebook_tqdm
CUDA extension not installed.
CUDA extension not installed.


In [5]:
#@title Choose a model for quantization
pretrained_model_dir = "meta-llama/Llama-3.2-1B" #@param str
!echo proceed with model: {pretrained_model_dir}

proceed with model: meta-llama/Llama-3.2-1B


In [6]:
#@title Enter the desired bit precision (n-bit) for quantization (e.g., 2,3,4,8):
n_bits = 4 #@param int

# Quantization

In [7]:
model_name=pretrained_model_dir.split("/")[-1]
quantized_model_dir = f"{model_name}-{n_bits}bit"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
examples=get_c4(tokenizer=tokenizer, seqlen=2048, nsamples=128, split="train")

quantize_config = BaseQuantizeConfig(
    bits=n_bits,
    group_size=128,
    desc_act=False,
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config).to("cuda:0")

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"

KeyboardInterrupt: 

In [18]:
model.quantize(examples)
model.save_quantized(quantized_model_dir, use_safetensors=True)

# **Note**: By default, the format of the model file base name saved using Auto-GPTQ is: gptq_model-{bits}bit-{group_size}g.
# To support further loading with the automatic transformers class AutoForCausalLM, rename the file as below to model.safetensors.
matching_file_weights = [_filename for _filename in os.listdir(quantized_model_dir)
                         if _filename.endswith('.safetensors') and _filename != 'model.safetensors']

if matching_file_weights:
    os.rename(
        os.path.join(quantized_model_dir, matching_file_weights[0]),
        os.path.join(quantized_model_dir, 'model.safetensors')
    )

# Voilà, now the model can be used for inference
# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0")

# inference with model.generate
print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0]))

# or you can also use pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)
print(pipeline("auto-gptq is")[0]["generated_text"])

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Device set to use cuda:0
The model 'LlamaGPTQForCausalLM' is not supported for . Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LM

<|begin_of_text|>auto_gptq is a tool for generating GPT-3 style text. It is inspired by the GPT-3
auto-gptq is a tool for generating quantum circuits from a given quantum circuit. It is based on the quantum circuit generator


# Push Quantized Model to Hugging Face Hub

To use `use_auth_token=True`, log in first via `huggingface-cli login`, or pass an explicit token with: `use_auth_token="hf_xxxxxxx"`.

**Uncomment the following three lines to enable this feature:**

```python
repo_id = f"YourUserName/{quantized_model_dir}"
commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits} bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
```
**Note**: By default, the format of the model file base name saved using Auto-GPTQ is: `gptq_model-{bits}bit-{group_size}g`. To support further loading with the automatic class `AutoForCausalLM`, change it to `model.safetensors`, as suggested above.

```
model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
tokenizer.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
```