# 4-bit Quantization with GPTQ

Source: https://mlabonne.github.io/blog/posts/4_bit_Quantization_with_GPTQ.html

In [2]:
import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer

In [4]:
# Define base model and output directory
model_id = "gpt2"
out_dir = model_id + "-GPTQ"

# GPTQ Model

In [5]:
from gptqmodel import GPTQModel

In [7]:
model = GPTQModel.load("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")

Fetching 8 files: 100%|██████████| 8/8 [00:39<00:00,  4.88s/it]
INFO - You passed a model that is compatible with the Marlin int4*fp16 GPTQ kernel but backend is not BACKEND.MARLIN. We recommend using `backend=BACKEND.MARLIN` to use the optimized Marlin kernels for inference. Example: `model = GPTQModel.from_quantized(..., backend=BACKEND.MARLIN)`.
INFO - Auto pick kernel based on compatibility: <class 'gptqmodel.nn_modules.qlinear.marlin.MarlinQuantLinear'>
INFO - Compatibility: converting `checkpoint_format` from `gptq` to `gptq_v2`.


In [9]:
model.device

device(type='cuda', index=0)

In [12]:
from transformers import AutoTokenizer


In [13]:
tokenizer = AutoTokenizer.from_pretrained("ModelCloud/Llama-3.2-1B-Instruct-gptqmodel-4bit-vortex-v2.5")

In [15]:
len(tokenizer)

128256

In [24]:
messages = [
    {"role": "system", "content": "You are Llama, created by Meta. You are a helpful assistant."},
    {"role": "user", "content": "Create a simple linked list in Python"},
]

In [25]:
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

In [26]:
input_tensor

tensor([[128000, 128006,   9125, 128007,    271,  38766,   1303,  33025,   2696,
             25,   6790,    220,   2366,     18,    198,  15724,   2696,     25,
            220,   2304,   4448,    220,   2366,     20,    271,   2675,    527,
            445,  81101,     11,   3549,    555,  16197,     13,   1472,    527,
            264,  11190,  18328,     13, 128009, 128006,    882, 128007,    271,
           4110,    264,   4382,  10815,   1160,    304,  13325, 128009, 128006,
          78191, 128007,    271]])

In [27]:
input_tensor.shape

torch.Size([1, 57])

In [28]:
outputs = model.generate(input_ids=input_tensor.to(model.device), max_new_tokens=512)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


In [29]:
result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)

In [31]:
# print(result)

# Introduction to the problem: Optimal Brain Quantization

### Optimal Brain Quantization

Let’s start by introducing the problem we’re trying to solve. For every layer $\ell$ in the network, we want to find a quantized version $\hat{\mathbf{W}}_{\ell}$ of the original weights $\mathbf{W}_{\ell}$. This is called the *layer-wise compression problem*. More specifically, to minimize performance degradation, we want the outputs $(\hat{\mathbf{W}}_{\ell}\,\mathbf{X}_{\ell})$ of these new weights to be as close as possible to the original ones $(\mathbf{W}_{\ell}\,\mathbf{X}_{\ell})$. In other words, we want to find:

$$
\arg \min_{\hat{\mathbf{W}}_{\ell}} \bigl\|\mathbf{W}_{\ell}\,\mathbf{X}_{\ell} \;-\; \hat{\mathbf{W}}_{\ell}\,\mathbf{X}_{\ell}\bigr\|_{2}^{2}.
$$

Different approaches have been proposed to solve this problem, but we’re interested in the **Optimal Brain Quantizer (OBQ)** framework here.

This method is inspired by a pruning technique to carefully remove weights from a fully trained dense neural network (Optimal Brain Surgeon). It uses an approximation technique and provides explicit formulas for the best single weight $w_{q}$ to remove and optimal update $\delta_{F}$ to adjust the set of remaining non-quantized weights $F$ to make up for the removal:

$$
w_{q} \;=\; \arg\min_{w_{q}} \;\frac{\bigl(\mathrm{quant}(w_{q}) - w_{q}\bigr)^{2}}{\bigl[H_{F}^{-1}\bigr]_{qq}}, 
\quad
\delta_{F} \;=\; -\, \frac{w_{q} \;-\; \mathrm{quant}(w_{q})}{\bigl[H_{F}^{-1}\bigr]_{qq}} \;\cdot\; (H_{F}^{-1})_{:q}.
$$

where $\mathrm{quant}(w)$ is the weight rounding given by the quantization and $H_{F}$ is the Hessian.


### Optimal Brain Surgeon

Optimal Brain Surgeon (OBS) was introduced in the early 1990s (Hassibi & Stork, 1993) as a technique to prune neural networks by removing weights one at a time while minimizing the increase in the training loss. 

- **Pruning (or quantizing) a weight** is viewed as constraining it to a new (often smaller) value—in classical pruning, that new value might be zero; in quantization, it might be a discrete number from a small set.

- Consider that the neural network is highly over-parameterized, and changing a single weight can often be **compensated by small adjustments in the others**

**OBS provides a systematic way to find and apply those compensations**, using second-order approximations of the loss function.

After training a normal network we reach $\nabla_{\mathbf{w}} L(\mathbf{w}^*) \;=\; \mathbf{0}.$

Now imagine we make a small change (or “perturbation”). This perturbation can be approximated via a second-order Taylor expansion:

$$L(\mathbf{w}^* + \Delta \mathbf{w})
\;\approx\; L(\mathbf{w}^*) \;+\;
\underbrace{\nabla_{\mathbf{w}} L(\mathbf{w}^*)}_{=\mathbf{0}}^\top \,\Delta \mathbf{w}
\;+\; \tfrac{1}{2}\,\Delta \mathbf{w}^\top \underbrace{\nabla_{\mathbf{w}}^2 L(\mathbf{w}^*)}_{H} \,\Delta \mathbf{w},$$

where $H = \nabla_{\mathbf{w}}^2 L(\mathbf{w}^*)$ is the Hessian matrix evaluated at $w^*$

Since we are in a local minimum, it is given that $\nabla_{\mathbf{w}} L(\mathbf{w}^*)=\mathbf{0}$

Therefore:

$$\Delta L \;=\;
L(\mathbf{w}^* + \Delta \mathbf{w}) - L(\mathbf{w}^*)
\;\approx\; \tfrac{1}{2} \,\Delta \mathbf{w}^\top H \,\Delta \mathbf{w}.$$

More reasoning of the proof is provided but unnecessary to get why we need the Hessian of the loss for the set of parameters $F$ that will compensate for the pruned or quantized weight

### IMPORTANT NOTE:
When you see an expression like

$$
\mathrm{quant}(w_{q}) \;-\; w_{q},
$$

it is important to note that $\mathrm{quant}(\cdot)$ usually refers **not** just to mapping $w_q$ to an integer, but rather **the round-trip quantize–dequantize operation**. In other words:

1. **Quantize** a floating-point value $w_q$ into an integer (e.g. in $[0, 255]$ for 8-bit).
2. **Dequantize** that integer back into a floating-point approximation (which lives in the same domain as the original $w_q$).

Thus, $\mathrm{quant}(w_{q})$ ends up being a floating-point number **close to** $w_q$, but restricted to the discrete set of representable values allowed by your quantization scheme.


## Coming back to our formulas for OBQ

$$
w_{q} \;=\; \arg\min_{w_{q}} \;\frac{\bigl(\mathrm{quant}(w_{q}) - w_{q}\bigr)^{2}}{\bigl[H_{F}^{-1}\bigr]_{qq}}, $$

$$
\delta_{F} \;=\; -\, \frac{w_{q} \;-\; \mathrm{quant}(w_{q})}{\bigl[H_{F}^{-1}\bigr]_{qq}} \;\cdot\; (H_{F}^{-1})_{:q}.
$$

Using OBQ, we can quantize the easiest weight first and then adjust all remaining non-quantized weights to compensate for this precision loss. Then we pick the next weight to quantize, and so on.

**Problem: Outliers**: A potential issue with this approach is when there are outlier weights, which can result in high quantization error. Usually, these outliers would be quantized last, when there are few non-quantized weights left that could be adjusted to compensate for the large error.

### How to deal with the high computationally intensive task this supposes

This process could be computationally heavy, especially for LLMs. To deal with this, the OBQ method uses a trick that avoids redoing the entire computation each time a weight is simplified.

After quantizing a weight, it adjusts the matrix used in calculations (the Hessian) by removing the row and column associated with that weight (using Gaussian elimination).

The scaling is **cubic**. This cubic growth makes it difficult to use OBQ on very large models with billions of parameters.

# The GPTQ Algorithm

Scales the OBQ method

The general procedure in GPTQ:

- Partition $W$ into blocks. For example, in GPTQ-for-LLaMA, **each transformer layer might be broken into smaller sub-matrices**, or you use one block per layer.
- **Quantize a block using an iterative second-order approach**, sometimes referred to as “Error Compensation.”
- **Update the residual or remainder of the block** (and possibly other blocks) to compensate for the error.
- Repeat until all blocks are quantized.

### Mathematical derivation

Start with a **block of weights**:

$$ \mathbf{W}_{b} \in \mathbb{R}^{m \times n}$$

The idea is to quantize (some or all of) these parameters from high precision (e.g., float16) to a discrete set $\mathbf{Q}$. **GPTQ proceeds iteratively, often focusing on one “row” or “column” (or small set of columns) at a time within the block.**

#### Second-Order Approximation

We start with a second-order Taylor expansion of the loss function around the current trained weights $w^*$:

$$\Delta \mathcal{L}(\Delta \mathbf{w})
\;\approx\;
\frac{1}{2}
\,\Delta \mathbf{w}^\top
H
\,\Delta \mathbf{w},$$

**Goal**: If we decide to quantize a subset of the parameters $\mathbf{w}_Q\subset \mathbf{w}$ to discrete values, we can allow a compensatory update $\Delta \mathbf{w}_F$ in the "free" subset of parameters $\mathbf{w}_F$. The main **goal** is to minimize the second-order loss increase.

$$\Delta \mathbf{w}_Q
\;=\; \mathrm{quant}(\mathbf{w}_Q^*) \;-\; \mathbf{w}_Q^*,
\quad\quad
\mathbf{w}_Q^* \,\text{are the original values}.$$

### One-by-One or Small-Group Updates

A common practical approach is:

- Pick a single row (or small group) of $\mathbf{W}_b$
- Attempt to quantize it from float16 to 4-bit.
- Solve for how to best update the unquantized portion of that row (or block) to compensate.
- Move on to the next row/group.


### Minimization

To minimize $\Delta \mathcal{L}$ we solve for $\Delta \mathbf{w}_F$. In a small group context, this can be done with a local Hessian or a block of the Hessian, sometimes denoted $H_b$. Then we do:

$$\Delta \mathbf{w}_F
\;=\;
\arg \min_{\Delta \mathbf{w}_F}
\;
\tfrac{1}{2}
\begin{pmatrix}
\Delta \mathbf{w}_Q \\
\Delta \mathbf{w}_F
\end{pmatrix}^\top
H_b
\begin{pmatrix}
\Delta \mathbf{w}_Q \\
\Delta \mathbf{w}_F
\end{pmatrix}.$$


Because $\Delta \mathbf{w}_Q$ is fixed, we can differentiate and set to zero with only respect to $\Delta \mathbf{w}_F$. This yields a closed-form update if $H_b$ is well-defined (invertible). In practice, GPTQ uses approximations or partial factorization of $H_b$.

### Final Summary


1. **Initialize**: $\mathbf{W}_b \leftarrow \mathbf{W}_b^*$ (the trained block weights).  
2. **Compute Hessian Approx**: $\tilde{H}_b \approx \nabla^2_{\mathbf{W}_b} \mathcal{L}$. (In practice, may only store or invert partial info.)  
3. **Partition** $\mathbf{W}_b$ into rows or columns that you will quantize in small sets.  
4. **For each row/column** $r$ in block $b$:  
   - Compute $\Delta w_j = \mathrm{quant}(w_j^*) - w_j^*$ for $j \in r$.  
   - Solve for $\Delta \mathbf{w}_F$ in the “free” weights to minimize $\Delta \mathcal{L} \approx \tfrac{1}{2}\,\Delta \mathbf{w}^\top \tilde{H}_b \,\Delta \mathbf{w}$.  
   - Update $\mathbf{W}_b \leftarrow \mathbf{W}_b + \Delta \mathbf{w}$.  
5. **Move to next block** $b+1$.  

After all blocks are processed, you have $\widehat{\mathbf{W}}$, your 4-bit (or otherwise quantized) weights.  


## Step 1: Arbitrary Order Insight


- The OBQ method selects weights (parameters in a model) for quantization in a certain order, determined by which will add the least additional error
- GPTQ observes that for large models, quantizing weights in any fixed order can perform just as well.

 GPTQ aims to quantize all weights in the same order for all rows of a matrix.

## Step 2: Lazy Batch-Updates

This scheme won’t be fast because it requires updating a huge matrix with very few computations for each entry. This type of operation can’t utilize the full compute capabilities of GPUs and will be slowed down by memory limitations: *To resolve this, GPTQ introduces “lazy batch” updates. It turns out that the final rounding decisions for a given column are only affected by updates performed on that column, not on later columns.*

**Therefore, GPTQ can apply the algorithm to a batch of columns at a time (like 128 columns), updating only those columns and a corresponding block of the matrix.**

## Step 3. Cholesky Reformulation

When the algorithm scales up to very large models, numerical inaccuracies can become a problem. Specifically, repeated applications of a certain operation can accumulate numerical errors.

To tackle this, GPTQ uses a Cholesky decomposition, a numerically stable method for solving certain mathematical problems. It involves precomputing some required information from the matrix using the Cholesky method. This approach, combined with a slight “dampening” (adding a small constant to diagonal elements of the matrix), helps the algorithm to avoid numerical issues.

1. The GPTQ algorithm begins with a Cholesky decomposition of the Hessian inverse (a matrix that helps decide how to adjust the weights)

2. It then runs in loops, handling batches of columns at a time.

3. For each column in a batch, it quantizes the weights, calculates the error, and updates the weights in the block accordingly.

4. After processing the batch, it updates all remaining weights based on the block’s errors.

### Results

The GPTQ algorithm was tested on various language generation tasks. It was compared with other quantization methods, like rounding all weights to the **nearest quantized value (RTN).**

# Quantize an LLM with GPTQModel

The transformers library with bitsandbytes allows you to quantize a model when it’s loaded using the load_in_4bit=true argument, which requires downloading full models and storing them in your RAM.

In [32]:
import random

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset
import torch
from transformers import AutoTokenizer

In [33]:
# Define base model and output directory
model_id = "gpt2"
out_dir = model_id + "-GPTQ"

In [34]:
# Load quantize config, model and tokenizer
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    damp_percent=0.01,
    desc_act=False,
)

`group_size`:
- **Meaning**: “Group size” typically refers to how weights (e.g., within a matrix or layer) are grouped for quantization. Often, GPTQ or related methods do "per-channel" or "per-group" quantization, meaning the quantization parameters (like scale and offset) can vary across groups.
- **Why groups?**: Having a single scaling factor for an entire large weight matrix can be too coarse and might degrade accuracy. But having per-weight scaling factors is impractical. A middle ground is to partition the matrix into “groups,” each with its own quantization parameters.
- `group_size=128`: This means that within each weight matrix, we divide it into groups (for instance, 128 consecutive weights in a row or column, depending on the implementation) and quantize each group with its own scale/offset. This typically improves the quantization fidelity compared to one global scale factor, while remaining computationally feasible and memory-efficient.

`damp_percent`:
- Meaning: This parameter refers to a “damping” factor, often used in second-order or GPTQ-style quantization algorithms that approximate or utilize the Hessian (or some second-order information).
- In practice: “Damping” helps stabilize the Hessian-based updates, preventing large swings in weight adjustments. Sometimes it’s expressed as a percentage of the trace or a fraction that controls how strong the regularization/damping is on the Hessian or the updates.
- `0.01` means a small damping, so we slightly regularize the second-order approximations when computing how to quantize or adjust weights.

`desc_act`:
- Meaning: It allows you to process rows based on decreasing activation, meaning the most important or impactful rows (determined by sampled inputs and outputs) are processed first. This method aims to place most of the quantization error (inevitably introduced during quantization) on less significant weights. **However, when used alongside group size, desc_act can lead to performance slowdowns due to the need to frequently reload quantization parameters.**

In [35]:
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
tokenizer = AutoTokenizer.from_pretrained(model_id)



The quantization process **relies heavily on samples** to evaluate and enhance the quality of the quantization. They provide a means of comparison between the outputs produced by the original and the newly quantized model. 

In the context of this article, we utilize the C4 (Colossal Clean Crawled Corpus) dataset to generate our samples. The C4 dataset is a large-scale, multilingual collection of web text gathered from the Common Crawl project. 

In [36]:
# Load data and tokenize examples
n_samples = 1024
data = load_dataset("allenai/c4", data_files="en/c4-train.00001-of-01024.json.gz", split=f"train[:{n_samples*5}]")
tokenized_data = tokenizer("\n\n".join(data['text']), return_tensors='pt')

Generating train split: 356318 examples [00:03, 113592.38 examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (2441065 > 1024). Running this sequence through the model will result in indexing errors


In [37]:
data

Dataset({
    features: ['text', 'timestamp', 'url'],
    num_rows: 5120
})

In [40]:
data[0]

{'text': 'At Seven Avenue Design we design projects in wide range of scales from one space to whole neighborhoods.\nAt Seven Avenue Design we transform the interior of your space to make it functional, personal, according to your space and budget.\nOur architectural services include your conceptual and schematic design listening to your objectives, space requirements and even what you are planning ahead.\nWe accompany you to make your design a palpable building.',
 'timestamp': datetime.datetime(2019, 4, 25, 19, 52, 44),
 'url': 'http://sevenavedesign.com/studio/'}

In [39]:
tokenized_data["input_ids"].shape

torch.Size([1, 2441065])

In [41]:
# Format tokenized examples
examples_ids = []
for _ in range(n_samples):
    i = random.randint(0, tokenized_data.input_ids.shape[1] - tokenizer.model_max_length - 1)
    j = i + tokenizer.model_max_length
    input_ids = tokenized_data.input_ids[:, i:j]
    attention_mask = torch.ones_like(input_ids)
    examples_ids.append({'input_ids': input_ids, 'attention_mask': attention_mask})

In [42]:
len(examples_ids)

1024

In [43]:
examples_ids[0]

{'input_ids': tensor([[5419,  284, 9494,  ..., 7486,   11, 4055]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [44]:
# Quantize with GPTQ
model.quantize(
    examples_ids,
    batch_size=1,
    use_triton=True,
)

INFO - Start quantizing layer 1/12
INFO - Quantizing attn.c_attn in layer 1/12...
INFO - Quantizing attn.c_proj in layer 1/12...
INFO - Quantizing mlp.c_fc in layer 1/12...
INFO - Quantizing mlp.c_proj in layer 1/12...
INFO - Start quantizing layer 2/12
INFO - Quantizing attn.c_attn in layer 2/12...
INFO - Quantizing attn.c_proj in layer 2/12...
INFO - Quantizing mlp.c_fc in layer 2/12...
INFO - Quantizing mlp.c_proj in layer 2/12...
INFO - Start quantizing layer 3/12
INFO - Quantizing attn.c_attn in layer 3/12...
INFO - Quantizing attn.c_proj in layer 3/12...
INFO - Quantizing mlp.c_fc in layer 3/12...
INFO - Quantizing mlp.c_proj in layer 3/12...
INFO - Start quantizing layer 4/12
INFO - Quantizing attn.c_attn in layer 4/12...
INFO - Quantizing attn.c_proj in layer 4/12...
INFO - Quantizing mlp.c_fc in layer 4/12...
INFO - Quantizing mlp.c_proj in layer 4/12...
INFO - Start quantizing layer 5/12
INFO - Quantizing attn.c_attn in layer 5/12...
INFO - Quantizing attn.c_proj in layer 5/1

In [45]:
out_dir

'gpt2-GPTQ'

In [46]:
# Save model and tokenizer
model.save_quantized(out_dir, use_safetensors=True)
tokenizer.save_pretrained(out_dir)

('gpt2-GPTQ/tokenizer_config.json',
 'gpt2-GPTQ/special_tokens_map.json',
 'gpt2-GPTQ/vocab.json',
 'gpt2-GPTQ/merges.txt',
 'gpt2-GPTQ/added_tokens.json',
 'gpt2-GPTQ/tokenizer.json')

Load the model

In [47]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Reload model and tokenizer
model = AutoGPTQForCausalLM.from_quantized(
    out_dir,
    device=device,
    use_triton=True,
    use_safetensors=True,
)
tokenizer = AutoTokenizer.from_pretrained(out_dir)

1. You disabled CUDA extensions compilation by setting BUILD_CUDA_EXT=0 when install auto_gptq from source.
2. You are using pytorch without CUDA support.
3. CUDA and nvcc are not installed in your device.
INFO - The layer lm_head is not quantized.


Let’s check that the model is working correctly. The AutoGPTQ model (mostly) works as a normal transformers model, which makes it compatible with inference pipelines, as shown in the following example:

In [48]:
from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = generator("I have a dream", do_sample=True, max_length=50)[0]['generated_text']
print(result)

Device set to use cuda:0
The model 'GPT2GPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForC

I have a dream to make every single American in America a better place," she said.

"When the dream happens, I will work hard to make that dream happen," she said. "I will stand up for my daughter, my son


 A more in-depth evaluation would require measuring the perplexity of the quantized model versus the original one:

In [49]:
def calculate_perplexity(model, text):
    # Encode the text
    encodings = tokenizer(text, return_tensors='pt').to(device)

    # Define input_ids and target_ids
    input_ids = encodings.input_ids
    target_ids = input_ids.clone()

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

    # Loss calculation
    neg_log_likelihood = outputs.loss

    # Perplexity calculation
    ppl = torch.exp(neg_log_likelihood)

    return ppl

In [50]:
original_model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
original_tokenizer = AutoTokenizer.from_pretrained(model_id)



In [51]:
original_generator = pipeline('text-generation', model=original_model, tokenizer=original_tokenizer)
result = original_generator("I have a dream", do_sample=True, max_length=50)[0]['generated_text']
print(result)

Device set to use cuda:0
The model 'GPT2GPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'DbrxForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'FalconMambaForCausalLM', 'FuyuForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GitForCausalLM', 'GlmForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'JambaForCausalLM', 'JetMoeForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'Mamba2ForCausalLM', 'MarianForC

I have a dream of becoming President of the United States."

While the Republicans are not exactly happy—and that's the point, probably—they are clearly not backing down on the fact that Trump has been "disciplined." I think


In [52]:
original_text = "I have a dream"
ppl_quant     = calculate_perplexity(model, original_text)
ppl_original = calculate_perplexity(original_model, original_text)

In [53]:
print(f"Original perplexity:  {ppl_original.item():.2f}")
print(f"Quantized perplexity:    {ppl_quant.item():.2f}")

Original perplexity:  68.75
Quantized perplexity:    81.62
