# Compress and Evaluate Reasoning Large Language Models

| Component | Details |
|-----------|---------|
| **Goal** | Showcase a standard workflow for optimizing and evaluating a reasoning Large Language Model |
| **Model** |[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
| **Dataset** | [SmolSmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk)  |
| **Optimization Algorithms** | quantizer(hqq), compiler(torch_compile) |
| **Evaluation Metrics** | `total time`, `perplexity`, `throughput`, `energy_consumed` |

## Getting Started

To install the required dependencies, you can run the following command:


In [1]:
import sys, torch, os

print("Python:", sys.executable)
print("Torch:", torch.__version__)
try:
    import hqq, pruna
    print("HQQ:", getattr(hqq, "__version__", "unknown"))
    print("Pruna:", getattr(pruna, "__version__", "unknown"))
except Exception as e:
    print("Import check:", e)

# Safety shim: only triggers for meta-tensor case
import torch.nn as nn
_orig_to = nn.Module.to
def _safe_to(self, *args, **kwargs):
    try:
        return _orig_to(self, *args, **kwargs)
    except NotImplementedError as e:
        if "Cannot copy out of meta tensor" in str(e):
            device = kwargs.get("device", None)
            dtype = kwargs.get("dtype", None)
            if device is None and args:
                device = args[0]
            if dtype is None and len(args) > 1:
                dtype = args[1]
            return self.to_empty(device=device, dtype=dtype)
        raise
nn.Module.to = _safe_to

Python: /root/miniconda3/envs/pruna0/bin/python
Torch: 2.7.0+cu126


Multiple distributions found for package optimum. Picked distribution: optimum


HQQ: 0.2.7.post1
Pruna: 0.2.9


In [1]:
!python -c "import sys, torch; print(sys.executable); print('Torch', torch.__version__)"
!pip show hqq pruna

/root/miniconda3/envs/pruna0/bin/python
Torch 2.7.0+cu126
Name: hqq
Version: 0.2.7.post1
Summary: Half-Quadratic Quantization (HQQ)
Home-page: https://github.com/mobiusml/hqq/
Author: Dr. Hicham Badri
Author-email: hicham@mobiuslabs.com
License: Apache 2
Location: /root/miniconda3/envs/pruna0/lib/python3.10/site-packages
Requires: accelerate, einops, huggingface_hub, numpy, termcolor, tqdm, transformers
Required-by: pruna
---
Name: pruna
Version: 0.2.9
Summary: Smash your AI models
Home-page: 
Author: 
Author-email: Pruna AI <hello@pruna.ai>
License: Copyright 2025 - Pruna AI GmbH. All rights reserved.
        
                                         Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, repro

In [1]:
%pip install --upgrade --force-reinstall pruna==0.2.9

Collecting pruna==0.2.9
  Using cached pruna-0.2.9-py3-none-any.whl.metadata (29 kB)
Collecting aenum (from pruna==0.2.9)
  Using cached aenum-3.1.16-py3-none-any.whl.metadata (3.8 kB)
Collecting bitsandbytes (from pruna==0.2.9)
  Using cached bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting codecarbon (from pruna==0.2.9)
  Using cached codecarbon-3.0.4-py3-none-any.whl.metadata (11 kB)
Collecting colorama (from pruna==0.2.9)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configspace>=1.2.1 (from pruna==0.2.9)
  Using cached configspace-1.2.1-py3-none-any.whl
Collecting ctranslate2==4.6.0 (from pruna==0.2.9)
  Using cached ctranslate2-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting datasets<=3.5.0 (from pruna==0.2.9)
  Using cached datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting deepcache (from pruna==0.2.9)
  Using cached DeepCache-0.1.1-py3-none-any.whl.metadata 

In [1]:
%pip install --upgrade --force-reinstall git+https://github.com/PrunaAI/pruna.git@main

Collecting git+https://github.com/PrunaAI/pruna.git@main
  Cloning https://github.com/PrunaAI/pruna.git (to revision main) to /tmp/pip-req-build-q80ihaak
  Running command git clone --filter=blob:none --quiet https://github.com/PrunaAI/pruna.git /tmp/pip-req-build-q80ihaak
  Resolved https://github.com/PrunaAI/pruna.git to commit 15876bb39ca33b0c93a5de844c8d23c1bd88a610
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting aenum (from pruna==0.2.9)
  Using cached aenum-3.1.16-py3-none-any.whl.metadata (3.8 kB)
Collecting bitsandbytes (from pruna==0.2.9)
  Using cached bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting codecarbon (from pruna==0.2.9)
  Using cached codecarbon-3.0.4-py3-none-any.whl.metadata (11 kB)
Collecting colorama (from pruna==0.2.9)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collectin

In [8]:
%pip install -U hqq
# (and transformers/accelerate if your stack needs them)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting hqq
  Downloading hqq-0.2.8.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: hqq
  DEPRECATION: Building 'hqq' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'hqq'. Discussion can be found at https://github.com/pypa/pip/issues/6334
  Building wheel for hqq (setup.py) ... [?25ldone
[?25h  Created wheel for hqq: filename=hqq-0.2.8-py3-none-any.whl size=68424 sha256=4a3b4a33523554539d1f0b74a2cd37617d1bfa3ab44885d8f9910e47c4193553
  Stored in directory: /root/.cache/pip/wheels/6d/27/5c/30e8d87478cecd6b28dca83bd2d3e27724b55f565fdba980d9
Successfully built hqq
Installing collected packages: hqq
  Attempting uninstall: hqq
    Found e

In [2]:
%pip show transformers

Name: transformers
Version: 4.52.4
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: transformers@huggingface.co
License: Apache 2.0 License
Location: /root/miniconda3/envs/pruna-tutorials/lib/python3.10/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm
Required-by: compressed-tensors, DeepCache, gliner, hqq, llmcompressor, optimum, pruna, whisper-s2t
Note: you may need to restart the kernel to use updated packages.


For more information about how to install Pruna, please refer to the [Installation](https://docs.pruna.ai/en/stable/setup/install.html) page.

Then, we will set the device to the best available option to maximize the optimization process's benefits. However, in this case, we recommend using a GPU.

In [1]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

## 1. Load the Model

First, we will load the original model and tokenizer using the transformers library. In our case, we will use one of the small versions of Qwen3, [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) just as a starting point. However, Pruna works at least as well with larger models, so feel free to use a bigger version of Qwen3 or any other [reasoning model available on Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [2]:
from transformers import pipeline

model_name = "Qwen/Qwen3-1.7B"

pipe = pipeline(
    "text-generation",
    model_name,
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use cuda:0


Once we've loaded the model and tokenizer, we can try to generate a response from the model and parse the response to get the reasoning steps.

In [4]:
messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language model.",
    },
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]

In [5]:
import copy
import re


def parse_thinking_content(messages):  # noqa: D103
    messages = copy.deepcopy(messages)
    for message in messages:
        if message["role"] == "assistant" and (
            m := re.match(
                r"<think>\n(.+)</think>\n\n", message["content"], flags=re.DOTALL
            )
        ):
            message["content"] = message["content"][len(m.group(0)) :]
            if thinking_content := m.group(1).strip():
                message["reasoning_content"] = thinking_content
    return messages


parse_thinking_content(messages)

[{'role': 'user',
  'content': 'Give me a short introduction to large language model.'},
 {'role': 'assistant',
  'content': 'Large language models (LLMs) are AI systems designed to understand, generate, and interact with human language. They are trained on massive datasets of text, enabling them to grasp complex patterns and produce coherent, context-aware responses. These models, often based on transformer architecture, excel in tasks like translation, writing, and answering questions. While they offer remarkable capabilities, they also face challenges such as data bias and the need for continuous refinement. LLMs are revolutionizing industries by enhancing productivity and innovation in areas like customer service, content creation, and research.',
  'reasoning_content': 'Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models (LLMs) are AI systems trained on vast amounts of text data. I should mention their k

## 2. Define the SmashConfig

Now that our base model is lodaded and tested, we can specify the `SmashConfig` to customize the optimizations applied during smashing.

Not every optimization algorithm works with every model. You can learn about the requirements and compatibility in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html).

In this example, we will enable `hqq` quantization to improve the performance of the model and `torch_compile` compilation to improve the speed of the model.

In [3]:
from pruna import SmashConfig

smash_config = SmashConfig(cache_dir_prefix="/scratch/.cache")
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 8
smash_config["hqq_compute_dtype"] = "torch.bfloat16"

smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamic"] = True

Multiple distributions found for package optimum. Picked distribution: optimum
INFO - Using best available device: 'cuda'


## 3. Smash the Model

Now that we have our `SmashConfig` defined, it’s time to apply it to our base model. We’ll call the `smash` function with the base model and our `SmashConfig`

Ready to smash? This operation typically takes around 20 seconds, depending on the configuration.

In [7]:
from pruna import smash
import copy

copy_model = copy.deepcopy(pipe.model).to("cpu")
smashed_model = smash(
    model=pipe.model,
    smash_config=smash_config,
)

INFO - Starting quantizer hqq...
100%|██████████| 143/143 [00:00<00:00, 937.14it/s]
100%|██████████| 197/197 [00:03<00:00, 55.35it/s]
INFO - quantizer hqq was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.


Great! Now we have our optimized smashed model. Let's check how it works by running some inference.

Consider that if you are using `torch_compile` as a compiler, you can expect the first inference warmup to take a bit longer than the actual inference.

In [8]:
from transformers import pipeline

messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language models.",
    },
]
messages = pipe(messages, max_new_tokens=32768)[0]["generated_text"]
parse_thinking_content(messages)

[{'role': 'user',
  'content': 'Give me a short introduction to large language models.'},
 {'role': 'assistant',
  'content': "Large language models (LLMs) are advanced AI systems designed to understand and generate human-like text. They learn from vast amounts of data using deep learning techniques, enabling them to produce coherent and contextually relevant responses. These models excel in tasks like language translation, content creation, and customer service chatbots. While they're powerful, they're not infallible and rely on data quality. Their integration into daily life has transformed how we interact with technology, making tasks faster and more efficient.",
  'reasoning_content': "Okay, the user wants a short introduction to large language models. Let me start by defining what they are. Large language models are AI systems that can understand and generate human-like text. I should mention their training with vast amounts of data and their use in various applications like chatb

As we can see, the model still generates a similar response with a thinking process.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on [Discord]([https://discord.gg/Tun8YgzxZ9](https://discord.gg/Tun8YgzxZ9)) if you have any questions or feedback.

## 4. Evaluate the Smashed Model

As our smashed model is working, we can evaluate how much it has improved with our optimization. For this, we can run an evaluation of the performance using the `EvaluationAgent`. In this case, we will include metrics like `total time`,`perplexity`, `throughput` and `energy_consumed`.

A complete list of the available metrics can be found in [Evaluation](https://docs.pruna.ai/en/stable/reference/evaluation.html).

In [8]:
from datasets import load_dataset

from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.data.utils import split_train_into_train_val_test
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    EnergyConsumedMetric,
    ThroughputMetric,
    TorchMetricWrapper,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations
# and warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=50, n_warmup_iterations=5),
    ThroughputMetric(n_iterations=50, n_warmup_iterations=5),
    TorchMetricWrapper("perplexity", call_type="single"),
    EnergyConsumedMetric(n_iterations=50, n_warmup_iterations=5),
]

# Load custom datasets
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token

train_ds = load_dataset("zwhe99/DeepMath-103K", split="train")
train_ds = train_ds.rename_column("question", "text")
train_ds, val_ds, test_ds = split_train_into_train_val_test(train_ds, seed=42)

# Create the data module
datamodule = PrunaDataModule.from_datasets(
    datasets=(train_ds, val_ds, test_ds),
    collate_fn="text_generation_collate",
    tokenizer=pipe.tokenizer,
    collate_fn_args={"max_seq_len": 512},
    dataloader_args={"batch_size": 16, "num_workers": 4},
)
datamodule.limit_datasets(100)

inference_args = {
    "max_new_tokens": 512,
}

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using call_type: y_gt for metric perplexity
INFO - Using best available device: 'cuda'
INFO - Loaded only training, splitting train 80/10/10 into train, validation and test...
INFO - Testing compatibility with text_generation_collate...
INFO - Using provided list of metric instances.


In [9]:
# Evaluate smashed model and offload it to CPU
smashed_model.move_to_device(device)
smashed_model.inference_handler.model_args.update(inference_args)
smashed_model_results = eval_agent.evaluate(smashed_model)
smashed_model.move_to_device("cpu")

INFO - Using best available device: 'cuda'
INFO - Evaluating a smashed model.
INFO - Detected transformers model. Using TransformerHandler.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .logits attribute.
INFO - Evaluating stateful metrics.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlock

In [10]:
for result in smashed_model_results:
    print(result.name, result.result)

perplexity 2.823002815246582
total_time 6869.606880187988
throughput 0.11645498992194264
energy_consumed 0.001068044841136194


In [11]:
# Evaluate base model and offload it to CPU
base_pipe = PrunaModel(model=copy_model)
base_pipe.move_to_device(device)
base_pipe.inference_handler.model_args.update(inference_args)
base_model_results = eval_agent.evaluate(base_pipe)
base_pipe.move_to_device("cpu")

INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Evaluating a base model.
INFO - Detected transformers model. Using TransformerHandler.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .logits attribute.
INFO - Evaluating stateful metrics.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

[codecarbon INFO @ 15:19:10] Energy consumed for RAM : 0.000026 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 15:19:10] Delta energy consumed for CPU with cpu_load : 0.000008 kWh, power : 21.006048756000002 W
[codecarbon INFO @ 15:19:10] Energy consumed for All CPU : 0.000008 kWh
[codecarbon INFO @ 15:19:10] Energy consumed for all GPUs : 0.000042 kWh. Total GPU Power : 78.18919233641185 W
[codecarbon INFO @ 15:19:10] 0.000077 kWh of electricity used since the beginning.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable

In [12]:
for result in base_model_results:
    print(result.name, result.result)

perplexity 3.332951068878174
total_time 42390.90364074707
throughput 0.01887197326057995
energy_consumed 0.005906576510335687


Now we can see the results of the evaluation and compare the performance of the original and the optimized model.

In [13]:
from IPython.display import Markdown, display  # noqa


# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa
    return ((optimized - original) / original) * 100


# Calculate differences and prepare table data
table_data = []
for base_metric_result, smashed_metric_result in zip(
    base_model_results, smashed_model_results
):
    diff = calculate_percentage_diff(
        base_metric_result.result, smashed_metric_result.result
    )
    table_data.append(
        {
            "Metric": base_metric_result.name,
            "Base Model": f"{base_metric_result.result:.4f}",
            "Compressed Model": f"{smashed_metric_result.result:.4f}",
            "Relative Difference": f"{diff:+.2f}%",
        }
    )

# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
    metric_obj = [metric for metric in metrics if metric.metric_name == row["Metric"]][
        0
    ]
    unit = f" {metric_obj.metric_units}" if hasattr(metric_obj, "metric_units") else ""
    markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n"  # noqa: E501

display(Markdown(markdown_table))

| Metric | Base Model | Compressed Model | Relative Difference |
|--------|----------|-----------|------------|
| perplexity | 3.3330  | 2.8230  | -15.30% |
| total_time | 42390.9036  ms | 6869.6069  ms | -83.79% |
| throughput | 0.0189  num_iterations/ms | 0.1165  num_iterations/ms | +517.08% |
| energy_consumed | 0.0059  kWh | 0.0011  kWh | -81.92% |


As expected, we can observe a slight improvement of the model. So, we can save the optimized model to disk or share it with others:

In [None]:
# Save the model to disk
smashed_model.save_pretrained("Qwen3-0.6B-smashed")
# Load the model from disk
# smashed_model = PrunaModel.from_pretrained("Qwen3-0.6B-smashed/")

# Save the model to HuggingFace
# smashed_model.save_to_hub("PrunaAI/Qwen3-0.6B-smashed")
# smashed_model = PrunaModel.from_hub("PrunaAI/Qwen3-0.6B-smashed")

## Conclusions

In this tutorial, we have seen how to optimize and evaluate a reasoning Large Language Model using Pruna. We have seen how to use the `SmashConfig` to customize the optimizations applied during smashing and how to evaluate the performance of the optimized model using the `EvaluationAgent`.

The results show that

Check out our other [tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) for more examples on how to optimize and evaluate image/video generation models or LLM models.