# Compress and Evaluate a Reasoning LLM

| Component | Details |
|-----------|---------|
| **Goal** | Showcase a standard workflow for optimizing and evaluating a reasoning Large Language Model |
| **Model** |[Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
| **Dataset** |   |
| **Optimization Algorithms** |  |
| **Evaluation Metrics** | `total time`, `latency`, `perplexity`, `throughput`, `energy_consumed`, `co2_emissions` |

## Getting Started

To install the required dependencies, you can run the following command:


In [None]:
%pip install pruna
%pip install ftfy imageio imageio-ffmpeg

For more information about how to install Pruna, please refer to the [Installation](https://docs.pruna.ai/en/stable/setup/install.html) page.

Then, we will set the device to the best available option to maximize the optimization process's benefits. However, in this case, we recommend using a GPU.

In [None]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

## 1. Load the Model

First, we will load the original model and tokenizer using the transformers library. In our case, we will use one of the small versions of Qwen3, [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) just as a starting point. However, Pruna works at least as well with larger models, so feel free to use a bigger version of Qwen3 or any other [reasoning model available on Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation&sort=trending).

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-1.7B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfoat16, device_map=device
)

Once we've loaded the model and tokenizer, we can try to generate a response from the model.

In [None]:
from torch import bfloat16
from transformers import pipeline

model_name_or_path = "Qwen/Qwen3-1.7B"

generator = pipeline(
    "text-generation",
    model_name_or_path,
    torch_dtype=bfloat16,
    device_map=device,
)

messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language models.",
    },
]
messages = generator(messages, max_new_tokens=32768)[0]["generated_text"]


In [None]:
import copy
import re


def parse_thinking_content(messages):
    messages = copy.deepcopy(messages)
    for message in messages:
        if message["role"] == "assistant" and (
            m := re.match(
                r"<think>\n(.+)</think>\n\n", message["content"], flags=re.DOTALL
            )
        ):
            message["content"] = message["content"][len(m.group(0)) :]
            if thinking_content := m.group(1).strip():
                message["reasoning_content"] = thinking_content
    return messages


parse_thinking_content(messages)

## 2. Define the SmashConfig

Now that our base model is lodaded and tested, we can specify the `SmashConfig` to customize the optimizations applied during smashing.

Not every optimization algorithm works with every model. You can learn about the requirements and compatibility in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html).

In this example, we will enable 

In [None]:
from pruna import SmashConfig

smash_config = SmashConfig()
# Select the quantizer
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = (
    4  # can work with 2, 8 also (but 4 is the best performance)
)
smash_config["hqq_compute_dtype"] = (
    "torch.bfloat16"  # can work with float16, but better performance with bfloat16
)

# Select torch_compile for the compilation
smash_config["compiler"] = "torch_compile"
# smash_config['torch_compile_max_kv_cache_size'] = 400 # uncomment if you want to use a custom kv cache size
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_mode"] = "max-autotune"
# If the model is not compatible with cudagraphs, you can try to comment the line above and uncomment the line below
# smash_config['torch_compile_mode'] = 'max-autotune-no-cudagraphs'


## 3. Smash the Model

Now that we have our `SmashConfig` defined, it’s time to apply it to our base model. We’ll call the `smash` function with the base model and our `SmashConfig`

Ready to smash? This operation typically takes around 20 seconds, depending on the configuration.

In [None]:
import copy

from pruna import smash

copy_pipe = copy.deepcopy(pipe).to("cpu")
smashed_pipe = smash(
    model=pipe,
    smash_config=smash_config,
)


Great! Now we have our optimized smashed model. Let's check how it works by running some inference.

Consider that if you are using `torch_compile` as a compiler, you can expect the first inference warmup to take a bit longer than the actual inference.

In [None]:
from torch import bfloat16
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model_name_or_path,
    torch_dtype=bfloat16,
    device_map=device,
)

messages = [
    {
        "role": "user",
        "content": "Give me a short introduction to large language models.",
    },
]
messages = generator(messages, max_new_tokens=32768)[0]["generated_text"]


As we can see, the model still generates a similar response with a thinking process.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on [Discord]([https://discord.gg/Tun8YgzxZ9](https://discord.gg/Tun8YgzxZ9)) if you have any questions or feedback.

## 4. Evaluate the Smashed Model

As our smashed model is working, we can evaluate how much it has improved with our optimization. For this, we can run an evaluation of the performance using the `EvaluationAgent`. In this case, we will include metrics like `total time`, `latency`, `perplexity`, `throughput`, `energy_consumed` and `co2_emissions`.

A complete list of the available metrics can be found in [Evaluation](https://docs.pruna.ai/en/stable/reference/evaluation.html).

In [None]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    Co2EmissionsMetric,
    EnergyConsumedMetric,
    LatencyMetric,
    ThroughputMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations and warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
    LatencyMetric(n_iterations=3, n_warmup_iterations=1),
    ThroughputMetric(n_iterations=3, n_warmup_iterations=1),
    Co2EmissionsMetric(n_iterations=3, n_warmup_iterations=1),
    EnergyConsumedMetric(n_iterations=3, n_warmup_iterations=1),
]

# Define the datamodule
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(10)

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

# Evaluate base model and offload it to CPU
base_pipe = PrunaModel(model=copy_pipe)
base_pipe.move_to_device(device)
base_model_results = eval_agent.evaluate(base_pipe)
base_pipe.move_to_device("cpu")

# Evaluate smashed model and offload it to CPU
smashed_pipe.move_to_device(device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)
smashed_pipe.move_to_device("cpu")


Now we can see the results of the evaluation and compare the performance of the original and the optimized model.

In [None]:
from IPython.display import Markdown, display  # noqa


# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa
    return ((optimized - original) / original) * 100


# Calculate differences and prepare table data
table_data = []
for base_metric_result, smashed_metric_result in zip(
    base_model_results, smashed_model_results
):
    diff = calculate_percentage_diff(
        base_metric_result.result, smashed_metric_result.result
    )
    table_data.append(
        {
            "Metric": base_metric_result.name,
            "Base Model": f"{base_metric_result.result:.4f}",
            "Compressed Model": f"{smashed_metric_result.result:.4f}",
            "Relative Difference": f"{diff:+.2f}%",
        }
    )

# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
    metric_obj = [metric for metric in metrics if metric.metric_name == row["Metric"]][
        0
    ]
    unit = f" {metric_obj.metric_units}" if hasattr(metric_obj, "metric_units") else ""
    markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n"  # noqa: E501

display(Markdown(markdown_table))


As expected, we can observe a slight improvement of the model. So, we can save the optimized model to disk or share it with others:

In [None]:
# Save the model to disk
smashed_pipe.save_pretrained("Qwen3-1.7B-smashed")
# Load the model from disk
# smashed_pipe = PrunaModel.from_pretrained("Qwen3-1.7B-smashed/")

# Save the model to HuggingFace
# smashed_pipe.save_to_hub("PrunaAI/Qwen3-1.7B-smashed")
# smashed_pipe = PrunaModel.from_hub("PrunaAI/Qwen3-1.7B-smashed")


## Conclusions

In this tutorial, we have seen how to optimize and evaluate a reasoning Large Language Model using Pruna. We have seen how to use the `SmashConfig` to customize the optimizations applied during smashing and how to evaluate the performance of the optimized model using the `EvaluationAgent`.

The results show that

Check out our other [tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) for more examples on how to optimize and evaluate image/video generation models or LLM models.