# Compress and Evaluate Large Language Models

<a target="_blank" href="https://colab.research.google.com/github/PrunaAI/pruna/blob/v|version|/docs/tutorials/llms.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

| Component | Details |
|-----------|---------|
| **Goal** | Show a standard workflow for optimizing and evaluating a large language model |
| **Model** | [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct) |
| **Dataset** | [SmolSmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) |
| **Libraries** | [transformers](https://github.com/huggingface/transformers), [datasets](https://github.com/huggingface/datasets) |
| **Optimization Algorithms** | quantizer(hqq), compiler(torch_compile) |
| **Evaluation Metrics** | **Base Metrics:** elapsed_time<br>**Stateful Metrics:** perplexity |

## Getting Started

### Install the dependencies

To install the dependencies, run the following command:

In [1]:
!pip install pruna

zsh:1: command not found: pip


### Set the device

Normally, we would set the device to the best available device to make the most out of the optimization process.


In [2]:
import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

## Load the model

Before we can optimize the model, we need to ensure that we can load the model and tokenizer correctly and that they can fit in memory. For this example, we will use a nice and small LLM, [SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct), but feel free to use any [text-generation model on Hugging Face](https://huggingface.co/models?pipeline_tag=text-generation). 

Although Pruna works at least as good with smaller models, a small model is a good starting point to show the steps of the optimization process.

In [9]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

Now we've loaded the model and tokenizer. Let's see if we can run some inference with them. To make this easy for use, we will be using the `transformers` library's `pipeline` function.

In [10]:
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages, max_new_tokens=100)

Device set to use mps


[{'generated_text': [{'role': 'user', 'content': 'Who are you?'},
   {'role': 'assistant',
    'content': "I am an advanced language model designed to assist users in understanding and generating human-like text. I was trained on a vast amount of text data, including various languages, and I can help with a wide range of tasks, from text editing and summarization to generating creative content like stories and poetry. I'm here to provide you with helpful and accurate information to improve your communication."}]}]

As we can see, the model is able to generate a response to the user's question, which is being cut-off after the allowed `max_new_tokens`. 

## Define the SmashConfig

Now we know the model is working, let's continue with the optimization process and define the `SmashConfig`, which we will use later on to optimize the model.

Not all optimization algorithms are available for all models but we can learn a bit more about different optimization algorithms and their requirements in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html) section of the documentation.

For the current optimization, we will be using the [`hqq` quantizer](https://docs.pruna.ai/en/stable/compression.html#hqq) and the [`torch_compile` compiler](https://docs.pruna.ai/en/stable/compression.html#torch-compile). We will updating some parameters for these algorithms, setting `hqq_weight_bits` to `4`, `hqq_compute_dtype` to `torch.bfloat16`, `torch_compile_fullgraph` to `True`, `torch_compile_dynamic` to `True`, and `torch_compile_mode` to `max-autotune`. This is one of the many configurations and will just serve as an example.

Let's define the `SmashConfig` object.

In [11]:
from pruna import SmashConfig

smash_config = SmashConfig(device=device)
smash_config["quantizer"] = "hqq"
smash_config["hqq_weight_bits"] = 8
smash_config["hqq_compute_dtype"] = "torch.bfloat16"
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_fullgraph"] = True
smash_config["torch_compile_dynamic"] = True
smash_config["torch_compile_mode"] = "max-autotune"

## Smash the model

Now that we have defined the `SmashConfig` object, we can smash the model. We will be using the `smash` function to smash the model and pass the `model` and `smash_config` to it. We also make a deep copy of the model to avoid modifying the original model. 

Let's smash the model, which should take around 20 seconds for this configuration.

In [12]:
import copy

from pruna import smash

copy_model = copy.deepcopy(model)
smashed_model = smash(
    model=copy_model,
    smash_config=smash_config,
)

INFO - Starting quantizer hqq...
100%|██████████| 99/99 [00:00<00:00, 4872.58it/s]
100%|██████████| 225/225 [00:10<00:00, 21.40it/s]
INFO - quantizer hqq was applied successfully.


Now we've optimized the model. Let's see if everything still works as expected and we can run some inference with the optimized model. In this case, we are running the inference by first encoding the prompt through the `tokenizer` and then passing the `input_ids` to the `PrunaModel.generate` method, which also allows us to specify additional parameters such as `max_new_tokens`.

In [13]:
prompt = "Who are you?"
messages = [{"role": "user", "content": prompt}]
tokenized_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([tokenized_prompt], return_tensors="pt").to(smashed_model.device)
generated_ids = smashed_model.generate(
    **model_inputs,
    max_new_tokens=100,
)
# Extract only the assistant's message from the decoded output
full_response = tokenizer.decode(generated_ids[0], skip_special_tokens=False)
assistant_message = full_response.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0]
assistant_message

"I'm a helpful assistant designed to assist users with their queries and provide information on various topics. I don't have personal experiences or emotions, just facts and data. I'm here to help with your questions and provide accurate information. If you have a question or need help with something, feel free to ask!"

As we can see, the model is able to generate a similar response to the original model. 

If you notice a significant difference, it might have several reasons, the models, the configuration, the hardware, etc. As optimization can be non-deterministic, we encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case but also feel free to reach out to us on [Discord](https://discord.gg/Tun8YgzxZ9) if you have any questions or feedback.

## Evaluate the smashed model

Now that we have optimized the model, we can evaluate the performance of the optimized model. We will be using the `EvaluationAgent` to evaluate the performance of the optimized model. We will do so with some basic metrics, the `elapsed_time`, as well as a stateful metrics, the `perplexity`. An overview of the different metrics can be found in our [documentation](https://docs.pruna.ai/).

Let's define the `EvaluationAgent` object and start the evaluation process. Note that we are using the `datamodule.limit_datasets(10)` method to limit the number of datasets to 10, which is just for the sake of time.

In [22]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import TorchMetricWrapper
from pruna.evaluation.task import Task

# Define the metrics
metrics = [TorchMetricWrapper("perplexity")]

# Define the datamodule
datamodule = PrunaDataModule.from_string("SmolSmolTalk", tokenizer=tokenizer)
datamodule.limit_datasets(10)

# Define the task and evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

# Evaluate base model, and smashed model
wrapped_model = PrunaModel(model=model)
base_model_results = eval_agent.evaluate(wrapped_model)
smashed_model_results = eval_agent.evaluate(smashed_model)

INFO - Using best available device: 'mps'
INFO - Using call_type: y_gt for metric perplexity
INFO - Loaded only training and test, splitting train 90/10 into train and validation...
  obj.co_lnotab,  # for < python 3.10 [not counted in args]
INFO - Testing compatibility with text_generation_collate...
INFO - Using provided list of metric instances.
INFO - Using best available device: 'mps'
INFO - Using best available device: 'mps'
INFO - Evaluating a base model.
INFO - Detected transformers model. Using TransformerHandler.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .logits attribute.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.
INFO - Using best available device: 'mps'
INFO - Evaluating a smashed model.
INFO - Detected transformers model. Using TransformerHandler.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .logits attribute.
INFO - Evaluatin

We can now take a look at the results of the evaluation and compare the performance of the original and the optimized model.

In [23]:
base_model_results, smashed_model_results

([MetricResult(name='perplexity', params={'_defaults': {}, 'metric': Perplexity(), 'update_fn': <function default_update at 0x3ee8b3380>, 'call_type': 'y_gt', 'metric_name': 'perplexity', 'higher_is_better': False}, result=23.96961784362793)],
 [MetricResult(name='perplexity', params={'_defaults': {}, 'metric': Perplexity(), 'update_fn': <function default_update at 0x3ee8b3380>, 'call_type': 'y_gt', 'metric_name': 'perplexity', 'higher_is_better': False}, result=2926.905029296875)])

Now we can see the results of the evaluation and compare the performance of the original and the optimized model.

In [24]:
# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa: D103
    return ((optimized - original) / original) * 100


# Calculate and display percentage differences
print("Percentage differences between original and optimized model:")
for base_metric_result, smashed_metric_result in zip(base_model_results, smashed_model_results):
    diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)
    print(f"{base_metric_result}: {diff:.2f}%")

Percentage differences between original and optimized model:
perplexity: 23.96961784362793: 12110.90%


As we can see, the optimized model is roughly 2x faster, while lose some of its performance on perplexity, which is expected given the nature of the optimization. Now, we can start to compare, iterate and see what optimization works best for our models, given the metrics we are interested in.

We can now save the optimized model to disk and share it with others.

In [None]:
smashed_model.save_pretrained("smashed_model")
smashed_model.save_to_hub("smashed_model")

## Conclusion

In this tutorial, we have shown a standard workflow for optimizing and evaluating a large language model. We have used the `SmashConfig` object to define the optimization algorithms and the `EvaluationAgent` to evaluate the performance of the optimized model. We have also used the `PrunaDataModule` to load the dataset and the `Task` object to define the task and evaluation agent.

We have shown how to optimize the model using the `smash` function and how to evaluate the performance of the optimized model using the `EvaluationAgent`.

Proving we can optimize the model, by making it quicker, more energy efficient and using less memory, while only losing a small amount of accuracy.