# Compress and Evaluate Video Generation Models

| Component | Details |
|-----------|---------|
| **Goal** | Showcase a standard workflow for optimizing and evaluating a video generation model |
| **Model** |[Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) |
| **Dataset** |  [nannullna/laion_subset](https://huggingface.co/datasets/nannullna/laion_subset) |
| **Optimization Algorithms** | cacher(pab) |
| **Evaluation Metrics** | `total time`, `latency`, `througput`, `co2_emissions`, and `energy_consumed` |

## Getting Started

To install the required dependencies, you can run the following command:


In [1]:
%pip install pruna
%pip install ftfy imageio imageio-ffmpeg

Collecting pruna
  Downloading pruna-0.2.7-py3-none-any.whl.metadata (28 kB)
Collecting bitsandbytes (from pruna)
  Using cached bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting codecarbon (from pruna)
  Using cached codecarbon-3.0.4-py3-none-any.whl.metadata (11 kB)
Collecting colorama (from pruna)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configspace>=1.2.1 (from pruna)
  Using cached configspace-1.2.1-py3-none-any.whl
Collecting ctranslate2==4.6.0 (from pruna)
  Using cached ctranslate2-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting datasets<=3.5.0 (from pruna)
  Using cached datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting deepcache (from pruna)
  Using cached DeepCache-0.1.1-py3-none-any.whl.metadata (16 kB)
Collecting diffusers>=0.21.4 (from pruna)
  Using cached diffusers-0.34.0-py3-none-any.whl.metadata (20 kB)
Collecting gliner (from pruna)
  Using ca

In [1]:
%pip freeze

accelerate==1.9.0
aiohappyeyeballs==2.6.1
aiohttp==3.12.14
aiosignal==1.4.0
annotated-types==0.7.0
anyio==4.9.0
arrow==1.3.0
asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1733250440834/work
async-timeout==5.0.1
attrs==25.3.0
audioread==3.0.1
bitsandbytes==0.46.1
certifi==2025.7.14
cffi==1.17.1
charset-normalizer==3.4.2
click==8.2.1
codecarbon==3.0.4
colorama==0.4.6
coloredlogs==15.0.1
comm @ file:///home/conda/feedstock_root/build_artifacts/bld/rattler-build_comm_1753453984/work
compressed-tensors==0.10.2
ConfigSpace==1.2.1
cryptography==45.0.5
ctranslate2==4.6.0
datasets==3.5.0
debugpy @ file:///croot/debugpy_1736267418885/work
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1740384970518/work
DeepCache==0.1.1
diffusers==0.34.0
dill==0.3.8
einops==0.8.1
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1733327148154/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_17469472

For more information about how to install Pruna, please refer to the [Installation](https://docs.pruna.ai/en/stable/setup/install.html) page.

Then, we will set the device to the best available option to maximize the optimization process's benefits. However, in this case, we recommend using a GPU.

In [1]:
import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

In [2]:
device

'cuda'

## 1. Load the Model

First, we must load the original model using the diffusers library to ensure it fits into memory. In this example, we will use a light model compatible with most of the consumer-grade GPUs, [Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B).

Pruna works at least as well with larger models, like the model version of Wan 2.1 14B or HuyuanVideo. The choice to use a smaller model is simply because it’s a good starting point, so feel free to use any [text-to-video model available on Hugging Face](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending).

In [3]:
from diffusers import AutoencoderKLWan, WanPipeline

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)

pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16).to(device)

Multiple distributions found for package optimum. Picked distribution: optimum-quanto


Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Once we have loaded the pipeline, we can run some inference and check the output. The standard prompt structure for a video is **Subject + Subject Action + Scene**, which can become more complex as we add descriptions and details like the lighting, point of view, or visual style to achieve specific and refined results.

Remember that you can improve the quality of the video by increasing the number of frames, the number of inference steps, and the guidance scale, but this will also increment the time and amount of resources required to generate the video.

In [4]:
from diffusers.utils import export_to_video

prompt = "A dog runs on the beach, realistic."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"  # noqa: E501

with torch.no_grad():
    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "base_video.mp4", fps=15)

  0%|          | 0/15 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'base_video.mp4'

As we can see, the model has generated a nice short video based on our prompt.

## 2. Define the SmashConfig

Now that we have correctly loaded and tested our base model, let's continue by defining the `SmashConfig` to customize the optimizations we want to apply when smashing.

Take into account that not all optimization algorithms are available for all models, so you can learn about the requirements and compatibility in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html).

In the current optimization, we will use [pab](https://docs.pruna.ai/en/stable/compression.html#pab) with an interval of `2`, which will speed up the model's inference time. 

Let's define the `SmashConfig` object.

In [4]:
from pruna import SmashConfig

smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"
smash_config["quantizer"] = "torchao"
smash_config["torchao_quant_type"] = "fp8dq"
smash_config["torchao_excluded_modules"] = "norm+embedding"

INFO - Using best available device: 'cuda'


## 3. Smash the Model

Next, we need to apply our defined `SmashConfig` by smashing our model. The `smash` function will be in charge of this, so we just need to pass the `model` and the `smash_config`. To evaluate and compare the models in the upcoming sections, we will make a deep copy of the base model.

Time to smash! This will take around 20 seconds, depending on the configuration.

In [5]:
import copy

from pruna import smash

copy_pipe = copy.deepcopy(pipe).to("cpu")
smashed_pipe = smash(
    model=pipe,
    smash_config=smash_config,
)

INFO - Starting quantizer torchao...
INFO - quantizer torchao was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.


Now, we will have an optimized smashed model, so let's check how it works using the previous prompt.

Consider that if you are using `torch_compile` as a compiler, you can expect the first inference warmup to take a bit longer than the actual inference.



In [7]:
with torch.no_grad():
    output = smashed_pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "smashed_video.mp4", fps=15)

  0%|          | 0/15 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


'smashed_video.mp4'

As we can observe, it has also generated a short video similar to the original model.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. We encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on [Discord]([https://discord.gg/Tun8YgzxZ9](https://discord.gg/Tun8YgzxZ9)) if you have any questions or feedback.

## 4. Evaluate the Smashed Model

Now that we have our smashed model, the key question is how much has improved with our optimization. For this, we can run an evaluation of the performance using the `EvaluationAgent`. In this case, we will include metrics like the `total_time`, `latency`, `throughput`, `co2_emissions`, and `energy_consumed`.

A complete list of the available metrics can be found in [Evaluation](https://docs.pruna.ai/en/stable/reference/evaluation.html).

In [6]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    CO2EmissionsMetric,
    EnergyConsumedMetric,
    LatencyMetric,
    ThroughputMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations and warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
    LatencyMetric(n_iterations=3, n_warmup_iterations=1),
    ThroughputMetric(n_iterations=3, n_warmup_iterations=1),
    CO2EmissionsMetric(n_iterations=3, n_warmup_iterations=1),
    EnergyConsumedMetric(n_iterations=3, n_warmup_iterations=1),
]

# Define the datamodule
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(10)

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

  if LooseVersion(torch.__version__) < LooseVersion("1.0.0"):
  if LooseVersion(torch.__version__) >= LooseVersion("1.1.0"):
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Loaded only training, splitting train 80/10/10 into train, validation and test...
INFO - Testing compatibility with image_generation_collate...
INFO - Using provided list of metric instances.


In [10]:
# Evaluate base model and offload it to CPU
base_pipe = PrunaModel(model=copy_pipe)
base_pipe.move_to_device(device)
base_model_results = eval_agent.evaluate(base_pipe)

INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Evaluating a base model.
INFO - Detected diffusers model. Using DiffuserHandler with fixed seed.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .images attribute.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 15:26:07] [setup] RAM Tracking...
[codecarbon INFO @ 15:26:07] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 15:26:09] CPU Model on constant consumption mode: AMD EPYC 7R13 Processor
[codecarbon INFO @ 15:26:09] [setup] GPU Tracking...
[codecarbon INFO @ 15:26:09] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 15:26:09] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 15:26:09] >>> Tracker's metadata:
[codecarbon INFO @ 15:26:09]   Platform system: Linux-6.8.0-1031-aws-x86_64-with-glibc2.39
[codecarbon INFO @ 15:26:09]   Python version: 3.10.18
[codecarbon INFO @ 15:26:09]   CodeCarbon version: 3.0.4
[codecarbon INFO @ 15:26:09]   Available RAM : 61.940 GB
[codecarbon INFO

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

[codecarbon INFO @ 15:26:13] Energy consumed for RAM : 0.000023 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 15:26:13] Delta energy consumed for CPU with constant : 0.000130 kWh, power : 112.5 W
[codecarbon INFO @ 15:26:13] Energy consumed for All CPU : 0.000130 kWh
[codecarbon INFO @ 15:26:13] Energy consumed for all GPUs : 0.000105 kWh. Total GPU Power : 91.04196829550723 W
[codecarbon INFO @ 15:26:13] 0.000258 kWh of electricity used since the beginning.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 15:40:51] Energy consumed for RAM : 0.003678 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 15:40:51] Delta energy consumed for CPU with constant : 0.020556 kWh, power : 112.5 W
[codecarbon INFO @ 15:40:51] Energy consumed for All CPU : 0.020686 kWh
[codecarbon INFO @ 15:40:51] Energy consumed for all GPUs : 0.062260 kWh. Total GPU Power : 340.1569740178087 W
[codecarbon INFO @ 15:40:51] 0.086624 kWh of electricity used since the beginning.
[codecarbon INFO @ 15:40:51] Energy consumed for RAM : 0.003678 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 15:40:51] Delta energy consumed for CPU with constant : 0.000000 kWh, power : 112.5 W
[codecarbon INFO @ 15:40:51] Energy consumed for All CPU : 0.020686 kWh
[codecarbon INFO @ 15:40:51] Energy consumed for all GPUs : 0.062260 kWh. Total GPU Power : 0.0 W
[codecarbon INFO @ 15:40:51] 0.086624 kWh of electricity used since the beginning.


In [11]:
base_pipe.move_to_device("cpu")

In [12]:
for result in base_model_results:
    print(result)

total_time: 655497.640625
latency: 218499.21354166666
throughput: 4.576675511966112e-06
co2_emissions: 0.031975750293829194
energy_consumed: 0.08662360432680613


In [7]:
# Evaluate smashed model and offload it to CPU
smashed_pipe.move_to_device(device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)

INFO - Using best available device: 'cuda'
INFO - Evaluating a smashed model.
INFO - Detected diffusers model. Using DiffuserHandler with fixed seed.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .images attribute.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.


  0%|          | 0/50 [00:00<?, ?it/s]



  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 14:55:12] [setup] RAM Tracking...
[codecarbon INFO @ 14:55:12] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 14:55:13] CPU Model on constant consumption mode: AMD EPYC 7R13 Processor
[codecarbon INFO @ 14:55:13] [setup] GPU Tracking...
[codecarbon INFO @ 14:55:13] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 14:55:13] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 14:55:13] >>> Tracker's metadata:
[codecarbon INFO @ 14:55:13]   Platform system: Linux-6.8.0-1031-aws-x86_64-with-glibc2.39
[codecarbon INFO @ 14:55:13]   Python version: 3.10.18
[codecarbon INFO @ 14:55:13]   CodeCarbon version: 3.0.4
[codecarbon INFO @ 14:55:13]   Available RAM : 61.940 GB
[codecarbon INFO

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

INFO - Starting quantizer torchao...
INFO - quantizer torchao was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.
[codecarbon INFO @ 14:55:21] Energy consumed for RAM : 0.000044 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 14:55:21] Delta energy consumed for CPU with constant : 0.000246 kWh, power : 112.5 W
[codecarbon INFO @ 14:55:21] Energy consumed for All CPU : 0.000246 kWh
[codecarbon INFO @ 14:55:21] Energy consumed for all GPUs : 0.000207 kWh. Total GPU Power : 94.4638114281574 W
[codecarbon INFO @ 14:55:21] 0.000497 kWh of electricity used since the beginning.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 15:10:12] Energy consumed for RAM : 0.003754 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 15:10:12] Delta energy consumed for CPU with constant : 0.020869 kWh, power : 112.5 W
[codecarbon INFO @ 15:10:12] Energy consumed for All CPU : 0.021115 kWh
[codecarbon INFO @ 15:10:12] Energy consumed for all GPUs : 0.060089 kWh. Total GPU Power : 322.813921365026 W
[codecarbon INFO @ 15:10:12] 0.084958 kWh of electricity used since the beginning.
[codecarbon INFO @ 15:10:12] Energy consumed for RAM : 0.003754 kWh. RAM Power : 20.0 W
[codecarbon INFO @ 15:10:12] Delta energy consumed for CPU with constant : 0.000000 kWh, power : 112.5 W
[codecarbon INFO @ 15:10:12] Energy consumed for All CPU : 0.021115 kWh
[codecarbon INFO @ 15:10:12] Energy consumed for all GPUs : 0.060089 kWh. Total GPU Power : 0.0 W
[codecarbon INFO @ 15:10:12] 0.084958 kWh of electricity used since the beginning.
  df = pd.concat([df, new_df], ignore_index=True)


In [8]:
smashed_pipe.move_to_device("cpu")

In [9]:
for result in smashed_model_results:
    print(result)

total_time: 668918.96875
latency: 222972.98958333334
throughput: 4.484848150749949e-06
co2_emissions: 0.031360857993651445
energy_consumed: 0.08495783614858526


Let's visualize and compare the evaluation results of the base and smashed models.

In [16]:
from IPython.display import Markdown, display  # noqa


# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa
    return ((optimized - original) / original) * 100


# Calculate differences and prepare table data
table_data = []
for base_metric_result in base_model_results:
    for smashed_metric_result in smashed_model_results:
        if base_metric_result.name == smashed_metric_result.name:
            diff = calculate_percentage_diff(base_metric_result.result, smashed_metric_result.result)
            table_data.append(
                {
                    "Metric": base_metric_result.name,
                    "Base Model": f"{base_metric_result.result:.7f}",
                    "Compressed Model": f"{smashed_metric_result.result:.7f}",
                    "Relative Difference": f"{diff:+.2f}%",
                }
            )
            break

# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
    metric = [m for m in metrics if m.metric_name == row["Metric"]][0]
    unit = metric.metric_units if hasattr(metric, "metric_units") else ""
    markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n"  # noqa: E501

display(Markdown(markdown_table))

| Metric | Base Model | Compressed Model | Relative Difference |
|--------|----------|-----------|------------|
| total_time | 655497.6406250 ms | 668918.9687500 ms | +2.05% |
| latency | 218499.2135417 ms/num_iterations | 222972.9895833 ms/num_iterations | +2.05% |
| throughput | 0.0000046 num_iterations/ms | 0.0000045 num_iterations/ms | -2.01% |
| co2_emissions | 0.0319758 kgCO2e | 0.0313609 kgCO2e | -1.92% |
| energy_consumed | 0.0866236 kWh | 0.0849578 kWh | -1.92% |


As expected, we can observe a slight improvement in the speed of the model. So, we can save the optimized model to disk or share it with others:

In [None]:
# Save the model to disk
smashed_pipe.save_pretrained("Wan2.1-T2V-1.3B-smashed")
# Load the model from disk
# smashed_pipe = PrunaModel.from_pretrained("Wan2.1-T2V-1.3B-smashed/")

# Save the model to HuggingFace
# smashed_pipe.save_to_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")
# smashed_pipe = PrunaModel.from_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")

## Conclusions

In this tutorial, we have gone over the standard workflow for optimizing and evaluating a text-to-video model.

We started loading the base model and defining the SmashConfig with the desired optimization algorithms and parameters. Then we smashed the base model, obtaining an optimized version, and we ensured the improvement in performance by running an evaluation with the EvaluationAgent.

The results show that we can significantly improve runtime performance and reduce memory usage and energy consumption, while maintaining a high level of output quality. This makes it easy to explore trade-offs and iterate on configurations to find the best optimization strategy for your specific use case.

Check out our other [tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) for more examples on how to optimize and evaluate image generation models or LLM models.