# Compress and Evaluate Video Generation Models

| Component | Details |
|-----------|---------|
| **Goal** | Showcase a standard workflow for optimizing and evaluating a video generation model |
| **Model** |[Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) |
| **Dataset** |  [nannullna/laion_subset](https://huggingface.co/datasets/nannullna/laion_subset) |
| **Optimization Algorithms** | quantizer(torchao), compiler(torch_compile) |
| **Evaluation Metrics** | `total time`, `latency`, `througput`, `co2_emissions`, and `energy_consumed` |

## Getting Started

To install the required dependencies, you can run the following command:


In [None]:
%pip install pruna
%pip install ftfy imageio imageio-ffmpeg

In [1]:
%pip install --upgrade --force-reinstall git+https://github.com/PrunaAI/pruna.git@main

Collecting git+https://github.com/PrunaAI/pruna.git@main
  Cloning https://github.com/PrunaAI/pruna.git (to revision main) to /tmp/pip-req-build-mxalhv7g
  Running command git clone --filter=blob:none --quiet https://github.com/PrunaAI/pruna.git /tmp/pip-req-build-mxalhv7g
  Resolved https://github.com/PrunaAI/pruna.git to commit 15876bb39ca33b0c93a5de844c8d23c1bd88a610
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting aenum (from pruna==0.2.9)
  Using cached aenum-3.1.16-py3-none-any.whl.metadata (3.8 kB)
Collecting bitsandbytes (from pruna==0.2.9)
  Downloading bitsandbytes-0.47.0-py3-none-manylinux_2_24_x86_64.whl.metadata (11 kB)
Collecting codecarbon (from pruna==0.2.9)
  Using cached codecarbon-3.0.4-py3-none-any.whl.metadata (11 kB)
Collecting colorama (from pruna==0.2.9)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting

For more information about how to install Pruna, please refer to the [Installation](https://docs.pruna.ai/en/stable/setup/install.html) page.

Then, we will set the device to the best available option to maximize the optimization process's benefits. However, in this case, we recommend using a GPU.

In [2]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

## 1. Load the Model

First, we must load the original model using the diffusers library to ensure it fits into memory. In this example, we will use a light model compatible with most of the consumer-grade GPUs, [Wan-AI/Wan2.1-T2V-1.3B](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B).

Pruna works at least as well with larger models, like the model version of Wan 2.1 14B or HuyuanVideo. The choice to use a smaller model is simply because it’s a good starting point, so feel free to use any [text-to-video model available on Hugging Face](https://huggingface.co/models?pipeline_tag=text-to-video&sort=trending).

In [3]:
from diffusers import AutoencoderKLWan, WanPipeline

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"

vae = AutoencoderKLWan.from_pretrained(
    model_id, subfolder="vae", torch_dtype=torch.float32
)

pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16).to(
    device
)

Multiple distributions found for package optimum. Picked distribution: optimum


Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Once we have loaded the pipeline, we can run some inference and check the output. The standard prompt structure for a video is **Subject + Subject Action + Scene**, which can become more complex as we add descriptions and details like the lighting, point of view, or visual style to achieve specific and refined results.

Remember that you can improve the quality of the video by increasing the number of frames, the number of inference steps, and the guidance scale, but this will also increment the time and amount of resources required to generate the video.

In [None]:
from diffusers.utils import export_to_video

prompt = "A dog runs on the beach, realistic."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"  # noqa: E501

with torch.no_grad():
    output = pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "base_video.mp4", fps=15)

As we can see, the model has generated a nice short video based on our prompt.

## 2. Define the SmashConfig

Now that we have correctly loaded and tested our base model, let's continue by defining the `SmashConfig` to customize the optimizations we want to apply when smashing.

Take into account that not all optimization algorithms are available for all models, so you can learn about the requirements and compatibility in the [Algorithms Overview](https://docs.pruna.ai/en/stable/compression.html).

In the current optimization, we will use [torch_compile](https://docs.pruna.ai/en/stable/compression.html#torch-compile) to make it more efficient and [torchao](https://docs.pruna.ai/en/stable/compression.html#torchao) to quantize the model.

Let's define the `SmashConfig` object.

In [5]:
from pruna import SmashConfig

smash_config = SmashConfig(device=device)

smash_config["compiler"] = "torch_compile"
smash_config["torch_compile_target"] = "module_list"

smash_config["kernel"] = "flash_attn3"

## 3. Smash the Model

Next, we need to apply our defined `SmashConfig` by smashing our model. The `smash` function will be in charge of this, so we just need to pass the `model` and the `smash_config`. To evaluate and compare the models in the upcoming sections, we will make a deep copy of the base model.

Time to smash! This will take around 20 seconds, depending on the configuration.

In [6]:
import copy

from pruna import smash

copy_pipe = copy.deepcopy(pipe).to("cpu")
smashed_pipe = smash(
    model=pipe,
    smash_config=smash_config,
)



Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

INFO - Starting kernel flash_attn3...


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

INFO - kernel flash_attn3 was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.


Now, we will have an optimized smashed model, so let's check how it works using the previous prompt.

Consider that if you are using `torch_compile` as a compiler, you can expect the first inference warmup to take longer than the actual inference.



In [None]:
with torch.no_grad():
    output = smashed_pipe(
        prompt=prompt,
        negative_prompt=negative_prompt,
        height=480,
        width=832,
        num_frames=33,
        guidance_scale=3.0,
        num_inference_steps=15,
        generator=torch.Generator(device=device).manual_seed(42),
    ).frames[0]

export_to_video(output, "smashed_video.mp4", fps=15)

As we can observe, it has also generated a short video similar to the original model.

If you notice a significant difference, it might be due to the model, the configuration, the hardware, etc. We encourage you to retry the optimization process or try out different configurations and models to find the best fit for your use case. However, feel free to reach out to us on [Discord]([https://discord.gg/Tun8YgzxZ9](https://discord.gg/Tun8YgzxZ9)) if you have any questions or feedback.

## 4. Evaluate the Smashed Model

Now that we have our smashed model, the key question is how much has improved with our optimization. For this, we can run an evaluation of the performance using the `EvaluationAgent`. In this case, we will include metrics like the `total_time`, `latency`, `throughput`, `co2_emissions`, and `energy_consumed`.

A complete list of the available metrics can be found in [Evaluation](https://docs.pruna.ai/en/stable/reference/evaluation.html).

In [10]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    CO2EmissionsMetric,
    EnergyConsumedMetric,
    LatencyMetric,
    ThroughputMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

# Define the metrics. Increment the number of iterations and
# warmup iterations to get a more accurate result.
metrics = [
    TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
    LatencyMetric(n_iterations=3, n_warmup_iterations=1),
    ThroughputMetric(n_iterations=3, n_warmup_iterations=1),
    CO2EmissionsMetric(n_iterations=3, n_warmup_iterations=1),
    EnergyConsumedMetric(n_iterations=3, n_warmup_iterations=1),
]

# Define the datamodule
datamodule = PrunaDataModule.from_string("LAION256")
datamodule.limit_datasets(10)

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)

  if LooseVersion(torch.__version__) < LooseVersion("1.0.0"):
  if LooseVersion(torch.__version__) >= LooseVersion("1.1.0"):
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Loaded only training, splitting train 80/10/10 into train, validation and test...
INFO - Testing compatibility with image_generation_collate...
INFO - Using provided list of metric instances.


In [11]:
# Evaluate smashed model and offload it to CPU
smashed_pipe.move_to_device(device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)
smashed_pipe.move_to_device("cpu")

INFO - Using best available device: 'cuda'
INFO - Evaluating a smashed model.
INFO - Detected diffusers model. Using DiffuserHandler with fixed seed.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .images attribute.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.


  0%|          | 0/50 [00:00<?, ?it/s]



  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 10:19:31] [setup] RAM Tracking...
[codecarbon INFO @ 10:19:31] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 10:19:32] CPU Model on constant consumption mode: AMD EPYC 9334 32-Core Processor
[codecarbon INFO @ 10:19:32] [setup] GPU Tracking...
[codecarbon INFO @ 10:19:32] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:19:32] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: cpu_load
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 10:19:32] >>> Tracker's metadata:
[codecarbon INFO @ 10:19:32]   Platform system: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
[codecarbon INFO @ 10:19:32]   Python version: 3.10.18
[codecarbon INFO @ 10:19:32]   CodeCarbon version: 3.0.4
[codecarbon INFO @ 10:19:32]   Available RAM : 235.943 GB
[codecarbon 

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

INFO - Starting kernel flash_attn3...


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

INFO - kernel flash_attn3 was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.
[codecarbon INFO @ 10:19:41] Energy consumed for RAM : 0.000086 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:19:41] Delta energy consumed for CPU with cpu_load : 0.000027 kWh, power : 21.0352890216 W
[codecarbon INFO @ 10:19:41] Energy consumed for All CPU : 0.000027 kWh
[codecarbon INFO @ 10:19:41] Energy consumed for all GPUs : 0.000129 kWh. Total GPU Power : 89.59151065367644 W
[codecarbon INFO @ 10:19:41] 0.000242 kWh of electricity used since the beginning.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 10:25:36] Energy consumed for RAM : 0.004969 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:25:37] Delta energy consumed for CPU with cpu_load : 0.001556 kWh, power : 21.02482661952803 W
[codecarbon INFO @ 10:25:37] Energy consumed for All CPU : 0.001583 kWh
[codecarbon INFO @ 10:25:37] Energy consumed for all GPUs : 0.025681 kWh. Total GPU Power : 344.65999204304967 W
[codecarbon INFO @ 10:25:37] 0.032233 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:25:37] Energy consumed for RAM : 0.004969 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:25:37] Delta energy consumed for CPU with cpu_load : 0.000000 kWh, power : 21.000000189 W
[codecarbon INFO @ 10:25:37] Energy consumed for All CPU : 0.001583 kWh
[codecarbon INFO @ 10:25:37] Energy consumed for all GPUs : 0.025696 kWh. Total GPU Power : 106.26894718206573 W
[codecarbon INFO @ 10:25:37] 0.032248 kWh of electricity used since the beginning.
  df = pd.concat([df, new_df], ignore_index=True)


In [13]:
# Evaluate base model and offload it to CPU
base_pipe = PrunaModel(model=copy_pipe)
base_pipe.move_to_device(device)
base_model_results = eval_agent.evaluate(base_pipe)
base_pipe.move_to_device("cpu")

INFO - Using best available device: 'cuda'
INFO - Using best available device: 'cuda'
INFO - Evaluating a base model.
INFO - Detected diffusers model. Using DiffuserHandler with fixed seed.
- The first element of the batch is passed as input.
- The generated outputs are expected to have .images attribute.
INFO - Evaluating stateful metrics.
INFO - Evaluating isolated inference metrics.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 10:36:27] [setup] RAM Tracking...
[codecarbon INFO @ 10:36:27] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 10:36:28] CPU Model on constant consumption mode: AMD EPYC 9334 32-Core Processor
[codecarbon INFO @ 10:36:28] [setup] GPU Tracking...
[codecarbon INFO @ 10:36:28] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:36:29] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: cpu_load
                GPU Tracking Method: pynvml
            
[codecarbon INFO @ 10:36:29] >>> Tracker's metadata:
[codecarbon INFO @ 10:36:29]   Platform system: Linux-6.8.0-71-generic-x86_64-with-glibc2.39
[codecarbon INFO @ 10:36:29]   Python version: 3.10.18
[codecarbon INFO @ 10:36:29]   CodeCarbon version: 3.0.4
[codecarbon INFO @ 10:36:29]   Available RAM : 235.943 GB
[codecarbon 

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

[codecarbon INFO @ 10:36:36] Energy consumed for RAM : 0.000064 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:36:36] Delta energy consumed for CPU with cpu_load : 0.000020 kWh, power : 21.031323348 W
[codecarbon INFO @ 10:36:36] Energy consumed for All CPU : 0.000020 kWh
[codecarbon INFO @ 10:36:36] Energy consumed for all GPUs : 0.000094 kWh. Total GPU Power : 84.86949514851169 W
[codecarbon INFO @ 10:36:36] 0.000178 kWh of electricity used since the beginning.


  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

[codecarbon INFO @ 10:46:51] Energy consumed for RAM : 0.008528 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:46:52] Delta energy consumed for CPU with cpu_load : 0.002696 kWh, power : 21.019366618909036 W
[codecarbon INFO @ 10:46:52] Energy consumed for All CPU : 0.002716 kWh
[codecarbon INFO @ 10:46:52] Energy consumed for all GPUs : 0.044383 kWh. Total GPU Power : 344.94818691303806 W
[codecarbon INFO @ 10:46:52] 0.055627 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:46:52] Energy consumed for RAM : 0.008528 kWh. RAM Power : 66.0 W
[codecarbon INFO @ 10:46:52] Delta energy consumed for CPU with cpu_load : 0.000000 kWh, power : 21.000001512 W
[codecarbon INFO @ 10:46:52] Energy consumed for All CPU : 0.002716 kWh
[codecarbon INFO @ 10:46:52] Energy consumed for all GPUs : 0.044398 kWh. Total GPU Power : 106.45399623597787 W
[codecarbon INFO @ 10:46:52] 0.055642 kWh of electricity used since the beginning.


Let's visualize and compare the evaluation results of the base and smashed models.

In [15]:
from IPython.display import Markdown, display  # noqa


# Calculate percentage differences for each metric
def calculate_percentage_diff(original, optimized):  # noqa
    return ((optimized - original) / original) * 100


# Calculate differences and prepare table data
table_data = []
for base_metric_result in base_model_results:
    for smashed_metric_result in smashed_model_results:
        if base_metric_result.name == smashed_metric_result.name:
            diff = calculate_percentage_diff(
                base_metric_result.result, smashed_metric_result.result
            )
            table_data.append(
                {
                    "Metric": base_metric_result.name,
                    "Base Model": f"{base_metric_result.result:.7f}",
                    "Compressed Model": f"{smashed_metric_result.result:.7f}",
                    "Relative Difference": f"{diff:+.2f}%",
                }
            )
            break

# Create and display markdown table manually
markdown_table = "| Metric | Base Model | Compressed Model | Relative Difference |\n"
markdown_table += "|--------|----------|-----------|------------|\n"
for row in table_data:
    metric = [m for m in metrics if m.metric_name == row["Metric"]][0]
    unit = metric.metric_units if hasattr(metric, "metric_units") else ""
    markdown_table += f"| {row['Metric']} | {row['Base Model']} {unit} | {row['Compressed Model']} {unit} | {row['Relative Difference']} |\n"  # noqa: E501

display(Markdown(markdown_table))

| Metric | Base Model | Compressed Model | Relative Difference |
|--------|----------|-----------|------------|
| total_time | 460992.1875000 ms | 265793.1718750 ms | -42.34% |
| latency | 153664.0625000 ms/num_iterations | 88597.7239583 ms/num_iterations | -42.34% |
| throughput | 0.0000065 num_iterations/ms | 0.0000113 num_iterations/ms | +73.44% |
| co2_emissions | 0.0031181 kgCO2e | 0.0018072 kgCO2e | -42.04% |
| energy_consumed | 0.0556424 kWh | 0.0322483 kWh | -42.04% |


As we can see, the model is more efficient producing less CO2 emissions and energy consumed. Even if the speed is slightly slower, the quality of the video is still good, and consider that for this example we are running the metrics with a low number of iterations and warmup iterations.
So, we can save the optimized model to disk or share it with others:

In [None]:
# Save the model to disk
smashed_pipe.save_pretrained("Wan2.1-T2V-1.3B-smashed")
# Load the model from disk
# smashed_pipe = PrunaModel.from_pretrained("Wan2.1-T2V-1.3B-smashed/")

# Save the model to HuggingFace
# smashed_pipe.save_to_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")
# smashed_pipe = PrunaModel.from_hub("PrunaAI/Wan2.1-T2V-1.3B-smashed")

## Conclusions

In this tutorial, we have gone over the standard workflow for optimizing and evaluating a text-to-video model.

We started loading the base model and defining the SmashConfig with the desired optimization algorithms and parameters. Then we smashed the base model, obtaining an optimized version, and we ensured the improvement in performance by running an evaluation with the EvaluationAgent.

The results show that we can significantly reduce the energy consumption, while maintaining a high level of output quality. This makes it easy to explore trade-offs and iterate on configurations to find the best optimization strategy for your specific use case.

Check out our other [tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) for more examples on how to optimize and evaluate image generation models or LLM models.