## Optimizing and Deploying AI Models with Pruna and Hugging Face

Objective: Build a complete tutorial demonstrating how to optimize the [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED) diffusion model using Pruna and deploy it seamlessly to the Hugging Face Hub.

Model: [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED)

Dataset: [data-is-better-together/open-image-preferences-v1-binarized](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized)

To follow along, ensure that you have the Pruna SDK installed along with all required third-party libraries. Running this tutorial in a clean virtual environment is recommended for a smooth setup.

In [None]:
pip install pruna 

In [None]:
pip install datasets huggingface_hub gradio diffusers

You will need to login on the Hugging Face Hub for using the model weights. Run the cell below to do the same.

In [1]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


In [1]:
import torch

device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)

### Smash Configuration

To optimize the model, we first need to define the methods that will enhance its performance. For detailed options, refer to the [SmashConfig guide](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html).

In this tutorial, we will:

* Select a **quantizer** to reduce memory usage
* Use a **cacher** to store intermediate computation results, accelerating future operations
* Upload the optimized (smashed) model to the Hugging Face Hub for easy access and deployment

In [None]:
import torch
from pruna import smash, SmashConfig, PrunaModel
from diffusers import SanaPipeline

# Define the model ID
model_id = "Efficient-Large-Model/Sana_600M_512px_diffusers"

# Load the pre-trained model
pipe = SanaPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Configure Pruna smash
smash_config = SmashConfig()
smash_config["quantizer"] = "hqq_diffusers"  # Quantizer to reduce memory usage
smash_config['hqq_diffusers_weight_bits'] = 8          # Cacher to speed up computations

# 3. Smash (optimize) the model
smashed_pipe = smash(model=pipe, smash_config=smash_config)

# 4. Push the smashed pipeline to Hugging Face Hub using save_to_hub
smashed_pipe.save_to_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

print("✅ Smashed Sana model uploaded successfully to Hugging Face Hub.")


### Load Dataset and Collate Dataset

In this step, we will load the dataset required for optimizing and evaluating the model. This dataset will serve as input data during the evaluation and help assess the model’s performance after applying quantization.

We will use the [`data-is-better-together/open-image-preferences-v1-binarized`](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized) dataset, which contains binarized user image preferences. Loading the dataset correctly ensures that the input pipeline is ready for smooth optimization and deployment workflows.

In [3]:
from datasets import load_dataset  
from pruna.data.pruna_datamodule import PrunaDataModule  
from pruna.data.utils import split_train_into_train_val_test  

# Load and split dataset
ds = load_dataset("data-is-better-together/open-image-preferences-v1-binarized")["train"]
train_ds, val_ds, test_ds = split_train_into_train_val_test(ds, seed=42)

# Initialize PrunaDataModule
datamodule = PrunaDataModule.from_datasets(  
    datasets=(train_ds, val_ds, test_ds),  
    collate_fn="image_generation_collate",  
    collate_fn_args={"img_size": 512, "output_format": "float"}  
)

# Limit datasets to 5 samples each for quick testing
datamodule.limit_datasets(5)


AttributeError: partially initialized module 'datasets' has no attribute 'utils' (most likely due to a circular import)

### Evaluate the Model

Now that the model and dataset are set up, we can proceed to evaluate the model using the **Pruna Evaluation Agent**. This evaluation helps us measure the model’s current performance before optimization, providing a baseline for comparison. It assesses how well the model performs on the given dataset and generates relevant metrics that will guide us in understanding the impact of our optimization configurations later.

In [1]:
from pruna import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    LatencyMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task

smashed_pipe = PrunaModel.from_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

metrics = [
    TotalTimeMetric(n_iterations=3, n_warmup_iterations=1),
    LatencyMetric(n_iterations=3, n_warmup_iterations=1),
]

# Define the task and the evaluation agent
task = Task(metrics, datamodule=datamodule, device=device)
eval_agent = EvaluationAgent(task)


# Evaluate smashed model and offload it to CPU
smashed_pipe.move_to_device(device)
smashed_model_results = eval_agent.evaluate(smashed_pipe)

Multiple distributions found for package optimum. Picked distribution: optimum


### Gradio Demo

Once the model has been optimized, we can deploy the smashed model using **Gradio** to create an interactive demo. This allows anyone to test the model’s capabilities directly in their browser.

In this section, we will:

* Show how to deploy the optimized model on the Hugging Face Hub with a Gradio demo
* Discuss considerations such as **handling queuing**, especially if multiple users access the demo simultaneously
* Highlight best practices for integrating Gradio demos in your Hugging Face Space to ensure a smooth and responsive user experience

Creating a Gradio demo not only showcases your optimized model effectively but also enables easy sharing and real-world testing by the community.

In [1]:
import gradio as gr
from pruna import PrunaModel


# ✅ Load PrunaModel
model = PrunaModel.from_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

# ✅ Inference function
def generate_image(prompt):
    result = pipe(prompt, num_inference_steps=25, guidance_scale=7.5)
    return result.images[0]

# ✅ Create Gradio interface with queueing enabled
demo = gr.Interface(
    fn=generate_image,
    inputs=gr.Textbox(lines=2, placeholder="Enter your prompt here...", label="Prompt"),
    outputs=gr.Image(type="pil"),
    title="Sana Smashed Text-to-Image Demo",
    description="Generate high-quality images using the smashed Sana diffusion model optimized with Pruna.",
    allow_flagging="never"
)

# ✅ Enable queueing to handle multiple users
demo.queue()

# ✅ Launch the app
if __name__ == "__main__":
    demo.launch()


Multiple distributions found for package optimum. Picked distribution: optimum


Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

INFO - Using best available device: 'cuda'
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
100%|██████████| 231/231 [00:00<00:00, 59364.27it/s]
100%|██████████| 230/230 [00:00<00:00, 7521.71it/s]


Loading pipeline components...:   0%|          | 0/4 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.
