# Optimize and Deploy AI Models with Pruna and Hugging Face

Objective: Build a complete tutorial demonstrating how to optimize the [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED) diffusion model using Pruna and deploy it seamlessly to the Hugging Face Hub.

Model: [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED)

Dataset: [data-is-better-together/open-image-preferences-v1-binarized](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized)

To follow along, ensure that you have the Pruna SDK installed along with all required third-party libraries. Running this tutorial in a clean virtual environment is recommended for a smooth setup.

<a target="_blank" href="https://colab.research.google.com/github/PrunaAI/pruna/blob/v|version|/docs/tutorials/deploying_sana_tutorial.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
pip install pruna 

In [None]:
pip install datasets huggingface_hub gradio 

You will need to login on the Hugging Face Hub for using the model weights. We also need to select the best available device for executing the notebook. Run the cells below to do the same.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [1]:
import torch

device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

## 1. Smash Configuration

To optimize the model effectively, we first define the configuration methods that enhance its performance. For detailed options and parameter explanations, refer to the [SmashConfig guide](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html).

In this tutorial, we will:

* Select **hqq_diffusers** to reduce memory usage during inference.
* Set the **weight bit precision** to 8 bits for the diffusers model:
* Upload the optimized (“smashed”) model to the Hugging Face Hub for easy access, sharing, and deployment in downstream applications.

In [None]:
import torch
from diffusers import SanaPipeline

from pruna import PrunaModel, SmashConfig, smash

# Define the model ID
model_id = "Efficient-Large-Model/Sana_600M_512px_diffusers"

# Load the pre-trained model
pipe = SanaPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to(device)

# 2. Configure Pruna smash
smash_config = SmashConfig()
smash_config["quantizer"] = "hqq_diffusers"  # Quantizer to reduce memory usage
smash_config["hqq_diffusers_weight_bits"] = 8

# 3. Smash (optimize) the model
smashed_pipe = smash(model=pipe, smash_config=smash_config)

# 4. Push the smashed pipeline to Hugging Face Hub using save_to_hub
smashed_pipe.save_to_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

print("✅ Smashed Sana model uploaded successfully to Hugging Face Hub.")

## **2. Load and Collate Dataset**

In this step, we will load the dataset required for optimizing and evaluating the model. This dataset will provide the input data needed to assess the model’s performance after applying optimization techniques such as quantization.

We will use the [`data-is-better-together/open-image-preferences-v1-binarized`](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized) dataset, which contains binarized user image preferences and prompts for image generation tasks. Correctly loading and collating the dataset ensures that the input is properly prepared, enabling smooth evaluation.

In [None]:
from datasets import load_dataset

from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.data.utils import split_train_into_train_val_test

# Load dataset
dataset = load_dataset("data-is-better-together/open-image-preferences-v1-binarized")["train"]

dataset = dataset.rename_column("image_quality_dev", "image")
dataset = dataset.rename_column("quality_prompt", "text")

dataset = dataset.cast_column("image", Image())

# Split train into train/val/test
train_ds, val_ds, test_ds = split_train_into_train_val_test(dataset, seed=42)

# Initialize PrunaDataModule
datamodule = PrunaDataModule.from_datasets(
    datasets=(train_ds, val_ds, test_ds),
    collate_fn="image_generation_collate",
    collate_fn_args={"img_size": 512, "input_format": "float", "output_format": "float"},
)

# Limit datasets to 5 samples each for quick testing
datamodule.limit_datasets(5)


## 3. Evaluate the Model

Now that the model and dataset are set up, we can proceed to evaluate the model using the **Pruna Evaluation Agent**. This evaluation helps us measure the model’s current performance before optimization, providing a baseline for comparison. It assesses how well the model performs on the given dataset and generates relevant metrics that will guide us in understanding the impact of our optimization configurations later.

In [None]:
# Import required modules from Pruna
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.metrics import (
    LatencyMetric,
    TotalTimeMetric,
)
from pruna.evaluation.task import Task
from pruna import PrunaModel

# Load the smashed (optimized) model pipeline from Hugging Face Hub
smashed_pipe = PrunaModel.from_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

# Define evaluation metrics (example: total time and latency)
metrics = [
    TotalTimeMetric(n_iterations=1, n_warmup_iterations=1),
    LatencyMetric(n_iterations=1, n_warmup_iterations=1),
]

# Define the evaluation task with metrics and datamodule
# (Ensure `datamodule` and `device` are defined before this script runs)
task = Task(metrics, datamodule=datamodule, device=device)

# Initialize the evaluation agent
eval_agent = EvaluationAgent(task)

# Move smashed model to evaluation device (GPU or CPU)
smashed_pipe.move_to_device(device)

# Evaluate the smashed model pipeline using the evaluation agent
smashed_model_results = eval_agent.evaluate(smashed_pipe)

# Optionally, print results for verification
print(smashed_model_results)


## 4. Gradio Demo

Once the model has been optimized, we can deploy the smashed model using **Gradio** to create an interactive demo. This allows anyone to test the model’s capabilities directly in their browser.

In this section, we will:

* Show how to deploy the optimized model on the Hugging Face Hub with a Gradio demo
* Discuss considerations such as **handling queuing**, especially if multiple users access the demo simultaneously
* Highlight best practices for integrating Gradio demos in your Hugging Face Space to ensure a smooth and responsive user experience

Creating a Gradio demo not only showcases your optimized model effectively but also enables easy sharing and real-world testing by the community.

In [None]:
import gradio as gr

from pruna import PrunaModel

# Load PrunaModel
model = PrunaModel.from_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")


# Inference function
def generate_image(prompt):
    """Generate an image from a given text prompt."""
    result = pipe(prompt, num_inference_steps=25, guidance_scale=7.5)
    return result.images[0]


# Create Gradio interface with queueing enabled
demo = gr.Interface(
    fn=generate_image,
    inputs=gr.Textbox(lines=2, placeholder="Enter your prompt here...", label="Prompt"),
    outputs=gr.Image(type="pil"),
    title="Sana Smashed Text-to-Image Demo",
    description="Generate high-quality images using the smashed Sana diffusion model optimized with Pruna.",
    allow_flagging="never",
)

# Enable queueing to handle multiple users
demo.queue()

# Launch the app
if __name__ == "__main__":
    demo.launch()

## Conclusions
In this tutorial, we have covered the end-to-end workflow for optimizing and evaluating a text-to-image diffusion model using Pruna.

We began by loading the Sana base model and defining the SmashConfig with the desired optimization algorithms and parameters. We then smashed the base model, obtaining an optimized version, and ensured its performance improvements by running an evaluation with the EvaluationAgent.

After optimization, we demonstrated how to deploy the smashed model using Gradio to create an interactive demo. This enables anyone to test the model’s capabilities directly in their browser. This end-to-end approach makes it easy to explore trade-offs, iterate on optimization configurations, and deploy robust, production-ready text-to-image models. 

Check out our other [tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) for more examples on optimizing and evaluating large language models, text-to-video models using Pruna.