## Optimizing and Deploying AI Models with Pruna and Hugging Face

Objective: Build a complete tutorial demonstrating how to optimize the [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED) diffusion model using Pruna and deploy it seamlessly to the Hugging Face Hub.

Model: [Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED](https://huggingface.co/Efficient-Large-Model/Sana_600M_1024px_ControlNet_HED)

Dataset: [data-is-better-together/open-image-preferences-v1-binarized](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized)

To follow along, ensure that you have the Pruna SDK installed along with all required third-party libraries. Running this tutorial in a clean virtual environment is recommended for a smooth setup.

In [None]:
pip install pruna 

In [None]:
pip install datasets huggingface_hub gradio diffusers

You will need to login on the Hugging Face Hub for using the model weights. Run the cell below to do the same.

In [1]:
from huggingface_hub import notebook_login

notebook_login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


### Smash Configuration

To optimize the model, we first need to define the methods that will enhance its performance. For detailed options, refer to the [SmashConfig guide](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html).

In this tutorial, we will:

* Select a **quantizer** to reduce memory usage
* Use a **cacher** to store intermediate computation results, accelerating future operations
* Upload the optimized (smashed) model to the Hugging Face Hub for easy access and deployment

In [None]:
import torch
from pruna import smash, SmashConfig, PrunaModel
from diffusers import SanaPipeline

# Define the model ID
model_id = "Efficient-Large-Model/Sana_600M_512px_diffusers"

# Load the pre-trained model
pipe = SanaPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

# 2. Configure Pruna smash
smash_config = SmashConfig()
smash_config["quantizer"] = "hqq_diffusers"  # Quantizer to reduce memory usage
smash_config['hqq_diffusers_weight_bits'] = 8          # Cacher to speed up computations

# 3. Smash (optimize) the model
smashed_pipe = smash(model=pipe, smash_config=smash_config)

# 4. Push the smashed pipeline to Hugging Face Hub using save_to_hub
smashed_pipe.save_to_hub("AINovice2005/Sana_600M_ControlNet_HED-smashed")

print("✅ Smashed Sana model uploaded successfully to Hugging Face Hub.")


### Load Dataset and Collate Dataset

In this step, we will load the dataset required for optimizing and evaluating the model. This dataset will serve as input data during the evaluation and help assess the model’s performance after applying quantization.

We will use the [`data-is-better-together/open-image-preferences-v1-binarized`](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized) dataset, which contains binarized user image preferences. Loading the dataset correctly ensures that the input pipeline is ready for smooth optimization and deployment workflows.

In [None]:
from pruna.data.pruna_datamodule import PrunaDataModule
from datasets import load_dataset
from pruna.data.utils import split_train_into_train_val_test
from pruna.data.collate import image_generation_collate
from functools import partial

# Step 1: Load and split dataset
dataset_name = "data-is-better-together/open-image-preferences-v1-binarized"
full_ds = load_dataset(dataset_name)["train"]
train_ds, val_ds, test_ds = split_train_into_train_val_test(full_ds, seed=42)

# Step 2: Prepare collate function
collate_fn = partial(image_generation_collate, img_size=512, output_format="int")

# Step 3: Define dataloader arguments
dataloader_args = {
    "batch_size": 8,
    "shuffle": True,
    "num_workers": 4
}

# ✅ Step 4: Initialize PrunaDataModule with separate dataset arguments
datamodule = PrunaDataModule(
    train_ds,
    val_ds,
    test_ds,
    collate_fn=collate_fn,
    dataloader_args=dataloader_args
)

print("✅ PrunaDataModule initialized successfully with train, val, and test splits.")


### Evaluate the Model

Now that the model and dataset are set up, we can proceed to evaluate the model using the **Pruna Evaluation Agent**. This evaluation helps us measure the model’s current performance before optimization, providing a baseline for comparison. It assesses how well the model performs on the given dataset and generates relevant metrics that will guide us in understanding the impact of our optimization configurations later.

In [None]:
from datasets import load_dataset
from pruna.engine.pruna_model import PrunaModel
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.evaluation_agent import EvaluationAgent
from pruna.evaluation.task import Task

# Step 1: Load the binarized Open Image Preferences dataset
dataset_name = "data-is-better-together/open-image-preferences-v1-binarized"
split = "train"
ds = load_dataset(dataset_name, split=split)

save_path = "workspace/saved_sana_model"
smashed_model = PrunaModel.from_pretrained(model_path=save_path)
print("✅ Smashed Sana model loaded successfully with PrunaModel.from_pretrained.")

# Step 4: Set up evaluation task using PrunaDataModule
metrics = ["clip_score", "psnr"]  # Adjust metrics based on evaluation needs
datamodule = PrunaDataModule.from_string(dataset_name)

# Step 5: Define the evaluation task
task = Task(metrics, datamodule=datamodule, device="cuda")  # use "cpu" if no GPU

# Step 6: Initialize EvaluationAgent and run evaluation
eval_agent = EvaluationAgent(task)
results = eval_agent.evaluate(smashed_model)

# Step 7: Print evaluation results
print("📊 Evaluation Results:")
print(results.results)


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: Pipeline <class 'diffusers.pipelines.flux.pipeline_flux.FluxPipeline'> expected ['scheduler', 'text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'transformer', 'vae'], but only {'vae', 'tokenizer', 'text_encoder_2', 'text_encoder', 'scheduler', 'tokenizer_2'} were passed.

### Gradio Demo

Once the model has been optimized, we can deploy the smashed model using **Gradio** to create an interactive demo. This allows anyone to test the model’s capabilities directly in their browser.

In this section, we will:

* Show how to deploy the optimized model on the Hugging Face Hub with a Gradio demo
* Discuss considerations such as **handling queuing**, especially if multiple users access the demo simultaneously
* Highlight best practices for integrating Gradio demos in your Hugging Face Space to ensure a smooth and responsive user experience

Creating a Gradio demo not only showcases your optimized model effectively but also enables easy sharing and real-world testing by the community.

In [None]:
import gradio as gr
import torch
from diffusers import SanaPipeline

# ✅ Load the smashed Sana diffusion model
model_path = "AINovice2005/Sana_600M_ControlNet_HED-smashed"
pipe = SanaPipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu")
pipe.set_progress_bar_config(disable=True)

# ✅ Inference function
def generate_image(prompt):
    result = pipe(prompt, num_inference_steps=25, guidance_scale=7.5)
    return result.images[0]

# ✅ Create Gradio interface with queueing enabled
demo = gr.Interface(
    fn=generate_image,
    inputs=gr.Textbox(lines=2, placeholder="Enter your prompt here...", label="Prompt"),
    outputs=gr.Image(type="pil"),
    title="Sana Smashed Text-to-Image Demo",
    description="Generate high-quality images using the smashed Sana diffusion model optimized with Pruna.",
    allow_flagging="never"
)

# ✅ Enable queueing to handle multiple users
demo.queue()

# ✅ Launch the app
if __name__ == "__main__":
    demo.launch()


Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: Pipeline <class 'diffusers.pipelines.flux.pipeline_flux.FluxPipeline'> expected ['scheduler', 'text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'transformer', 'vae'], but only {'vae', 'tokenizer', 'text_encoder_2', 'text_encoder', 'scheduler', 'tokenizer_2'} were passed.