### Optimizing and Deploying AI Models with Pruna and Hugging Face

`Goal`: Create an end-to-end tutorial to optimize the HiDream-I1-Fast model using Pruna and deploy it on the Hugging Face Hub.

`Model`:[HiDream-ai/HiDream-I1-Dev](https://huggingface.co/HiDream-ai/HiDream-I1-Dev)

`Dataset`: [data-is-better-together/open-image-preferences-v1-binarized](https://huggingface.co/datasets/data-is-better-together/open-image-preferences-v1-binarized)

To complete the tutorial, you need to install the pruna SDK along with a few third-party libraries via pip. It is recommended to run this notebook in a new virtual environment.


In [None]:
pip install pruna 

In [None]:
pip install datasets huggingface_hub gradio

In [None]:
pip install git+https://github.com/huggingface/diffusers.git #To modify once diffusers v.0.34.0 is released

You will need to login on the Hugging Face Hub for using the model weights. Run the cell below to do the same.

In [3]:
from huggingface_hub import login

login()


VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Token has not been saved to git credential helper.


In [4]:
# Load model 
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

Smash Configuration:

In [6]:
import torch
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
from diffusers import HiDreamImagePipeline
from pruna import smash, SmashConfig


tokenizer_4 = PreTrainedTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text_encoder_4 = LlamaForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    output_attentions=True,
    torch_dtype=torch.bfloat16,
)

# Load the HiDream pipeline
pipe = HiDreamImagePipeline.from_pretrained(
    "HiDream-ai/HiDream-I1-Fast", 
    tokenizer_4=tokenizer_4,
    text_encoder_4=text_encoder_4,
    torch_dtype=torch.bfloat16,
)

pipe = pipe.to("cuda")

# Configure Pruna smash
smash_config = SmashConfig()
smash_config["compiler"] = "torch_compile"
smash_config["quantizer"] = "hqq_diffusers"  # Optional, depends on availability
smash_config["cacher"] = "deepcache"         # Optional

# Smash the pipeline
smashed_pipe = smash(model=pipe, smash_config=smash_config)




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Loading pipeline components...:   0%|          | 0/11 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB. GPU 0 has a total capacity of 44.53 GiB of which 19.25 MiB is free. Process 36371 has 44.50 GiB memory in use. Of the allocated memory 43.60 GiB is allocated by PyTorch, and 504.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Upload the model to HuggingFace Hub

In [None]:
from huggingface_hub import HfApi, Repository

api = HfApi()
repo_url = api.create_repo("HiDream-I1-Fast-pruned", exist_ok=True, private=False)

repo = Repository(local_dir="HiDream-I1-Fast-pruned", clone_from=repo_url)
smashed_model.save_pretrained("HiDream-I1-Fast-pruned")
repo.push_to_hub(commit_message="Add Pruna-smashed HiDream-I1-Fast")


Load Dataset

In [None]:
from datasets import load_dataset

# load the binarized Open Image Preferences prompts
ds = load_dataset("data-is-better-together/open-image-preferences-v1-binarized", split="train")

# preview first 100 examples
for example in ds.select(range(10)):
    print(example["prompt"])


Evaluate the model

In [None]:
from pruna.data.pruna_datamodule import PrunaDataModule
from pruna.evaluation.task import Task
from pruna.evaluation.evaluation_agent import EvaluationAgent

# wrap the HF dataset into a PrunaDataModule
dm = PrunaDataModule.from_hf("data-is-better-together/open-image-preferences-v1-binarized")
dm.setup("fit")

# create an image-generation quality task
task = Task("image_generation_quality", datamodule=dm)

# run evaluation
agent  = EvaluationAgent(task)
results = agent.evaluate(smashed_model)

print(results)


Gradio Demo

In [None]:
import gradio as gr
from diffusers import DiffusionPipeline

# Load the HiDream model
pipe = DiffusionPipeline.from_pretrained("HiDream-ai/HiDream-I1-Fast")

# Define the generation function
def generate(prompt):
    return pipe(prompt).images[0]

# Create the Gradio interface
gr.Interface(fn=generate, inputs="text", outputs="image").launch()
