## Open notebook in:
| Colab                                 |  
|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH04/ch04_quantize_T2I_models.ipynb)                                                        

# About this notebook

This notebook demonstrates a memory-efficient approach to generating high-quality images using the PixArt-Σ model, a state-of-the-art diffusion transformer for ultra-high-resolution image synthesis. As deep learning models grow in complexity, memory management becomes a crucial aspect, especially when working with limited GPU resources.

In this example, you use advanced quantization techniques provided by the [`BitsAndBytesConfig`](https://huggingface.co/docs/transformers/main/en/quantization/bitsandbytes) configuration and the [`optimum.quanto`](https://github.com/huggingface/optimum-quanto) library to reduce the memory footprint while maintaining performance. The notebook will guide you through the steps of setting up a quantized text encoder, generating prompt embeddings, and efficiently managing GPU memory during the image generation process.

You will also monitor GPU memory usage throughout the process, using [PyTorch's memory functionality](https://pytorch.org/docs/stable/torch_cuda_memory.html#), and employ strategies like freezing parts of the model and cleaning up unused resources to further optimize memory consumption. By the end of this notebook, you will have a practical understanding of how to handle large-scale models on limited hardware, enabling you to generate high-quality images with reduced memory overhead.
The provided code is inspired by the [examples](https://github.com/huggingface/optimum-quanto/blob/main/examples/vision/text-to-image/quantize_pixart_sigma.py) in Hugging Face's quanto libary and [Diffusers library](https://huggingface.co/docs/diffusers/main/en/api/pipelines/pixart_sigma).


#Installs

In [None]:
!pip -q install transformers==4.42.4 \
                diffusers==0.30.0 \
                bitsandbytes==0.43.3 \
                ftfy==6.2.3 \
                optimum-quanto==0.2.2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.3/9.3 MB[0m [31m95.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m74.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.0/43.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.1/98.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.2/307.2 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install accelerate==0.33.0 -qqq

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/315.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25h

#Imports

In [None]:
from transformers import T5EncoderModel, BitsAndBytesConfig
from diffusers import PixArtSigmaPipeline
from optimum.quanto import freeze, qfloat8, qint4, qint8, quantize
import torch
import gc

# Helper Function

In [None]:
def to_giga_bytes(bytes):
    return bytes / (1024 ** 3)


# Quantize the text encoder model

In [None]:
torch.cuda.memory._record_memory_history()

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

text_encoder = T5EncoderModel.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    subfolder="text_encoder",
    quantization_config=quant_config,
    device_map="balanced",
)

pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    text_encoder=text_encoder,
    transformer=None,
    device_map="balanced"
)

with torch.no_grad():
    prompt = "Cute animated tabby with big eyes"
    prompt_embeds, prompt_attention_mask, negative_embeds, negative_prompt_attention_mask = pipe.encode_prompt(prompt)


text_encoder/config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

(…)ext_encoder/model.safetensors.index.json:   0%|          | 0.00/19.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/9.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

model_index.json:   0%|          | 0.00/400 [00:00<?, ?B/s]

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

scheduler/scheduler_config.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/2.35k [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/631 [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/4 [00:00<?, ?it/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [None]:
print(
    f"Max memory allocated: {to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
)

print(
    f"Max memory reserved: {to_giga_bytes(torch.cuda.memory_reserved())} GB"
)

Max memory allocated: 6.249999046325684 GB
Max memory reserved: 6.587890625 GB


# Delete Text Encoder

In [None]:
del text_encoder
del pipe

In [None]:
gc.collect()
torch.cuda.empty_cache()

In [None]:
print(
    f"Max memory allocated: {to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
)

print(
    f"Max memory reserved: {to_giga_bytes(torch.cuda.memory_reserved())} GB"
)

Max memory allocated: 6.249999046325684 GB
Max memory reserved: 6.2578125 GB


# Quantize the Diffusion Model

In [None]:
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS",
    text_encoder=None,
    torch_dtype=torch.float16,
).to("cuda")

quantize(pipe.transformer, weights=qint8, exclude="proj_out")
freeze(pipe.transformer)

latents = pipe(
    negative_prompt=None,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    prompt_attention_mask=prompt_attention_mask,
    negative_prompt_attention_mask=negative_prompt_attention_mask,
    num_images_per_prompt=1,
    output_type="latent",
).images

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

transformer/config.json:   0%|          | 0.00/785 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
print(
    f"Max memory allocated: {to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
)

print(
    f"Max memory reserved: {to_giga_bytes(torch.cuda.memory_reserved())} GB"
)

Max memory allocated: 7.409801959991455 GB
Max memory reserved: 8.11328125 GB


# Flush the Memory

In [None]:
del pipe.transformer


In [None]:
gc.collect()
torch.cuda.empty_cache()

In [None]:
print(
    f"Max memory allocated: {to_giga_bytes(torch.cuda.max_memory_allocated())} GB"
)

print(
    f"Max memory reserved: {to_giga_bytes(torch.cuda.memory_reserved())} GB"
)

Max memory allocated: 7.409801959991455 GB
Max memory reserved: 0.419921875 GB


# Generate the Image

In [None]:
with torch.no_grad():
    image = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
image = pipe.image_processor.postprocess(image, output_type="pil")

image[0].save("tabby.png")

# Get Memory Summary

In [None]:
torch.cuda.memory._dump_snapshot("PixArtSigma_quant.pickle")

print(
    torch.cuda.memory_summary()
)


|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      | 183757 KiB |   7587 MiB |   1198 GiB |   1198 GiB |
|       from large pool | 174336 KiB |   7505 MiB |   1198 GiB |   1197 GiB |
|       from small pool |   9421 KiB |     85 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------------------|
| Active memory         | 183757 KiB |   7587 MiB |   1198 GiB |   1198 GiB |
|       from large pool | 174336 KiB |   7505 MiB |   1198 GiB |   1197 GiB |
|       from small pool |   9421 KiB |     85 MiB |      0 GiB |      0 GiB |
|---------------------------------------------------------------