<a href="https://www.kaggle.com/code/aisuko/text-to-image-with-diffusers-pipeline?scriptVersionId=164200404" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In this notebook, we are going to use text to generate an image by using pipeline from diffusers.

In [1]:
# %%capture
!pip install diffusers==0.26.3
!pip install transformers==4.38.1

Collecting diffusers==0.26.3
  Downloading diffusers-0.26.3-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting huggingface-hub>=0.20.2 (from diffusers==0.26.3)
  Downloading huggingface_hub-0.20.3-py3-none-any.whl (330 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.1/330.1 kB[0m [31m28.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface-hub, diffusers
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.15.1
    Uninstalling huggingface-hub-0.15.1:
      Successfully uninstalled huggingface-hub-0.15.1
Successfully installed diffusers-0.26.3 huggingface-hub-0.20.3
[0mCollecting transformers==4.38.1
  Downloading transformers-4.38.1-py3-none-any.whl (8.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.5/8.5 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[

In [2]:
import os
import torch

os.environ['MODEL_NAME']='CompVis/stable-diffusion-v1-4'

if torch.cuda.is_available():
    torch_device = 'cuda'
else:
    torch_device = 'cpu'

print(torch_device)

cuda


In [3]:
# !diffusers-cli env

# Loading the Components

Load all these components with the `from_pretrained()` method.

In [4]:
from PIL import Image

from diffusers import AutoencoderKL

vae = AutoencoderKL.from_pretrained(os.getenv('MODEL_NAME'), subfolder="vae")
vae.to(torch_device)
print(vae)

ImportError: cannot import name 'HF_HOME' from 'huggingface_hub.constants' (/opt/conda/lib/python3.10/site-packages/huggingface_hub/constants.py)

In [None]:
from transformers import CLIPTokenizer

tokenizer = CLIPTokenizer.from_pretrained(os.getenv('MODEL_NAME'), subfolder="tokenizer")
print(tokenizer)

In [None]:
from transformers import CLIPTextModel

text_encoder = CLIPTextModel.from_pretrained(os.getenv('MODEL_NAME'), subfolder = "text_encoder")
text_encoder.to(torch_device)
print(text_encoder)

In [None]:
from diffusers import UNet2DConditionModel

unet = UNet2DConditionModel.from_pretrained(os.getenv('MODEL_NAME'), subfolder = "unet")
unet.to(torch_device)
print(unet)

## Exchange to UniPCMultistepScheduler

It is easy to change to other schedulers

In [None]:
from diffusers import UniPCMultistepScheduler

scheduler = UniPCMultistepScheduler.from_pretrained(os.getenv('MODEL_NAME'), subfolder="scheduler")
print(scheduler)

# Create text embeddings

**Tokenizing** the text to generate embeddings. The text is used to condition the UNet model and steer the diffusion process towards something that resembles the input prompt.

In [None]:
prompt =["a photograph of an astronaut riding a horse"]
# default weight of Stable Diffusion
height = 512
width = 512
# Number of denoising steps
num_inference_steps = 5
# Scale for classifier-free guidance
guidance_scale = 7.5
# Seed generator to create the initial latent noise
seed = torch.manual_seed(0)
batch_size=len(prompt)

text_input = tokenizer(
    prompt, 
    padding="max_length", 
    max_length=tokenizer.model_max_length, 
    truncation=True,
    return_tensors="pt"
)

print(text_input)

In [None]:
with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
    
print(text_embeddings)

## Generate the Unconditional Text Embeddings

Generate the unconditional text embeddings for the padding token. These need to have the same shape(batch-size and seq_length) as the conditional text_embeddings:

In [None]:
max_length=text_input.input_ids.shape[-1]
uncond_input =tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]
print(uncond_embeddings)

## Concatenation embeddings

Concatenate the condifitional and unconditional embeddings into a batch to avoid doing two forward passes

In [None]:
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
print(text_embeddings)

# Create random noise

**Generating some initial random noise as a starting point for the diffusion process.** This is the latent representation of the image, and it'll be gradually denoised. At this point, the latent image is snaller than the final image size but that's okay though because the model will transform it into the final 512x512 image dimensions later.

In [None]:
latents= torch.randn(
    batch_size,
    unet.in_channels,
    height // 8,
    width // 8,
    generator=seed,
)

latents =latents.to(torch_device)
print(latents)

# Denoise the image

Start by scaling the input with the inital noise distribution ***sigma*** the noise scale value, which is required for improved schedulers like UniPCMultistepScheduler:

In [None]:
latents = latents * scheduler.init_noise_sigma
print(latents)

The last step is to create the ***denoising loop*** that'll progressively transform the pure noise in latents to an image described by the prompt.

The denoising loop:
* Setting the scheduler's timesteps to use during denoising
* Iterating over the timesteps
* At each timestep, call the UNet model to predict the noise residual and pass it to the scheduler to compute the previous noisy sample

In [None]:
from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    latent_model_input=torch.cat([latents]*2)
    latent_model_input=scheduler.scale_model_input(latent_model_input, timesteps=t)

    # predict the noise residual
    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
    
    # perform guidance
    noise_pred_uncond, noise_pred_text =noise_pred.chunk(2)
    noise_pred= noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t-> x_t-1
    latents=scheduler.step(noise_pred, t, latents).prev_sample

# Decode the image

The final step is to use the vae to decode the latent representation into an image and get the decoded output with sample:

In [None]:
# scale and decode the image latents with vae

latents = 1/0.18215* latents
with torch.no_grad():
    image =vae.decode(latents).sample
    
image = (image /2+0.5).clamp(0, 1)
image = image.detach().cpu().permute(0,2,3,1).numpy()
images= (image*255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0]