<span style="color: red;">Requirement when running in Goolge Colab</span>

In [None]:
!pip install diffusers

#  Chapter 3 - Classifier Free Guidance

Classifier-Free Guidance (CFG) emerged as a technique to significantly improve the quality and control of image generation in diffusion models like Stable Diffusion. Introduced by Ho and Salimans (https://arxiv.org/pdf/2207.12598) in 2021, CFG addresses the limitations of unconditional sampling, which often produces low-quality or irrelevant results. By interpolating between unconditional and text-conditioned outputs, CFG allows for better alignment between the generated image and the input prompt. This technique enhances image quality, increases prompt relevance, and gives users more control over the generation process. When using Stable Diffusion, applying CFG is crucial for producing high-quality, prompt-adhering images, making it an essential component in most practical applications of the model.

Now we move to the implementation of it which is quite simple. the first part below is copied from Chatper 2 and it's exactly the same and you can run it and move to the next one:


In [None]:
import warnings
warnings.filterwarnings("ignore")
import diffusers
from diffusers import AutoencoderKL, UNet2DConditionModel, DDIMScheduler
from transformers import CLIPTextModel, CLIPTokenizer
import torch
import matplotlib.pyplot as plt
from tqdm import tqdm

model_id = "stabilityai/stable-diffusion-2-1-base"

vae = AutoencoderKL.from_pretrained(
    model_id, subfolder="vae", revision=None, variant="fp16"
).to("cuda")


unet = UNet2DConditionModel.from_pretrained(
    model_id, subfolder="unet", revision=None, variant="fp16"
).to("cuda")
scheduler = DDIMScheduler.from_pretrained(model_id, subfolder="scheduler")


tokenizer = CLIPTokenizer.from_pretrained(
    model_id, subfolder="tokenizer", revision=None, variant="fp16"
)
text_encoder = CLIPTextModel.from_pretrained(
    model_id, subfolder="text_encoder", revision=None, variant="fp16"
).to("cuda")


prompt = "A photo of a woman, straight hair, light blonde and pink hair, smiling expression, grey background"

text_inputs = tokenizer(
                prompt,
                padding="max_length",
                max_length=tokenizer.model_max_length,
                truncation=True,
                return_tensors="pt",
            ).input_ids.to("cuda")
prompt_embeds = text_encoder(text_inputs)[0]

latents = torch.randn((1, unet.in_channels, unet.config.sample_size, unet.config.sample_size), generator=torch.Generator().manual_seed(220)).to("cuda")

The notion of CFG require an unconditional state of the model for sampling that we then use at every step to balance the weights of the latents. Therefore we tokenise an empty prompt as our unconditional state and then encode the text

In [None]:
uncond_tokens = ""
max_length = prompt_embeds.shape[1]
uncond_input = tokenizer(
    uncond_tokens,
    padding="max_length",
    max_length=max_length,
    truncation=True,
    return_tensors="pt").input_ids.to("cuda")
uncond_prompt_embeds = text_encoder(uncond_input)[0]

Now we combine our unconditional promopt with our conditional promopt to run inference in each step for our sampling process for both bathes

In [None]:
prompt_embeds_combined = torch.cat([uncond_prompt_embeds, prompt_embeds])

the core difference to the previous chapter is that we duplicate our latents to run inference in each step for two batches of the same latent on different prompt condition
    latent_model_input = torch.cat([latents] * 2)
and based of the noise prediction for unconditional and conditional state with predict a relevant noise that is then use to generate the next latent for the next step of our sampling process

In [None]:
num_inference_steps = 50
guidance_scale = 7.5
scheduler.set_timesteps(num_inference_steps, device="cuda")
timesteps = scheduler.timesteps

with torch.no_grad():
    for i, t in tqdm(enumerate(timesteps), total=len(timesteps), desc="Inference steps"):

        latent_model_input = torch.cat([latents] * 2)

        noise_pred = unet(
            latent_model_input,
            t,
            encoder_hidden_states=prompt_embeds_combined,
            cross_attention_kwargs=None,
            return_dict=False,
        )[0]


        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

        latents = scheduler.step(noise_pred, t, latents, return_dict=False)[0]


as before to bring it to the pixel space with decode the latent with vae and now we should hopefully a significantly better result

In [None]:
with torch.no_grad():
    image = vae.decode(latents / vae.config.scaling_factor, return_dict=False)[0]
    image_np = image.squeeze(0).float().permute(1,2,0).detach().cpu()
    image_np = image_np - image_np.min()
    image_np = image_np / image_np.max()

In [None]:
plt.imshow(image_np)