# Generating images and text with UniDiffuser

UniDiffuser was introduced in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555).

In this notebook, we will show how the [UniDiffuser pipeline](https://huggingface.co/docs/diffusers/api/pipelines/unidiffuser) in 🧨 diffusers can be used for:

* Unconditional image generation
* Unconditional text generation
* Text-to-image generation
* Image-to-text generation
* Image variation
* Text variation

One pipeline to rule six use cases 🤯

Let's start!

## Setup

In [2]:
!pip install -q git+https://github.com/dg845/diffusers
!pip install transformers accelerate -q

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for diffusers (pyproject.toml) ... [?25l[?25hdone


## Unconditional image and text generation

In [3]:
import torch
from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

Downloading (…)ain/model_index.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

Fetching 22 files:   0%|          | 0/22 [00:00<?, ?it/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/518 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)_decoder/config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

Downloading (…)er/added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/351M [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/748 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/511M [00:00<?, ?B/s]

Downloading (…)16e/unet/config.json:   0%|          | 0.00/839 [00:00<?, ?B/s]

Downloading (…)216e/vae/config.json:   0%|          | 0.00/582 [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/3.81G [00:00<?, ?B/s]

Downloading (…)on_pytorch_model.bin:   0%|          | 0.00/335M [00:00<?, ?B/s]

No inputs or latents have been supplied, and mode has not been manually set, defaulting to mode 'joint'.


  0%|          | 0/20 [00:00<?, ?it/s]

  attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)


The interior of a fancy restaurant, chandeliers


You can also generate only an image or only text (which the UniDiffuser paper calls “marginal” generation since we sample from the marginal distribution of images and text, respectively):

In [None]:
# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance

# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]

# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

To reset a mode, call: `pipe.reset_mode()`. 

## Text-to-image generation

The `UniDiffuserPipeline` can infer the right mode of execution from provided inputs to the pipeline called. Since we started with the joint unconditional mode (`set_joint_mode()`), the subsequent calls will be executed in this model. Now, we want to generate images from text. So, we set the model accordingly. 

In [4]:
pipe.set_text_to_image_mode()

In [5]:
# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

  0%|          | 0/20 [00:00<?, ?it/s]

## Image-to-text generation

In [7]:
pipe.set_image_to_text_mode()

In [8]:
from diffusers import load_image

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(text)

  0%|          | 0/20 [00:00<?, ?it/s]

The interior of a fancy restaurant, chandeliers


## Image variation

For image variation, we follow a "round-trip" method as suggested in the paper. We first generate a caption from a given image. And then use the caption to generate a image from it. 

In [9]:
# Image variation can be performed with a image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

pipe.set_image_to_text_mode()
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
pipe.set_text_to_image_mode()
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

  0%|          | 0/20 [00:00<?, ?it/s]

An astronaut floating in effect                                                              


  0%|          | 0/20 [00:00<?, ?it/s]

## Text variation

In [10]:
# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

pipe.set_text_to_image_mode()
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
pipe.set_image_to_text_mode()
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/20 [00:00<?, ?it/s]

An elephant swimming in the ocean by an aquarium
