# Diffusers Task

Let's setup the environment and download the models first.

The [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) downloads and caches all modeling, tokenization, and scheduling components.
Because the model consists of roughly 1.4 billion parameters, we strongly recommend running it on a GPU.

Start by creating an instance of [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) and specify which pipeline checkpoint you would like to download.

<!---
I2I model replaced for faster download
i2i_pipe = StableDiffusionImg2ImgPipeline.from_pretrained("nitrosocke/Ghibli-Diffusion", torch_dtype=torch.float16)
-->

In [None]:
import torch
import requests
from PIL import Image
from io import BytesIO
from diffusers import (
    DiffusionPipeline,
    StableDiffusionImg2ImgPipeline,
    StableDiffusionDepth2ImgPipeline,
)

cache_dir = "/data2/diffusion"

# Unconditional Image Generation
uncond_generator = DiffusionPipeline.from_pretrained(
    "anton-l/ddpm-butterflies-128", cache_dir=cache_dir
)

# Conditional Image Generation
cond_generator = DiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", cache_dir=cache_dir
)

# Image to Image Generation
i2i_pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    cache_dir=cache_dir,
)

# Depth-conditioned Image Generation
depth_pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-depth",
    torch_dtype=torch.float16,
    cache_dir=cache_dir,
)

## Stable Diffusion
![The Stable Diffusion architecture](https://scholar.harvard.edu/sites/scholar.harvard.edu/files/styles/os_files_xxlarge/public/binxuw/files/stablediffusion_overview.jpg?m=1708096154&itok=n2gM0Xba)

Stable Diffusion is the most popular open source foundation models for image generation. The details of the architecture is explained as a “Latent Diffusion Model" in a previous session. The Stable Diffusion model uses the CFG (Classifier-free Guidance) which is highly related to parameters for the image generation.

SDXL and SD3 inherits the similar architecture with improvement on prompt alignment and image quality. The Stable Diffusion 3 is the latest version yet to be released with [waitlist](https://stability.ai/stablediffusion3) available.

In this tutorial, we will use Stable Diffusion 1.5, a finetuned version of the Stable Diffusion.

## Unconditional Image Generation

Unconditional image generation is a relatively straightforward task. The model only generates images - without any additional context like text or an image - resembling the training data it was trained on. For this task, we will use a model trained to generate specific type of image.



In this guide, you'll use [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) for unconditional image generation with [DDPM](https://arxiv.org/abs/2006.11239) (the checkpoint you'll use generates images of butterflies).

You can use any of the 🧨 Diffusers [checkpoints](https://huggingface.co/models?library=diffusers&sort=downloads) from the Hub. If you want to use a different model, replace the "anton-l/ddpm-butterflies-128" with the model name to download and use it.

In [None]:
# uncond_generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128", cache_dir=cache_dir)

In [None]:
generator = uncond_generator.to("cuda")

Now you can use the `generator` to generate an image:

In [None]:
image = generator().images[0]
image

## Conditional image generation

Conditional image generation allows you to generate images from a text prompt. The text is converted into embeddings which are used to condition the model to generate an image from noise.

The texts are tokenized and then CLIP model encodes texts. The cross-attention is used to guide image generation with text.

The [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) is the easiest way to use a pre-trained diffusion system for inference.

Start by creating an instance of [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) and specify which pipeline [checkpoint](https://huggingface.co/models?library=diffusers&sort=downloads) you would like to download.

In this guide, you'll use [DiffusionPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) for text-to-image generation with [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5):

In [None]:
generator = cond_generator.to("cuda")

Now you can use the `generator` on your text prompt:

In [None]:
image = generator("An image of a squirrel in Picasso style").images[0]

The output is by default wrapped into a [`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class) object.

You can save the image by calling:

In [None]:
image

## Text-guided image-to-image generation

The [StableDiffusionImg2ImgPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/img2img#diffusers.StableDiffusionImg2ImgPipeline) lets you pass a text prompt and an initial image to condition the generation of new images.

In noise is added to the input image gradually as in the forward diffusion process. The image is then encoded to be used as a condition through cross-attention, like text conditions.

Let's load the model to GPU first.

In [None]:
device = "cuda"
pipe = i2i_pipe.to(device)

Download and preprocess an initial image so you can pass it to the pipeline:

In [None]:
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image.thumbnail((768, 768))

<Tip>

💡 `strength` is a value between 0.0 and 1.0 that controls the amount of noise added to the input image. Values that approach 1.0 allow for lots of variations but will also produce images that are not semantically consistent with the input. The strength determines the number of steps for forward diffusion process on the conditioning image.

</Tip>

<Tip>

💡 `guidance_scale` determines the scale of the conditioned inference of the CFG model. CFG infernece result is the weighted sum of conditional inference and unconditional inference. This parameter controls the weight of the conditional inference. The higher value leads to better alignment to the prompt and other conditions. 
    
</Tip>

Define the prompt (for this checkpoint finetuned on Ghibli-style art, you need to prefix the prompt with the `ghibli style` tokens) and run the pipeline:

In [None]:
prompt = "ghibli style, a fantasy landscape with castles"
generator = torch.Generator(device=device).manual_seed(1024)
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
image

### Sample Result
| Input                                                                           | Output                                                                                |
|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| <img src="https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/image_2_image_using_diffusers_cell_8_output_0.jpeg" width="500"/> | <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ghibli-castles.png" width="500"/> |

You can also try experimenting with a different scheduler to see how that affects the output:

In [None]:
from diffusers import LMSDiscreteScheduler

lms = LMSDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.scheduler = lms
generator = torch.Generator(device=device).manual_seed(1024)
image = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5, generator=generator).images[0]
image

### Sample Result
| Input                                                                           | Output                                                                                                                                |
|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| <img src="https://huggingface.co/datasets/YiYiXu/test-doc-assets/resolve/main/image_2_image_using_diffusers_cell_8_output_0.jpeg" width="500"/> | <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lms-ghibli.png" width="500"/> |

## Text-guided depth-to-image generation

The [StableDiffusionDepth2ImgPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/depth2img#diffusers.StableDiffusionDepth2ImgPipeline) lets you pass a text prompt and an initial image to condition the generation of new images. In addition, you can also pass a `depth_map` to preserve the image structure. If no `depth_map` is provided, the pipeline automatically predicts the depth via an integrated [depth-estimation model](https://github.com/isl-org/MiDaS).

Start by creating an instance of the [StableDiffusionDepth2ImgPipeline](https://huggingface.co/docs/diffusers/main/en/api/pipelines/stable_diffusion/depth2img#diffusers.StableDiffusionDepth2ImgPipeline):

In [None]:
pipe = depth_pipe.to("cuda")

Now pass your prompt to the pipeline.

<Tip>

💡 `negative_prompt` prevents certain words from guiding how an image is generated. In the CFG formulation, the subtraction of the unconditional inference is replaced by the negative prompt inference result.
    
</Tip>

In [None]:
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
init_image

In [None]:
prompt = "two tigers"
n_prompt = "bad, deformed, ugly, bad anatomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_prompt, strength=0.7).images[0]
image

### Sample Result

| Input                                                                           | Output                                                                                                                                |
|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/coco-cats.png" width="500"/> | <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/depth2img-tigers.png" width="500"/> |

### Try it yourself!
You can try with your custom image url and prompts.

Try finding a good balance between image quality and prompt alignment with different parameters. \
You can experiment with different prompts, negative prompts, guidance scale, and noise strength.

In [None]:
url = "http://images.cocodataset.org/test-stuff2017/000000000509.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
init_image

In [None]:
prompt = "chemistry laboratory"
n_prompt = "window"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_prompt, guidance=10, strength=0.95).images[0]
image