<a href="https://colab.research.google.com/github/HemantCopilot/g_test/blob/dev/Stable_Diffusion_with_SwinIR_plus_Real_ESRGAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Stable Diffusion** 🎨 
*...using `🧨diffusers`*

Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). It's trained on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM.
See the [model card](https://huggingface.co/CompVis/stable-diffusion) for more information.

This Colab notebook shows how to use Stable Diffusion with the 🤗 Hugging Face [🧨 Diffusers library](https://github.com/huggingface/diffusers). 

Let's get started!

### Setup

First, please make sure you are using a GPU runtime to run this notebook, so inference is much faster. If the following command fails, use the `Runtime` menu above and select `Change runtime type`.

In [None]:
!nvidia-smi


Next, you should install `diffusers==0.3.0` as well `scipy`, `ftfy` and `transformers`.

In [None]:
!pip install diffusers==0.3.0
!pip install transformers scipy ftfy
!pip install "ipywidgets>=7,<8"

You also need to accept the model license before downloading or using the weights. In this post we'll use model version `v1-4`, so you'll need to  visit [its card](https://huggingface.co/CompVis/stable-diffusion-v1-4), read the license and tick the checkbox if you agree. 

You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work. For more information on access tokens, please refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).

As google colab has disabled external widgtes, we need to enable it explicitly. Run the following cell to be able to use `notebook_login`

In [None]:
from google.colab import output
output.enable_custom_widget_manager()

Now you can login with your user token.

In [None]:
from huggingface_hub import notebook_login
# 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
notebook_login()

### Stable Diffusion Pipeline

`StableDiffusionPipeline` is an end-to-end inference pipeline that you can use to generate images from text with just a few lines of code.

First, we load the pre-trained weights of all components of the model.

In addition to the model id [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4), we're also passing a specific `revision`, `torch_dtype` and `use_auth_token` to the `from_pretrained` method.
`use_auth_token` is necessary to verify that you have indeed accepted the model's license.

We want to ensure that every free Google Colab can run Stable Diffusion, hence we're loading the weights from the half-precision branch [`fp16`](https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/fp16) and also tell `diffusers` to expect the weights in float16 precision by passing `torch_dtype=torch.float16`.

If you want to ensure the highest possible precision, please make sure to remove `revision="fp16"` and `torch_dtype=torch.float16` at the cost of a higher memory usage.

In [None]:
import cv2
import matplotlib.pyplot as plt
import os
import glob
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = "max_split_size_mb:100"
import torch
import numpy as np
torch.cuda.empty_cache()

import gc
from diffusers import StableDiffusionPipeline,StableDiffusionImg2ImgPipeline

# make sure you're logged in with `huggingface-cli login`
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=True)  
# img2imgpipe = StableDiffusionImg2ImgPipeline.from_pretrained("CompVis/stable-diffusion-v1-4",revision="fp16", torch_dtype=torch.float16,use_auth_token=True)

In [None]:

!git clone https://github.com/TencentARC/GFPGAN.git
os.chdir("/content/GFPGAN")


!pip install basicsr

!pip install facexlib
!pip install -r requirements.txt
!python setup.py develop
os.chdir("/content/")

#realesr
!git clone https://github.com/xinntao/Real-ESRGAN.git
# !pip install realesrgan
!wget https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth -P /content/Real-ESRGAN/experiments/pretrained_models
os.chdir("/content/Real-ESRGAN")
!pip install gfpgan
!pip install -r requirements.txt
!python setup.py develop
!python setup.py install
os.chdir("/content/")


!wget https://github.com/TencentARC/GFPGAN/releases/download/v1.3.0/GFPGANv1.3.pth -P /content/GFPGAN/experiments/pretrained_models
!git clone https://github.com/JingyunLiang/SwinIR.git
!pip install timm
!wget https://github.com/JingyunLiang/SwinIR/releases/download/v0.0/003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth -P experiments/pretrained_models
!wget https://github.com/JingyunLiang/SwinIR/releases/download/v0.0/003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN.pth -P experiments/pretrained_models




In [None]:
os.chdir("/content/Real-ESRGAN")
from realesrgan import RealESRGANer
from basicsr.archs.rrdbnet_arch import RRDBNet
esrdevice='cuda'
with torch.no_grad():
  RealESRUpScale = RealESRGANer(model_path="/content/Real-ESRGAN/experiments/pretrained_models/RealESRGAN_x4plus.pth",scale=4,device=esrdevice,model= RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4))
os.chdir("/content/")

Next, let's move the pipeline to GPU to have faster inference.

In [None]:
pipe = pipe.to("cuda")

In [None]:

from google.colab import files
import shutil

upload_folder = 'upload'
result_folder = 'results'
final_folder = 'final'
version_folder = 'version'

if os.path.isdir(upload_folder):
    shutil.rmtree(upload_folder)
if os.path.isdir(result_folder):
    shutil.rmtree(result_folder)
if os.path.isdir(final_folder):
    shutil.rmtree(final_folder)
if os.path.isdir(version_folder):
    shutil.rmtree(version_folder)
os.mkdir(upload_folder)
os.mkdir(result_folder)
os.mkdir(final_folder)
os.mkdir(version_folder)



Let's first write a helper function to display a grid of images. Just run the following cell to create the `image_grid` function, or disclose the code if you are interested in how it's done.

In [None]:
from PIL import Image

def image_grid(imgs, rows, cols):
    assert len(imgs) == rows*cols

    w, h = imgs[0].size
    grid = Image.new('RGB', size=(cols*w, rows*h))
    grid_w, grid_h = grid.size
    
    for i, img in enumerate(imgs):
        grid.paste(img, box=(i%cols*w, i//cols*h))
    return grid

# Now, we can generate a grid image once having run the pipeline with a list of 4 prompts.

In [None]:
import random
from torch import autocast
torch.cuda.empty_cache()
gc.collect()
num_images = 3
prompt = ["nature landscape for wallpaper, colourfull trees, hd, 4k, cinematic"] * num_images
rn = random.randint(0,10000000)

# rn=1624589

print(rn)
with torch.no_grad():
  generator = torch.Generator("cuda").manual_seed(rn)
  with autocast("cuda"):
    images = pipe(prompt,generator=generator, num_inference_steps=50).images
    for i,j in enumerate(images):
      j.save("./upload/img{}-{}".format(i,rn)+'.png')
images_2up = '/content/upload'
torch.cuda.empty_cache()
grid = image_grid(images, rows=1, cols=3)
grid



# Refine Same Images(Generated in Prev step) with 100 steps. You Can Directly jump to the next Cell

In [None]:
torch.cuda.empty_cache()
gc.collect()
generator = torch.Generator("cuda").manual_seed(rn)

dir = '/content/upload'
if os.path.isdir(dir):
  for f in os.listdir(dir):
     os.remove(os.path.join(dir, f))

with autocast("cuda"):
  images = pipe(prompt,generator=generator, num_inference_steps=100).images
  for i,j in enumerate(images):
    j.save("./upload/img{}".format(i)+'.png')
images_2up = '/content/upload'
torch.cuda.empty_cache()
grid = image_grid(images, rows=1, cols=3)
grid



# Refine Same Images(Generated in Prev step) with 200 steps.

In [None]:
torch.cuda.empty_cache()
gc.collect()
generator = torch.Generator("cuda").manual_seed(rn)

dir = '/content/upload'
if os.path.isdir(dir):
  for f in os.listdir(dir):
     os.remove(os.path.join(dir, f))


with autocast("cuda"):
  images = pipe(prompt,generator=generator, num_inference_steps=200).images
  for i,j in enumerate(images):
    j.save("./upload/img{}".format(i)+'.png')
torch.cuda.empty_cache()
gc.collect()
images_2up = '/content/upload'
grid = image_grid(images, rows=1, cols=3)
grid


# **Run This Cell Only if Face is not Restored Completely.**

In [None]:
torch.cuda.empty_cache()
gc.collect()
# os.chdir("/content/GFPGAN")
dir = '/content/results/restored_imgs'
if os.path.isdir(dir):
  for f in os.listdir(dir):
     os.remove(os.path.join(dir, f))
with torch.no_grad():     
  !python GFPGAN/inference_gfpgan.py -i /content/upload -o /content/results -s 2
# os.chdir("/content")


def displayGFP(img1, img2):
  fig = plt.figure(figsize=(25, 10))
  ax1 = fig.add_subplot(1, 2, 1) 
  plt.title('Input image', fontsize=16)
  ax1.axis('off')
  ax2 = fig.add_subplot(1, 2, 2)
  plt.title('GFPGAN_FaceRestore', fontsize=16)
  ax2.axis('off')
  ax1.imshow(img1)
  ax2.imshow(img2)
def imread(img_path):
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  return img

torch.cuda.empty_cache()
gc.collect()
# display each image in the upload folder

images_2up = '/content/results/restored_imgs'

input_folder = '/content/upload'
result_folder = '/content/results/restored_imgs'
input_list = sorted(glob.glob(os.path.join(input_folder, '*.png')))
output_list = sorted(glob.glob(os.path.join(result_folder, '*.png')))
for input_path, output_path in zip(input_list, output_list):
  img_input = imread(input_path)
  img_output = imread(output_path)
  displayGFP(img_input, img_output)



# **Real-ESRGAN Upscale**

In [None]:
torch.cuda.empty_cache()
gc.collect()

imgs_path = sorted(glob.glob(os.path.join(images_2up, '*.png')))
imgs = []
for path in imgs_path:
  imgs.append(Image.open(path).convert('RGB'))


def displayESRGAN(img1, img2):
  fig = plt.figure(figsize=(25, 10))
  ax1 = fig.add_subplot(1, 2, 1) 
  plt.title('Input image', fontsize=16)
  ax1.axis('off')
  ax2 = fig.add_subplot(1, 2, 2)
  plt.title('ESRGAN', fontsize=16)
  ax2.axis('off')
  ax1.imshow(img1)
  ax2.imshow(img2)
def imread(img_path):
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  return img

import cv2
for i,j in enumerate(imgs):
  with torch.no_grad():
    output, _ = RealESRUpScale.enhance(np.array(j), outscale=4)
  output = Image.fromarray(output)
  output.save("final/fin_img_upscaled_{}.png".format(i))
torch.cuda.empty_cache()
gc.collect()

input_folder = '/content/upload'
result_folder = '/content/final'
input_list = sorted(glob.glob(os.path.join(input_folder, '*.png')))
output_list = sorted(glob.glob(os.path.join(result_folder, '*.png')))
for input_path, output_path in zip(input_list, output_list):
  img_input = imread(input_path)
  img_output = imread(output_path)
  displayESRGAN(img_input, img_output)



# **(Optional) Denoise Using SwinIR and Upscale**

In [None]:
torch.cuda.empty_cache()
gc.collect()
# os.chdir("/content/SwinIR")
torch.cuda.empty_cache()
dir = '/content/results/swinir_real_sr_x4_large'
if os.path.isdir(dir):
  for f in os.listdir(dir):
     os.remove(os.path.join(dir, f))

if '/content/results/restored_imgs' in images_2up:
  !python SwinIR/main_test_swinir.py --task real_sr --model_path /content/experiments/pretrained_models/003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth --folder_lq results/restored_imgs --scale 4
else:
  !python SwinIR/main_test_swinir.py --task real_sr --model_path /content/experiments/pretrained_models/003_realSR_BSRGAN_DFO_s64w8_SwinIR-M_x4_GAN.pth --folder_lq upload --scale 4
  


def displaySwinIR(img1, img2):
  fig = plt.figure(figsize=(25, 10))
  ax1 = fig.add_subplot(1, 2, 1) 
  plt.title('Input image', fontsize=16)
  ax1.axis('off')
  ax2 = fig.add_subplot(1, 2, 2)
  plt.title('SwinIR', fontsize=16)
  ax2.axis('off')
  ax1.imshow(img1)
  ax2.imshow(img2)
def imread(img_path):
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  return img

torch.cuda.empty_cache()
gc.collect()
# display each image in the upload folder
images_2up = 'content/results/swinir_real_sr_x4_large'

input_folder = '/content/upload'
result_folder = '/content/results/swinir_real_sr_x4'
input_list = sorted(glob.glob(os.path.join(input_folder, '*.png')))
output_list = sorted(glob.glob(os.path.join(result_folder, '*.png')))
for input_path, output_path in zip(input_list, output_list):
  img_input = imread(input_path)
  img_output = imread(output_path)
  displaySwinIR(img_input, img_output)



# **(Optional) Compare ALL Generated images**

In [None]:
def imread(img_path):
  img = cv2.imread(img_path)
  img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  return img
def compare_all(img_input,img_swin,img_esr):
  fig,axs = plt.subplots(len(img_input),3,figsize=(35,35))
  axs[0,0].set_title("Input", fontsize=16)
  axs[0,1].set_title("Swin", fontsize=16)
  axs[0,2].set_title("ESRGAN", fontsize=16)
  for i in range(len(img_input)):
    axs[i,0].imshow(imread(img_input[i]))
    axs[i,1].imshow(imread(img_swin[i]))
    axs[i,2].imshow(imread(img_esr[i]))
  for x in axs.flatten():
    x.axis("off")

input_folder = '/content/upload'
swin_folder = '/content/results/swinir_real_sr_x4'
esr_folder = '/content/final'
input_list = sorted(glob.glob(os.path.join(input_folder, '*.png')))
swin_list = sorted(glob.glob(os.path.join(swin_folder, '*.png')))
esr_list = sorted(glob.glob(os.path.join(esr_folder, '*.png')))
print(len(input_list))
print(len(swin_list))
print(len(esr_list))

compare_all(input_list,swin_list,esr_list)

## 2. What is Stable Diffusion

Now, let's go into the theoretical part of Stable Diffusion 👩‍🎓.

Stable Diffusion is based on a particular type of diffusion model called **Latent Diffusion**, proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752).



General diffusion models are machine learning systems that are trained to *denoise* random gaussian noise step by step, to get to a sample of interest, such as an *image*. For a more detailed overview of how they work, check [this colab](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb).

Diffusion models have shown to achieve state-of-the-art results for generating image data. But one downside of diffusion models is that the reverse denoising process is slow. In addition, these models consume a lot of memory because they operate in pixel space, which becomes unreasonably expensive when generating high-resolution images. Therefore, it is challenging to train these models and also use them for inference.



<br>

Latent diffusion can reduce the memory and compute complexity by applying the diffusion process over a lower dimensional _latent_ space, instead of using the actual pixel space. This is the key difference between standard diffusion and latent diffusion models: **in latent diffusion the model is trained to generate latent (compressed) representations of the images.** 

There are three main components in latent diffusion.

1. An autoencoder (VAE).
2. A [U-Net](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb#scrollTo=wW8o1Wp0zRkq).
3. A text-encoder, *e.g.* [CLIP's Text Encoder](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel).

**1. The autoencoder (VAE)**

The VAE model has two parts, an encoder and a decoder. The encoder is used to convert the image into a low dimensional latent representation, which will serve as the input to the *U-Net* model.
The decoder, conversely, transforms the latent representation back into an image.

 During latent diffusion _training_, the encoder is used to get the latent representations (_latents_) of the images for the forward diffusion process, which applies more and more noise at each step. During _inference_, the denoised latents generated by the reverse diffusion process are converted back into images using the VAE decoder. As we will see during inference we **only need the VAE decoder**.

**2. The U-Net**

The U-Net has an encoder part and a decoder part both comprised of ResNet blocks.
The encoder compresses an image representation into a lower resolution image representation and the decoder decodes the lower resolution image representation back to the original higher resolution image representation that is supposedly less noisy.
More specifically, the U-Net output predicts the noise residual which can be used to compute the predicted denoised image representation.

To prevent the U-Net from losing important information while downsampling, short-cut connections are usually added between the downsampling ResNets of the encoder to the upsampling ResNets of the decoder.
Additionally, the stable diffusion U-Net is able to condition its output on text-embeddings via cross-attention layers. The cross-attention layers are added to both the encoder and decoder part of the U-Net usually between ResNet blocks.

**3. The Text-encoder**

The text-encoder is responsible for transforming the input prompt, *e.g.* "An astronout riding a horse" into an embedding space that can be understood by the U-Net. It is usually a simple *transformer-based* encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.

Inspired by [Imagen](https://imagen.research.google/), Stable Diffusion does **not** train the text-encoder during training and simply uses an CLIP's already trained text encoder, [CLIPTextModel](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel).

**Why is latent diffusion fast and efficient?**

Since the U-Net of latent diffusion models operates on a low dimensional space, it greatly reduces the memory and compute requirements compared to pixel-space diffusion models. For example, the autoencoder used in Stable Diffusion has a reduction factor of 8. This means that an image of shape `(3, 512, 512)` becomes `(3, 64, 64)` in latent space, which requires `8 × 8 = 64` times less memory.

This is why it's possible to generate `512 × 512` images so quickly, even on 16GB Colab GPUs!

**Stable Diffusion during inference**

Putting it all together, let's now take a closer look at how the model works in inference by illustrating the logical flow.


<p align="left">
<img src="https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/stable_diffusion.png" alt="sd-pipeline" width="500"/>
</p>

The stable diffusion model takes both a latent seed and a text prompt as an input. The latent seed is then used to generate random latent image representations of size $64 \times 64$ where as the text prompt is transformed to text embeddings of size $77 \times 768$ via CLIP's text encoder.

Next the U-Net iteratively *denoises* the random latent image representations while being conditioned on the text embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm. Many different scheduler algorithms can be used for this computation, each having its pros and cons. For Stable Diffusion, we recommend using one of:

- [PNDM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_pndm.py) (used by default)
- [DDIM scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_ddim.py)
- [K-LMS scheduler](https://github.com/huggingface/diffusers/blob/main/src/diffusers/schedulers/scheduling_lms_discrete.py)

Theory on how the scheduler algorithm function is out of scope for this notebook, but in short one should remember that they compute the predicted denoised image representation from the previous noise representation and the predicted noise residual.
For more information, we recommend looking into [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364)

The *denoising* process is repeated *ca.* 50 times to step-by-step retrieve better latent image representations.
Once complete, the latent image representation is decoded by the decoder part of the variational auto encoder.



After this brief introduction to Latent and Stable Diffusion, let's see how to make advanced use of 🤗 Hugging Face Diffusers!