<br>
<p style='text-align: left;'><span style="color: #353535; font-family: Tahoma; font-size: 2.6em; font-weight: 600;">Generate Stunning Artworks with CLIP Guided Diffusion + SwinIR Super Resolution</span></p>

<p style='text-align: left;'><span style="color: #767676; font-family: Arial; font-size: 1.4em; font-weight: 400;">Create beautiful artworks by fine-tuning diffusion models on custom datasets, and performing CLIP guided text-conditional sampling, followed by SWIN-transformer based super-resolution</span></p>

<br>
<br>

![](https://i.ibb.co/H2sHF0T/cover1-02.jpg)

<br>
<br>

<span style="background-color: #EFFFCD;">📌 Throughout this notebook, we will be using a codebase I have put together:</span><br>
<p style='text-align: left;'>
    <span>
    &emsp;<a href="https://github.com/sreevishnu-damodaran/clip-diffusion-art"><img alt="github.com/sreevishnu-damodaran/clip-diffusion-art" src="https://img.shields.io/badge/sreevishnu--damodaran%2Fclip--diffusion--art-2B2E3A?style=flat&logo=github&logoColor=white" width="300">
        </a>
    </span>
</p>
<br>

<span style="background-color: #EFFFCD;">📌 Dataset with public domain artworks created for this project:</span><br>
<span>
&emsp;<a href="https://www.kaggle.com/sreevishnudamodaran/artworks-in-public-domain">kaggle.com/sreevishnudamodaran/artworks-in-public-domain
</a>
</span><br><br>

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Overview</span>

***Topics covered in this notebook:***

&emsp;✔️&ensp;Fine-tune Diffusion Models on Custom Datasets<br>
&emsp;✔️&ensp;Zero-shot CLIP Guided Diffusion Sampling<br>
&emsp;✔️&ensp;Denoising Diffusion Implicit Model Sampling<br>
&emsp;✔️&ensp;SwinIR Super-resolution Using Shifted Window Transformer<br>
&emsp;✔️&ensp;A Brief Overview of Diffusion models, CLIP & SwinIR<br>
&emsp;✔️&ensp;Experiment Tracking & Interactive Visualizations with Weights & Biases<br>

<br>

Let's explore the creative capabilities of deep generative models and take a deep dive into how we can make use of these models, in combination with generalized vision-language models, to create beautiful artworks of various styles from natural language text prompts.

We will look at how to fine-tune diffusion probabilistic models on a custom dataset created from artworks in the public domain. During the sampling process to generate images, we will use a <span style="color: #DC143C;">vision-language CLIP model to steer or guide</span> this fine-tuned model with <span style="color: #DC143C;">natural language prompts</span> without any extra training or supervision. Afterwards, the generated images will be enlarged to a larger size by using a Swin transformer-based super-resolution model, which turns the low resolution generated output into a high resolution image by generating finer realistic details, and enhancing visual quality. We will also briefly cover the concepts behind the inner workings of each of these models, and more details on integrating them in a bit.

***Here is a general block diagram showing the various components.***

<br>

<div style="text-align:center">
    <img src="https://i.ibb.co/3sTH15m/diagram-final3.jpg" alt="drawing" width="700" style="padding: 10px;"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">General block diagram</span>
    </p>
</div>
<br>


***Here are some examples of the artwork generation process from text prompts, using the final fine-tuned model with CLIP guidance:***

<div style="text-align:center">
    <img src="https://i.ibb.co/DpTYvK3/job18-1.gif" alt="drawing" width="250" style="padding: 20px;"/>
    <img src="https://i.ibb.co/8gsR0w1/2.gif" alt="drawing" width="250" style="padding: 20px;"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">Generated samples for prompts <span style="color: #DC143C;">"vibrant watercolor painting of a flower, artstation HQ"</span> and <span style="color: #DC143C;">"artstation HQ, photorealistic depiction of an alien city".</span></span>
    </p>
</div>
<br>
<span style="font-size:1.2em;">
    <a href="https://wandb.ai/sreevishnu-damodaran/clip_diffusion_art/reports/Results-CLIP-Guided-Diffusion-SwinIR--VmlldzoxNjUxNTMz">🔎 Visit this report for more generated artworks ➔ <br><br>
    </a>
    <div style="text-align:center">
        <a href="https://wandb.ai/sreevishnu-damodaran/clip_diffusion_art/reports/Results-CLIP-Guided-Diffusion-SwinIR--VmlldzoxNjUxNTMz">
            <img alt="Report gif" src="https://i.ibb.co/GHXJhyX/report-opti.gif" width="700">
        </a>
    </div>
</span>
<br>
<br>

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Diffusion Models</span>

Over the years, <span style="color: #DC143C;">deep generative models</span> have evolved to model complex high-dimensional probability distributions across a range of perceptive and predictive tasks. These were accomplished by well-formulated neural network architectures and parametrization techniques. For sometime, Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs) and Flow-based models were the front runners of this area. In spite of the vast number of milestones that are getting accomplished with these models, they suffer from a range of shortcomings in terms of training stability, lack of diversity, and high sensitivity to changes in hyper-parameters.

Diffusion Probabilistic models, a new family of models were introduced by [Sohl-Dickstein et al.](http://proceedings.mlr.press/v37/sohl-dickstein15.html) in 2015 to try to overcome these weaknesses, or rather to traverse other ways to solve generative tasks. They were inspired by non-equilibrium thermodynamics. Several papers and improvements later, they have now achieved competitive log likelihoods and state-of-the-art results across a wide variety of tasks,  maintaining better characteristics compared to its counterparts in terms of training stability and improved diversity in image synthesis.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/4JDnP9G/diff.jpg" alt="drawing" width="600" style="padding: 20px;"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">Forward and reverse diffusion process.</span>
    </p>
</div>
<br>

The key idea behind diffusion models is the use of a parameterized Markov chain, which is trained to produce samples from a data distribution by reversing a gradual, multi-step noising process starting from a pure noise $x_T$,and denoising at every step to produce less noisy samples $x_{T −1}, x_{T −2},$ … reaching the final synthesized sample $x_0$. Contrary to initial work on these models, it was later found that parameterizing this model as a function of the noise with respect to x_t and t, which predicts the noise component of a noisy sample x_t is better than predicting the noisy image x_t itself (Ho et al.). To train these models, each sample in a mini-batch is produced by randomly drawing a data sample x_0, a timestep t, and a noise epsilon, which together are used to produce a noisy sample x_t. The training objective is then $||\epsilon_θ(x_t, t) - \epsilon||^2$ i.e. a simple mean-squared error loss between the true noise and the predicted noise. The approximation of the reverse predicted noise is done by a neural network, since these predictions depend on the entire data distribution, which is unknown. So, the latent information of the training data distribution is stored in the neural network part of the model.

We will be using diffusion model architectures and training procedures from the papers Improved Denoising Diffusion Probabilistic Models and Diffusion Models Beat GANs by Dhariwal and Nichol, 2021 (OpenAI), where the authors have improved the log-likelihood to maximize the learning of all modes of the data distribution, and other generative metrics like FID (Fréchet Inception Distance) and IS (Inception Score), to enhance the generated image fidelity. The model we will use has a neural network architecture based on the backbone of PixelCNN++, which is a U-Net based on a Wide ResNet with group normalization instead of weight normalization, to make the implementation simpler. These models have two convolutional residual blocks per resolution level, and use multi-head self-attention blocks at the 16×16 resolution and 8x8 resolution between the convolutional blocks. Diffusion time $t$ is specified by adding the transformer sinusoidal position embedding into each residual block.

There are several other intricacies to understanding diffusion models with many improvements in recent literature, which all would be hard to summarize in a short article. For a better theoretical understanding and details on the implementation, I recommend going through the papers on diffusion models. At the time of writing this article, the total count of papers on diffusion models is not as overwhelming as the number of GANs papers.


<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">A Faster Way of Sampling with DDIMs</span>

> ***DDPMs inherently suffer from the need to sample hundreds-to-thousands of steps to generate a high fidelity sample, making them prohibitively expensive and impractical in real-world applications, where the data tends to be high-dimensional.***

A solution to get around this problem was to shift to the use of non-Markovian diffusion processes instead of Markovian diffusion processes (used in DDPMs) during sampling. This new class of models were called DDIMs (Denoising Diffusion Implicit Models), which follow the same training procedure as that of DDPMs to train for an arbitrary number of forward steps. The reverse process is performed with new generative processes, which enable sampling faster in only a subset of those forward steps during generation. The authors showed that DDIMs can produce high quality samples <span style="color: #DC143C;">10x to 50x</span> faster compared to DDPMs.

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Steering Gradients with CLIP</span>

CLIP (Contrastive Language–Image Pre-training) has set a benchmark in the areas of <span style="color: #DC143C;">zero-shot transfer, natural language supervision, and multi-modal learning</span>, by means of training on a wide variety of images and language supervision. These models are not trained directly to optimize on the benchmarks of singular tasks, making them far less short-sighted on the visual and language concepts learned. This led to better performance compared to several supervised ImageNet-trained models, even surpassing the original ResNet50 without being trained explicitly on any of the 1.28M labeled samples. CLIP has been used in a wide variety of tasks since it was introduced in January, 2021.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/Jxxg4CK/clip.jpg" alt="drawing" width="650"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">Comparison of CLIP with other models.</span>
    </p>
</div>
<br>

The authors used a large dataset created with around 400 million image-text pairs for training. In every iteration, a batch of $N$ pairs of text and images are forwarded through an image and text encoder, which trains jointly to maximize the cosine similarity of the text and image embeddings of the $N$ real pairs (in the diagonal elements of the multi-modal embedding space represented in the figure below), while minimizing the similarity scores of the other $N^2-N$ elements (present at the non-diagonal positions) in the embedding space, to form a contrastive training objective. A symmetric cross-entropy loss is used to optimize the model on these similarity scores.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/jZyf44F/clip3.jpg" alt="drawing" width="550"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">CLIP training process.</span>
    </p>
</div>
<br>

We will use CLIP to steer the image sampling denoising process of diffusion models, to produce samples matching the text prompt provided as a condition. This technique has been used in works like DALL-E and GLIDE, and also to guide other generative models like VQGAN, StyleGAN2 and Siren (Sinusoidal Representation Networks) to name a few. This guidance procedure is done by first encoding the intermediate output image of the diffusion model during the iterative sampling process with the CLIP image encoder head, while the text prompts are converted to embeddings by using the text encoder head. Then, the resultant output image and text embeddings are used to compute a perceptual loss, which measures the similarity between the two embeddings. The <span style="color: #DC143C;">gradients with respect to this loss and the intermediate denoised image are used for conditioning</span>, or guiding the diffusion model during the sampling process to produce the next intermediate denoised image. This process is repeated until the total sampling steps are complete. We also use losses to control spatial smoothing like total variation and range losses, as well as image augmentations, to improve the quality. In addition to this, multiple cutouts of images are also taken in batches to minimize the loss objective, leading to improvements in the synthesis quality, and optimized memory usage when sampling from smaller GPUs.

<br>
<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Upscaling Generated Images with Super-resolution</span>

Large deep generative models need to be trained on large GPU clusters for days or even weeks. On single and smaller GPUs, we are limited to being able to train 256x256 diffusion models, which can only output images with less visual detail. So, we will work around this by training a smaller 256x256 output model, and upscaling its predictions 3x times to obtain the final images of a larger size of 1024x1024. Conventional upscaling to enlarge images by using interpolation techniques such as bilinear or lanczos, results in degradation of image quality and blurring artifacts, as no new visual detail gets added. An easy remedy to this problem is to use a super-resolution model trained to recover the finer details by a generative process. This produces enlarged images with <span style="color: #DC143C;">high perceptual quality and peak signal-to-noise ratio (PSNR)</span>.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/L1d8rfL/swin-1.jpg" alt="drawing" width="350" style="padding: 20px;"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">(a) Building hierarchical feature maps by merging image patches in Swin transformers; (b) global computation of self-attention in ViT.</span>
    </p>
</div>
<br>

Swin transformers are a class of visual transformer-based neural network architectures aimed at improving the adaptation of transformers for vision tasks similar to ViT/DeiT. They have achieved state-of-the-art results across various tasks such as image classification, instance segmentation, and semantic segmentation. They take a hierarchical approach in its architecture in building feature maps by merging patches (keeping the number of patches in each layer a constant with respect to the image size), when moving from one layer to the other to achieve <span style="color: #DC143C;">scale-invariance</span>. Self-attention is computed only within each local window, thereby <span style="color: #DC143C;">reducing computations to linear complexity</span> compared to the quadratic complexity of ViTs, where self-attention is computed globally. Local self-attention lacks connections across windows, limiting modelling power, and this is solved by cyclic shifting when the image is partitioned for creating patches to essentially enable <span style="color: #DC143C;">cross-window connections</span>. This partitioning configuration is alternated to form consecutive non-shifted and shifted blocks, enhancing the overall modelling power.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/Pxz9fWG/swin-3.jpg" alt="drawing" width="650" style="padding: 20px;"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">(a) The architecture of a Swin Transformer (Swin-T); (b) two successive Swin Transformer Blocks. W-MSA and SW-MSA are multi-head self-attention modules with regular and shifted windowing configurations, respectively.</span>
    </p>
</div>
<br>

We will make use of an image-restoration model proposed in the paper [SwinIR: Image Restoration Using Swin Transformer](https://arxiv.org/pdf/2108.10257.pdf), which is built upon swin transformer blocks. The generated image after $N$ CLIP-conditioned diffusion denoising steps is fed as the input to this model. The architecture of SwinIR consists of modules for shallow feature extraction, deep feature extraction, and high-quality (HQ) image reconstruction. Shallow feature extraction module extracts the shallow features which have the low-frequency information. By means of a convolution layer and these are directly transmitted to the final reconstruction module. Deep feature extraction module consists of several Residual Swin Transformer blocks (RSTB). Each RSTB has several swin transformer layers for capturing local attention and cross-window interactions. The authors also use another convolution layer at the end of the block for feature enhancement with a residual connection, to provide a shortcut for feature aggregation. Both the shallow and deep features are fused at the final reconstruction module, producing the final restored or enlarged image.

<br>
<div style="text-align:center">
    <img src="https://i.ibb.co/RpK5xCt/swinir.jpg" alt="drawing" width="550"/>
    <p style='text-align: center;'>
        <span style="color: #353535; font-size:0.9em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">Figure 5.3: SwinIR architecture</span>
    </p>
</div>
<br>

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Credits</span>

Developed using techniques and architectures borrowed from original work by the authors below:

- [Guided diffusion](https://github.com/openai/guided-diffusion) and [improved diffusion](https://github.com/openai/improved-diffusion) by [OpenAI](https://github.com/openai)

- Original notebook on CLIP guidance sampling by Katherine Crowson (https://github.com/crowsonkb, https://twitter.com/RiversHaveWings) with improvements by [nerdyrodent](https://github.com/nerdyrodent/CLIP-Guided-Diffusion) and [sadnow](https://github.com/sadnow/360Diffusion) (@sadly_existent) 

- [SwinIR: Image Restoration Using Shifted Window Transformer](https://github.com/JingyunLiang/SwinIR)

Huge thanks to all their great work! I highly recommend checking these out.

<br>
<span style="float:center;"><a href="https://www.kaggle.com/sreevishnudamodaran"><img style="padding: 5px;" border="0" alt="Ask Me Something" src="https://img.shields.io/badge/Ask%20Me%20Something-FCC624?style=for-the-badge&logo=kaggle&logoColor=black" width="175"></a><br>

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Install Dependencies</span>

In [None]:
!conda install -y mpi4py >> /dev/null

!git clone https://github.com/sreevishnu-damodaran/clip-diffusion-art.git -q
%cd /kaggle/working/clip-diffusion-art
!pip install -e . -q
!git clone https://github.com/crowsonkb/guided-diffusion -q
!pip install -e guided-diffusion -q
!git clone https://github.com/JingyunLiang/SwinIR.git -q
!git clone https://github.com/openai/CLIP -q
!pip install -e ./CLIP -q

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">⚗️ Imports & Setup</span>

In [None]:
import random
import os
import numpy as np

import sys
import yaml
import glob
from datetime import datetime

import matplotlib.pyplot as plt
from types import SimpleNamespace
import wandb

import torch
import torchvision
import torchvision.transforms.functional as TF

sys.path.append("./clip-diffusion-art")
sys.path.append("./guided-diffusion")
from clip_diffusion_art import logger
from clip_diffusion_art.train import TrainLoop
from clip_diffusion_art.cda_utils import (
    args_to_dict,
    add_dict_to_argparser,
)
from clip_diffusion_art.sample import ClipDiffusion

from guided_diffusion.script_util import (
    create_model_and_diffusion,
    model_and_diffusion_defaults
)
from guided_diffusion.image_datasets import load_data
from guided_diffusion.resample import create_named_schedule_sampler
from guided_diffusion import dist_util

In [None]:
def seed_all(seed):
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">⚒️ Training Config & Hyperparameters</span>

<br>

In this step, we will chooose the hyperparameters and other training configurations for fine-tuning with the custom dataset. We have selected resonable defaults which allows us to fine-tune a model on custom datasets with 16GB GPUs on Colab or Kaggle.

### Download the pre-trained checkpoint.

In [None]:
!wget https://openaipublic.blob.core.windows.net/diffusion/march-2021/lsun_uncond_100M_1200K_bs128.pt -P ./pretrained_models -q

resume_checkpoint = "/kaggle/working/clip-diffusion-art/pretrained_models/lsun_uncond_100M_1200K_bs128.pt"

In [None]:
train_cfg = model_and_diffusion_defaults()

cfg = {
    'data_dir': "/kaggle/input/artworks-in-public-domain/artworks_in_public_domain",
    'attention_resolutions': "16",
    'class_cond': False,
    'diffusion_steps':1000,
    'rescale_timesteps': True,
    'rescale_learned_sigmas': True,
    'image_size': 256,
    'learn_sigma': True,
    'noise_schedule': "linear",
    'num_channels': 128,
    'num_heads': 1,
    'num_res_blocks': 2,
    'use_checkpoint': False,
    'use_fp16': True,
    'use_scale_shift_norm': False,
    'schedule_sampler': "uniform",
    'lr': 1e-7,
    'weight_decay': 0.0,
    'lr_anneal_steps': 0,
    'batch_size': 8,
    'microbatch': 1,  # -1 disables microbatches
    'ema_rate': "0.9999",  # comma-separated list of EMA values
    'log_interval': 10,
    'save_interval': 1000,
    'resume_checkpoint': resume_checkpoint,
    'use_checkpoint': True,
    'fp16_scale_growth': 1e-3,
    'log_dir': "outputs",
    'wandb_project': "clip_diffusion_art_train",
    'wandb_entity': None,
    'wandb_name': None,
    'seed': 47
}

train_cfg.update(cfg)
train_cfg = SimpleNamespace(**train_cfg)

# Set seed for training
seed_all(train_cfg.seed)

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Dataset Creation Process</span>

I have downloaded artworks that are in the public domain from [WikiArt](https://www.wikiart.org/) and [rawpixel.com](https://www.rawpixel.com/) for creating the dataset used for this project. After downloading them, I resized everything to the size of 256x256. The dataset contains around 29.3k images. We will use this dataset to fine-tune our model.

To use custom datasets for training, download/scrape the necessary images, and then resize them (and preferably center crop to avoid aspect ratio change) to the input size of the diffusion model of choice.

Note: Make sure all the images have 3 channels (RGB). In case of grayscale images, convert them to RGB.

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Load & Display Training Samples</span>

`load_data` function creates the torch dataset and a torch dataloader-based generator for training.

In [None]:
data = load_data(
    data_dir=train_cfg.data_dir,
    batch_size=train_cfg.batch_size,
    image_size=train_cfg.image_size,
    class_cond=train_cfg.class_cond,
)

In [None]:
sample_imgs, _ = next(data)
grid_img = torchvision.utils.make_grid(sample_imgs.clamp(-1, 1).add(1).div(2), nrow=4)
plt.figure(figsize=(10, 5))
plt.axis('off')
plt.imshow(grid_img.permute(1, 2, 0));

<div style="text-align:center">
    <img src="https://i.ibb.co/1sQcJb2/gb6B4ig.png" alt="drawing" width="350"/></div>

<br>
<br>

Weights & Biases helps machine learning teams build better models faster. With a few lines of code, practitioners can instantly debug, compare and reproduce their models — architecture, hyperparameters, git commits, model weights, GPU usage, and even datasets and predictions — and collaborate with their teammates.

Now, fast track your experiments with:

 - **Dashboard (experiment tracking)**: Log and visualize experiments in real time = Keep data and results in one convenient place. Consider this as a repository of experiments.
 - **Artifacts (dataset + model versioning)**: Store and version datasets, models, and results = Know exactly what data a model is being trained on.

It is also <span style="color: #DC143C;">free to use for academic and open source projects!</span>

<!-- <span style="background-color: #EFFFCD;">📌 Dataset with public domain artworks created for this project:</span><br><br>
<span>
&emsp;<a href="https://www.kaggle.com/sreevishnudamodaran/artworks-in-public-domain">kaggle.com/sreevishnudamodaran/artworks-in-public-domain
</a>
</span> -->

**Visit https://wandb.me/kaggle to know more about using Weights & Biases in Kaggle!**

**To get to know about all the exciting features, take a look at https://wandb.ai/site**

<br>
<br>

<p style='text-align: left;'><span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">⚗️ Integrate Weights & Biases</span></p>

I have already integrated Weights & Biases to perform logging of metrics and images in the repository we use.

To enable it, just pass `wandb_run` handler created below to the training loop method for experiment tracking and logging.

In [None]:
wandb_run = wandb.init(project=train_cfg.wandb_project,
    entity=train_cfg.wandb_entity,
    name=train_cfg.wandb_name)

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Training</span>

In [None]:
dist_util.setup_dist()

model, diffusion = create_model_and_diffusion(
    **args_to_dict(train_cfg, model_and_diffusion_defaults().keys()))
model.to(dist_util.dev())

schedule_sampler = create_named_schedule_sampler(
    train_cfg.schedule_sampler, diffusion)

In [None]:
logger.configure(dir=train_cfg.log_dir,
                    wandb_run=wandb_run)

try:
    TrainLoop(
    model=model,
    diffusion=diffusion,
    data=data,
    batch_size=train_cfg.batch_size,
    microbatch=train_cfg.microbatch,
    lr=train_cfg.lr,
    ema_rate=train_cfg.ema_rate,
    log_interval=train_cfg.log_interval,
    save_interval=train_cfg.save_interval,
    resume_checkpoint=train_cfg.resume_checkpoint,
    use_fp16=train_cfg.use_fp16,
    fp16_scale_growth=train_cfg.fp16_scale_growth,
    schedule_sampler=schedule_sampler,
    weight_decay=train_cfg.weight_decay,
    lr_anneal_steps=train_cfg.lr_anneal_steps,
    wandb_run=wandb_run
).run_loop()
    
except KeyboardInterrupt:
    wandb.finish()

### Training can also be done in script mode


#### Set Hyperparameters

```
MODEL_FLAGS="--image_size 256 --num_channels 128 --num_res_blocks 2 --num_heads 1 --attention_resolutions 16"
DIFFUSION_FLAGS="--diffusion_steps 1000 --noise_schedule linear --learn_sigma True --rescale_learned_sigmas True --rescale_timesteps True --use_scale_shift_norm False"
TRAIN_FLAGS="--lr 5e-6 --save_interval 500 --batch_size 16 --use_fp16 True --wandb_project diffusion-art-train --use_checkpoint True --resume_checkpoint pretrained_models/lsun_uncond_100M_1200K_bs128.pt"
```

#### Run the Traning Job:

```
python clip_diffusion_art/train.py --data_dir path/to/images $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
```

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Sampling</span>

Let's download and use a checkpoint that was trained earlier for 5000 iterations on the same artworks-in-public-domain dataset, to generate samples. Do note that the we are using a fine-tuned checkpoint trained on a small number of iterations with single 16GB GPUs for demonstration purposes. Other practical applications may need more hyper-parameter tuning, longer training, and larger pre-trained models.

In [None]:
!wget https://api.wandb.ai/files/sreevishnu-damodaran/clip_diffusion_art/29bag3br/256x256_clip_diffusion_art.pt -q

<span style="color: #000508; font-size:1.3em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Sampling Config </span>

#### Options:
`--images` - image prompts (default=None)<br>
`--checkpoint` - diffusion model checkpoint to use for sampling<br>
`--model_config` - diffusion model config yaml<br>
`--wandb_project` - enable wandb logging and use this project name<br>
`--wandb_name` - optinal run name to use for wandb logging<br>
`--wandb_entity` - optinal entity to use for wandb logging<br>
`--num_samples` - - number of samples to generate (default=1)<br>
`--batch_size` - default=1batch size for the diffusion model<br>
`--sampling` - timestep respacing sampling methods to use (default="ddim50", choices=[25, 50, 100, 150, 250, 500, 1000, ddim25, ddim50, ddim100, ddim150, ddim250, ddim500, ddim1000])<br>
`--diffusion_steps` - number of diffusion timesteps (default=1000)<br>
`--skip_timesteps` - diffusion timesteps to skip (default=5)<br>
`--clip_denoised` - enable to filter out noise from generation (default=False)<br>
`--randomize_class_disable` - disables changing imagenet class randomly in each iteration (default=False)<br>
`--eta` - the amount of noise to add during sampling (default=0)<br>
`--clip_model` - CLIP pre-trained model to use (default="ViT-B/16",
choices=["RN50","RN101","RN50x4","RN50x16","RN50x64","ViT-B/32","ViT-B/16","ViT-L/14"])<br>
`--skip_augs` - enable to skip torchvision augmentations (default=False)<br>
`--cutn` - the number of random crops to use (default=16)<br>
`--cutn_batches` - number of crops to take from the image (default=4)<br>
`--init_image` - init image to use while sampling (default=None)<br>
`--loss_fn` - loss fn to use for CLIP guidance (default="spherical", choices=["spherical" "cos_spherical"])<br>
`--clip_guidance_scale` - CLIP guidance scale (default=5000)<br>
`--tv_scale` - controls smoothing in samples (default=100)<br>
`--range_scale` - controls the range of RGB values in samples (default=150)<br>
`--saturation_scale` - controls the saturation in samples (default=0)<br>
`--init_scale` - controls the adherence to the init image (default=1000)<br>
`--scale_multiplier` - scales clip_guidance_scale tv_scale and range_scale (default=50)<br>
`--disable_grad_clamp` - disable gradient clamping (default=False)<br>
`--sr_model_path` - SwinIR super-resolution model checkpoint (default=None)<br>
`--large_sr` - enable to use large SwinIR super-resolution model (default=False)<br>
`--output_dir` - output images directory (default="output_dir")<br>
`--seed` - the random seed (default=47)<br>
`--device` - the device to use <br>
<br>


In [None]:
cfg_dict = {
    "seed": 84,
    "wandb_project": "clip_diffusion_art",
    "wandb_name": "job7",
    "model_config": "clip_diffusion_art/configs/256x256_clip_diffusion_art.yaml",
    "checkpoint": "/kaggle/working/clip-diffusion-art/256x256_clip_diffusion_art.pt",
    "batch_size": 1,
    "skip_timesteps": 5,
    "sampling": "ddim50",
    "diffusion_steps": 1000,
    "clip_guidance_scale": 5000,
    "cutn": 60,
    "cutn_batches": 4,
    "scale_multiplier": 1,
    "tv_scale":75,
    "range_scale": 200,
    "loss_fn":"spherical",
    "clip_model": "ViT-B/16",
    "large_sr": True,
}

cfg_dict["output_dir"] = f"/kaggle/working/{cfg_dict['wandb_name']}"

cfg = SimpleNamespace(**cfg_dict)

In [None]:
seed_all(cfg.seed)

config_file = open(cfg.model_config)
model_config = yaml.load(config_file,
                         Loader=yaml.FullLoader)["model_config"]
print("model_config", model_config)

In [None]:
clip_diffusion = ClipDiffusion(cfg.checkpoint,
    model_config=model_config,
    sampling=cfg.sampling,
    diffusion_steps=cfg.diffusion_steps,
    clip_model=cfg.clip_model,
    device=device
)

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Generate Samples</span>

In [None]:
os.makedirs(os.path.join(cfg.output_dir, 'wandb'), exist_ok=True)

wandb_run = wandb.init(project=cfg.wandb_project,
                        dir=cfg.output_dir,
                        name=cfg.wandb_name)

#### Give your prompt of choice to generate artworks

**Some examples to try:**

"beautiful matte painting of dystopian city, Behance HD"<br>
"vibrant watercolor painting of a flower, artstation HQ"<br>
"a photo realistic apple in HD"<br>
"beach with glowing neon lights, trending on artstation"<br>
"beautiful abstract painting of the horizon in ultrafine detail, HD"<br>
"vibrant digital illustration of a waterfall in the woods, HD"<br>
"beautiful matte painting of ship at sea, Behance HD"<br>
"hyperrealism oil painting of beautiful skies, HD"

In [None]:
prompts =  ["vibrant matte painting of a house in an enchanted forest, artstation HQ"]
num_samples = 4

out_generator = clip_diffusion.sample(
                    prompts,
#                     args.images,
                    num_samples=num_samples,
                    batch_size=cfg.batch_size,
                    skip_timesteps=cfg.skip_timesteps,
#                     clip_denoised=cfg.clip_denoised,
#                     randomize_class=cfg.randomize_class,
#                     eta=cfg.eta,
#                     skip_augs=cfg.skip_augs,
                    cutn=cfg.cutn,
                    cutn_batches=cfg.cutn_batches,
#                     init_image=cf.init_image,
                    loss_fn=cfg.loss_fn,
                    clip_guidance_scale=cfg.clip_guidance_scale,
                    tv_scale=cfg.tv_scale,
                    range_scale=cfg.range_scale,
#                     saturation_scale=cfg.saturation_scale,
#                     init_scale=cfg.init_scale,
                    scale_multiplier=cfg.scale_multiplier,
                    output_dir=cfg.output_dir,
                    wandb_run=wandb_run
                )

In [None]:
os.makedirs(cfg.output_dir, exist_ok=True)

for i, out_image in enumerate(out_generator):
    disp_image = TF.to_pil_image(out_image.squeeze(0))
    out_image = clip_diffusion.upscale(out_image,
                                        large_sr=cfg.large_sr)
    out_image = TF.to_pil_image(out_image.squeeze(0))
    
    fig, axs = plt.subplots(1, 2, figsize=(15, 18))
    [axi.set_axis_off() for axi in axs.ravel()]
    axs[0].imshow(disp_image)
    axs[0].set_title("Before Super-resolution", fontsize=20)
    axs[1].imshow(out_image)
    axs[1].set_title("After Super-resolution", fontsize=20)
    plt.show()
    
    current_time = datetime.now().strftime('%y%m%d-%H%M%S_%f')
    filename = f'image{i}_{current_time}.png'
    out_image.save(os.path.join(cfg.output_dir, filename))
    
    if wandb_run is not None:
        wandb.log({os.path.splitext(filename)[0]: wandb.Image(os.path.join(cfg.output_dir, filename))})

In [None]:
if wandb_run is not None:
    for k in range(cfg.batch_size):
        for i in range(num_samples):
            img_files = glob.glob(os.path.join(cfg.output_dir,
                                        f"sample{i}_output{k}_steps", '*'))
            wandb.log(
                {f"sample{i}_output{k}": [wandb.Image(img_path) for img_path
                in sorted(img_files,
                          key=lambda x: int(os.path.splitext(x)[0].split("_")[-1].lstrip("step")))]}
            )
                      
wandb.finish()

<span style="color: #674177; font-size:1.4em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Generated Results</span>

<span style="font-size:1.1em;">
    <b>📌 View more generated artworks<a href="https://wandb.ai/sreevishnu-damodaran/clip_diffusion_art/reports/Results-CLIP-Guided-Diffusion-SwinIR--VmlldzoxNjUxNTMz"> here
<br><br>
</a>
</b>
</span>

<br>

<div style="text-align:center">
     <span style="color: #353535; font-size:1.1em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">"beautiful matte painting of a dystopian city, Behance HD"</span>
     <p style='text-align: center;'>
         <img src="https://i.ibb.co/BTWfbf4/23-0.gif" alt="drawing" width="300" style="padding: 20px;"/>
    </p>
</div>

<div style="text-align:center">
     <span style="color: #353535; font-size:1.1em; font-family: Verdana; font-weight: 300; letter-spacing: 0px;">"vibrant watercolor painting of a flower, artstation HQ"</span>
     <p style='text-align: center;'>
         <img src="https://i.ibb.co/8dBTzpX/job18-2.gif" alt="drawing" width="300" style="padding: 20px;"/>
    </p>
</div>

<span style="color: #674177; font-size:1.6em; font-family: Segoe UI; font-weight: 600; letter-spacing: 0px;">Super-resolution Results</span>

<div style="text-align:center">
    <img src="https://i.ibb.co/Gss0y38/sr-zoom-optimized.gif" alt="drawing" width="650" style="padding: 20px;"/>
</div>

<p style='text-align: center;'><span style="color: #006ea4; font-family: Tahoma; font-size: 1.8em; font-weight: 300;">Thanks for reading!</span></p>