<a href="https://colab.research.google.com/github/KaliYuga-ai/DreamBooth_With_Dataset_Captioning/blob/main/DreamBooth_With_Dataset_Captioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KaliYuga's DreamBooth With Dataset Captioning

<div>
<img src="https://images.squarespace-cdn.com/content/v1/6213c340453c3f502425776e/a432c21c-bb12-4f38-b5e2-1c12a3c403f6/Animated-Logo_1.gif" width="150"/>
</div>

This is KaliYuga's fork of Shivam Shrirao's [DreamBooth implementation](https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth). It adds a number of new features to make dataset labeling and organization faster and more powerful, and training more accurate (hopefully!).

**This fork adds the following:** 

*   a slightly modified version of the ***BLIP dataset
autocaptioning functionality*** from [victorchall's EveryDream comapnion tools repo](https://github.com/victorchall/EveryDream).

Once you've autocaptioned your datasets, you can use this same notebook to train Stable Diffusion models on your new text/image pairs (with or without instance and class prompts) using 

*   ***KaliYuga's Dataset Organization Omnitool***, which I wrote with copious help from ChatGPT. This tool lets you extract .txt files from your image filenames so you can train on unique text/image pairs instead of using a single broad instance prompt for a whole group of images. You can still use class and instance prompts alongside the text/image pairs if you want, and this can be a good way to hyper-organize your training data. More detail is given in the Omnitool section.

------
You can support victorchall's awesome EveryDream on 
[Patreon](https://www.patreon.com/everydream) or at [Kofi](https://ko-fi.com/everydream)!

[Follow Shivam](https://twitter.com/ShivamShrirao) on Twitter

[Follow me](https://twitter.com/KaliYuga_ai) (KaliYuga) on Twitter!


------

# Setup

In [None]:
#@markdown Check type of GPU and VRAM available.
!nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader

### Install Requirements/Definitions

In [None]:
!wget -q https://github.com/ShivamShrirao/diffusers/raw/main/examples/dreambooth/train_dreambooth.py
!wget -q https://github.com/ShivamShrirao/diffusers/raw/main/scripts/convert_diffusers_to_original_stable_diffusion.py
%pip install -qq git+https://github.com/ShivamShrirao/diffusers
%pip install -q -U --pre triton
%pip install -q accelerate transformers ftfy bitsandbytes==0.35.0 gradio natsort safetensors xformers

In [None]:
#@title #Login to HuggingFace 🤗

#@markdown You need to accept the model license before downloading or using the Stable Diffusion weights. Please, visit the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5), read the license and tick the checkbox if you agree. You have to be a registered user in 🤗 Hugging Face Hub, and you'll also need to use an access token for the code to work.
# https://huggingface.co/settings/tokens
!mkdir -p ~/.huggingface
HUGGINGFACE_TOKEN = "" #@param {type:"string"}
!echo -n "{HUGGINGFACE_TOKEN}" > ~/.huggingface/token

------

## Settings and run

In [None]:
#@markdown If model weights should be saved directly in google drive (takes around 4-5 GB).
save_to_gdrive = True #@param {type:"boolean"}
if save_to_gdrive:
    from google.colab import drive
    drive.mount('/content/drive')

#@markdown Name/Path of the initial model.
MODEL_NAME = "runwayml/stable-diffusion-v1-5" #@param {type:"string"}

#@markdown Enter the directory name to save model at.

OUTPUT_DIR = "/content/drive/MyDrive/test" #@param {type:"string"}
if save_to_gdrive:
    OUTPUT_DIR = "/content/drive/MyDrive/" + OUTPUT_DIR
else:
    OUTPUT_DIR = "/content/" + OUTPUT_DIR

print(f"[*] Weights will be saved at {OUTPUT_DIR}")

!mkdir -p $OUTPUT_DIR

-----

##Optional: BLIP 1 Autocaptioning of datasets



If you don't want to use BLIP, or if your dataset is already labeled, you can skip this step.

This section is taken (and modified slightly) from [victorchall](https://github.com/victorchall/EveryDream2trainer#docs)'s EveryDream 2 training notebook.
It uses [Salesforce BLIP](https://github.com/salesforce/BLIP) tool to autocaption a given dataset. Captions are saved as the image filenames. These filenames can be extracted to textfiles and used with the Dataset Organization Omnitool below. This is not as accurate as hand-labeling a dataset in most cases, but it's MUCH faster. I plan to implement [BLIP 2](https://github.com/salesforce/LAVIS/blob/main/examples/blip_image_captioning.ipynb) soon.


In [None]:
#@title ### Download Repo
!git clone https://github.com/victorchall/EveryDream.git
# Set working directory
%cd EveryDream

In [None]:
#@title ###Install Requirements
skip_cell_for_run_all()
!pip install torch=='1.12.1+cu113' 'torchvision==0.13.1+cu113' --extra-index-url https://download.pytorch.org/whl/cu113
!pip install pandas>='1.3.5'
!git clone https://github.com/salesforce/BLIP scripts/BLIP
!pip install timm
!pip install fairscale=='0.4.4'
!pip install transformers=='4.19.2'
!pip install timm
!pip install aiofiles

### Upload your dataset to Google Drive (NOT to the Colab instance--doing this is very slow).
Name it something you'll be able to remember/find easily. 

### Auto-Captioning

*You cannot have commented lines between uncommented lines.  If you uncomment a line below, move it above any other commented lines.*

*!python must remain the first line.*

Default params should work fairly well.


In [None]:
!python scripts/auto_caption.py \
--img_dir /content/drive/MyDrive/YourDataset \
--out_dir /content/drive/MyDrive/output \
#--format mrwho \
#--min_length 34 \
#--q_factor 1.3 \
#--nucleus \

#IMPORTANT NOTE: replace "[YourDataset]" in the --img_dir line with your dataset folder name
##ANOTHER NOTE: if you want to save over your original file names instead of making a new directory for your output files,
##simply make your output path the same as your input path.


That's it! Once your dataset is autocaptioned, you can use these captions in the Dataset Organization Omnitool below!.

------

## **KaliYuga's Dataset Organization Omnitool**


# Methods of Use

###**Method 1: Use Image Filenames As Instance Prompts**
####***What Method 1 Does***
When you specify the path to your image dataset, running the cell creates text files of each image caption (filename). These are then used instead of instance prompts. 

**To use this**, simply do not input instance/class prompts or class path below. **You will still need to input an instance_directory path**, as this is the path to all your dataset images.
<br></br>

###**Method 2: Use Image Filenames alsongside Instance Prompts and Class Prompts**

####***What Method 2 Does***
Like method one, this method extracts the filenames of all the images in a specified directory and saves them to a text file which can be used as input for machine learning models. Unlike the above section, though, you can use **instance prompts** and **class prompts** alongside basic extracted image captions.
<br></br>

#####***What Are Instance and Class Prompts?***

Instance and class prompts are additional text descriptions of the image content that can help improve the quality of the model's output.

Instance prompts describe specific objects or features within an image, such as **"a red car"** or **"a smiling person."** Class prompts describe more general concepts or categories, such as **"a car"** or **"a person."** By including these prompts in the training data, the model can learn to associate specific features with broader categories, resulting in more accurate and nuanced results. 

Please note that, if you're using class prompts, you do not have to provide class images yourself. The class prompt you provide will be used to generate these automatically from stable diffusion and save them to your drive.
<br></br>
#####**IMPORTANT NOTE:**

If you have the same word in both your instance and class prompt, it can lead to overfitting on that specific word. When this happens, the model may focus too heavily on that word and generate images that only match that word, rather than the overall concept. To avoid this, it's recommended to choose unique and distinct prompts for both the instance and class.
<br></br>

---

###**Pros and Cons of These Methods of Dataset Labeling**
**Pros--**much more accurate text/image pairs than a general instance prompt (assuming good image captioning)

**Cons--**hand-captioning datasets is slow going, and tools like BLIP are not always accurate.

####Run

In [None]:
import os
from tqdm.auto import tqdm
import json

def extract_image_filenames_to_txts(data_dir):
    for subdir, _, files in tqdm(os.walk(data_dir), desc="Extracting filenames"):
        for file in files:
            if file.endswith(".jpg") or file.endswith(".png"):
                filename, extension = os.path.splitext(file)
                with open(os.path.join(subdir, f"{filename}{extension}.txt"), "w") as f:
                    f.write(filename)

prompts = [
   {
       "instance_prompt":      "potion", ## for method 1, you can leave this blank
       "class_prompt":         "rpg item", ## for method 1, you can leave this blank
       "instance_data_dir":    "/content/drive/MyDrive/[your image dataset path]", 
       "class_data_dir":       "/content/drive/MyDrive/data/[folder to download class images/regularization images from class_prompt ]", ## for method 1, you can leave this blank
   }
#     {
#         "instance_prompt":      "ukj with a dark-haired woman in Hawaii",
#         "class_prompt":         "photo of a person",
#         "instance_data_dir":    "/content/data/ukj",
#         "class_data_dir":       "/content/data/photosofpeople"
#     }
]

for prompt in prompts:
    instance_prompt = prompt["instance_prompt"]
    class_prompt = prompt["class_prompt"]
    data_dir = prompt["instance_data_dir"]
    extract_image_filenames_to_txts(data_dir)
    for subdir, _, files in os.walk(data_dir):
        for file in files:
            if file.endswith(".jpg") or file.endswith(".png"):
                filename, extension = os.path.splitext(file)
                with open(os.path.join(subdir, f"{filename}{extension}.txt"), "w") as f:
                    f.write(f"{filename}|{instance_prompt}|{class_prompt}")
                    
for c in prompts:
    os.makedirs(c["instance_data_dir"], exist_ok=True)

with open("concepts_list.json", "w") as f:
    json.dump(prompts, f, indent=4)


--------

# Start Training

####Use the table below to choose the best flags based on your memory and speed requirements. Tested on Tesla T4 GPU.


| `fp16` | `train_batch_size` | `gradient_accumulation_steps` | `gradient_checkpointing` | `use_8bit_adam` | GB VRAM usage | Speed (it/s) |
| ---- | ------------------ | ----------------------------- | ----------------------- | --------------- | ---------- | ------------ |
| fp16 | 1                  | 1                             | TRUE                    | TRUE            | 9.92       | 0.93         |
| no   | 1                  | 1                             | TRUE                    | TRUE            | 10.08      | 0.42         |
| fp16 | 2                  | 1                             | TRUE                    | TRUE            | 10.4       | 0.66         |
| fp16 | 1                  | 1                             | FALSE                   | TRUE            | 11.17      | 1.14         |
| no   | 1                  | 1                             | FALSE                   | TRUE            | 11.17      | 0.49         |
| fp16 | 1                  | 2                             | TRUE                    | TRUE            | 11.56      | 1            |
| fp16 | 2                  | 1                             | FALSE                   | TRUE            | 13.67      | 0.82         |
| fp16 | 1                  | 2                             | FALSE                   | TRUE            | 13.7       | 0.83          |
| fp16 | 1                  | 1                             | TRUE                    | FALSE           | 15.79      | 0.77         |


* Add `--gradient_checkpointing` flag for around 9.92 GB VRAM usage.

* You must keep `read_prompts_from_txts` flag in order to use the image/text datasets created by the Omnitool.

* Remove `--use_8bit_adam` flag for full precision. Requires 15.79 GB with `--gradient_checkpointing` else 17.8 GB.

* Remove `--train_text_encoder` flag to reduce memory usage further, degrades output quality. Not reccomended with this dataset-building method.
<br></br>

**Notes about training**

This method of text/image pair training seems to need a slower `--learning_rate` than other methods or it overfits quickly. `1e-6` is probably the fastest LR you want with most datasets. If it overfits, drop it down into the `ne-7` range. This could be because the prompts and data being used in the datasets created with the Omnitool are more complex/diverse. This could cause the model to require more time and data to converge on a good solution, and therefore need a slower learning rate to avoid overfitting.

The default of `1e-7` seems to work for small datasets. It could probably be sped up some, too.

If you get an error saying it can't find a specific text file when running the training, check for duplicate images or images with the .gif. It only makes a txt file for the first of two duplicates and not at all for gifs.

In [None]:
!accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_name_or_path="stabilityai/sd-vae-ft-mse" \
  --output_dir=$OUTPUT_DIR \
  --revision="fp16" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --seed=1337 \
  --resolution=512 \
  --train_batch_size=1 \
  --train_text_encoder \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --read_prompts_from_txts \
  --gradient_accumulation_steps=1 \
  --learning_rate=1e-7 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --sample_batch_size=4 \
  --max_train_steps=80000 \
  --save_interval=500 \
  --save_sample_prompt="a beautiful favewave desert nighttime" \
  --concepts_list="concepts_list.json"
# Keep the number of class images close-ish to number of dataset images 
# Reduce the `--save_interval` to lower than `--max_train_steps` to save weights from intermediate steps.
# `--save_sample_prompt` can be same as `--instance_prompt` to generate intermediate samples (saved along with weights in samples directory).

In [None]:
#@markdown Specify the weights directory to use (leave blank for latest)
WEIGHTS_DIR = "" #@param {type:"string"}
if WEIGHTS_DIR == "":
    from natsort import natsorted
    from glob import glob
    import os
    WEIGHTS_DIR = natsorted(glob(OUTPUT_DIR + os.sep + "*"))[-1]
print(f"[*] WEIGHTS_DIR={WEIGHTS_DIR}")

In [None]:
#@markdown Run to generate a grid of preview images from the last saved weights.
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

weights_folder = OUTPUT_DIR
folders = sorted([f for f in os.listdir(weights_folder) if f != "0"], key=lambda x: int(x))

row = len(folders)
col = len(os.listdir(os.path.join(weights_folder, folders[0], "samples")))
scale = 4
fig, axes = plt.subplots(row, col, figsize=(col*scale, row*scale), gridspec_kw={'hspace': 0, 'wspace': 0})

for i, folder in enumerate(folders):
    folder_path = os.path.join(weights_folder, folder)
    image_folder = os.path.join(folder_path, "samples")
    images = [f for f in os.listdir(image_folder)]
    for j, image in enumerate(images):
        if row == 1:
            currAxes = axes[j]
        else:
            currAxes = axes[i, j]
        if i == 0:
            currAxes.set_title(f"Image {j}")
        if j == 0:
            currAxes.text(-0.1, 0.5, folder, rotation=0, va='center', ha='center', transform=currAxes.transAxes)
        image_path = os.path.join(image_folder, image)
        img = mpimg.imread(image_path)
        currAxes.imshow(img, cmap='gray')
        currAxes.axis('off')
        
plt.tight_layout()
plt.savefig('grid.png', dpi=72)

## Convert weights to ckpt to use in web UIs like AUTOMATIC1111.

In [None]:
#@markdown Run conversion.
ckpt_path = WEIGHTS_DIR + "/model.ckpt"

half_arg = ""
#@markdown  Whether to convert to fp16, takes half the space (2GB).
fp16 = True #@param {type: "boolean"}
if fp16:
    half_arg = "--half"
!python convert_diffusers_to_original_stable_diffusion.py --model_path $WEIGHTS_DIR  --checkpoint_path $ckpt_path $half_arg
print(f"[*] Converted ckpt saved at {ckpt_path}")

## Inference

In [None]:
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline, DDIMScheduler
from IPython.display import display

model_path = WEIGHTS_DIR             # If you want to use previously trained model saved in gdrive, replace this with the full path of model in gdrive

pipe = StableDiffusionPipeline.from_pretrained(model_path, safety_checker=None, torch_dtype=torch.float16).to("cuda")
pipe.scheduler = DDIMScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()
g_cuda = None

In [None]:
#@markdown Can set random seed here for reproducibility.
g_cuda = torch.Generator(device='cuda')
seed = 52362 #@param {type:"number"}
g_cuda.manual_seed(seed)

In [None]:
#@title Run for generating images.

prompt = "" #@param {type:"string"}
negative_prompt = "" #@param {type:"string"}
num_samples = 4 #@param {type:"number"}
guidance_scale = 7.5 #@param {type:"number"}
num_inference_steps = 24 #@param {type:"number"}
height = 512 #@param {type:"number"}
width = 512 #@param {type:"number"}

with autocast("cuda"), torch.inference_mode():
    images = pipe(
        prompt,
        height=height,
        width=width,
        negative_prompt=negative_prompt,
        num_images_per_prompt=num_samples,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=g_cuda
    ).images

for img in images:
    display(img)

In [None]:
#@markdown Run Gradio UI for generating images.
import gradio as gr

def inference(prompt, negative_prompt, num_samples, height=512, width=512, num_inference_steps=50, guidance_scale=7.5):
    with torch.autocast("cuda"), torch.inference_mode():
        return pipe(
                prompt, height=int(height), width=int(width),
                negative_prompt=negative_prompt,
                num_images_per_prompt=int(num_samples),
                num_inference_steps=int(num_inference_steps), guidance_scale=guidance_scale,
                generator=g_cuda
            ).images

with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            prompt = gr.Textbox(label="Prompt", value="photo of zwx dog in a bucket")
            negative_prompt = gr.Textbox(label="Negative Prompt", value="")
            run = gr.Button(value="Generate")
            with gr.Row():
                num_samples = gr.Number(label="Number of Samples", value=4)
                guidance_scale = gr.Number(label="Guidance Scale", value=7.5)
            with gr.Row():
                height = gr.Number(label="Height", value=512)
                width = gr.Number(label="Width", value=512)
            num_inference_steps = gr.Slider(label="Steps", value=24)
        with gr.Column():
            gallery = gr.Gallery()

    run.click(inference, inputs=[prompt, negative_prompt, num_samples, height, width, num_inference_steps, guidance_scale], outputs=gallery)

demo.launch(debug=True)

In [None]:
#@title (Optional) Delete diffuser and old weights and only keep the ckpt to free up drive space.

#@markdown [ ! ] Caution, Only execute if you are sure u want to delete the diffuser format weights and only use the ckpt.
import shutil
from glob import glob
import os
for f in glob(OUTPUT_DIR+os.sep+"*"):
    if f != WEIGHTS_DIR:
        shutil.rmtree(f)
        print("Deleted", f)
for f in glob(WEIGHTS_DIR+"/*"):
    if not f.endswith(".ckpt") or not f.endswith(".json"):
        try:
            shutil.rmtree(f)
        except NotADirectoryError:
            continue
        print("Deleted", f)

In [None]:
#@title Free runtime memory
exit()