# Scalable data preprocessing pipeline for Stable Diffusion

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/stable-diffusion/preprocessing_architecture_v4.jpeg" width="900px">

The preceding architecture diagram illustrates the data preprocessing pipeline for Stable Diffusion. 

Ray Data loads the data from a remote storage system, then streams the data through two processing main stages:
1. **Transformation**
   1. Cropping and normalizing images.
   2. Tokenizing the text captions using a CLIP tokenizer.
2. **Encoding**
   1. Compressing images into a latent space using a VAE encoder.
   2. Generating text embeddings using a CLIP model.

This notebook executes a fully self-contained module, `Preprocessing.py`, that processes a small subset of the full 2 billion dataset to demonstrate the workload. You can parameterize the same module code to process the full dataset. The "Scale to 2 billion images" section below for a summarizes the necessary changes for read the full dataset.

Run the following cell to perform the data preprocessing. The script loads the data, transforms it, and encodes the output. After the cell executes, view the two sample visualized inputs along with their corresponding outputs.

In [None]:
%run scripts/Preprocessing.py


## Scale to 2 billion images

The following table summarizes the changes you need to make to scale the data processing to the full dataset:

| Step | Change | Description |
| --- | --- | --- |
| 1 | Raw Data Path | Change to point to the full dataset |
| 2 | Data Loading Workers | Increase from 1 to 192 CPUs |
| 3 | Transformation Workers | Increase from 1 to 192 CPUs |
| 4 | Batch Size | Use 120 for 256x256 images and 40 for 512x512 images |
| 5 | Encoding Workers | Increase from 0 to 48 A10-G GPUs |
| 6 | Output Path | Change to a permanent remote storage location |
| 7 | Run Process | Run as an Anyscale Job |

In terms of infrastructure, provision 48 instances of g5.2xlarge instances for the entire process or use Anyscale's autoscaling capabilities to scale up and down as needed.


## Want to pre-train with custom data? 📈

If you're looking to scale your Stable Diffusion pre-training with custom data, we're here to help 🙌 !

👉 **[Check out this link](https://forms.gle/9aDkqAqobBctxxMa8) so we can assist you**.