How do text to image AIs like stable diffusion and Dalle2 work?
Text-to-image AI models such as StackGAN, DALL-E 2, and others are generative models that are capable of converting textual descriptions into images. These models typically use a combination of deep learning techniques, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to generate images.

The basic architecture of a text-to-image AI model typically consists of two components: a text encoder and an image generator. The text encoder takes in a textual description and maps it to a fixed-length representation, known as a latent code, that can be used to generate an image. The image generator uses the latent code to generate an image that is conditioned on the input text.

The models are trained on large datasets of text-image pairs, and they learn to generate images that are semantically consistent with the input text. During training, the model is optimized to generate images that are similar to real images in the training dataset, while also ensuring that the generated images are consistent with the input text.

For example, in the case of StackGAN, the model is trained in two stages. In the first stage, a lower-resolution image is generated using the latent code. In the second stage, the lower-resolution image is used as input to generate a higher-resolution image. This two-stage process allows the model to generate images that are more detailed and closer to the real images in the training dataset.

In the case of DALL-E 2, the model uses a transformer architecture, which is commonly used in natural language processing tasks, to encode the input text into a latent code. The latent code is then used to generate the final image.

These text-to-image AI models have shown promising results in generating high-quality images that are semantically consistent with the input text. However, there is still room for improvement, and further research is needed to make these models more robust and capable of generating a wider range of images.
What about Stable Diffusion? How does that work?
Stable Diffusion uses a variant of diffusion model called latent diffusion model. Diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: 

1.	A variational autoencoder, an image to latent space compressor. 
2.	U-Net, a denoising algorithm.
3.	A clip model, a pretrained CLIP text encoder.

A prompt is taken by the clip model and tokenizes by transforming text prompts to an embedding space. This is passed to the U-net. The U-Net block, composed of a ResNet backbone, along with the text embedding, takes in gaussian noise that has been transformed into the latent space by the variational autoencoder (VAE). The U- net then over several iterations denoises the output from forward diffusion backwards to obtain latent representation of the image. Finally, the VAE decoder generates the final image by converting the representation back into pixel space.