# Text to Image Generation

In this notebook, we explore **Text-to-Image Generation**, a fascinating subfield of **Generative AI** that converts natural language prompts into images using advanced deep learning models like **DALL·E**, **Stable Diffusion**, and **Midjourney**.

## 🎯 Learning Objectives

By the end of this notebook, you will:

- Understand how text-to-image generation models work
- Learn the architecture behind diffusion-based models
- Implement basic image generation using the `diffusers` library
- Explore prompt engineering for better image generation
- Evaluate generated images qualitatively and quantitatively

## 🧩 1. Introduction to Text-to-Image Models

Text-to-Image models take a **text prompt** (e.g., *"a futuristic city in the clouds"*) and generate a corresponding image.

### Key Model Types:
- **GAN-based models:** e.g., AttnGAN, StackGAN
- **Diffusion-based models:** e.g., Stable Diffusion, Imagen, DALL·E 2
- **Transformer-based models:** e.g., Parti, CogView

Modern systems primarily use **Diffusion Models**, which iteratively *denoise* random noise into structured images guided by text embeddings.

## 🧠 2. Understanding Diffusion Models

Diffusion models work in two phases:

1. **Forward Process:** Gradually adds noise to an image until it becomes pure noise.
2. **Reverse Process:** Learns to remove noise step-by-step, generating realistic images from random noise.

These models are trained with **text embeddings** (from models like CLIP or T5) to align image generation with text descriptions.

In [1]:
# Install the required library (if running locally)
# !pip install diffusers transformers accelerate torch torchvision matplotlib

## ⚙️ 3. Setup for Stable Diffusion using 🤗 Diffusers

In [2]:
from diffusers import StableDiffusionPipeline
import torch
from PIL import Image

# Load model (requires a token if using private models)
# model_id = 'runwayml/stable-diffusion-v1-5'
# pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
# pipe = pipe.to('cuda') if torch.cuda.is_available() else pipe.to('cpu')

## 🖼️ 4. Generate Images from Text Prompts

In [3]:
# Example text prompt
prompt = "A futuristic city in the clouds with flying cars and neon lights"

# Generate image (commented out for compatibility)
# image = pipe(prompt).images[0]
# image.save('futuristic_city.png')
# image.show()

## 🧮 5. Prompt Engineering Techniques

Improving prompt quality can significantly enhance output quality.

### Example Prompt Improvements

- **Simple:** “A cat on a mat.”  
- **Better:** “A realistic photo of a fluffy white cat sitting on a red mat, sunlight through the window.”  
- **Creative:** “A digital painting of a majestic cat meditating on a glowing mat under a neon moon.”

You can also use **negative prompts** to specify unwanted details (e.g., “blurry”, “low quality”, etc.).

## 🧠 6. Evaluating Generated Images

Evaluation is mostly qualitative, but a few metrics exist:

- **FID (Fréchet Inception Distance):** Measures image quality vs real data
- **IS (Inception Score):** Measures diversity and realism
- **CLIPScore:** Measures alignment between text and image semantics

In [4]:
# Example placeholder for FID or CLIP-based evaluation
# from torchmetrics.multimodal import CLIPScore
# metric = CLIPScore(model_name_or_path='openai/clip-vit-base-patch16')
# score = metric(preds=[image], target=[prompt])
# print('CLIPScore:', score)

## 🚀 7. Key Takeaways

- Text-to-image models translate language into rich visual representations.
- Diffusion models like **Stable Diffusion** dominate modern generation tasks.
- **Prompt engineering** plays a huge role in output quality.
- Image evaluation is often subjective but can be aided with metrics like **FID** and **CLIPScore**.

## 🔮 8. What’s Next?

- Explore **text-to-video** models (e.g., Runway Gen-2, Pika Labs)
- Learn **fine-tuning** for personalized image generation (DreamBooth, LoRA)
- Experiment with **multi-modal pipelines** combining text, images, and audio.