# Title: Quantitative comparison between Stable diffusion performance in human faces and cats

#### Group Member Names : Hugo Garcia, Jesal Patel



### INTRODUCTION:
*********************************************************************************************************************
#### AIM : To evaluate the quality and realism of images generated by the Stable Diffusion model using cat images as a benchmark, and compute the Fréchet Inception Distance (FID) between generated and real cat images from the Google Open Images Dataset V7

*********************************************************************************************************************
#### Github Repo:
Original https://github.com/aliborji/GFW
Our project: https://github.com/Hiugo15/AIDI1002_final_project

*********************************************************************************************************************
#### DESCRIPTION OF PAPER: The paper conducts a comprehensive evaluation of three leading text-to-image diffusion models Stable Diffusion, Midjourney, and DALL·E 2 on tasks related to face generation. It introduces benchmarks that consider aesthetic quality, identity preservation, and realism. The authors also release a dataset and metrics to support future studies in generative image evaluation.

*********************************************************************************************************************
#### PROBLEM STATEMENT : Can the performance of a text-to-image generative model (specifically Stable Diffusion) be reliably evaluated using FID scores when the domain of interest shifts from human faces to non-human subjects (e.g., cats)? How well does the model generalize in generating high-quality images of non-human categories?

*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM: Most benchmarks for diffusion models focus on human faces or aesthetic art. However, diffusion models are used in diverse real-world scenarios that include animals, objects, and scenes. Evaluating model performance on less human-centric content like cats provides insight into the generalization capacity of these models. Using the Google Open Images Dataset V7 and modifying the original paper's pipeline for local computation, we aim to test this capacity with cat images.
*
*********************************************************************************************************************
#### SOLUTION: Use the Google Open Images Dataset V7 to extract real cat images, generate corresponding synthetic cat images using Stable Diffusion and carefully designed prompts, then calculate the FID score between real and generated images. The original code from the GFW paper is adapted for local execution with modified paths, prompt handling, and evaluation criteria.
* 


# Background
*********************************************************************************************************************


|Reference|
https://arxiv.org/abs/2210.00586

Explanation|
Provides code and methodology to evaluate diffusion models on face generation using FID and identity metrics.

Dataset/Input|
https://drive.google.com/file/d/16BXO1fgN08UGLLeA5ZNU9bhwAkcAOdci/view

Weakness|
Focused only on human faces; needs adaptation for other subjects like animals.



*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************
This code was modified by us to be run using local storage and Jupyter notebook environment
*

!pip install torch torchvision numpy scipy pandas pillow tqdm

# ====================================
# STEP 1: IMPORT LIBRARIES
# ====================================
import os
import numpy as np
from PIL import Image
from tqdm import tqdm
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from scipy import linalg
import torch.nn as nn


# ====================================
# STEP 2: DEFINE PATHS (LOCAL SETUP)
# ====================================
real_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\real\real_faces'
sd_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\stable_diffusion\generated_data\faces_generated'
mj_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\midjourney\faces_generated_midjourney'
dalle_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\dalle2\DALLEFaces'


# ====================================
# STEP 3: IMAGE LOADING FUNCTION
# ====================================
def get_activations(folder, model, batch_size=50, max_images=2000):
    model.eval()
    preprocess = transforms.Compose([
        transforms.Resize((299, 299)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5]*3, std=[0.5]*3),
    ])

    images = []
    image_files = [f for f in os.listdir(folder) if f.lower().endswith(('jpg', 'png', 'jpeg'))]
    image_files = image_files[:max_images]  # limit number of images

    for file in tqdm(image_files, desc=f'Loading images from {os.path.basename(folder)}'):
        img_path = os.path.join(folder, file)
        img = Image.open(img_path).convert('RGB')
        img = preprocess(img)
        images.append(img)

    images = torch.stack(images)
    activations = []

    with torch.no_grad():
        for i in range(0, len(images), batch_size):
            batch = images[i:i+batch_size]
            output = model(batch)
            activations.append(output.cpu().numpy())

    activations = np.concatenate(activations, axis=0)
    return activations


# ====================================
# STEP 4: LOAD INCEPTIONV3 MODEL
# ====================================
from torchvision.models import inception_v3, Inception_V3_Weights

weights = Inception_V3_Weights.DEFAULT
inception = inception_v3(weights=weights, aux_logits=True)  
inception.fc = nn.Identity()
inception.eval()


# ====================================
# STEP 5: FID CALCULATION FUNCTION
# ====================================
def calculate_fid(act1, act2):
    mu1, sigma1 = act1.mean(axis=0), np.cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), np.cov(act2, rowvar=False)

    diff = mu1 - mu2
    covmean, _ = linalg.sqrtm(sigma1 @ sigma2, disp=False)
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    fid = diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)
    return fid


# ====================================
# STEP 6: LOAD ACTIVATIONS
# ====================================
print("Extracting activations...")

act_real = get_activations(real_path, inception)
act_sd = get_activations(sd_path, inception)
act_mj = get_activations(mj_path, inception)
act_dalle = get_activations(dalle_path, inception)


# ====================================
# STEP 7: CALCULATE FID SCORES
# ====================================
print("\nCalculating FID Scores...\n")

fid_sd = calculate_fid(act_real, act_sd)
fid_mj = calculate_fid(act_real, act_mj)
fid_dalle = calculate_fid(act_real, act_dalle)

print(f'FID (Stable Diffusion vs Real): {fid_sd:.2f}')
print(f'FID (Midjourney vs Real): {fid_mj:.2f}')
print(f'FID (DALL·E 2 vs Real): {fid_dalle:.2f}')

*********************************************************************************************************************
### Contribution  Code :
*

Code to generate images using stable diffusion:

!pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118  # or use cu117 for older GPUs
!pip install diffusers[torch] transformers accelerate --upgrade


import torch
print(torch.__version__)
print("CUDA available:", torch.cuda.is_available())


from huggingface_hub import login

# Replace YOUR_HUGGINGFACE_TOKEN with your actual token
login(token="YOUR_HUGGINGFACE_TOKEN")



from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    safety_checker=None,  
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)
pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu")

import os
from tqdm import tqdm

output_dir = r"C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\Project_Cat_images\Stable_Diffusion\Generated_images"
prompt = "A cat, centered, high detail, studio lighting, DSLR"
num_images = 100

os.makedirs(output_dir, exist_ok=True)


print(f"Generating {num_images} images...")

with torch.inference_mode():
    for i in tqdm(range(num_images)):
        image = pipe(prompt, num_inference_steps=20, height=384, width=384).images[0]
        image.save(os.path.join(output_dir, f"cat_sd_{i:03d}.png"))

print(f"✅ Done! All images saved to {output_dir}")


Code to run the new comparison:

# ====================================
# STEP 1: IMPORT LIBRARIES
# ====================================
import os
import numpy as np
from PIL import Image
from tqdm import tqdm
import torch
import torchvision.transforms as transforms
import torchvision.models as models
from scipy import linalg
import torch.nn as nn


# ====================================
# STEP 2: DEFINE PATHS (LOCAL SETUP)
# ====================================
real_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\Project_Cat_images\Real_data\Real_images\content\OID\Dataset\train\Cat\images'
sd_path = r'C:\Users\hugo_\Documents\hugo_garcia\Big_Data_Analytics_georgian_college\AI management\1 semester\Machine Learning Programming\Final Project\datasets\Project_Cat_images\Stable_Diffusion\Generated_images'

# ====================================
# STEP 3: IMAGE LOADING FUNCTION
# ====================================
def get_activations(folder, model, batch_size=50, max_images=2000):
    model.eval()
    preprocess = transforms.Compose([
        transforms.Resize((299, 299)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.5]*3, std=[0.5]*3),
    ])

    images = []
    image_files = [f for f in os.listdir(folder) if f.lower().endswith(('jpg', 'png', 'jpeg'))]
    image_files = image_files[:max_images]  # limit number of images

    for file in tqdm(image_files, desc=f'Loading images from {os.path.basename(folder)}'):
        img_path = os.path.join(folder, file)
        img = Image.open(img_path).convert('RGB')
        img = preprocess(img)
        images.append(img)

    images = torch.stack(images)
    activations = []

    with torch.no_grad():
        for i in range(0, len(images), batch_size):
            batch = images[i:i+batch_size]
            output = model(batch)
            activations.append(output.cpu().numpy())

    activations = np.concatenate(activations, axis=0)
    return activations


# ====================================
# STEP 4: LOAD INCEPTIONV3 MODEL
# ====================================
from torchvision.models import inception_v3, Inception_V3_Weights

weights = Inception_V3_Weights.DEFAULT
inception = inception_v3(weights=weights, aux_logits=True)  
inception.fc = nn.Identity()
inception.eval()


# STEP 5: FID CALCULATION FUNCTION
# ====================================
def calculate_fid(act1, act2):
    mu1, sigma1 = act1.mean(axis=0), np.cov(act1, rowvar=False)
    mu2, sigma2 = act2.mean(axis=0), np.cov(act2, rowvar=False)

    diff = mu1 - mu2
    covmean, _ = linalg.sqrtm(sigma1 @ sigma2, disp=False)
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    fid = diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)
    return fid

# ====================================
# STEP 6: LOAD ACTIVATIONS
# ====================================
print("Extracting activations...")

act_real = get_activations(real_path, inception)
act_sd = get_activations(sd_path, inception)


# ====================================
# STEP 7: CALCULATE FID SCORES
# ====================================
print("\nCalculating FID Scores...\n")

fid_sd = calculate_fid(act_real, act_sd)
fid_mj = calculate_fid(act_real, act_mj)
fid_dalle = calculate_fid(act_real, act_dalle)

print(f'FID (Stable Diffusion vs Real): {fid_sd:.2f}')
print(f'FID (Midjourney vs Real): {fid_mj:.2f}')
print(f'FID (DALL·E 2 vs Real): {fid_dalle:.2f}')

### Results :

After adapting and running the original GFW notebook locally with modifications for dataset location and evaluation category, we computed the Fréchet Inception Distance (FID) between 100 cat images generated by Stable Diffusion and real cat images from the Google Open Images Dataset V7.

✅ FID (Stable Diffusion vs Real): 187.67

This FID score reflects the statistical difference between the distributions of real and generated cat images in the embedding space of an Inception network. A lower FID indicates more similarity and higher visual fidelity.

*******************************************************************************************************************************



#### Observations :

The FID score of 187.67 is relatively high, suggesting a noticeable difference between real and generated cat images.

This outcome may be influenced by:

The relatively small sample size of only 100 generated images, which limits statistical robustness.

The fact that Stable Diffusion is not explicitly trained on cats, especially with the kind of detail and consistency required for FID-based evaluation.

Variability in prompts and possible divergence in style between generated images and real-world, natural images.

Despite this, the model still managed to generate semantically correct and recognizable cat images using only text prompts like "a photo of a cat sitting on a couch".

Qualitatively, many generated images show stylistic artifacts or ambiguous shapes, indicating that fine-tuning or improved prompt engineering might be needed for better results.

*******************************************************************************************************************************
*


### Conclusion and Future Direction :
*******************************************************************************************************************************
#### Learnings :

We successfully adapted an existing research benchmark to evaluate text-to-image diffusion models on non-human subjects, specifically cats.

The process involved working with large image datasets, modifying code for local execution, and understanding how FID reflects the quality of generated images.

This adaptation also demonstrated the importance of prompt quality, image diversity, and dataset consistency when evaluating generative models.

*******************************************************************************************************************************
#### Results Discussion :

The FID score of 187.67 suggests that Stable Diffusion's outputs were perceptually and statistically less similar to real cat images compared to typical human face generation tasks.

Limited by hardware resources, we only generated 100 images, which affects the reliability of FID. A larger sample (e.g., 1,000 or more) would yield a more stable and accurate metric.

Additionally, while FID captures feature-space similarity, it does not account for semantic correctness—many generated cats were indeed recognizable but contained subtle visual distortions.


*******************************************************************************************************************************
#### Limitations :

The evaluation was constrained to 100 generated images due to hardware and runtime limitations. This low number can inflate FID variance.

We did not compare against Midjourney or DALL·E 2 in this phase, as those models are not open-source and require API access.

Prompt engineering was basic (single-line prompts) and not fine-tuned for realism or stylistic coherence.

The Stable Diffusion model used was not fine-tuned on cat-specific imagery or conditioned on segmentation/masks.

*******************************************************************************************************************************
#### Future Extension :

Generate a larger dataset (e.g., 1,000+ images) to compute a more statistically meaningful FID score.

Experiment with better prompts or use prompt ensembles to improve generation quality and consistency.

Apply CLIP-based filtering to select the best generated outputs before computing FID.

Extend the comparison to include Midjourney and DALL·E 2 (via API), and evaluate them under the same metrics.

Investigate other evaluation metrics like Inception Score (IS), CLIPScore, or user studies to complement FID.


# References:

[1]:  Research paper
https://arxiv.org/abs/2210.00586

[2]:GitHub repository of the research paper
https://github.com/aliborji/GFW

[3]:Datasets used in the research paper
https://drive.google.com/file/d/16BXO1fgN08UGLLeA5ZNU9bhwAkcAOdci/view

[4]: Google images dataset
https://storage.googleapis.com/openimages/web/index.html

[5]: Instructions on how to download from google images
https://sharmaji27.medium.com/easiest-way-to-download-data-from-the-open-image-dataset-553dccfb92d8