# Experiment Goal: Optimizing Generative Evaluation via Sample Size Expansion and Semantic Refinement

In this experiment, I focus on proving that even minimal adjustments — such as increasing the sample size, refining text prompts, and filtering for high-quality results — can significantly improve evaluation metrics.

**Key Improvements:**
* Increased Sample Size: Expanding the generated dataset by using 10 repetitions for each prompt stabilizes the covariance matrix calculations required for the FID score. This reduces the statistical noise caused by small sample sizes.
* Strategic Filtering: Metrics are now calculated only on a subset of images with a guidance weight of w ≥ 1.0. This excludes the abstract noise generated by low or negative guidance weights.
* Prompt Engineering: I extended the text prompts to include more detailed descriptions. Providing richer semantic information helps the CLIP model better align pixels with concepts, thereby increasing the CLIP score.

**Expected Outcome:**
Overall, these adjustments should lead to a significant reduction in the FID score (indicating better realism) and a higher, more stable average CLIP score (indicating better prompt adherence).

# Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

In [None]:
import sys
from pathlib import Path

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    %cd "/content/drive/MyDrive/Applied-Computer-Vision-Projects/Diffusion_Model_03"

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

Mounted at /content/drive
/content/drive/MyDrive/Applied-Computer-Vision-Projects/Diffusion_Model_03


In [2]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
import os
from google.colab import userdata

import numpy as np
from scipy.linalg import sqrtm
from PIL import Image
from tqdm import tqdm

import torch
import torchvision.transforms.v2 as transforms
from torchvision.transforms import ToPILImage
from torch.utils.data import DataLoader
from torchvision.models import inception_v3, Inception_V3_Weights

import clip
import open_clip

import wandb
import fiftyone as fo
import fiftyone.brain as fob

  return '(?ms)' + res + '\Z'


In [4]:
from utils import UNet_utils, ddpm_utils, other_utils, config

In [5]:
!rm -rf /content/data
!cp -r "$config.DRIVE_ROOT/data"* /content

In [6]:
other_utils.set_seeds(config.SEED)

All random seeds set to 51 for reproducibility


# Part 1: Image Generation and Embedding Extraction

In this section, you will load the pre-trained U-Net model from notebook 05_CLIP.ipynb of the corresponding NVIDIA course, generate images of flowers, and extract embeddings from the model's bottleneck.

## Recreate CLIP + DDPM + sampling

In this section, the CLIP text encoder and the DDPM sampling process are reinitialized to match the training setup. The noise schedule and diffusion parameters are defined to ensure compatibility with the pretrained U-Net. This reconstruction is necessary to generate images that are consistent with the original training regime.

In [7]:
# Load CLIP for encoding the text prompts
clip_model, clip_preprocess = clip.load("ViT-B/32", device=config.DEVICE)
clip_model.eval()

100%|████████████████████████████████████████| 338M/338M [00:02<00:00, 120MiB/s]


CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
    (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): Sequential(
        (0): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (gelu): QuickGELU()
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (1): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          

In [8]:
# Re-initialize DDPM wrapper
B_start = 0.0001
B_end = 0.02
B = torch.linspace(B_start, B_end, config.TIMESTEPS).to(config.DEVICE)

ddpm = ddpm_utils.DDPM(B, config.DEVICE)

## Load the pre-trained U-Net

Here, the U-Net architecture is instantiated and pretrained weights are loaded from disk. The model is switched to evaluation mode to disable training-specific behavior. This step restores the trained generative model used for all subsequent image synthesis.

In [9]:
# Define the uNet Architecture
uNet_model = UNet_utils.UNet(
    T=config.TIMESTEPS,
    img_ch=config.IMG_CH,
    img_size=config.IMG_SIZE,
    down_chs=(256, 256, 512),
    t_embed_dim=8,
    c_embed_dim=config.CLIP_FEATURES
).to(config.DEVICE)


# Load the model weights
try:
    uNet_model.load_state_dict(torch.load(config.UNET_MODEL_PATH))
    print("Model weights loaded successfully.")
except FileNotFoundError:
    print("Error: Model weights not found.")

uNet_model.eval()

Model weights loaded successfully.


UNet(
  (down0): ResidualConvBlock(
    (conv1): GELUConvBlock(
      (model): Sequential(
        (0): Conv2d(3, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): GroupNorm(8, 256, eps=1e-05, affine=True)
        (2): GELU(approximate='none')
      )
    )
    (conv2): GELUConvBlock(
      (model): Sequential(
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): GroupNorm(8, 256, eps=1e-05, affine=True)
        (2): GELU(approximate='none')
      )
    )
  )
  (down1): DownBlock(
    (model): Sequential(
      (0): GELUConvBlock(
        (model): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): GroupNorm(32, 256, eps=1e-05, affine=True)
          (2): GELU(approximate='none')
        )
      )
      (1): GELUConvBlock(
        (model): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): GroupNorm(32, 256, eps

In [10]:
# Define prompts
text_prompts = [
    "A photo of a red rose",
    "A high-quality photo of a vibrant red rose",
    "A close-up shot of a red rose with many layered petals",
    "Macro photography of a deep red rose with velvet-like texture",
    "A single red rose blooming in a lush green garden",
    "A detailed photo of a red rose flower with morning dew drops",
    "A vibrant red rose captured in bright, natural sunlight",
    "A professional studio photograph of a single red rose against a soft background",

    "A photo of a white daisy",
    "A high-resolution photo of a white daisy with a yellow center",
    "A round white daisy with crisp, clean white petals",
    "A white daisy flower growing in a sunny green meadow",
    "A macro photo of a daisy showing the texture of the yellow pollen center",
    "A delicate white daisy captured in soft, diffused natural light",
    "A top-down view of a symmetrical white daisy flower",
    "A professional photograph of a simple white daisy with sharp focus",

    "A photo of a yellow sunflower",
    "A vibrant yellow sunflower with a large brown center",
    "A detailed photo of a sunflower with bright, radiant yellow petals",
    "A tall yellow sunflower standing in a vast sunflower field",
    "A macro shot of a sunflower head showing the pattern of the seeds",
    "A bright yellow sunflower facing the sun during golden hour",
    "A large, blooming yellow sunflower with a thick green stem",
    "A professional high-quality photo of a yellow sunflower with vivid colors"
]

In [11]:
# Sanity check: Calculate how many images are to be generated
# Guidance strengths for classifier-free guidance
P = len(text_prompts)           # number of prompts
W = len(config.W_TESTS)         # guidance values per prompt
NUM_REPETITIONS = 10            # Number of copies of every prompt/guidance pair
n_samples = P * W * NUM_REPETITIONS   # Total images generated: one per (prompt, guidance) pair

print("Expected n_samples:", n_samples)

Expected n_samples: 1680


In [12]:
# Store intermediate feature maps extracted via forward hooks
embeddings_storage = {}

def get_embedding_hook(name):
    """
    Creates a forward hook that stores the output of a given layer.

    The output is detached from the computation graph to avoid
    gradient tracking and reduce memory usage.
    """
    def hook(model, input, output):
        # We use .detach() to disconnect from the gradient graph (saves memory)
        embeddings_storage[name] = output.detach()
    return hook

# Register a forward hook on the U-Net bottleneck layer
uNet_model.down2.register_forward_hook(get_embedding_hook('down2'))
print("Hook registered on model.down2")

Hook registered on model.down2


In [13]:
def sample_flowers_with_hook(text_list, model, ddpm, input_size, T, device, w_tests, num_repetitions=10):
    """
    Generates images from text prompts using classifier-free guided diffusion
    while capturing intermediate U-Net embeddings via a forward hook.

    Args:
        text_list (list[str]): Text prompts used for conditioning.
        model (nn.Module): Pretrained U-Net diffusion model.
        ddpm: Diffusion process wrapper.
        input_size (tuple): Spatial size of generated images.
        T (int): Number of diffusion timesteps.
        device (torch.device): Computation device.

    Returns:
        torch.Tensor: Final generated images.
        torch.Tensor: Stored intermediate diffusion states (for visualization).
    """
    all_generated_images = []
    all_extracted_embeddings = []

    # Encode text prompts using CLIP
    text_tokens = clip.tokenize(text_list).to(device)
    with torch.no_grad():
      c = clip_model.encode_text(text_tokens).float()

    for rep in range(num_repetitions):
      print(f"  Running repetition {rep+1}/{num_repetitions}...")

      # Run diffusion sampling with classifier-free guidance
      x_gen, _ = ddpm_utils.sample_w(model, ddpm, input_size, T, c, device, w_tests)

      # Grabs the embedding from the hook storage before it gets overwritten
      # As the images are doubled, we keep only the conditioned ones
      current_batch_embs = embeddings_storage['down2'][:x_gen.shape[0]].detach().cpu()

      # Store both images and embeddings
      all_generated_images.append(x_gen.detach().cpu())
      all_extracted_embeddings.append(current_batch_embs)

    # Concatenate all repetitions into single large tensors
    final_images = torch.cat(all_generated_images, dim=0)
    final_embeddings = torch.cat(all_extracted_embeddings, dim=0)

    return final_images, final_embeddings

# Run the generation
other_utils.set_seeds(config.SEED)

print(f"Generating {len(text_prompts) * len(config.W_TESTS) * NUM_REPETITIONS} total images...")

print("Generating images...")
generated_images, extracted_embeddings = sample_flowers_with_hook(
    text_list=text_prompts,
    model=uNet_model,
    ddpm=ddpm,
    input_size=config.INPUT_SIZE,
    T=config.TIMESTEPS,
    device=config.DEVICE,
    w_tests=config.W_TESTS,
    num_repetitions=NUM_REPETITIONS
)

print(f"Generation Complete.")
print(f"Final Images Shape: {generated_images.shape}")
print(f"Final Embeddings Shape: {extracted_embeddings.shape}")

All random seeds set to 51 for reproducibility
Generating 1680 total images...
Generating images...
  Running repetition 1/10...
  Running repetition 2/10...
  Running repetition 3/10...
  Running repetition 4/10...
  Running repetition 5/10...
  Running repetition 6/10...
  Running repetition 7/10...
  Running repetition 8/10...
  Running repetition 9/10...
  Running repetition 10/10...
Generation Complete.
Final Images Shape: torch.Size([1680, 3, 32, 32])
Final Embeddings Shape: torch.Size([1680, 512, 8, 8])


In [14]:
# Save generated images to disk for downstream evaluation
to_pil = ToPILImage()

# Track saved image paths together with their prompts and guidance values
saved_samples = []

print("Saving images to disk...")
assert len(generated_images) == n_samples, (
    f"generated_images={len(generated_images)} != {n_samples}"
)

for i, img_tensor in enumerate(generated_images):
    # Recover prompt and guidance value from the sampling order
    idx_within_rep = i % (P * W)
    prompt = text_prompts[idx_within_rep % P]
    w_val = config.W_TESTS[idx_within_rep // P]

    # Map model output from [-1, 1] to [0, 1] for image saving and clip any artifacts that fell outside the valid range
    img_norm = ((img_tensor + 1) / 2).clamp(0, 1).detach().cpu()
    pil_img = to_pil(img_norm)

    filename = os.path.join(
        config.SAVE_DIR, f"flower_w{w_val:+.1f}_p{idx_within_rep % P}_{i}.png"
    )
    pil_img.save(filename)

    saved_samples.append((filename, prompt, float(w_val)))

print("All images saved.")

Saving images to disk...
All images saved.


# Part 2: Evaluation with CLIP Score and FID
In this section, the quality of the generated images is evaluated using CLIP Score and Fréchet Inception Distance (FID), following the definitions provided in the assignment task.

In [23]:
# Filters the generated samples before extraction to only only include samples where guidance (w) is 1.0 or higher.
filtered_samples = [s for s in saved_samples if s[2] >= 1.0]

print(f"Filtered Samples: {len(filtered_samples)}")

Filtered Samples: 480


## CLIP Score

The CLIP score is computed as the cosine similarity between image and text embeddings produced by a pretrained CLIP model. It answers the question: "How accurately does the generated image depict the content described in the text prompt?"

A higher score indicates stronger semantic alignment. Scores are computed for all generated images, allowing comparison across different guidance strengths and prompts.

In [24]:
# Initialize OpenCLIP model for CLIP-score evaluation
clip_scorer, _, clip_preprocess_val = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)
clip_scorer.to(config.DEVICE).eval()
tokenizer = open_clip.get_tokenizer("ViT-B-32")

In [25]:
def calculate_clip_score(image_path, text_prompt, device=None):
    """
    Computes a CLIP similarity score between an image and a text prompt.

    The image and text are embedded using a pretrained OpenCLIP model, L2-normalized,
    and compared via cosine similarity (dot product of normalized embeddings).

    Args:
        image_path (str | Path): Path to the image file on disk.
        text_prompt (str): Text prompt to compare against.

    Returns:
        float: Cosine similarity score (higher means stronger semantic alignment).
    """
    # Preprocess and move to the same device as the CLIP model
    image = clip_preprocess_val(Image.open(image_path)).unsqueeze(0).to(device)
    text = tokenizer([text_prompt]).to(device)

    with torch.no_grad():
        image_features = clip_scorer.encode_image(image)
        text_features = clip_scorer.encode_text(text)

        # Normalize to turn dot product into cosine similarity
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)

        score = (image_features @ text_features.T).item()

    return score


# Compute CLIP scores for all generated samples
clip_scores = []

print("Calculating scores...")

for i, (filepath, prompt, w_val) in enumerate(filtered_samples):
    score = calculate_clip_score(
        image_path=filepath,
        text_prompt=prompt,
        device=config.DEVICE,
    )
    clip_scores.append(score)

avg_clip_score = float(np.mean(clip_scores))
print(f"Average CLIP Score: {avg_clip_score:.4f}")

Calculating scores...
Average CLIP Score: 0.2594


## FID Score

To measure how realistic our generated images are, we calculate the Fréchet Inception Distance (FID) score. We use a powerful pre-trained InceptionV3 model as a feature "judge." Both our generated images and the real flower images are prepared identically—resized to 299x299 and normalized—to ensure a fair comparison.

FID is computed by comparing feature statistics extracted from the model for real and generated images. It statistically compares these two collections; a lower score indicates that the generated images are more similar to the real data.

In [26]:
# Load Pretrained InceptionV3;
inception = inception_v3(
    weights=Inception_V3_Weights.DEFAULT,
    transform_input=False,
).to(config.DEVICE)

# To return features (2048) - not classes as the standard Inception model does - the final
# "Fully Connected" layer needs to be replaced with a "Pass Through" (Identity)
inception.fc = torch.nn.Identity()

inception.eval()

image_net_mean = [0.485, 0.456, 0.406]
image_net_std  = [0.229, 0.224, 0.225]

# Inception expects 299x299 size instead of now 32x32 and specific normalization
inception_transform = transforms.Compose([
    transforms.Resize((config.INCEPTION_IMG_SIZE, config.INCEPTION_IMG_SIZE)), # Up-sample from 32x32
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),
    transforms.Normalize(mean=image_net_mean, std=image_net_std)   # from original pytorch docs
])

Downloading: "https://download.pytorch.org/models/inception_v3_google-0cc3c7bd.pth" to /root/.cache/torch/hub/checkpoints/inception_v3_google-0cc3c7bd.pth


100%|██████████| 104M/104M [00:00<00:00, 186MB/s] 


In [27]:
def get_inception_features_from_raw(dataset_path, batch_size, model, device=None, num_workers=0):
    """
    Extracts 2048-dimensional InceptionV3 feature embeddings for a dataset of images.

    Args:
        raw_dataset): dataset of the original images
        model (nn.Module): Pretrained InceptionV3 feature extractor (fc = Identity).

    Returns:
        np.ndarray: Array of shape (N, 2048) containing feature embeddings for all images.
    """
    raw_dataset = other_utils.MyDataset(dataset_path, inception_transform, config.CLASSES)
    raw_dataloader  = DataLoader(raw_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

    features = []                         # Stores feature batches
    with torch.no_grad():
      for img, _ in tqdm(raw_dataloader):
          img = img.to(device)
          f = model(img)                # Runs Inception forward pass
          features.append(f.cpu().numpy())  # Transform to numpy for later mathematical operations
    return np.concatenate(features, axis=0) # Concatenate batches to one array

In [28]:
def get_inception_features_from_files(saved_samples, batch_size, model, transform, device=None, num_workers=0):
    """
    Loads images from disk and extracts InceptionV3 feature embeddings.

    Args:
        saved_samples: List of tuples containing image filepaths.
        model: Pretrained feature extractor.
        transform: Image preprocessing pipeline (resize/normalize).

    Returns:
        np.ndarray: Feature matrix of shape (N, 2048).
    """
    dataset = other_utils.GeneratedListDataset(saved_samples, transform=transform)
    loader = DataLoader(dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers)

    features = []

    with torch.no_grad():
        for img_batch in tqdm(loader, desc="Extracting Generated Features"):
            img_batch = img_batch.to(device)

            # The transform handles Resize, Scale, and Normalize
            f = model(img_batch)
            features.append(f.cpu().numpy())

    return np.concatenate(features, axis=0)

In [29]:
def calculate_fid(real_embeddings, gen_embeddings):

    # Calculate mean and covariance
    mu1, sigma1 = real_embeddings.mean(axis=0), np.cov(real_embeddings, rowvar=False)
    mu2, sigma2 = gen_embeddings.mean(axis=0), np.cov(gen_embeddings, rowvar=False)

    # Sum squared difference between means
    ssdiff = np.sum((mu1 - mu2)**2)

    # Product of covariances
    covmean = sqrtm(sigma1.dot(sigma2))

    # Numerical error handling
    if np.iscomplexobj(covmean):
        covmean = covmean.real

    # Final FID calculation
    fid = ssdiff + np.trace(sigma1 + sigma2 - 2.0 * covmean)

    return fid

In [30]:
# Extract features from real images on disk
dataset_path = config.TMP_ROOT / "data/cropped_flowers"

# Extract features from real images on disk
real_embeddings = get_inception_features_from_raw(dataset_path, config.BATCH_SIZE, inception, device=config.DEVICE, num_workers=config.NUM_WORKERS)

print("Real Embeddings:", real_embeddings.shape)

print(f"Total generated: {len(saved_samples)} | Realistic samples for FID: {len(filtered_samples)}")

# Extract features from generated images on disk
gen_embeddings = get_inception_features_from_files(
    saved_samples=filtered_samples,
    batch_size=config.BATCH_SIZE,
    model=inception,
    transform=inception_transform,
    device=config.DEVICE,
    num_workers=config.NUM_WORKERS
)

print("Generated Embeddings:", gen_embeddings.shape)

# Compute FID Score: checks if the images look "real" compared to the original dataset
fid_score = calculate_fid(real_embeddings, gen_embeddings)
print(f"FID Score: {fid_score:.4f}")

100%|██████████| 37/37 [02:57<00:00,  4.80s/it]


Real Embeddings: (1166, 2048)
Total generated: 1680 | Realistic samples for FID: 480


Extracting Generated Features: 100%|██████████| 15/15 [00:02<00:00,  5.41it/s]


Generated Embeddings: (480, 2048)
FID Score: 239.2461


# Part 3: Embedding Analysis with FiftyOne Brain
This section organizes the generated images into a structured FiftyOne dataset to analyze the model's internal behavior.
Each image is paired with its corresponding prompt, guidance weight (w), and CLIP score, while the extracted U-Net bottleneck features are stored as vector embeddings.
Afterwards, uniqueness (identifying visually distinct samples) and representativeness (identifying the most typical examples) is computed.

In [33]:
# Create a new FiftyOne dataset

# Delete existing dataset if it exists
if config.FIFTYONE_DATASET_EXPERIMENTS_NAME in fo.list_datasets():
    print(f"Deleting existing dataset: {config.FIFTYONE_DATASET_EXPERIMENTS_NAME}")
    fo.delete_dataset(config.FIFTYONE_DATASET_EXPERIMENTS_NAME)

dataset = fo.Dataset(name=config.FIFTYONE_DATASET_EXPERIMENTS_NAME)

In [34]:
# Build a FiftyOne dataset where each image is paired with prompt, guidance w,
# CLIP score, and a flattened U-Net embedding (used for embedding-based analysis)
samples = []

print("Building FiftyOne dataset...")

for i, (filepath, prompt, w_val) in enumerate(filtered_samples):
    # FiftyOne Brain expects a 1D embedding vector per sample for distance computations
    raw_embedding = extracted_embeddings[i]                 # e.g., (512, 8, 8)
    flat_embedding = raw_embedding.flatten().cpu().numpy() # (512*8*8,)

    sample = fo.Sample(filepath=filepath)

    # Store fields for filtering and analysis in the FiftyOne App
    sample["ground_truth"] = fo.Classification(label=prompt)
    sample["w"] = float(w_val)
    sample["clip_score"] = float(clip_scores[i])
    sample["unet_embedding"] = flat_embedding

    samples.append(sample)

# Add all samples in one call for efficiency
dataset.add_samples(samples)
print(f"Added {len(samples)} samples to the dataset.")


Building FiftyOne dataset...
 100% |█████████████████| 480/480 [11.8s elapsed, 0s remaining, 30.1 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 480/480 [11.8s elapsed, 0s remaining, 30.1 samples/s]      


Added 480 samples to the dataset.


In [35]:
# Compute Uniqueness (Visual diversity)
fob.compute_uniqueness(dataset, embeddings="unet_embedding")

# Compute Representativeness using the extracted U-Net embeddings
fob.compute_representativeness(dataset, embeddings="unet_embedding")

Computing uniqueness...


INFO:fiftyone.brain.internal.core.uniqueness:Computing uniqueness...


Uniqueness computation complete


INFO:fiftyone.brain.internal.core.uniqueness:Uniqueness computation complete


Computing representativeness...


INFO:fiftyone.brain.internal.core.representativeness:Computing representativeness...


Computing clusters for 480 embeddings; this may take awhile...


INFO:fiftyone.brain.internal.core.representativeness:Computing clusters for 480 embeddings; this may take awhile...


Representativeness computation complete


INFO:fiftyone.brain.internal.core.representativeness:Representativeness computation complete


In [36]:
# Launch the FiftyOne App to visualize your dataset and analyze the results
session = fo.launch_app(dataset)


Could not connect session, trying again in 10 seconds



RuntimeError: Client is not connected

# Evaluation

To reduce statistical noise observed in earlier experiments, the evaluation pipeline was expanded from a small-scale snapshot of only 21 samples to 1,680 images (7 guidance weights, 21 prompts and 10 repetitions), enabling a more stable and reliable assessment.

To ensure meaningful evaluation, a quality-based filtering strategy was applied. Only samples with guidance weight w ≥ 1.0 were retained, resulting in a final evaluation set of 480 images.

| Metric          | Baseline (N = 21) | Filtered (N = 480) | Improvement |
|-----------------|------------------:|-------------------:|------------:|
| Avg. CLIP score | 0.212             | 0.259              | +22.1%      |
| FID             | 320.5             | 239.2              | −25.3%      |

**Analysis of Improvements**
* The FID score dropped from 320.5 to 239.2. This improvement is primarily attributed to the increased sample mass (10 repetitions across 21 unique prompts). 
* Semantic Alignment (CLIP): The CLIP score rose from 0.212 to 0.259, indicating a stronger correlation between the pixels and text prompts. This jump is a direct result of replacing generic labels with descriptive prompts and programmatically filtering the evaluation to exclude abstract and noisy outputs (where w<1.0). 
* Dataset Diversity: Utilizing 21 distinct prompts provided a wider feature set and allowed for a more comprehensive comparison against the real flower dataset.

**Conclusion**
The results prove that already minimal adjustments lead to an improvement in both scores. 


# Part 4: Logging with Weights & Biases
All experiments are logged to Weights & Biases for reproducibility and comparison. Hyperparameters, generated images, guidance values, CLIP scores, and embedding-based metrics are stored in a structured table, together with aggregate evaluation metrics such as average CLIP score and FID.

In [37]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mmichele-marschner[0m ([33mmichele-marschner-university-of-potsdam[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [38]:
# Initialize Run
timestamp = other_utils.get_timestamp()
run = wandb.init(project="diffusion_model_assessment_v2", name=f"experiment_run_further_experiments_{timestamp}")

# Log Hyperparameters
wandb.config.update({
    "steps_T": config.TIMESTEPS,
    "image_size": config.IMG_SIZE,
    "clip_features": config.CLIP_FEATURES,
    "prompts": text_prompts
})

# Create a Table for Visual Results
columns = ["image generated", "prompt", "guidance_w", "clip_score", "uniqueness", "representativeness"]

diffusion_test_table = wandb.Table(columns=columns)

# Populate Table
# Grab uniqueness and representativeness scores back from FiftyOne
uniqueness_scores = dataset.values("uniqueness")
representativeness_scores = dataset.values("representativeness")

for i, (filepath, prompt, w_val) in enumerate(filtered_samples):
    wandb_img = wandb.Image(filepath)

    diffusion_test_table.add_data(
        wandb_img,
        prompt,
        w_val,
        clip_scores[i],
        uniqueness_scores[i],
        representativeness_scores[i],
    )

# Log the Table and Metrics
wandb.log({
    "generation_results": diffusion_test_table,
    "evaluation/fid_score": fid_score,
    "evaluation/average_clip_score": avg_clip_score
    })

# Finish
run.finish()

0,1
evaluation/average_clip_score,▁
evaluation/fid_score,▁

0,1
evaluation/average_clip_score,0.25942
evaluation/fid_score,239.24612
