# Experiment Goal: Optimizing Generative Evaluation via Sample Size Expansion and Semantic Refinement

In this experiment, I focus on proving that even minimal adjustments ‚Äî such as increasing the sample size, refining text prompts, and filtering for high-quality results ‚Äî can significantly improve evaluation metrics.

**Key Improvements:**
* **Increased Sample Size:** Expanding the generated dataset by using 20 repetitions for each prompt stabilizes the covariance matrix calculations required for the FID score. This reduces the statistical noise caused by small sample sizes.
* **High-Guidance Generation:** To reduce low-guidance artifacts, we directly generate samples at a higher guidance setting (w ‚â• 1.0) and evaluate metrics on the complete set of generated images.
* **Prompt Engineering:** I extended the text prompts to include more detailed descriptions. Providing richer semantic information helps the CLIP model better align pixels with concepts, thereby increasing the CLIP score.

**Expected Outcome:**
Overall, these adjustments should lead to a significant reduction in the FID score (indicating better realism) and a higher, more stable average CLIP score (indicating better prompt adherence).

# Setup

The project repository is mounted from Google Drive and added to the Python path to allow clean imports from the src module. The dataset is copied to the local Colab filesystem to improve I/O performance during training. 

All global settings (random seed, device selection, paths, batch sizes) are defined once and reused across the notebook to ensure consistency and reproducibility.

In [None]:
import sys
from pathlib import Path

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    #  !! Change the following path if the project is located elsewhere (repeat in config.py)
    %cd "/content/drive/MyDrive/Applied-Computer-Vision-Projects/Diffusion_Model_03"                    

PROJECT_ROOT = Path.cwd()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

Mounted at /content/drive
/content/drive/MyDrive/Applied-Computer-Vision-Projects/Diffusion_Model_03


In [2]:
# Install dependencies
%%capture
%pip install --no-cache-dir -r requirements.txt

In [None]:
import os
from google.colab import userdata

import numpy as np
import torch

import clip
import open_clip

import wandb
import fiftyone as fo
import fiftyone.brain as fob

from huggingface_hub import snapshot_download
from huggingface_hub import HfApi

  return '(?ms)' + res + '\Z'


In [None]:
from utils import UNet_utils, ddpm_utils, other_utils, config, metrics, visualizations

In [5]:
!rm -rf /content/data
!cp -r "$config.DRIVE_ROOT/data"* /content

In [6]:
other_utils.set_seeds(config.SEED)

All random seeds set to 51 for reproducibility


# Part 1: Generating Flower Images and Extracting U-Net Bottleneck Features

This part of the project restores a pretrained CLIP-conditioned DDPM pipeline to (1) generate flower images from text prompts and (2) capture intermediate U-Net representations from the bottleneck via forward hooks.

## Reconstructing CLIP, DDPM and the sampling setup

To ensure that sampling matches the original training regime, the CLIP text encoder and the DDPM diffusion process are reinitialized with the same diffusion hyperparameters (noise schedule and number of timesteps) as during training. 

Rebuilding these components is essential for producing samples that are compatible with the pretrained U-Net and therefore comparable across guidance strengths.

In [7]:
# Load CLIP for encoding the text prompts
clip_model, clip_preprocess = clip.load("ViT-B/32", device=config.DEVICE)
clip_model.eval()

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 338M/338M [00:02<00:00, 120MiB/s]


CLIP(
  (visual): VisionTransformer(
    (conv1): Conv2d(3, 768, kernel_size=(32, 32), stride=(32, 32), bias=False)
    (ln_pre): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (transformer): Transformer(
      (resblocks): Sequential(
        (0): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlp): Sequential(
            (c_fc): Linear(in_features=768, out_features=3072, bias=True)
            (gelu): QuickGELU()
            (c_proj): Linear(in_features=3072, out_features=768, bias=True)
          )
          (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        )
        (1): ResidualAttentionBlock(
          (attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          

In [8]:
# Re-initialize DDPM wrapper
B_start = 0.0001
B_end = 0.02
B = torch.linspace(B_start, B_end, config.TIMESTEPS).to(config.DEVICE)

ddpm = ddpm_utils.DDPM(B, config.DEVICE)

## Loading the pretrained U-Net

Next, the U-Net architecture is instantiated and its trained weights are loaded from disk. The model is set to evaluation mode to disable training-time behavior (e.g., dropout-like effects) and to make generation deterministic given a fixed seed. This restored model serves as the backbone for all subsequent image synthesis and feature extraction. All images are resized to 32√ó32 and normalized to the [-1, 1] range. 

No explicit train/validation split is used at this stage. Diffusion models learn a data distribution rather than a supervised mapping, so performance is best assessed through sample quality and distribution-level metrics (e.g., FID and CLIP Score).

### U-Net‚ÄìDDPM architecture used in this project

The model is a CLIP-conditioned DDPM U-Net that predicts noise at each diffusion step.
* Encoder: residual conv stem + two downsampling blocks (Conv + GroupNorm + GELU + rearrangement pooling)
* Bottleneck: low-res spatial feature map (e.g., 8√ó8), optionally with self-attention.
* Conditioning: sinusoidal timestep embeddings + projected CLIP text embeddings; injected via scale‚Äìshift modulation in the decoder; classifier-free guidance via random condition dropout (Bernoulli mask).
* Decoder: nearest-neighbor upsampling + conv with skip connections; final conv outputs the RGB noise prediction.

In [9]:
# Define the uNet Architecture
uNet_model = UNet_utils.UNet(
    T=config.TIMESTEPS,
    img_ch=config.IMG_CH,
    img_size=config.IMG_SIZE,
    down_chs=(256, 256, 512),
    t_embed_dim=8,
    c_embed_dim=config.CLIP_FEATURES
).to(config.DEVICE)


# Load the model weights
try:
    uNet_model.load_state_dict(torch.load(config.UNET_MODEL_PATH))
    print("Model weights loaded successfully.")
except FileNotFoundError:
    print("Error: Model weights not found.")

uNet_model.eval()

Model weights loaded successfully.


UNet(
  (down0): ResidualConvBlock(
    (conv1): GELUConvBlock(
      (model): Sequential(
        (0): Conv2d(3, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): GroupNorm(8, 256, eps=1e-05, affine=True)
        (2): GELU(approximate='none')
      )
    )
    (conv2): GELUConvBlock(
      (model): Sequential(
        (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): GroupNorm(8, 256, eps=1e-05, affine=True)
        (2): GELU(approximate='none')
      )
    )
  )
  (down1): DownBlock(
    (model): Sequential(
      (0): GELUConvBlock(
        (model): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): GroupNorm(32, 256, eps=1e-05, affine=True)
          (2): GELU(approximate='none')
        )
      )
      (1): GELUConvBlock(
        (model): Sequential(
          (0): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          (1): GroupNorm(32, 256, eps

In [10]:
# Define prompts
text_prompts = [
    "A photo of a red rose",
    "A high-quality photo of a vibrant red rose",
    "A close-up shot of a red rose with many layered petals",
    "Macro photography of a deep red rose with velvet-like texture",
    "A single red rose blooming in a lush green garden",
    "A detailed photo of a red rose flower with morning dew drops",
    "A vibrant red rose captured in bright, natural sunlight",
    "A professional studio photograph of a single red rose against a soft background",

    "A photo of a white daisy",
    "A high-resolution photo of a white daisy with a yellow center",
    "A round white daisy with crisp, clean white petals",
    "A white daisy flower growing in a sunny green meadow",
    "A macro photo of a daisy showing the texture of the yellow pollen center",
    "A delicate white daisy captured in soft, diffused natural light",
    "A top-down view of a symmetrical white daisy flower",
    "A professional photograph of a simple white daisy with sharp focus",

    "A photo of a yellow sunflower",
    "A vibrant yellow sunflower with a large brown center",
    "A detailed photo of a sunflower with bright, radiant yellow petals",
    "A tall yellow sunflower standing in a vast sunflower field",
    "A macro shot of a sunflower head showing the pattern of the seeds",
    "A bright yellow sunflower facing the sun during golden hour",
    "A large, blooming yellow sunflower with a thick green stem",
    "A professional high-quality photo of a yellow sunflower with vivid colors"
]

In [None]:
# Sanity check: Calculate how many images are to be generated
# Guidance strengths for classifier-free guidance
P = len(text_prompts)           # number of prompts
W = len(config.W_OPT)         # guidance values per prompt
NUM_REPETITIONS = 20            # Number of copies of every prompt/guidance pair
n_samples = P * W * NUM_REPETITIONS   # Total images generated: one per (prompt, guidance) pair

print("Expected n_samples:", n_samples)

Expected n_samples: 1680


## Image Generation

In this part of the project, we sample flower images from a pretrained CLIP-conditioned UNet-DDPM using classifier-free guidance (CFG). 

**Image generation (CFG sampling)**
Given a small set of text prompts and a list of guidance weights ùë§, images are generated via a wrapper function (`sample_flowers_with_hook`) that:
1. encodes prompts using CLIP, and
2. runs CFG sampling through ddpm_utils.sample_w(...). 

The sampling setup produces one final image per (prompt, ùë§) pair (in our configuration: 24 prompts √ó 3 guidance values * 20 repetitions = 1440 images). 

**Bottleneck embedding extraction (forward hook)**

To analyze internal representations, we register a forward hook on the U-Net‚Äôs down2 module and store its activations during sampling. Because CFG internally doubles the batch (conditioned + unconditioned), the captured tensor may contain more entries than final outputs; we therefore keep only the first n_samples embeddings to align 1 embedding with 1 generated image. 


**Where outputs are stored**
Generated images are mapped from [‚àí1,1] to [0,1] and saved as PNG files to `config.SAVE_DIR` using the naming scheme: `flower_w{w:+.1f}_p{prompt_idx}_{i}.png`

In [None]:
# Register a forward hook on the U-Net bottleneck layer
uNet_model.down2.register_forward_hook(UNet_utils.get_embedding_hook('down2'))
print("Hook registered on model.down2")

# Run the generation
other_utils.set_seeds(config.SEED)

print(f"Generating {len(text_prompts) * len(config.W_OPT) * NUM_REPETITIONS} total images...")

print("Generating images...")
generated_images, extracted_embeddings = ddpm_utils.sample_flowers_with_hook(
    text_list=text_prompts,
    model=uNet_model,
    clip_model=clip_model,
    ddpm=ddpm,
    input_size=config.INPUT_SIZE,
    T=config.TIMESTEPS,
    device=config.DEVICE,
    w=config.W_OPT,
    num_repetitions=NUM_REPETITIONS
)

print(f"Generation Complete.")
print(f"Final Images Shape: {generated_images.shape}")
print(f"Final Embeddings Shape: {extracted_embeddings.shape}")

All random seeds set to 51 for reproducibility
Generating 1680 total images...
Generating images...
  Running repetition 1/10...
  Running repetition 2/10...
  Running repetition 3/10...
  Running repetition 4/10...
  Running repetition 5/10...
  Running repetition 6/10...
  Running repetition 7/10...
  Running repetition 8/10...
  Running repetition 9/10...
  Running repetition 10/10...
Generation Complete.
Final Images Shape: torch.Size([1680, 3, 32, 32])
Final Embeddings Shape: torch.Size([1680, 512, 8, 8])


In [None]:
# save generated flower images to disk
saved_samples = other_utils.save_samples_to_disk(generated_images, text_prompts, config.W_OPT, config.SAVE_DIR, P, n_samples)

Saving images to disk...
All images saved.


# Part 2: Evaluation with CLIP Score and FID
In this section, we evaluate the generated flower images using CLIP Score and Fr√©chet Inception Distance (FID), following the metrics defined in the assignment. Together, these measures capture (1) semantic alignment with the text prompt and (2) distribution-level realism compared to real flower images.


## CLIP Score (semantic alignment)

CLIP Score measures how well a generated image matches its conditioning prompt. It answers the question: "How accurately does the generated image depict the content described in the text prompt?"

We compute it as the cosine similarity between text and image embeddings produced by a pretrained OpenCLIP ViT-B-32 model. Before computing similarity, both embeddings are L2-normalized, so the score is a dot product in normalized embedding space. Higher CLIP scores indicate stronger prompt‚Äìimage correspondence. Scores are computed for all generated images, enabling comparisons across prompts and different guidance strengths ùë§.

A higher score indicates stronger semantic alignment. Scores are computed for all generated images, allowing comparison across different guidance strengths and prompts.

## Fr√©chet Inception Distance (FID) (distribution realism)

FID measures how close the distribution of generated images is to the distribution of real images. 

We compute FID using 2048-dimensional feature vectors extracted from a pretrained InceptionV3 network, where the classification head is replaced by an identity layer to access pooled features. Real images are loaded from disk, while generated samples are read from the saved PNG outputs.

For a fair comparison, both real and generated images are processed identically: they are resized to 299√ó299 and normalized with ImageNet mean and standard deviation (from [0,1]) before being passed through InceptionV3. FID is then computed by comparing the mean and covariance of Inception features for real vs. generated samples; lower values indicate more realistic generations.

In [None]:
# Initialize OpenCLIP model for CLIP-score evaluation
clip_scorer, _, clip_preprocess_val = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)
clip_scorer.to(config.DEVICE).eval()
tokenizer = open_clip.get_tokenizer("ViT-B-32")

# Compute CLIP scores for all generated samples
clip_scores = []

print("Calculating scores...")

for i, (filepath, prompt, w_val) in enumerate(saved_samples):
    score = metrics.calculate_clip_score(
        clip_preprocess_val=clip_preprocess_val,
        tokenizer=tokenizer,
        clip_scorer=clip_scorer,
        image_path=filepath,
        text_prompt=prompt,
        device=config.DEVICE,
    )
    clip_scores.append(score)

avg_clip_score = float(np.mean(clip_scores))
print(f"Average CLIP Score: {avg_clip_score:.4f}")

Calculating scores...
Average CLIP Score: 0.2594


In [None]:
# Extract features from real images on disk
dataset_path = config.TMP_ROOT / "data/cropped_flowers"

# Compute FID Score: checks if the images look "real" compared to the original dataset
fid_score = metrics.calculate_fid_score(saved_samples, dataset_path)
print(f"FID Score: {fid_score:.4f}")

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 37/37 [02:57<00:00,  4.80s/it]


Real Embeddings: (1166, 2048)
Total generated: 1680 | Realistic samples for FID: 480


Extracting Generated Features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 15/15 [00:02<00:00,  5.41it/s]


Generated Embeddings: (480, 2048)
FID Score: 239.2461


# Evaluation

To reduce statistical noise observed in earlier experiments, the evaluation pipeline was expanded from a small-scale snapshot of only 21 samples to 1,440 images (3 guidance weights, 24 prompts and 20 repetitions), enabling a more stable and reliable assessment (but overall still limited).

| Metric          | Baseline (N = 21) | Optimized (N=1440) | Improvement |
|-----------------|------------------:|-------------------:|------------:|
| Avg. CLIP score | 0.212             | 0.259              | +22.1%      |
| FID             | 320.5             | 230.2              | ‚àí28.2%      |

**Analysis of Improvements**
* The FID score dropped from 320.5 to 230.2. This improvement is primarily attributed to the increased sample mass (10 repetitions across 21 unique prompts). 
* Semantic Alignment (CLIP): The CLIP score rose from 0.212 to 0.259, indicating a stronger correlation between the pixels and text prompts. This jump is a direct result of replacing generic labels with descriptive prompts and exclude abstract and noisy outputs (where w<1.0). 
* Dataset Diversity: Utilizing 24 distinct prompts provided a wider feature set and allowed for a more comprehensive comparison against the real flower dataset.

**Conclusion**
The results prove that already minimal adjustments lead to a noticeable improvement in both scores. 
Low resolution and limited data are the main bottlenecks: the model captures the key, repeated cues of each flower type but misses fine textures and petal/leaf detail, and colors appear slightly oversaturated.

---------------------------------------------------------------------------------------------------------------------------------------------

# Part 3: Embedding Analysis with FiftyOne Brain

In this section, we analyze the internal representations of our pretrained UNet-DDPM using FiftyOne Brain. We build a FiftyOne dataset from the generated samples, and attach the relevant metadata to each sample:
* **Prompt:** the text used for conditioning
* **Guidance weight ùë§:** the CFG strength used during sampling
* **CLIP Score:** semantic alignment between the generated image and its prompt
* **U-Net bottleneck embedding:** intermediate features captured via a forward hook (from the UNet‚Äôs down2 block)

The extracted embeddings are stored directly as vector fields in the FiftyOne dataset (one embedding per image), enabling representation-based analysis.

We then compute two FiftyOne Brain metrics on these embeddings:
* **Uniqueness:** identifies samples that are most distinct relative to the rest (useful for spotting diverse or outlier generations)
* **Representativeness:** identifies samples that best summarize the set (i.e., typical examples in embedding space)

Finally, we launch the FiftyOne App for interactive exploration‚Äîallowing us to visually inspect images alongside their prompts, ùë§, CLIP scores, and embedding-driven uniqueness/representativeness rankings, giving a qualitative view into how the model‚Äôs bottleneck features structure the generated set.

In [None]:
dataset = other_utils.create_FiftyOne_dataset(saved_samples, extracted_embeddings, clip_scores)

# Compute Uniqueness (Visual diversity)
fob.compute_uniqueness(dataset, embeddings="unet_embedding")

# Compute Representativeness using the extracted U-Net embeddings
fob.compute_representativeness(dataset, embeddings="unet_embedding")

Computing uniqueness...


INFO:fiftyone.brain.internal.core.uniqueness:Computing uniqueness...


Uniqueness computation complete


INFO:fiftyone.brain.internal.core.uniqueness:Uniqueness computation complete


Computing representativeness...


INFO:fiftyone.brain.internal.core.representativeness:Computing representativeness...


Computing clusters for 480 embeddings; this may take awhile...


INFO:fiftyone.brain.internal.core.representativeness:Computing clusters for 480 embeddings; this may take awhile...


Representativeness computation complete


INFO:fiftyone.brain.internal.core.representativeness:Representativeness computation complete


In [36]:
# Launch the FiftyOne App to visualize your dataset and analyze the results
session = fo.launch_app(dataset)


Could not connect session, trying again in 10 seconds



RuntimeError: Client is not connected

In [None]:
rep_high_view = dataset.sort_by("representativeness", reverse=True).limit(12)
session.view = rep_high_view

In [None]:
rep_low_view = dataset.sort_by("representativeness").limit(12)
session.view = rep_low_view

In [None]:
low_unique_view = dataset.sort_by("uniqueness").limit(12)
session.view = low_unique_view

In [None]:
high_unique_view = dataset.sort_by("uniqueness", reverse=True).limit(12)
session.view = high_unique_view

### Qualitative Analysis of FiftyOne Embedding Metrics

With 1,440 generated samples, the FiftyOne Brain metrics provide a statistically meaningful view of how the U-Net bottleneck embeddings structure the generated image distribution.

We eavluate the images by their representativeness (how central a sample lies within the embedding space) and uniqueness (how much samples deviate from nearby generations)

When sorting the images by representativeness, clear patterns become visible.
The most representative samples mainly show white daisies and yellow sunflowers.
These flowers are centered in the image, have a clear circular shape, and look
very similar across different generations. This indicates that the model can
produce these flower types in a stable and consistent way.

![High representativeness samples](../results/further_experiments/high_representativeness.png)

The least representative samples are mostly red roses. While their color matches
the text prompt well, the rose images vary strongly in shape and texture and often
appear overly saturated or blurry. This higher variation places them further away
from the center of the learned embedding space.

![Low representativeness samples](../results/further_experiments/low_representativeness.png)

The uniqueness views show a similar effect. Images with low uniqueness are almost
identical red roses with the same size, position, and background, indicating
very little variation. 

![Low uniqueness samples](../results/further_experiments/low_uniqueness.png)

In contrast, highly unique samples show different image
layouts, including changes in zoom level and images containing multiple flowers.
In particular, images with two flowers are clearly more unique than those with
only a single flower.

![High uniqueness samples](../results/further_experiments/high_uniqueness.png)

In addition, more detailed text prompts tend to produce images with higher
uniqueness and higher representativeness than simple prompts. Richer descriptions
encourage more consistent structure while still allowing visual variation,
whereas short prompts often lead to repetitive compositions.

Overall, the screenshots show that the model performs best on simple, centered,
and symmetric flower shapes, while increased scene complexity and richer prompts
introduce greater diversity in the generated results.



# Part 4: Logging with Weights & Biases
All experiments are logged to Weights & Biases (`diffusion-model-assessment-v2`) under the run name `experiment_run_TIMESTAMP` for reproducibility and comparison. Hyperparameters, generated images, guidance values, CLIP scores, and embedding-based metrics are stored in a structured table, together with aggregate evaluation metrics such as average CLIP score and FID.

**Detailed Overview:**
* Number of diffusion timesteps (T)
* Image size (IMG_SIZE)
* Clip Features
* Prompts used for generation

**Aggregate Metrics:**
We log the final FID score and the average CLIP score across the FiftyOne dataset.

**Per-sample results table:**
We also create a W&B table for detailed inspection, with one row per generated image containing:
* Generated image (with preview)
* Text prompt
* Guidance weight w
* CLIP score
* FiftyOne Brain uniqueness
* FiftyOne Brain representativeness

In [37]:
# Load W&B API key from Colab Secrets and make it available as env variable
wandb_key = userdata.get('WANDB_API_KEY')
os.environ["WANDB_API_KEY"] = wandb_key
wandb.login()

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: Currently logged in as: [33mmichele-marschner[0m ([33mmichele-marschner-university-of-potsdam[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
# Initialize Run
timestamp = other_utils.get_timestamp()
run = wandb.init(project="diffusion_model_assessment_v2", name=f"experiment_run_further_experiments_{timestamp}")

# Log Hyperparameters
wandb.config.update({
    "steps_T": config.TIMESTEPS,
    "image_size": config.IMG_SIZE,
    "clip_features": config.CLIP_FEATURES,
    "prompts": text_prompts
})

# Create a Table for Visual Results
columns = ["image generated", "prompt", "guidance_w", "clip_score", "uniqueness", "representativeness"]

diffusion_test_table = wandb.Table(columns=columns)

# Populate Table
# Grab uniqueness and representativeness scores back from FiftyOne
uniqueness_scores = dataset.values("uniqueness")
representativeness_scores = dataset.values("representativeness")

for i, (filepath, prompt, w_val) in enumerate(saved_samples):
    wandb_img = wandb.Image(filepath)

    diffusion_test_table.add_data(
        wandb_img,
        prompt,
        w_val,
        clip_scores[i],
        uniqueness_scores[i],
        representativeness_scores[i],
    )

# Log the Table and Metrics
wandb.log({
    "generation_results": diffusion_test_table,
    "evaluation/fid_score": fid_score,
    "evaluation/average_clip_score": avg_clip_score
    })

# Finish
run.finish()

0,1
evaluation/average_clip_score,‚ñÅ
evaluation/fid_score,‚ñÅ

0,1
evaluation/average_clip_score,0.25942
evaluation/fid_score,239.24612


# Part 5: Publish Dataset on Hugging Face

The dataset is exported in FiftyOne‚Äôs native format to a local directory (set via export_dir). This export produces a folder that includes the media files (data/) and the dataset metadata (samples.json, metadata.json), where any stored fields (e.g., CLIP score, uniqueness, representativeness, and embeddings) are preserved for later restoration.

To use this in the project, the export path must be adapted to a valid location (e.g., under /content/ or config.DRIVE_ROOT). In addition, the desired metrics must already be present as sample fields in the FiftyOne dataset prior to export; otherwise they will not appear in the exported samples.json.

In [None]:
# Save FiftyOne dataset (images + metadata) to disk
print(f"Exporting dataset to {config.EXPORT_DIR}...")

dataset.export(
    export_dir=str(config.EXPORT_DIR),
    dataset_type=fo.types.FiftyOneDataset,
    export_media=True, # This ensures the actual .png images are included
)

print("Export complete.")

In [None]:
os.environ["HF_TOKEN"] = "HF_TOKEN"

# Token needs to be stored in Colab Secrets
HF_TOKEN = os.getenv("HF_TOKEN")
assert HF_TOKEN is not None, "HF_TOKEN env var is not set!"

api = HfApi(token=HF_TOKEN)

api.upload_large_folder(
    folder_path=f"{config.EXPORT_DIR}",
    repo_id=config.HF_EXPERIMENT_REPO_ID,      # ! must already exist on HF
    repo_type="dataset",
    ignore_patterns=["*.ipynb_checkpoints"],
)

In [None]:
# Download the HF dataset repo snapshot to a local cache directory
local_dir = snapshot_download(
    repo_id=config.HF_EXPERIMENT_REPO_ID,
    repo_type="dataset",
)

# Name under which the dataset will be registered in FiftyOne
dataset_name = config.FIFTYONE_DATASET_EXPERIMENTS_NAME
if dataset_name in fo.list_datasets():
    fo.delete_dataset(dataset_name)

# Import the exported FiftyOneDataset from disk (expects samples.json, etc.)
restored_dataset = fo.Dataset.from_dir(
    dataset_dir=local_dir,
    dataset_type=fo.types.FiftyOneDataset,
    name=dataset_name,
)

# Launch the FiftyOne App
print(restored_dataset)
fo.launch_app(restored_dataset)