# Visual Generation Evaluation Benchmarks and Metrics

Introduction and implementations of visual generation evaluation benchmarks and metrics.

Written by yuanjk0921@outlook.com

See more reading papers and notes [here](https://junkunyuan.github.io/reading_papers/reading_papers.html)

Updated on Feb 23, 2025

**Contents**
<!-- - Inception Score -->
<!-- - FID -->
<!-- - FVD -->
<!-- - CLIPScore -->
<!-- - HPS / HPSv2 -->
<!-- - ImageReward -->
- T2I-CompBench
<!-- - GenEval -->
<!-- - VBench -->
<!-- - T2V-CompBench -->
<!-- - DPG-Bench -->
<!-- - T2V-CompBench -->

**References**
- [**T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation** *(NeurIPS 2023)*](https://arxiv.org/pdf/2307.06350): The paper to introduce the T2I-CompBench.

## T2I-CompBench

T2I-CompBench is designed for **image generation** on <u>compositional generation</u>, including attribute binding, object relationship, and complex compositions.

### 1. Attribute Binding

#### 1.1 Color

**Data:** [All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/color.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/color_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/color_val.txt)

**Metrics:** use the VQA ability of BLIP for evaluating the probability of answering "yes".

**Sources:** 480 prompts from CC500, 200 prompts from COCO, and 320 prompts generated by ChatGPT.

**Examples:**
- a green bench and a blue bowl
- A bright yellow wall in a bathroom adds appeal to a white tiled floor.

#### 1.2 Shape

[All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/shape.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/shape_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/shape_val.txt)

**Metrics:** use the VQA ability of BLIP for evaluating the probability of answering "yes".</u>

**Sources:** generated by ChatGPT by prompting with the shape set of {long, tall, short, big, small, cubic, ...}.

**Examples:**
- a pyramidal paperweight and a teardrop pen
- The pyramidal roof and the triangular archway were the defining features of the ancient temple.

#### 1.3 Texture

[All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/texture.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/texture_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/texture_val.txt)

**Metrics:** use the VQA ability of BLIP for evaluating the probability of answering "yes".

**Sources:** 800 prompts generated by random combinations of texture sets and 200 prompts are generated by ChatGPT.

**Examples:**
- a rubber eraser and a metallic key
- The wooden hanger and metallic hook support the fluffy bathrobe in the bathroom.

### 2. Object Relationship

#### 2.1 Spatial relationships

[All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/spatial.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/spatial_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/spatial_val.txt)

**Metrics:** use UniDet to detect objects and detetmine the spatial relationship by comparing the bounding boxes.

**Sources:** random combinations of a spatial set of {on the side of, next to, on the right of, ...}

**Examples:**
- a book on the top of a woman
- a turtle next to a airplane


#### 2.2 Non-spatial relationships

[All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/non_spatial.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/non_spatial_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/non_spatial_val.txt)

**Metrics:** CLIPScore.

**Sources:** generated by ChatGPT.

**Examples:**
- A person is yawning in a boring meeting.
- A runner is pushing themselves to go just a little bit farther, feeling their heart race and their muscles ache.

## 3. Complex Compositions

[All data (1000 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/complex.txt) & [Training data (700 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/complex_train.txt) & [Test data (300 prompts)](https://github.com/Karine-Huang/T2I-CompBench/blob/main/examples/dataset/complex_val.txt)

**Metrics:** the average score of each dimension.

**Sources:** generate 250 prompts with ChatGPT for each of the four scenarios: "two objects with multiple attributes", "two objects with mixed attributes", "more than two objects with multiple attributes", "more than two objects with mixed attributes".

**Examples:**
- The soft blanket draped over the bumpy couch and the hard floor.
- The sleek, aerodynamic shape of the sports car cut through the wind with ease, a symbol of speed and luxury.