Request on Open-source Evaluation Script

Hi,

Thank for your great work! May I know if there are any plans for updating the evaluation script?

I try to reproduce the results from the paper, using Qwen3-VL as VLM judge as well as the curated prompt mentioned in the paper, but I found out that the final results are much lower. For example, as for S_ad metric, I cannot get one 10 score sample at all. I got 51% 9.5 samples and 17.6% 8.5 samples, thus it is impossible to get 94.3 as reported in the paper. 

Could you please share the evaluation script or inform me where I got wrong? Thanks.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request on Open-source Evaluation Script #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request on Open-source Evaluation Script #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions