Hi,
Thank for your great work! May I know if there are any plans for updating the evaluation script?
I try to reproduce the results from the paper, using Qwen3-VL as VLM judge as well as the curated prompt mentioned in the paper, but I found out that the final results are much lower. For example, as for S_ad metric, I cannot get one 10 score sample at all. I got 51% 9.5 samples and 17.6% 8.5 samples, thus it is impossible to get 94.3 as reported in the paper.
Could you please share the evaluation script or inform me where I got wrong? Thanks.
Hi,
Thank for your great work! May I know if there are any plans for updating the evaluation script?
I try to reproduce the results from the paper, using Qwen3-VL as VLM judge as well as the curated prompt mentioned in the paper, but I found out that the final results are much lower. For example, as for S_ad metric, I cannot get one 10 score sample at all. I got 51% 9.5 samples and 17.6% 8.5 samples, thus it is impossible to get 94.3 as reported in the paper.
Could you please share the evaluation script or inform me where I got wrong? Thanks.