Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data leak #24

Open
kimihailv opened this issue Dec 2, 2022 · 2 comments
Open

Data leak #24

kimihailv opened this issue Dec 2, 2022 · 2 comments

Comments

@kimihailv
Copy link

Hello! According to XTD-10 repo, the test set contains 800 images from MSCOCO train set. During training you also use MSCOCO train set – it seems you have data leak. Or may be I don't understand something.

@FreddeFrallan
Copy link
Owner

FreddeFrallan commented Dec 2, 2022

Hey,

Now that you mention it, it looks like XTD includes train images in their translated captions. Which, in my humble opinion, is a rather weird decision... At least when there's still data from val+test that they have not used... ?
So yes, there seems to be data leakage in our evaluation.

We're currently working on creating a better evaluation system at CLIP_BENCHMARK, and we are working towards creating some multilingual evaluations.

The evaluations at this repo should be updated when such evaluations are available.

@guillemram97
Copy link

How did you evaluate Table 1 in the original paper ('Cross-lingual and Multilingual CLIP')? The space of retrievable images were the 1k images from XTD-10 dataset? Because there's null interesection between the images of that dataset and the MSCOCO 2014 test set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants