Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual benchmark datasets #113

Merged
merged 8 commits into from
Dec 2, 2023
Merged

Multilingual benchmark datasets #113

merged 8 commits into from
Dec 2, 2023

Conversation

visheratin
Copy link
Contributor

Hi!

When working on the paper about NLLB-CLIP models, I benchmarked the models on multiple multilingual datasets. I thought it would be a good idea to add these benchmarks to this repo to be able to test other multilingual models.

The changes in this PR:

  1. Added French and German languages to the multilingual MS COCO dataset to make it the full XTD10 dataset. Also, changed the logic of extracting images to avoid downloading full train and validation datasets to get only 1000 images.
  2. Added Crossmodal-3600 dataset - 3600 captions in 36 languages, including five low-resource ones.
  3. Added Flickr30k-200 dataset - 1000 captions from Flickr30k dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
  4. Added XTD200 dataset - captions from XTD10 dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
  5. Added the set_language method to set the target language for the NLLB tokenizer.
  6. Enabled building CSV files from directories.

@mehdidc @JeniaJitsev

@visheratin
Copy link
Contributor Author

visheratin commented Nov 19, 2023

Also, this PR depends on the PR in the OpenCLIP repo because we need to change how the text is tokenized to properly support NLLB tokenizer.

@visheratin
Copy link
Contributor Author

The OpenCLIP PR is merged. @mehdidc could you please check the PR?

@mehdidc
Copy link
Collaborator

mehdidc commented Dec 1, 2023

Checking now, thanks a lot @visheratin for the PR!

@mehdidc
Copy link
Collaborator

mehdidc commented Dec 1, 2023

@visheratin Many thanks, all the datasets work fine!

I am trying to reproduce the numbers in your paper, but I see some gaps.
For Table 3, NLLB-CLIP Base, I was using the following:

clip_benchmark eval --pretrained v1 --model nllb-clip-base --dataset multilingual_mscoco_captions --language en fr zh de es it tr ru ko pl jp --recall_k 1 5 10 --output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"

I got the following results:

language image_retrieval_recall@1 image_retrieval_recall@5 image_retrieval_recall@10
1 en 47.2 80.6 88.3
0 fr 45.0 76.9 85.8
8 zh 41.1 71.7 83.0
5 de 43.3 74.8 84.7
6 es 44.1 75.9 85.4
3 it 44.7 75.9 85.7
2 tr 41.3 73.4 84.3
9 ru 40.6 71.0 82.8
10 ko 39.4 70.9 81.2
7 pl 45.5 76.1 87.1
4 jp 37.9 65.2 77.8

Is there anything I am missing?

@visheratin
Copy link
Contributor Author

visheratin commented Dec 1, 2023

@mehdidc The changes from the PR in OpenCLIP haven't yet been released, so the tokenizer doesn't work correctly. You need to install OpenCLIP from the main branch. If you run the benchmark on the nllb-clip-base-siglip model, they should match the numbers from this CSV.

Regarding the numbers from the paper, I ran those tests on a slightly different version of the model, so the numbers for nllb-clip-base won't precisely match the numbers in the paper, although they should be quite close.

@mehdidc
Copy link
Collaborator

mehdidc commented Dec 1, 2023

@visheratin I was using the main branch actually, will rerun with the siglip version to compare with the csv, thank you.

@visheratin
Copy link
Contributor Author

@mehdidc Looks like the tokenizer works fine. The difference is because the models are not exactly the same. For some languages, the model from the paper is better; for others, the OpenCLIP version is better.

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

@mehdidc
Copy link
Collaborator

mehdidc commented Dec 1, 2023

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

Cool, sounds good!

I ran the siglip version of the base model and it is matching the CSV.

language image_retrieval_recall@1 image_retrieval_recall@5 image_retrieval_recall@10
13 en 68.5 88.8 95.2
3 fr 64.8 86.9 93.5
17 zh 58.4 84.7 92.4
10 de 62.8 87.0 93.5
0 es 65.2 86.7 93.9
4 it 64.3 87.9 94.4
20 tr 63.7 86.8 94.3
21 ru 60.7 82.4 88.6
18 ko 59.9 83.6 92.2
9 pl 65.4 87.6 94.4
5 jp 54.6 80.8 87.8

@mehdidc
Copy link
Collaborator

mehdidc commented Dec 2, 2023

Merging.

@mehdidc mehdidc merged commit 8a05786 into LAION-AI:main Dec 2, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants