Multilingual benchmark datasets #113

visheratin · 2023-11-19T02:09:11Z

Hi!

When working on the paper about NLLB-CLIP models, I benchmarked the models on multiple multilingual datasets. I thought it would be a good idea to add these benchmarks to this repo to be able to test other multilingual models.

The changes in this PR:

Added French and German languages to the multilingual MS COCO dataset to make it the full XTD10 dataset. Also, changed the logic of extracting images to avoid downloading full train and validation datasets to get only 1000 images.
Added Crossmodal-3600 dataset - 3600 captions in 36 languages, including five low-resource ones.
Added Flickr30k-200 dataset - 1000 captions from Flickr30k dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
Added XTD200 dataset - captions from XTD10 dataset translated to 200 languages using NLLB-3.3B model. Introduced in the NLLB-CLIP model.
Added the set_language method to set the target language for the NLLB tokenizer.
Enabled building CSV files from directories.

@mehdidc @JeniaJitsev

visheratin · 2023-11-19T02:16:38Z

Also, this PR depends on the PR in the OpenCLIP repo because we need to change how the text is tokenized to properly support NLLB tokenizer.

visheratin · 2023-11-22T18:46:56Z

The OpenCLIP PR is merged. @mehdidc could you please check the PR?

mehdidc · 2023-12-01T17:33:11Z

Checking now, thanks a lot @visheratin for the PR!

mehdidc · 2023-12-01T18:50:09Z

@visheratin Many thanks, all the datasets work fine!

I am trying to reproduce the numbers in your paper, but I see some gaps.
For Table 3, NLLB-CLIP Base, I was using the following:

clip_benchmark eval --pretrained v1 --model nllb-clip-base --dataset multilingual_mscoco_captions --language en fr zh de es it tr ru ko pl jp --recall_k 1 5 10 --output "benchmark_{dataset}_{pretrained}_{model}_{language}_{task}.json"

I got the following results:

	language	image_retrieval_recall@1	image_retrieval_recall@5	image_retrieval_recall@10
1	en	47.2	80.6	88.3
0	fr	45.0	76.9	85.8
8	zh	41.1	71.7	83.0
5	de	43.3	74.8	84.7
6	es	44.1	75.9	85.4
3	it	44.7	75.9	85.7
2	tr	41.3	73.4	84.3
9	ru	40.6	71.0	82.8
10	ko	39.4	70.9	81.2
7	pl	45.5	76.1	87.1
4	jp	37.9	65.2	77.8

Is there anything I am missing?

visheratin · 2023-12-01T18:59:42Z

@mehdidc The changes from the PR in OpenCLIP haven't yet been released, so the tokenizer doesn't work correctly. You need to install OpenCLIP from the main branch. If you run the benchmark on the nllb-clip-base-siglip model, they should match the numbers from this CSV.

Regarding the numbers from the paper, I ran those tests on a slightly different version of the model, so the numbers for nllb-clip-base won't precisely match the numbers in the paper, although they should be quite close.

mehdidc · 2023-12-01T19:02:50Z

@visheratin I was using the main branch actually, will rerun with the siglip version to compare with the csv, thank you.

visheratin · 2023-12-01T19:05:15Z

@mehdidc Looks like the tokenizer works fine. The difference is because the models are not exactly the same. For some languages, the model from the paper is better; for others, the OpenCLIP version is better.

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

mehdidc · 2023-12-01T22:06:49Z

I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval.

Cool, sounds good!

I ran the siglip version of the base model and it is matching the CSV.

	language	image_retrieval_recall@1	image_retrieval_recall@5	image_retrieval_recall@10
13	en	68.5	88.8	95.2
3	fr	64.8	86.9	93.5
17	zh	58.4	84.7	92.4
10	de	62.8	87.0	93.5
0	es	65.2	86.7	93.9
4	it	64.3	87.9	94.4
20	tr	63.7	86.8	94.3
21	ru	60.7	82.4	88.6
18	ko	59.9	83.6	92.2
9	pl	65.4	87.6	94.4
5	jp	54.6	80.8	87.8

mehdidc · 2023-12-02T00:01:39Z

Merging.

visheratin added 8 commits October 29, 2023 22:58

Updated XTD10 dataset.

0322448

Added Crossmodal-3600.

e91f95d

Added XTD200 and NLLB-CLIP.

5dc64cb

Added Flickr30k-200.

c84ae2f

Fixed datasets.

3baa52a

Fixed xm3600.

55a7d82

Added folders processing.

a4d87d2

Fixed flickr30k-200.

98ba1db

visheratin mentioned this pull request Nov 22, 2023

NLLB-CLIP with SigLIP + small tokenizer fix mlfoundations/open_clip#741

Merged

mehdidc merged commit 8a05786 into LAION-AI:main Dec 2, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual benchmark datasets #113

Multilingual benchmark datasets #113

visheratin commented Nov 19, 2023

visheratin commented Nov 19, 2023 •

edited

Loading

visheratin commented Nov 22, 2023

mehdidc commented Dec 1, 2023

mehdidc commented Dec 1, 2023

visheratin commented Dec 1, 2023 •

edited

Loading

mehdidc commented Dec 1, 2023

visheratin commented Dec 1, 2023

mehdidc commented Dec 1, 2023

mehdidc commented Dec 2, 2023

Multilingual benchmark datasets #113

Multilingual benchmark datasets #113

Conversation

visheratin commented Nov 19, 2023

visheratin commented Nov 19, 2023 • edited Loading

visheratin commented Nov 22, 2023

mehdidc commented Dec 1, 2023

mehdidc commented Dec 1, 2023

visheratin commented Dec 1, 2023 • edited Loading

mehdidc commented Dec 1, 2023

visheratin commented Dec 1, 2023

mehdidc commented Dec 1, 2023

mehdidc commented Dec 2, 2023

visheratin commented Nov 19, 2023 •

edited

Loading

visheratin commented Dec 1, 2023 •

edited

Loading