Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementing Flores 200 translation evaluation benchmark across 200 languages #1706

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
The diff you're trying to view is too large. We only load the first 3000 changed files.
59 changes: 59 additions & 0 deletions lm_eval/tasks/flores200/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# FLORES-200

### Paper

Title: `The FLORES-200 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation`
Link: https://github.com/facebookresearch/flores/blob/main/flores200/README.md

Original Flores-101 paper: https://arxiv.org/abs/2106.03193

The creation of FLORES-200 doubles the existing language coverage of FLORES-101. Given the nature of the new languages, which have less standardization and require more specialized professional translations, the verification process became more complex. This required modifications to the translation workflow. FLORES-200 has several languages which were not translated from English. Specifically, several languages were translated from Spanish, French, Russian and Modern Standard Arabic. Moreover, FLORES-200 also includes two script alternatives for four languages.

Homepage: https://github.com/facebookresearch/flores/tree/main/flores200

We use the prompt template introduced by "Multilingual Machine Translation with Large Language Models:
Empirical Results and Analysis" https://arxiv.org/pdf/2304.04675.pdf, and then further used in "SambaLingo: Teaching Large Language Models New Languages" https://arxiv.org/abs/2404.05829.

### Citation

```
@article{nllb2022,
author = {NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Jeff Wang},
title = {No Language Left Behind: Scaling Human-Centered Machine Translation},
year = {2022}
}

@inproceedings{,
title={The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation},
author={Goyal, Naman and Gao, Cynthia and Chaudhary, Vishrav and Chen, Peng-Jen and Wenzek, Guillaume and Ju, Da and Krishnan, Sanjana and Ranzato, Marc'Aurelio and Guzm\'{a}n, Francisco and Fan, Angela},
year={2021}
}

@inproceedings{,
title={Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English},
author={Guzm\'{a}n, Francisco and Chen, Peng-Jen and Ott, Myle and Pino, Juan and Lample, Guillaume and Koehn, Philipp and Chaudhary, Vishrav and Ranzato, Marc'Aurelio},
journal={arXiv preprint arXiv:1902.01382},
year={2019}
}
```

### Groups and Tasks

#### Tasks
There are 41618 supported translation tasks. In order to find the task name use the following steps.

- find the language code and character set you are interested in by visiting https://github.com/facebookresearch/flores/tree/main/flores200. The 3 characters before the underscore are the 3 letter [ISO 639](https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes) language codes, the 4 characters after the underscore represent the [ISO 15924](https://en.wikipedia.org/wiki/ISO_15924) codes for the representation of names of scripts. (For English your LANG_CODE would "eng_Latn")
- search for language code in task directory `cd lm_eval/tasks/flores200 && ls | grep LANG_CODE`. If you want to search for a specific translation task, you can grep for two distinct language codes, for example for Arabic -> English translation you could search `cd lm_eval/tasks/flores200 && ls | grep arb_Arab | grep eng_Latn`

### Checklist

For adding novel benchmarks/datasets to the library:
* [x] Is the task an existing benchmark in the literature?
* [x] Have you referenced the original paper that introduced the task?
* [] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [x] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
24 changes: 24 additions & 0 deletions lm_eval/tasks/flores200/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
group: flores200
dataset_path: facebook/flores
fewshot_config:
sampler: first_n
output_type: generate_until
should_decontaminate: false
metric_list:
- metric: chrf
aggregation: chrf
higher_is_better: true
ignore_case: true
ignore_punctuation: false
- metric: ter
aggregation: ter
higher_is_better: true
ignore_case: true
ignore_punctuation: false
- metric: bleu
aggregation: bleu
higher_is_better: true
ignore_case: true
ignore_punctuation: false
metadata:
version: 0.0
53 changes: 53 additions & 0 deletions lm_eval/tasks/flores200/_generate_configs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
"""
Take in a YAML, and output all other splits with this YAML
"""
import argparse
import os

from datasets import get_dataset_config_names

import yaml
from tqdm import tqdm

from lm_eval.utils import logging

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--base_yaml_path", required=True)
parser.add_argument("--save_prefix_path", default="flores200")
return parser.parse_args()

if __name__ == "__main__":
args = parse_args()

# get filename of base_yaml so we can `"include": ` it in our other YAMLs.
base_yaml_name = os.path.split(args.base_yaml_path)[-1]
with open(args.base_yaml_path, encoding="utf-8") as f:
base_yaml = yaml.full_load(f)

dataset_configs = get_dataset_config_names("facebook/flores")
for dataset_config in tqdm(dataset_configs):
# Ignore splits that are not parallel in two languages
if '-' in dataset_config:
in_lang = f"{dataset_config.split('-')[0]}"
out_lang = f"{dataset_config.split('-')[1]}"
yaml_dict = {
"include": base_yaml_name,
"task": f"flores200_{dataset_config}",
"test_split": 'dev',
"fewshot_split": 'dev',
"dataset_name": f"{dataset_config}",
"doc_to_text": "{{sentence_" + in_lang + "}} =",
"doc_to_target": "{{sentence_" + out_lang + "}}"
}

file_save_path = args.save_prefix_path + f"_{dataset_config}.yaml"
logging.info(f"Saving yaml for subset {dataset_config} to {file_save_path}")
with open(file_save_path, "w", encoding="utf-8") as yaml_file:
yaml.dump(
yaml_dict,
yaml_file,
width=float("inf"),
allow_unicode=True,
default_style='"',
)
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ace_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ace_Latn"
"doc_to_target": "{{sentence_ace_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ace_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-acm_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-acm_Arab"
"doc_to_target": "{{sentence_acm_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-acm_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-acq_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-acq_Arab"
"doc_to_target": "{{sentence_acq_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-acq_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-aeb_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-aeb_Arab"
"doc_to_target": "{{sentence_aeb_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-aeb_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-afr_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-afr_Latn"
"doc_to_target": "{{sentence_afr_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-afr_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ajp_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ajp_Arab"
"doc_to_target": "{{sentence_ajp_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ajp_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-aka_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-aka_Latn"
"doc_to_target": "{{sentence_aka_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-aka_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-als_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-als_Latn"
"doc_to_target": "{{sentence_als_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-als_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-amh_Ethi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-amh_Ethi"
"doc_to_target": "{{sentence_amh_Ethi}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-amh_Ethi"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-apc_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-apc_Arab"
"doc_to_target": "{{sentence_apc_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-apc_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-arb_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-arb_Arab"
"doc_to_target": "{{sentence_arb_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-arb_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-arb_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-arb_Latn"
"doc_to_target": "{{sentence_arb_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-arb_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ars_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ars_Arab"
"doc_to_target": "{{sentence_ars_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ars_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ary_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ary_Arab"
"doc_to_target": "{{sentence_ary_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ary_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-arz_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-arz_Arab"
"doc_to_target": "{{sentence_arz_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-arz_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-asm_Beng.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-asm_Beng"
"doc_to_target": "{{sentence_asm_Beng}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-asm_Beng"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ast_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ast_Latn"
"doc_to_target": "{{sentence_ast_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ast_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-awa_Deva.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-awa_Deva"
"doc_to_target": "{{sentence_awa_Deva}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-awa_Deva"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ayr_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ayr_Latn"
"doc_to_target": "{{sentence_ayr_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ayr_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-azb_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-azb_Arab"
"doc_to_target": "{{sentence_azb_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-azb_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-azj_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-azj_Latn"
"doc_to_target": "{{sentence_azj_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-azj_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bak_Cyrl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bak_Cyrl"
"doc_to_target": "{{sentence_bak_Cyrl}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bak_Cyrl"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bam_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bam_Latn"
"doc_to_target": "{{sentence_bam_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bam_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ban_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ban_Latn"
"doc_to_target": "{{sentence_ban_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ban_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bel_Cyrl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bel_Cyrl"
"doc_to_target": "{{sentence_bel_Cyrl}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bel_Cyrl"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bem_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bem_Latn"
"doc_to_target": "{{sentence_bem_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bem_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-ben_Beng.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-ben_Beng"
"doc_to_target": "{{sentence_ben_Beng}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-ben_Beng"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bho_Deva.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bho_Deva"
"doc_to_target": "{{sentence_bho_Deva}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bho_Deva"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bjn_Arab.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bjn_Arab"
"doc_to_target": "{{sentence_bjn_Arab}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bjn_Arab"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bjn_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bjn_Latn"
"doc_to_target": "{{sentence_bjn_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bjn_Latn"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bod_Tibt.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bod_Tibt"
"doc_to_target": "{{sentence_bod_Tibt}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bod_Tibt"
"test_split": "dev"
7 changes: 7 additions & 0 deletions lm_eval/tasks/flores200/flores200_ace_Arab-bos_Latn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
"dataset_name": "ace_Arab-bos_Latn"
"doc_to_target": "{{sentence_bos_Latn}}"
"doc_to_text": "{{sentence_ace_Arab}} ="
"fewshot_split": "dev"
"include": "_default_template_yaml"
"task": "flores200_ace_Arab-bos_Latn"
"test_split": "dev"
Loading