Multilingual-STSB

The original STS Benchmark [1] consists of train-dev-test splits of 5749, 1500, and 1379 pairs of sentences, respectively, labelled with a similarity score between 0 and 5, from less to more similar. To extend the original English STSb to a multilingual dataset, we translate the STS Benchmark splits from English to 15 languages¹. The Google Translator python package is used for this purpose, following the same procedure that has been largely used in existing related literature [2] [3] [4]. In addition, to maintain data quality, translated sentence pairs with a confidence value below 0.7 were dropped. As a result, Dutch is the language with the lowest sentence pairs in development (1483 sentence pairs) and test (1358 sentence pairs) sets. Besides, Google Translator distinguishes two variants of Chinese: simplified and using Mainland Chinese terms (zh-CN), and traditional and using Taiwanese terms (zh-TW).

¹Languages: ar, cs, de, en, es, fr, hi, it, ja, nl, pl, pt, ru, tr, zh-CN, zh-TW

Data

The data is splitted into train, dev and test sets at "Data" folder in pickle files. You can easily load the data using read_pickle() function from pandas.

Citation

In case you use the dataset please refer to this article:

@InProceedings{10.1007/978-3-030-91608-4_31,
author="Huertas-Garc{\'i}a, {\'A}lvaro
and Huertas-Tato, Javier
and Mart{\'i}n, Alejandro
and Camacho, David",
editor="Yin, Hujun
and Camacho, David
and Tino, Peter
and Allmendinger, Richard
and Tall{\'o}n-Ballesteros, Antonio J.
and Tang, Ke
and Cho, Sung-Bae
and Novais, Paulo
and Nascimento, Susana",
title="Countering Misinformation Through Semantic-Aware Multilingual Models",
booktitle="Intelligent Data Engineering and Automated Learning -- IDEAL 2021",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="312--323",
isbn="978-3-030-91608-4"
}

References

[1] Cer, Daniel, Diab, Mona, Agirre, Eneko, Lopez-Gazpio, Iñigo, & Specia, Lucia. (2017). SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. https://doi.org/10.18653/v1/S17-2001

[2] Ham, Jiyeon, Choe, Yo Joong, Park, Kyubyong, Choi, Ilji, & Soh, Hyungjoon. (2020). KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding.

[3] Reimers, Nils, & Gurevych, Iryna. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.

[4] Hu, Junjie, Ruder, Sebastian, Siddhant, Aditya, Neubig, Graham, Firat, Orhan, & Johnson, Melvin. (2020). XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
Data		Data
Google_Translator.ipynb		Google_Translator.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

Google_Translator.ipynb

Google_Translator.ipynb

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Multilingual-STSB

Data

Citation

References

About

Releases

Packages

Languages

License

Huertas97/Multilingual-STSB

Folders and files

Latest commit

History

Repository files navigation

Multilingual-STSB

Data

Citation

References

About

Resources

License

Stars

Watchers

Forks

Languages