Skip to content

Huertas97/Multilingual-STSB

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual-STSB

The original STS Benchmark [1] consists of train-dev-test splits of 5749, 1500, and 1379 pairs of sentences, respectively, labelled with a similarity score between 0 and 5, from less to more similar. To extend the original English STSb to a multilingual dataset, we translate the STS Benchmark splits from English to 15 languages1. The Google Translator python package is used for this purpose, following the same procedure that has been largely used in existing related literature [2] [3] [4]. In addition, to maintain data quality, translated sentence pairs with a confidence value below 0.7 were dropped. As a result, Dutch is the language with the lowest sentence pairs in development (1483 sentence pairs) and test (1358 sentence pairs) sets. Besides, Google Translator distinguishes two variants of Chinese: simplified and using Mainland Chinese terms (zh-CN), and traditional and using Taiwanese terms (zh-TW).

1Languages: ar, cs, de, en, es, fr, hi, it, ja, nl, pl, pt, ru, tr, zh-CN, zh-TW

Data

The data is splitted into train, dev and test sets at "Data" folder in pickle files. You can easily load the data using read_pickle() function from pandas.

Citation

In case you use the dataset please refer to this article:

@InProceedings{10.1007/978-3-030-91608-4_31,
author="Huertas-Garc{\'i}a, {\'A}lvaro
and Huertas-Tato, Javier
and Mart{\'i}n, Alejandro
and Camacho, David",
editor="Yin, Hujun
and Camacho, David
and Tino, Peter
and Allmendinger, Richard
and Tall{\'o}n-Ballesteros, Antonio J.
and Tang, Ke
and Cho, Sung-Bae
and Novais, Paulo
and Nascimento, Susana",
title="Countering Misinformation Through Semantic-Aware Multilingual Models",
booktitle="Intelligent Data Engineering and Automated Learning -- IDEAL 2021",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="312--323",
isbn="978-3-030-91608-4"
}

References

[1] Cer, Daniel, Diab, Mona, Agirre, Eneko, Lopez-Gazpio, Iñigo, & Specia, Lucia. (2017). SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation. https://doi.org/10.18653/v1/S17-2001

[2] Ham, Jiyeon, Choe, Yo Joong, Park, Kyubyong, Choi, Ilji, & Soh, Hyungjoon. (2020). KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding.

[3] Reimers, Nils, & Gurevych, Iryna. (2020). Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.

[4] Hu, Junjie, Ruder, Sebastian, Siddhant, Aditya, Neubig, Graham, Firat, Orhan, & Johnson, Melvin. (2020). XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization.

About

Repository of the multilingual extension of STS Benchmark to 15 languages

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published