Translated TACRED (The TAC Relation Extraction Dataset)

Version 1.0 (September 04, 2021)

This bundle contains 533 parallel examples sampled from TACRED, translated into Russian and Korean (and 3 additional examples in Russian), accompanied with tranlsation of a list of trigger words collected for the different relations.

Translation of TACRED:

We sampled the TACRED training set so that approximately 25% of the examples are labeled as no_relation (in TACRED, 79.5% are labeled as such) and the other 75% are proportionally distributed between various relation types. The examples were translated using the Yandex and Papago Translate APIs for Russian and Korean, respectively and were manually filtered and corrected by Dmitry Nikolaev (in Russian) and Taelin Karidi (in Korean). They then manually annotated the entity spans of the relation participants corresponding to those identified annotated in English source sentences. The resulting Russian and Korean RE datasets consist of 533 parallel examples in both languages (and 3 additional examples in Russian), thus providing us with parallel, automatically translated, and manually curated and annotated RE datasets.

Annotation:

TACRED is annotated with Stanford Dependencies, which are not designed to be cross-lingual, thus, we use the Trankit supervised parser for parsing the datasets in Universal Dependencies using the default provided pre-trained models. The resulting parses were checked and manually corrected by the annotators.

Trigger Words:

Several Relation Extraction works consults a list of trigger words collected for the different relations Yu et al., 2015. We translate these trigger words to Korean and Russian in two ways. First, we automatically translate the entire word list using an automated machine translating system (Google Translate). This is not always sufficient, as in many cases there are multiple ways to translate a given trigger word. To remedy this, when annotating the translated TACRED subsample, we also record the spans of the trigger words in the translated sentences that correspond to those in the original English sentences.

Format and Source Code:

The files are in CoNNL-U format, with metadata fields for the entities spans, types and trigger words. The docid and id fields of each example are the same as the original English TACREd example.

Citation:

@inproceedings{arviv2021relation,
    title = "On the Relation between Syntactic Divergence and Zero-Shot Performance",
    author = "Ofir Arviv and Dmitry Nikolaev and Taelin Karidi and Omri Abend",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2110.04644",
}

Licensing:

The translated TACRED is distributed under the "Attribution-ShareAlike 3.0 Unported" license (http://creativecommons.org/licenses/by-sa/3.0/). Please follow the link for exact details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
tacred_translation_sample_v_1_0		tacred_translation_sample_v_1_0
translated_trigger_dict		translated_trigger_dict
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tacred_translation_sample_v_1_0

tacred_translation_sample_v_1_0

translated_trigger_dict

translated_trigger_dict

README.md

README.md

Repository files navigation

Translated TACRED (The TAC Relation Extraction Dataset)

Version 1.0 (September 04, 2021)

Translation of TACRED:

Annotation:

Trigger Words:

Format and Source Code:

Citation:

Licensing:

About

Releases

Packages

Navigation Menu

OfirArviv/translated_tacred

Folders and files

Latest commit

History

Repository files navigation

Translated TACRED (The TAC Relation Extraction Dataset)

Version 1.0 (September 04, 2021)

Translation of TACRED:

Annotation:

Trigger Words:

Format and Source Code:

Citation:

Licensing:

About

Resources

Stars

Watchers

Forks