CoAT

CoAT🧥 (Corpus of Artificial Texts) is a large-scale corpus for Russian, which consists of 246k human-written texts from publicly available resources and artificial texts generated by 13 neural models, varying in the number of parameters, architecture choices, pre-training objectives, and downstream applications. Each model is fine-tuned for one or more of six natural language generation tasks, ranging from paraphrase generation to text summarisation. CoAT provides two task formulations:

detection of artificial texts, i.e., classifying if a given text is machine-generated or human-written;
authorship attribution, i.e., classifying the author of a given text among 14 candidates.

The design of our corpus enables various experiment settings, ranging from analysing the dependence of the detector performance on the natural language generation task to the robustness of detectors towards unseen generative models and text domains.

CoAT is available on HuggingFace and in the datasets/ folder of the repository.

Read more about CoAT in our paper published in the Natural Language Processing journal in 2024.

Design

We provide a high-level description of the data collection methodology. Please refer to our paper for more details on the CoAT design, post-processing and filtering procedure and general statistics.

Human-written Texts

The human-written texts are collected from six domains: Russian National Corpus, social media, Wikipedia, news articles, digitalized personal diaries, and machine translation datasets.

Machine-generated Texts

We use human-written texts as the input to 13 generative models, which are finetuned for one or more of the following natural language generation tasks: machine translation, paraphrase generation, text simplification, and text summarisation. In addition, we consider back-translation and zero-shot generation approaches.

Models

Machine translation and back-translation via EasyNMT:

OPUS-MT. Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – Building open translation services for the World
M-BART50. Yuqing Tang et al. 2020. Multilingual Translation with Extensible Multilingual Pretraining and Finetuning
M2M-100. Angela Fan et al. 2021. Beyond English-Centric Multilingual Machine Translation

Paraphrase generation via russian-paraphrasers:

ruGPT2-Large
ruGPT3-Large. Zmitrovich et al., 2024. A Family of Pretrained Transformer Language Models for Russian
ruT5-Base-Multitask
mT5-Small/Large. Linting Xue et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Text simplification via finetuning on a filtered version of the RuSimpleSentEval-2022 dataset (Fenogenova, 2021):

ruGP3-Small/Medium/Large. Zmitrovich et al., 2024. A Family of Pretrained Transformer Language Models for Russian
mT5-Large. Linting Xue et al. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
ruT5-Large. Zmitrovich et al., 2024. A Family of Pretrained Transformer Language Models for Russian

Text summarization:

Open-ended generation.

ruGP3-Small/Medium/Large. Zmitrovich et al., 2024. A Family of Pretrained Transformer Language Models for Russian

Leaderboards

Detection of artificial texts: https://www.kaggle.com/competitions/coat-artificial-text-detection
Authorship attribution: https://www.kaggle.com/competitions/coat-authorship-attribution

Contact us

For any questions about CoAT, codebase, or dataset configurations used in the paper, please contact vladism@ifi.uio.no or create an issue in this repository.

Cite

@article{shamardinacoat,
  title={CoAT: Corpus of artificial texts},
  author={Shamardina, Tatiana and Saidov, Marat and Fenogenova, Alena and Tumanov, Aleksandr and Zemlyakova, Alina and Lebedeva, Anna and Gryaznova, Ekaterina and Shavrina, Tatiana and Mikhailov, Vladislav and Artemova, Ekaterina},
  journal={Natural Language Processing},
  pages={1--26},
  publisher={Cambridge University Press}
}

The early version of CoAT is published as RuATD.

@article{shamardinafindings,
  title={Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian},
  author={Shamardina, Tatiana and Mikhailov, Vladislav and Chernianskii, Daniil and Fenogenova, Alena and Saidov, Marat and Valeeva, Anastasiya and Shavrina, Tatiana and Smurov, Ivan and Tutubalina, Elena and Artemova, Ekaterina}
  journal={Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference ''Dialogue 2022''},
  pages={1--15},
  publisher={RSUH}
}

Name	Name	Last commit message	Last commit date
Latest commit artemovae Update README.md Sep 10, 2024 fe5302b · Sep 10, 2024 History 3 Commits
datasets	datasets	added coat	Sep 10, 2024
.gitattributes	.gitattributes	added coat	Sep 10, 2024
.gitignore	.gitignore	added coat	Sep 10, 2024
LICENSE	LICENSE	Initial commit	Sep 10, 2024
README.md	README.md	Update README.md	Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoAT

Design

Human-written Texts

Machine-generated Texts

Leaderboards

Contact us

Cite

About

Releases

Packages

Contributors 2

License

RussianNLP/CoAT

Folders and files

Latest commit

History

Repository files navigation

CoAT

Design

Human-written Texts

Machine-generated Texts

Leaderboards

Contact us

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages