This repository contains two web corpora with manual register annotations: the French Corpus of Online REgisters (FreCORE) and the Swedish Corpus of Online REgisters (SweCORE) data. The corpora are extracted from the unrestricted open web, and they follow the register taxonomy presented for English in Register variation online by Douglas Biber and Jesse Egbert (2018, Cambridge).
FreCORE and SweCORE are introduced in the paper 'Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers'.
The FreCORE and SweCORE annotations are released under Creative Commons Attribution license (CC BY). Please cite:
@inproceedings{repo2021zeroshot,
title={Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers},
author={Liina Repo and Valtteri Skantsi and Samuel R\"onnqvist and Saara Hellstr\"om and Miika Oinonen and Anna Salmela and Douglas Biber and Jesse Egbert and Sampo Pyysalo and Veronika Laippala},
booktitle={Proceedings of the EACL 2021 Student Research Workshop},
year={2021}
}
The modeling code used in the paper is available here.
For FinCORE, a similar dataset in Finnish, please see https://github.com/TurkuNLP/FinCORE.