Skip to content

French Corpus of Online REgisters (FreCORE) and Swedish Corpus of Online REgisters (SweCORE)

Notifications You must be signed in to change notification settings

TurkuNLP/Multilingual-register-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Multilingual-register-corpora

This repository contains two web corpora with manual register annotations: the French Corpus of Online REgisters (FreCORE) and the Swedish Corpus of Online REgisters (SweCORE) data. The corpora are extracted from the unrestricted open web, and they follow the register taxonomy presented for English in Register variation online by Douglas Biber and Jesse Egbert (2018, Cambridge).

FreCORE and SweCORE are introduced in the paper 'Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers'.

The FreCORE and SweCORE annotations are released under Creative Commons Attribution license (CC BY). Please cite:

@inproceedings{repo2021zeroshot,
  title={Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers},
  author={Liina Repo and Valtteri Skantsi and Samuel R\"onnqvist and Saara Hellstr\"om and Miika Oinonen and Anna Salmela and Douglas Biber and Jesse Egbert and Sampo Pyysalo and Veronika Laippala},
  booktitle={Proceedings of the EACL 2021 Student Research Workshop},
  year={2021}
}

The modeling code used in the paper is available here.

For FinCORE, a similar dataset in Finnish, please see https://github.com/TurkuNLP/FinCORE.

About

French Corpus of Online REgisters (FreCORE) and Swedish Corpus of Online REgisters (SweCORE)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published