Skip to content

Helsinki-NLP/OPUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OPUS

The Open Parallel Corpus

This repository contains information about the released parallel corpora and derived data sets in OPUS, the open collection of parallel corpora. Each sub-directory in corpus/ corresponds to one specific resource with released versions and data sets according to the following format corpus/name/version.

The OPUS ecosystem

Tools for finding and processing OPUS data sets:

Managing OPUS:

Machine translation with OPUS-MT:

Citing

Please, cite the following LREC 2012 paper when using OPUS and also acknowledge corpus-specific references as specified in the resource-specific information and documentation!

@InProceedings{TIEDEMANN12.463,
  author = {Jörg Tiedemann},
  title = {Parallel Data, Tools and Interfaces in {OPUS}},
  booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},
  year = {2012},
  month = {may},
  date = {23-25},
  address = {Istanbul, Turkey},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {978-2-9517408-7-7},
}

Links to other resources

Acknowledgements

OPUS and related resources and tools have been partially supported by various projects such as

  • LetsMT! - A Platform for Online Sharing of Training Data and Building User Tailored Machine Translation (EU ICT PSP)
  • MeMAD - Methods for Managing Audiovisual Data (EU Horizon 2020)
  • NLPL - the Nordic Language Processing Laboritory (neic)
  • EOSC-nordic - the European Open Science Cloud within the Nordic and Baltic countries (EU Horizon 2020)
  • ELG - the European Language Grid (EU Horizon 2020)
  • FoTran - Found in Translation (EU ERC)
  • HPLT - High-Performance Language Technologies (EU Horizon)

OPUS is hosted by CSC, the IT Center for Science in Finland, and heavily draws on the HPC resources provided by CSC. OPUS is also part of NLPL, the Nordic Language Processing Laboratory. Last but not least, OPUS would not be possible without the various contributions from the community including aligned data sets and tools to create and process parallel corpora.