Skip to content

Helsinki-NLP/OpusFilter

develop
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
February 8, 2023 15:39
March 23, 2023 10:46
January 4, 2022 15:56
March 23, 2023 10:46
November 30, 2022 09:14
April 5, 2022 11:31
May 25, 2021 16:12

OpusFilter

OpusFilter is a tool for filtering and combining parallel corpora.

Features:

  • Corpus preprocessing pipelines configured with YAML
  • Simple downloading of parallel corpora from OPUS with OpusTools
  • Implementations for many common text file operations on parallel files
  • Memory-efficient processing of large files
  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.

Installing

Install the latest release from PyPI:

  • pip install opusfilter or pip install opusfilter[all] (include optional Python libraries)

Install from source:

  • pip install . or python setup.py install

Troubleshooting

OpusFilter should generally work fine on Python 3.6 to 3.10. In the case of troubles, try installing the exact versions in requirements.txt:

  • pip install -r requirements.txt

Documentation

The complete OpusFilter documentation is available from helsinki-nlp.github.io/OpusFilter.

You can also build the documents from the source:

  • pip install -r docs/requirements.txt or pip install .[docs]
  • sphinx-build docs docs-html

Changelog

A changelog is available in docs/CHANGELOG.md.

Citing

If you use OpusFilter in your research, please cite our ACL 2020 paper:

@inproceedings{aulamo-etal-2020-opusfilter,
    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-demos.20",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"
}

A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.

Contributing

See docs/CONTRIBUTING.md.