Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time
February 8, 2023 15:39
March 23, 2023 10:46
January 4, 2022 15:56
March 23, 2023 10:46
November 30, 2022 09:14
April 5, 2022 11:31
May 25, 2021 16:12


OpusFilter is a tool for filtering and combining parallel corpora.


  • Corpus preprocessing pipelines configured with YAML
  • Simple downloading of parallel corpora from OPUS with OpusTools
  • Implementations for many common text file operations on parallel files
  • Memory-efficient processing of large files
  • Implemented filters based e.g. on language identification, word aligment, n-gram language models, and multilingual sentence embeddings
  • Extendable with your own filters written in Python

OpusFilter has been presented in ACL 2020 system demonstrations.


Install the latest release from PyPI:

  • pip install opusfilter or pip install opusfilter[all] (include optional Python libraries)

Install from source:

  • pip install . or python install


OpusFilter should generally work fine on Python 3.6 to 3.10. In the case of troubles, try installing the exact versions in requirements.txt:

  • pip install -r requirements.txt


The complete OpusFilter documentation is available from

You can also build the documents from the source:

  • pip install -r docs/requirements.txt or pip install .[docs]
  • sphinx-build docs docs-html


A changelog is available in docs/


If you use OpusFilter in your research, please cite our ACL 2020 paper:

    title = "{O}pus{F}ilter: A Configurable Parallel Corpus Filtering Toolbox",
    author = {Aulamo, Mikko and Virpioja, Sami and Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    month = jul,
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "",
    doi = "10.18653/v1/2020.acl-demos.20",
    pages = "150--156"

A full bibliography of papers cited in the documentation and code can be found from docs/references.bib.


See docs/