Skip to content
Tools for filtering and cleaning parallel and monolingual corpora
Branch: master
Clone or download
Latest commit c94bba2 Jul 5, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
mono Fix for the mono script Oct 8, 2018
parallel
LICENSE Added MIT Dec 7, 2017
README.md Updated link to paper May 8, 2019
regular-expressions.php

README.md

Corpora Cleaning Tools

Tools for filtering and cleaning parallel and monolingual corpora in order to train better (neural) machine translation systems.

Inspired by the Data Filtering and Data Pre-processing sections of Tilde's WMT17 paper. This repository includes some of the more basic scripts that can help to get rid of the majority of junk from parallel corpora.

Tools included

  • parallel - tools for parallel corpora
  • mono - tools for monolingual corpora

Requirements

pip install subword-nmt
pip install langid

Publications

If you use this tool, please cite the following paper:

Matīss Rikters (2018). "Impact of Corpora Quality on Neural Machine Translation." In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018) (2018).

@inproceedings{Rikters2018BalticHLT,
	author = {Rikters, Matīss},
	booktitle={In Proceedings of the 8th Conference Human Language Technologies - The Baltic Perspective (Baltic HLT 2018)},
	title = {{Impact of Corpora Quality on Neural Machine Translation}},
	address={Tartu, Estonia},
	year = {2018}
}
You can’t perform that action at this time.