Skip to content

Preprossed data for workshop on statistical machine translation (WMT), collected from papers or other projects

Notifications You must be signed in to change notification settings

IdiosyncraticDragon/WMT_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 

Repository files navigation

WMT_data (Keep updating, welcome for advices)

Preprossed data for workshop on statistical machine translation (WMT), collected from other places

When reimplement the NMT models, I found the data of WMT14/15/16/17 are raw data provided on the homepage, and it is not easy to find processed data which is exactly the paper used. So I creat this repository to collect the processed WMT data I met, which I am sure met the requirment of papers I read.

WMT 2014

WMT 2014 English to French

The data is provided at: http://www-lium.univ-lemans.fr/~schwenk/cslm_joint_paper/

The data is used in these papers:

  • Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio: Neural Machine Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 (2014)
  • Kyunghyun Cho, Bart van Merriënboer, Ça˘glar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.

WMT 2014 English to German

The data is provided at: https://nlp.stanford.edu/projects/nmt/

The data is used in these papers:

  • (exactly the data) Thang Luong, Hieu Pham, Christopher D. Manning: Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015: 1412-1421
  • (similar to the data) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. On using very large target vocabulary for neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2015.
  • (similar to the data) Stephan Peitz, Joern Wuebker, Markus Freitag, and Hermann Ney. The rwth aachen german-english machine translation system for wmt 2014. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
  • (similar to the data) Tao Lei, Yu Zhang: Training RNNs as Fast as CNNs. CoRR abs/1709.02755 (2017)

WMT 2015 English to German

WMT 2014 English to German data had updated the data of News Commentary v10

The data is provided at:https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz The data is used in this tutorial for OpenNMT: http://forum.opennmt.net/t/training-english-german-wmt15-nmt-engine/29

WMT 2017

The homepage provide preprocssed data: http://data.statmt.org/wmt17/translation-task/preprocessed/

About

Preprossed data for workshop on statistical machine translation (WMT), collected from papers or other projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published