Pre-trained Machine Translation Models of Korean from/to ECJ

Pre-trained models are beautiful. They save your time, energy and/or money. You can obtain several pre-trained machine translation models for mostly European languages here. In this project, I add six other models: Korean <-> English, Chinese, Japanese as I failed to find publicly available ones. Not surprisingly, the biggest challenge in training NMT models for those language pairs is the lack of large parallel corpora. I decided to use both public data (OpenSubtitles) and private data) to overcome the difficulties. Overall, each of their performance may not so impressive, but you can keep training it with your own data, if necessary.

Requirements

python >=3.6
pytorch >=1.0
Fairseq

Data

Language Pair	# Training sents (public + private)	# Test sents (private)
ko-en	1,845,445 (1,391,190 + 454,255)	1,050
ko-zh	672,450 (485,843 + 186,607)	1,417
ko-ja	2,788,003 (302,063 + 2,485,940)	1,174

Model

Transformer Base

Vocabulary and tokenization

Click the links to download the pretrained models and vocabulary files.

Language	# Vocab.	Tokenization
ko	8k	BPE with sentencepiece
en	32k	BPE with sentencepiece
zh	32k	character
ja	8k	BPE with sentencepiece

Pre-trained models and their performance

Pre-trained model	BLEU on test set*
ko -> en	16.7
en -> ko	24.2
ko -> zh	17.13
zh -> ko	23.78
ko -> ja	40.7
ja -> ko	34.6

Evaluation is based on the tokenization tools such as Mecab-ko (ko), NLTK punct (en), pkuseg (zh), and MeCab (ja).)

Finetuning Examples

echo "ko -> en"
python -m torch.distributed.launch  --nproc_per_node 8 FAIRSEQ/train.py    ko-en-bin --arch transformer       --optimizer adam --lr 0.0005 --label-smoothing 0.1 --dropout 0.3       --max-tokens 4000 --min-lr '1e-09' --lr-scheduler inverse_sqrt       --weight-decay 0.0001 --criterion label_smoothed_cross_entropy       --max-epoch 80 --warmup-updates 4000 --warmup-init-lr '1e-07'    --adam-betas '(0.9, 0.98)'   --save-dir train/ko-en/ckpt  --save-interval 1 --restore-file checkpoint77.pt

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-trained Machine Translation Models of Korean from/to ECJ

Requirements

Data

Model

Vocabulary and tokenization

Pre-trained models and their performance

Finetuning Examples

About

Releases

Packages

License

Kyubyong/cjk_trans

Folders and files

Latest commit

History

Repository files navigation

Pre-trained Machine Translation Models of Korean from/to ECJ

Requirements

Data

Model

Vocabulary and tokenization

Pre-trained models and their performance

Finetuning Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages