Skip to content
This repository has been archived by the owner on Mar 13, 2020. It is now read-only.

Sequence to sequence implementation using PyTorch and torchtext with Korean-English pair dataset

Notifications You must be signed in to change notification settings

Huffon/pytorch-seq2seq-kor-eng

Repository files navigation

Sequence-to-sequence PyTorch implementations

This repo contains various sequential models used to translate Korean sentence into English sentence.

I used translation dataset, but you can apply these models to any sequence to sequence (i.e. text generation) tasks such as text summarization, response generation, ..., etc.

All of base codes are based on this great seq2seq tutorial.

In this project, I specially used Korean-English translation corpus from AI Hub to apply torchtext into Korean dataset.

I can not upload the used dataset because it requires an approval from AI Hub. You can get an approval from AI Hub, if you request it to admins.

And I also used soynlp library which is used to tokenize Korean sentence. It is really nice and easy to use, you should try if you handle Korean sentences :)

Currently, the lowest valid and test losses are 4.708 and 4.681 respectively.


Overview

  • Number of train data: 75,000
  • Number of validation data: 10,000
  • Number of test data: 10,000
Example: 
{
  'kor': '['부러진', '날개로', '다시한번', '날개짓을', '하라']',
  'eng': '['wings', 'once', 'again', 'with', 'broken', 'wings']'
}

Requirements

  • Following libraries are fundamental to this repo. Since I used conda environment requirements.txt has much more dependent libraries.
  • If you encounters any dependency problem, just use following command
    • pip install -r requirements.txt
en-core-web-sm==2.1.0
matplotlib==3.1.1
numpy==1.16.4
pandas==0.25.1
scikit-learn==0.21.3
soynlp==0.0.493
spacy==2.1.8
torch==1.2.0
torchtext==0.4.0

Models


Usage

  • Before training the model, you should train soynlp tokenizer on your training dataset and build vocabulary using following code.
  • You can determine the size of vocabulary of Korean and English dataset.
  • In general, Korean dataset creates the larger size vocabulary than English dataset. Therefore to make balance, you have to pick proper vocab size
  • By running following code, you will get tokenizer.pickle, kor.pickle and eng.pickle which are used to train, test the model and predict user's input sentence
python build_pickle.py --kor_vocab KOREAN_VOCAB_SIZE --eng_vocab ENGLISH_VOCAB_SIZE
  • For training, run main.py with train mode (which is default option)
python main.py --model MODEL_NAME
  • For testing, run main.py with test mode
python main.py --model MODEL_NAME --mode test
  • For predicting, run predict.py with your Korean input sentence.
  • Don't forget to wrap your input with double quotation mark !
python predict.py --model MODEL_NAME --input "YOUR_KOREAN_INPUT"

Example

  • These are well-trained examples, other inputs might not be as good.

kor> 저는 주말에 축구를 해요
eng> I am going to play soccer in the weekend

kor> 내일은 여자친구를 만나요
eng> I am going to meet a girlfriend

About

Sequence to sequence implementation using PyTorch and torchtext with Korean-English pair dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages