# I created the OpenNMT Tensorflow tutorial using Colab.

***First Go to Runtime and  change the runtime type to GPU.***


<br>
 Copyright Park Chanjun
<br>
 Email: bcj1210@naver.com










# Git Clone
First Git clone the OpenNMT source

In [0]:
!git clone https://github.com/OpenNMT/OpenNMT-tf.git

# Please install  OpenNMT-tensorflow use by pip



In [0]:
pip install OpenNMT-tf[tensorflow_gpu]

# Theory explanation

**Machine translation is a field of natural language processing, meaning that computers translate one language into another.**

Rule based, and statistical based, and recently we are using Deep Learning-based machine translation.

Learn how to build a real machine translation system and how the system pipeline is structured. Most of these courses can be applied to basic natural language processing problems as well as machine translation.

**Step**



**1.   Data Collection**

Parallel corpus is collected from various sources. It is possible to collect news texts, drama / movie subtitles, Wikipedia, etc., as well as data sets for evaluation of translation systems disclosed by WMT, a machine translation competition, and use them in translation systems.


**2.   Cleaning**

The collected data must be refined. The refinement process includes sorting sentences by corpus in both languages, and eliminating noise such as special characters.


**3. Subword Tokenization**

Refine spacing using the POS tagger or segmenter for each language. English may have refinement issues in upper / lower case.
After the spacing is refined, use Byte Pair Encoding (BPE) using public tools such as Subword or WordPiece. This allows you to perform additional segments and construct a vocabulary list. At this time, the segmented models learned for the BPE segment should be kept for future use.


**4. Train**

Train the seq2seq model using prepared datasets. Depending on the amount, you can train with a single GPU, or use multiple GPUs in parallel to reduce training time.


**5. Translate**

Now that the model has been created, you can start translating.


**6. Detokenization**

Even after the translation process is finished, it is still in a segment, so it is different from the actual sentence structure used by real people. Thus, when you perform a detoxification process, it is returned in the form of the actual sentence.


**7. Evaluating**

Quantitative evaluation is performed on the sentence thus obtained. BLEU is a quantitative evaluation method for machine translation. You can see which model is superior by comparing it to the BLEU score you are comparing.

# Data Collection

Let's Collect en-de Parallel Corpus form amazon S3
In your Colab Files A directory called toy-ende would have been created.

In [0]:
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz

In [0]:
!tar xf toy-ende.tar.gz

# Subword Tokenization

We use Byte Pair Encoding for Subword Tokenization

https://www.aclweb.org/anthology/P16-1162

i => input<br>
o ==> Output(*.code)<br>
s ==> Symbol (Usually use 32000)<br>

learn_bpe ==> make code<br>
apply_bpe ==> apply subwordTokenization<br>

src-train, src-val,test ==> Need to apply src.code<br>
tgt-train,tgt-val ==> Need to apply tgt.code


In [0]:
!python OpenNMT-tf/third_party/learn_bpe.py -i toy-ende/src-train.txt -o toy-ende/src.code -s 10000

In [0]:
!python OpenNMT-tf/third_party/learn_bpe.py -i toy-ende/tgt-train.txt -o toy-ende/tgt.code -s 10000

In [0]:
!python OpenNMT-tf/third_party/apply_bpe.py -c  toy-ende/src.code -i  toy-ende/src-train.txt -o toy-ende/src-train-bpe.txt

In [0]:
!python OpenNMT-tf/third_party/apply_bpe.py -c  toy-ende/src.code -i  toy-ende/src-val.txt -o toy-ende/src-val-bpe.txt

In [0]:
!python OpenNMT-tf/third_party/apply_bpe.py -c toy-ende/src.code -i toy-ende/src-test.txt -o toy-ende/src-test-bpe.txt

In [0]:
!python OpenNMT-tf/third_party/apply_bpe.py -c toy-ende/tgt.code -i toy-ende/tgt-train.txt -o toy-ende/tgt-train-bpe.txt

In [0]:
!python OpenNMT-tf/third_party/apply_bpe.py -c toy-ende/tgt.code -i toy-ende/tgt-val.txt -o toy-ende/tgt-val-bpe.txt

# Build Vocab

We will be working with some example data in toy-ende/ folder.
​
The data consists of parallel source (src) and target (tgt) data containing one sentence per line with tokens separated by a space:
​
1. src-train.txt
​
2. tgt-train.txt
​
3. src-val.txt
​
4. tgt-val.txt
​
​

Train data and validataion data are required for machine translation training.
​
Validation files are required and used to evaluate the convergence of the training. It usually contains no more than 5000 sentences.



In [0]:
!onmt-build-vocab --size 50000 --save_vocab toy-ende/src-vocab.txt toy-ende/src-train-bpe.txt

In [0]:
!onmt-build-vocab --size 50000 --save_vocab toy-ende/tgt-vocab.txt toy-ende/tgt-train-bpe.txt

# Let's Make data.yml

```
model_dir: toy-ende/run/

data:
  train_features_file: toy-ende/src-train.txt
  train_labels_file: toy-ende/tgt-train.txt
  eval_features_file: toy-ende/src-val.txt
  eval_labels_file: toy-ende/tgt-val.txt
  source_words_vocabulary: toy-ende/src-vocab.txt
  target_words_vocabulary: toy-ende/tgt-vocab.txt

train:
  save_checkpoints_steps: 1000

  eval:
    eval_delay: 3600  # Every 1 hour
    external_evaluators: BLEU
infer:
    batch_size: 32


```










**Create a data.yml file on your computer and upload it to Google Colab.**

# **Train the data(Basic)**

This command will start the training and evaluation loop of a small RNN-based sequence to sequence model.

If you want to use GPU , try add  below command (example use 1 GPU)
>  --num_gpus 1

Let's Check Available GPU


In [0]:
!nvidia-smi

**Let's Train**

Model is locate in toy-ende/run

In [0]:
!onmt-main train_and_eval --model_type NMTSmall --auto_config --config data.yml --num_gpus 1

# **Train the data(Transformer)**

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf


> If you get GPU-related errors, try halving batch_size


***Just Change model_type***

Available Model ==> http://opennmt.net/OpenNMT-tf/package/opennmt.models.catalog.html


In [0]:
!onmt-main train_and_eval --model_type Transformer --auto_config --config data.yml --num_gpus 1

# **Translate**

Now that you have your model, you can start translating.



Output predictions into pred.txt

Translate Using desired model

--checkpoint_path run/baseline-enfr/avg/model.ckpt-200000

In [0]:
!onmt-main infer --auto_config    --config data.yml      --features_file toy-ende/src-test.txt --predictions_file toy-ende/pred.txt

# Translate to your chosen model

Add

--checkpoint_path toy-ende/run/model.ckpt-YOUR_MODEL

In [0]:
!onmt-main infer --auto_config    --config data.yml      --features_file toy-ende/src-test.txt --predictions_file toy-ende/pred.txt --checkpoint_path toy-ende/run/model.ckpt-YOUR_MODEL

# Detokenization

Even after the translation process is finished, it is still in a segment, so it is different from the actual sentence structure used by real people. Thus, when you perform a detoxification process, it is returned in the form of the actual sentence.

We Use "sed" for BPE Detokenization


In [0]:
sed -i "s/@@ //g"  toy-ende/pred.txt

# Evaluation Using BLEU

Quantitative evaluation is performed on the sentence thus obtained. BLEU is a quantitative evaluation method for machine translation. You can see which model is superior by comparing it to the BLEU score you are comparing.

https://www.aclweb.org/anthology/P02-1040

In [0]:
perl  OpenNMT-tf/third_party/multi-bleu.perl toy-ende/ref.txt < toy-ende/pred.txt

If you have Any Question Please Email to  "bcj1210@naver.com"