<a href="https://colab.research.google.com/github/marued/low-resource-machine-translation-team07/blob/master/notebooks/Low_Resource_NMT_OpenNMT_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Baseline model with OpenNMT



## Remove punctuation

Punctuations have very high frequency in target corpus. It will affect the result negatively.


In [0]:
import os
import pandas as pd
import re

In [6]:
DATASET='/home/ryan/projects/OpenNMT-py/data/low_resource'
os.chdir(DATASET)
os.listdir()

['tokenizer.py',
 'train.lang1',
 'punctuation_remover.py',
 'nopunc',
 'unaligned.fr',
 'train.lang2',
 'unaligned.en',
 'evaluator.py']

In [0]:
!python punctuation_remover.py --input train.lang1 --output nopunc
!python punctuation_remover.py --input train.lang2 --output nopunc

In [0]:
#@title Folder to save data, models and outputs
#@markdown This can be replaced to different experiemtns
EXPR_DIR='/home/ryan/projects/OpenNMT-py/data/low_resource/nopunc' #@param

## Split the dataset to train valid and test

In [9]:
DATASET='/home/ryan/projects/OpenNMT-py/data/low_resource/nopunc'
os.chdir(DATASET)
os.listdir()

['sub_train.lang2',
 'train.lang1',
 'sub_train.lang1',
 'sub_test.lang2',
 'sub_test.lang1',
 'sub_valid.lang2',
 'train.lang2',
 'sub_valid.lang1']

In [0]:
%%bash
head -9000 train.lang1 > sub_train.lang1
head -9000 train.lang2 > sub_train.lang2

tail -2000 train.lang1 |head -1000 > sub_valid.lang1
tail -2000 train.lang2 |head -1000 > sub_valid.lang2

tail -1000 train.lang1 > sub_test.lang1
tail -1000 train.lang2 > sub_test.lang2

## Data preprocess

In [0]:
DATASET='/home/ryan/projects/OpenNMT-py/'
os.chdir(DATASET)

In [0]:
%%bash
for l in 1 2; do for f in data/low_resource/nopunc/*.lang$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi;  done; done
for f in data/low_resource/nopunc/*.lang1; do perl tools/tokenizer.perl -a -no-escape -l en -q  < $f > $f.atok; done
for f in data/low_resource/nopunc/*.lang2; do perl tools/tokenizer.perl -a -no-escape -l fr -q  < $f > $f.atok; done

In [16]:
# prepare pairs
!onmt_preprocess \
-train_src data/low_resource/nopunc/sub_train.lang1.atok \
-train_tgt data/low_resource/nopunc/sub_train.lang2.atok \
-valid_src data/low_resource/nopunc/sub_valid.lang1.atok \
-valid_tgt data/low_resource/nopunc/sub_valid.lang2.atok \
-save_data data/low_resource/nopunc/low_resource_en2fr.atok.low \
-lower \
-dynamic_dict \
-overwrite

[2020-03-21 13:21:54,443 INFO] Extracting features...
[2020-03-21 13:21:54,443 INFO]  * number of source features: 0.
[2020-03-21 13:21:54,443 INFO]  * number of target features: 0.
[2020-03-21 13:21:54,443 INFO] Building `Fields` object...
[2020-03-21 13:21:54,443 INFO] Building & saving training data...
[2020-03-21 13:21:54,479 INFO] Building shard 0.
[2020-03-21 13:21:55,995 INFO]  * saving 0th train data shard to data/low_resource/nopunc/low_resource_en2fr.atok.low.train.0.pt.
[2020-03-21 13:21:58,097 INFO]  * tgt vocab size: 16127.
[2020-03-21 13:21:58,114 INFO]  * src vocab size: 12242.
[2020-03-21 13:21:58,198 INFO] Building & saving validation data...
[2020-03-21 13:21:58,240 INFO] Building shard 0.
[2020-03-21 13:21:58,379 INFO]  * saving 0th valid data shard to data/low_resource/nopunc/low_resource_en2fr.atok.low.valid.0.pt.


## FastText word-embedding

In [0]:
%%bash
mkdir -p fasttext
cd fasttext
wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz
gunzip cc.en.300.vec.gz
gunzip cc.fr.300.vec.gz

In [0]:
DATASET='/home/ryan/projects/OpenNMT-py/'
os.chdir(DATASET)

In [22]:
!./tools/embeddings_to_torch.py -emb_file_enc "data/low_resource/nopunc/fasttext/cc.en.300.vec" \
-emb_file_dec "data/low_resource/nopunc/fasttext/cc.fr.300.vec" \
-dict_file "data/low_resource/nopunc/low_resource_en2fr.atok.low.vocab.pt" \
-output_file "data/low_resource/nopunc/fasttext_embeddings"

[2020-03-22 00:01:03,745 INFO] From: data/low_resource/nopunc/low_resource_en2fr.atok.low.vocab.pt
[2020-03-22 00:01:03,745 INFO] 	* source vocab: 12242 words
[2020-03-22 00:01:03,745 INFO] 	* target vocab: 16127 words
[2020-03-22 00:01:03,745 INFO] Reading encoder embeddings from data/low_resource/nopunc/fasttext/cc.en.300.vec
[2020-03-22 00:01:37,971 INFO] 	Found 2000000 total vectors in file.
[2020-03-22 00:01:37,971 INFO] Reading decoder embeddings from data/low_resource/nopunc/fasttext/cc.fr.300.vec
[2020-03-22 00:02:15,565 INFO] 	Found 2000000 total vectors in file
[2020-03-22 00:02:15,565 INFO] After filtering to vectors in vocab:
[2020-03-22 00:02:15,568 INFO] 	* enc: 11750 match, 492 missing, (95.98%)
[2020-03-22 00:02:15,572 INFO] 	* dec: 15985 match, 142 missing, (99.12%)
[2020-03-22 00:02:15,573 INFO] 
Saving embedding as:
	* enc: data/low_resource/nopunc/fasttext_embeddings.enc.pt
	* dec: data/low_resource/nopunc/fasttext_embeddings.dec.pt
[2020-03-22 00:02:16,480 INFO]

## NMT with seq2seq with attention


### Seq2Seq with fasttext

In [24]:
!onmt_train -fp32 -data data/low_resource/nopunc/low_resource_en2fr.atok.low \
-save_model data/low_resource/nopunc/low_resource_en2fr_model \
-gpu_ranks 0 \
-valid_steps 1000 \
-save_checkpoint_steps 2000 \
-train_steps 15000 \
-word_vec_size 300 \
-pre_word_vecs_enc "data/low_resource/nopunc/fasttext_embeddings.enc.pt" \
-pre_word_vecs_dec "data/low_resource/nopunc/fasttext_embeddings.dec.pt" \
-tensorboard \
-tensorboard_log_dir run_logs

[2020-03-22 00:10:13,202 INFO]  * src vocab size = 12242
[2020-03-22 00:10:13,202 INFO]  * tgt vocab size = 16127
[2020-03-22 00:10:13,202 INFO] Building model...
[2020-03-22 00:10:16,167 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(12242, 300, padding_idx=1)
        )
      )
    )
    (rnn): LSTM(300, 500, num_layers=2, dropout=0.3)
  )
  (decoder): InputFeedRNNDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(16127, 300, padding_idx=1)
        )
      )
    )
    (dropout): Dropout(p=0.3, inplace=False)
    (rnn): StackedLSTM(
      (dropout): Dropout(p=0.3, inplace=False)
      (layers): ModuleList(
        (0): LSTMCell(800, 500)
        (1): LSTMCell(500, 500)
      )
    )
    (attn): GlobalAttention(
      (linear_in): Linear(in_features=500, out_features=500, bias=False)
    

In [27]:
!onmt_translate -gpu 0 -model data/low_resource/nopunc/low_resource_en2fr_model_step_15000.pt -src data/low_resource/nopunc/sub_test.lang1.atok -tgt data/low_resource/nopunc/sub_test.lang2.atok -replace_unk -output data/low_resource/nopunc/low_resource_en2fr_test.pred.atok
!perl tools/multi-bleu.perl data/low_resource/nopunc/sub_test.lang2.atok < data/low_resource/nopunc/low_resource_en2fr_test.pred.atok

[2020-03-22 13:38:35,000 INFO] Translating shard 0.
  var = torch.tensor(arr, dtype=self.dtype, device=device)
PRED AVG SCORE: -0.4206, PRED PPL: 1.5228
GOLD AVG SCORE: -6.0389, GOLD PPL: 419.4372
BLEU = 11.73, 40.0/17.4/8.9/4.7 (BP=0.900, ratio=0.904, hyp_len=20488, ref_len=22654)


### seq2seq w/o fasttext embedding

In [0]:
!onmt_train -data data/low_resource/nopunc/low_resource_en2fr.atok.low \
-save_model data/low_resource/nopunc/low_resource_en2fr_noemb_model \
-gpu_ranks 0 \
-valid_steps 1000 \
-save_checkpoint_steps 2000 \
-train_steps 15000 \
-tensorboard \
--tensorboard_log_dir run_logs

In [24]:
!onmt_translate -gpu 0 -model data/low_resource/nopunc/low_resource_en2fr_noemb_model_step_4000.pt -src data/low_resource/nopunc/sub_test.lang1.atok -tgt data/low_resource/nopunc/sub_test.lang2.atok -replace_unk -verbose -output data/low_resource/nopunc/low_resource_en2fr_test.pred.atok
!perl tools/multi-bleu.perl data/low_resource/nopunc/sub_test.lang2.atok < data/low_resource/nopunc/low_resource_en2fr_test.pred.atok

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
PRED 168: Nous devons étudier et efficacement les systèmes d ' informations entre les autorités concernées et l ' utilisation des fonds historiques de l ' UE et de l ' utilisation des fonds proposées
PRED SCORE: -23.0253
GOLD 168: L ' échange d ' informations entre les autorités et des bases de données communes pour enregistrer la composition des nouvelles drogues devraient être mis en place et développés sans tarder et bénéficier d ' un financement communautaire
GOLD SCORE: -205.0291

SENT 169: ['i', 'find', 'this', 'hard', 'to', 'take', 'certainly', 'where', 'frontier', 'work', 'is', 'concerned']
PRED 169: Je me réjouis tout de prendre le travail de vue de l ' ennuyer
PRED SCORE: -10.3000
GOLD 169: Je trouve que c ' est difficile à accepter surtout lorsque le travail frontalier est concerné
GOLD SCORE: -96.9523

SENT 170: ['that', 'american', 'movie', 'was', 'a', 'great', 'success']
PRED 170: Ce qui a été un grand succè

In [27]:
!onmt_translate -gpu 0 -model data/low_resource/nopunc/low_resource_en2fr_noemb_model_step_10000.pt -src data/low_resource/nopunc/sub_test.lang1.atok -tgt data/low_resource/nopunc/sub_test.lang2.atok -replace_unk -output data/low_resource/nopunc/low_resource_en2fr_test.pred.atok
!perl tools/multi-bleu.perl data/low_resource/nopunc/sub_test.lang2.atok < data/low_resource/nopunc/low_resource_en2fr_test.pred.atok

[2020-03-21 14:15:46,426 INFO] Translating shard 0.
  var = torch.tensor(arr, dtype=self.dtype, device=device)
PRED AVG SCORE: -0.5011, PRED PPL: 1.6505
GOLD AVG SCORE: -5.5942, GOLD PPL: 268.8718
BLEU = 10.58, 37.2/15.6/7.4/3.4 (BP=0.961, ratio=0.961, hyp_len=21779, ref_len=22654)


In [28]:
!onmt_translate -gpu 0 -model data/low_resource/nopunc/low_resource_en2fr_noemb_model_step_15000.pt -src data/low_resource/nopunc/sub_test.lang1.atok -tgt data/low_resource/nopunc/sub_test.lang2.atok -replace_unk -output data/low_resource/nopunc/low_resource_en2fr_test.pred.atok
!perl tools/multi-bleu.perl data/low_resource/nopunc/sub_test.lang2.atok < data/low_resource/nopunc/low_resource_en2fr_test.pred.atok

[2020-03-21 14:16:08,202 INFO] Translating shard 0.
  var = torch.tensor(arr, dtype=self.dtype, device=device)
PRED AVG SCORE: -0.4459, PRED PPL: 1.5619
GOLD AVG SCORE: -6.0129, GOLD PPL: 408.6776
BLEU = 10.53, 37.8/15.8/7.6/3.7 (BP=0.923, ratio=0.926, hyp_len=20982, ref_len=22654)


## Transformer

In [2]:
%%bash
python  train.py -fp32 -data data/low_resource/nopunc/low_resource_en2fr.atok.low \
        -save_model data/low_resource/nopunc/transformer_en2fr_model \
        -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8  \
        -encoder_type transformer -decoder_type transformer -position_encoding \
        -train_steps 15000  -max_generator_batches 2 -dropout 0.1 \
        -batch_size 4096 -batch_type tokens -normalization tokens  -accum_count 2 \
        -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \
        -max_grad_norm 0 -param_init 0  -param_init_glorot \
        -label_smoothing 0.1 -valid_steps 1000 -save_checkpoint_steps 2000 \
        -gpu_ranks 0 -tensorboard \
        --tensorboard_log_dir run_logs

[2020-03-21 14:23:17,217 INFO]  * src vocab size = 12242
[2020-03-21 14:23:17,217 INFO]  * tgt vocab size = 16127
[2020-03-21 14:23:17,217 INFO] Building model...
[2020-03-21 14:23:21,261 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(12242, 512, padding_idx=1)
        )
        (pe): PositionalEncoding(
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=True)
          (linear_values): Linear(in_features=512, out_features=512, bias=True)
          (linear_query): Linear(in_features=512, out_features=512, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=512, out_feat

In [3]:
!onmt_translate -gpu 0 -model data/low_resource/nopunc/transformer_en2fr_model_step_15000.pt -src data/low_resource/nopunc/sub_test.lang1.atok -tgt data/low_resource/nopunc/sub_test.lang2.atok -replace_unk -verbose -output data/low_resource/nopunc/transformer_en2fr_test.pred.atok
!perl tools/multi-bleu.perl data/low_resource/nopunc/sub_test.lang2.atok < data/low_resource/nopunc/transformer_en2fr_test.pred.atok

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
PRED 168: Il faut développer ce fait de développer des systèmes de l ' information numérique et de l ' information commune dans le cadre de l ' enregistrement
PRED SCORE: -22.8057
GOLD 168: L ' échange d ' informations entre les autorités et des bases de données communes pour enregistrer la composition des nouvelles drogues devraient être mis en place et développés sans tarder et bénéficier d ' un financement communautaire
GOLD SCORE: -264.6055

SENT 169: ['i', 'find', 'this', 'hard', 'to', 'take', 'certainly', 'where', 'frontier', 'work', 'is', 'concerned']
PRED 169: Je me réjouis que ça doit certainement un travail accompli
PRED SCORE: -13.2043
GOLD 169: Je trouve que c ' est difficile à accepter surtout lorsque le travail frontalier est concerné
GOLD SCORE: -94.7532

SENT 170: ['that', 'american', 'movie', 'was', 'a', 'great', 'success']
PRED 170: Bien sûr c ' était un grand succès
PRED SCORE: -3.7169
GOLD 170: Ce film