<a href="https://colab.research.google.com/github/EverlynAsiko/Neural_Machine_Translation_for_African_Languages/blob/main/Baseline_models_results1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summary of Baseline Models 

**Overview:**
1. Text preprocessing
2. Inputs of the transformer
3. Workings of a transformer: *Submitted write up*
4. Results of baseline models

Codes are adapted from Masakhane reverse model notebook: https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb

Changes made include:
1. Additional models.
2. Generating csv from dataframes created.
3. Using of checkpoint to resume training.

#### Setting up locations and libraries

In [None]:
# Linking to drive
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# Importing needed libraries for preprocessing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#@title Default title text
# Install Pytorch with GPU support v1.8.0.
! pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html


In [None]:
# Filtering warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Loading the drive
import os
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language")

In [None]:
# Setting source and target languages
source_language = "en"
target_language1 = "lg"
target_language2 = "rw"
target_language3 = "lh"

os.environ["src"] = source_language 
os.environ["tgt1"] = target_language1
os.environ["tgt2"] = target_language2
os.environ["tgt3"] = target_language3

# Getting Data

JW300 to dataframes

In [None]:
# Installing package to retrieve datasets
! pip install opustools-pkg

## Luganda   

### Turning data from JW300 to dataframe

**Do not rerun**: Load pandas dataframe instead

In [None]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [None]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt1 -wm moses -w jw300.$src jw300.$tgt1 -q


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-lg.xml.gz not found. The following files are available for downloading:

   3 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en-lg.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en.zip
  22 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/lg.zip

 288 MB Total size
./JW300_latest_xml_en-lg.xml.gz ... 100% of 3 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
./JW300_latest_xml_lg.zip ... 100% of 22 MB


In [None]:
# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt1.xml.gz

In [None]:
# Downloading test set
# Download the global test set.
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en

# Specific test set
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$tgt1.en 
! mv test.en-$tgt1.en test.en
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$tgt1.$tgt1 
! mv test.en-$tgt1.$tgt1 test.$tgt1

--2021-05-14 10:19:53--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en.1’


2021-05-14 10:19:53 (8.11 MB/s) - ‘test.en-any.en.1’ saved [277791/277791]

--2021-05-14 10:19:54--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-lg.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 204598 (200K) [text/plain]
Saving to: ‘test.en-lg.en’


2021-0

In [None]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3571 global test sentences to filter from the training/dev data.


In [None]:
# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language1

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df.head(3)

Loaded data and skipped 5229/254723 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,This publication is not for sale .,Akatabo kano tekatundibwa .
1,COVER SUBJECT,OMUTWE OGULI KUNGULU
2,The Bible was completed about two thousand yea...,Bayibuli yamalirizibwa okuwandiikibwa emyaka n...


In [None]:
# Luganda training set
df.to_csv('Luganda.csv',index=False) 

### Data preprocessing

In [None]:
lug = pd.read_csv("Luganda.csv")
lug.head(3)

Unnamed: 0,source_sentence,target_sentence
0,This publication is not for sale .,Akatabo kano tekatundibwa .
1,COVER SUBJECT,OMUTWE OGULI KUNGULU
2,The Bible was completed about two thousand yea...,Bayibuli yamalirizibwa okuwandiikibwa emyaka n...


In [None]:
# drop duplicate translations
df_pp = lug.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
df_pp.dropna(inplace=True)

In [None]:
df_pp.isna().sum()

index              0
source_sentence    0
target_sentence    0
dtype: int64

In [None]:
# Splitting train and validation set
num_valid = 1000

dev = df_pp.tail(num_valid) 
stripped = df_pp.drop(df_pp.tail(num_valid).index)

# Creating files for luganda and english
with open("train."+source_language, "w") as src_file, open("train."+target_language1, "w") as trg_file:
  for index, row in stripped.iterrows():
    try:
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")
    except TypeError:
      print(index,row["target_sentence"])
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language1, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

In [None]:
! head train.*
! head dev.*

==> train.bpe.en <==
Ev@@ en@@ tually , however , the tru@@ ths I learned from the Bible began to sin@@ k deep@@ er into my heart . I real@@ ized that if I wanted to serve Jehovah , I had to change my pol@@ it@@ ical view@@ poin@@ ts and associ@@ ations .
At last , I have the st@@ able family life that I always cr@@ av@@ ed , and I have the loving Father that I always wanted .
I was a new husband , only 25 years old and very in@@ experienced , but off we went with confidence in Jehovah .
What can you do to show these de@@ a@@ f brothers personal attention ?
R@@ ef@@ er@@ r@@ ing to what the rul@@ er@@ ship of God’s Son will accompl@@ ish , Isaiah 9 : 7 says : “ The very z@@ eal of Jehovah of arm@@ ies will do this . ”
Jesus is the m@@ igh@@ ti@@ est of all of Jehovah’s spirit sons .
The ste@@ ad@@ f@@ ast example set by J@@ ac@@ o@@ b and R@@ ac@@ he@@ l no doubt had a powerful effect on their son Joseph , influ@@ enc@@ ing how he would hand@@ le t@@ ests of his own faith .
When s@@ en

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt
Collecting numpy==1.20.1
[?25l  Downloading https://files.pythonhosted.org/packages/70/8a/064b4077e3d793f877e3b77aa64f56fa49a4d37236a53f78ee28be009a16/numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3MB)
[K     |████████████████████████████████| 15.3MB 198kB/s 
Collecting torchtext==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/36/50/84184d6230686e230c464f0dd4ff32eada2756b4a0b9cefec68b88d1d580/torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 33.7MB/s 
[?25hCollecting sacrebleu>=1.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 6.3MB/s 
[?25hCollecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8cae4

In [None]:
# Apply BPE splits to the development and test data.
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt1 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt1

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < train.$tgt1 > train.bpe.$tgt1

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < dev.$tgt1 > dev.bpe.$tgt1
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < test.$tgt1 > test.bpe.$tgt1

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py train.bpe.$src train.bpe.$tgt1 --output_path vocab.txt

In [None]:
# Some output
! echo "BPE Luganda Sentences"
! tail -n 5 test.bpe.$tgt1
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Luganda Sentences
Eng@@ abo enn@@ ene ey’@@ okukkiriza ( Laba akat@@ undu 12 - 14 )
En@@ k@@ of@@ i@@ ira ey’@@ obul@@ ok@@ ozi ( Laba akat@@ undu 15 - 18 )
N@@ kir@@ abye nti abantu bak@@ wat@@ ibwako nnyo bwe bak@@ iraba nti oy@@ agala nnyo Bayibuli era nti ok@@ ola kyonna ekis@@ oboka oku@@ bayamba . ”
E@@ kit@@ ala eky’@@ omwoyo ( Laba akat@@ undu 19 - 20 )
Yakuwa asobola okutuyamba okul@@ wanyisa omul@@ abe oyo ne tu@@ mu@@ w@@ angula !
Combined BPE Vocab
(@@
Ó@@
taayo
\
meet@@
uld
Prover@@
”@@
ö
ŋ


## Kinyarwanda  

### Turning data from JW300 to dataframe

**Do not rerun**: Load pandas dataframe instead

In [None]:
# Changing to Kinyarwanda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda")

In [None]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt2 -wm moses -w jw300.$src jw300.$tgt2 -q


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-rw.xml.gz not found. The following files are available for downloading:

   5 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en-rw.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en.zip
  48 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/rw.zip

 316 MB Total size
./JW300_latest_xml_en-rw.xml.gz ... 100% of 5 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
./JW300_latest_xml_rw.zip ... 100% of 48 MB


In [None]:
# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt2.xml.gz

In [None]:
# Downloading test set
# Download the global test set.
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en

# Specific test set
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$tgt2.en 
! mv test.en-$tgt2.en test.en
! wget https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-$tgt2.$tgt2 
! mv test.en-$tgt2.$tgt2 test.$tgt2

--2021-05-14 10:39:30--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en’


2021-05-14 10:39:30 (10.6 MB/s) - ‘test.en-any.en’ saved [277791/277791]

--2021-05-14 10:39:30--  https://raw.githubusercontent.com/juliakreutzer/masakhane/master/jw300_utils/test/test.en-rw.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 202456 (198K) [text/plain]
Saving to: ‘test.en-rw.en’


2021-05-14

In [None]:
# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language2

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df2 = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df2.head(3)

Loaded data and skipped 5825/483984 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,The Deaf Praise Jehovah,Ibipfamatwi Bisingiza Yehova
1,BY AWAKE !,BY AWAKE !
2,CORRESPONDENT IN NIGERIA,CORRESPONDENT IN NIGERIA


In [None]:
# Kinyarwanda training set
df2.to_csv('Kinyarwanda.csv',index=False) 

### Data preprocessing

In [None]:
rwa = pd.read_csv("Kinyarwanda.csv")
rwa.head(3)

Unnamed: 0,source_sentence,target_sentence
0,The Deaf Praise Jehovah,Ibipfamatwi Bisingiza Yehova
1,BY AWAKE !,BY AWAKE !
2,CORRESPONDENT IN NIGERIA,CORRESPONDENT IN NIGERIA


In [None]:
# drop duplicate translations
df_pp = rwa.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
df_pp.dropna(inplace=True)

In [None]:
df_pp.isna().sum()

source_sentence    0
target_sentence    0
dtype: int64

In [None]:
# Splitting train and validation set
num_valid = 1000

dev = df_pp.tail(num_valid) 
stripped = df_pp.drop(df_pp.tail(num_valid).index)

# Creating files for luganda and english
with open("train."+source_language, "w") as src_file, open("train."+target_language2, "w") as trg_file:
  for index, row in stripped.iterrows():
    try:
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")
    except TypeError:
      print(index,row["target_sentence"])
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language2, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

In [None]:
! head train.*
! head dev.*

==> train.bpe.en <==
R@@ ight after his bapt@@ ism , he “ went off into Ar@@ ab@@ ia ” ​ — e@@ ither the S@@ y@@ ri@@ an D@@ es@@ ert or pos@@ sib@@ ly some qu@@ i@@ et place on the Ar@@ ab@@ ian P@@ en@@ ins@@ ul@@ a that was conduc@@ ive to med@@ it@@ ation .
You will see the time when God br@@ ings righteous rule to all the earth , und@@ o@@ ing the d@@ am@@ age and inj@@ ust@@ ice brought by human rul@@ er@@ ship .
Let us consider f@@ ive reas@@ ons why we should want to follow the Christ .
Even in the Bible , the id@@ ea of pers@@ u@@ as@@ ion som@@ et@@ imes has n@@ eg@@ ative con@@ no@@ t@@ ations , den@@ ot@@ ing a cor@@ rup@@ ting or a lead@@ ing as@@ tr@@ ay .
For God’s servants to be deliv@@ ered , Satan and his ent@@ ire world@@ wide system of things need to be rem@@ ov@@ ed .
I had never heard that name used in my ch@@ urch .
S@@ imp@@ ly having authority or a wid@@ er name recogn@@ ition is not the important thing .
M@@ ost people do not believe in the spir@@ its .
And ot

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shareddrives/NMT_for_African_Language/Kinyarwanda/joeynmt
Collecting numpy==1.20.1
[?25l  Downloading https://files.pythonhosted.org/packages/70/8a/064b4077e3d793f877e3b77aa64f56fa49a4d37236a53f78ee28be009a16/numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3MB)
[K     |████████████████████████████████| 15.3MB 200kB/s 
Collecting torchtext==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/36/50/84184d6230686e230c464f0dd4ff32eada2756b4a0b9cefec68b88d1d580/torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 24.3MB/s 
[?25hCollecting sacrebleu>=1.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 8.9MB/s 
[?25hCollecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8

In [None]:
# Apply BPE splits to the development and test data.
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt2 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt2

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt2 < train.$tgt2 > train.bpe.$tgt2

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt2 < dev.$tgt2 > dev.bpe.$tgt2
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt2 < test.$tgt2 > test.bpe.$tgt2

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py train.bpe.$src train.bpe.$tgt2 --output_path vocab.txt

In [None]:
# Some output
! echo "BPE Kinyarwanda Sentences"
! tail -n 5 test.bpe.$tgt2
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Kinyarwanda Sentences
I@@ ng@@ abo n@@ ini yo kwizera ( Reba p@@ aragar@@ af@@ u ya 12 - 14 )
I@@ ng@@ of@@ ero y’@@ agak@@ iza ( Reba p@@ aragar@@ af@@ u ya 15 - 18 )
N@@ abonye ko iyo abantu babona ko ukunda gukoresha Bibiliya kandi ug@@ akora ib@@ ish@@ oboka byose kugira ngo ub@@ afashe , bak@@ ira neza ubutumwa . ”
I@@ nk@@ ota y’@@ umwuka ( Reba p@@ aragar@@ af@@ u ya 19 - 20 )
Ariko Yehova adufasha k@@ umur@@ wanya , tuk@@ am@@ ut@@ sinda !
Combined BPE Vocab
Ê@@
̆
ahamu
ʺ
⁄
ointed
Ă@@
̄@@
ḥ
Ā@@


## Luhyia

### Data preprocessing

In [None]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia")

In [None]:
luh = pd.read_csv("Luhya.csv")
luh.tail(3)

Unnamed: 0,0,1
7949,Ne omundu yesi naba narusiakhwo likhuwa liosi...,and if anyone takes away from the words of th...
7950,Ulia ourusinjia obuloli khumakhuwa kano koosi...,"He who testifies to these things says, “Surel..."
7951,Obukoosia obwa Omwami Yesu bube khubandu ba N...,The grace of our Lord Jesus Christ be with yo...


In [None]:
# Tokenizing the data
import nltk
nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
luh['target_sentence'] = luh['0'].apply(lambda x: ' '.join(word_tokenize(x)))
luh['source_sentence'] = luh['1'].apply(lambda x: ' '.join(word_tokenize(x)))
luh = luh.drop(['0', '1'], axis = 1)
luh.tail(2)

Unnamed: 0,target_sentence,source_sentence
7950,Ulia ourusinjia obuloli khumakhuwa kano koosi ...,"He who testifies to these things says , “ Sure..."
7951,Obukoosia obwa Omwami Yesu bube khubandu ba Ny...,The grace of our Lord Jesus Christ be with you...


In [None]:
#luh.rename(columns = {'0' : 'target_sentence', '1' : 'source_sentence'}, inplace = True)

In [None]:
# drop duplicate translations
df_pp = luh.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
# Splitting train and validation set
num_valid = 1000

dev = df_pp.tail(num_valid) 
stripped = df_pp.drop(df_pp.tail(num_valid).index)
test = stripped.tail(num_valid)
stripped2 = stripped.drop(stripped.tail(num_valid).index)

# Creating files for luhyia and english
with open("train."+source_language, "w") as src_file, open("train."+target_language3, "w") as trg_file:
  for index, row in stripped2.iterrows():
    try:
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")
    except TypeError:
      print(index,row["target_sentence"])

# Dev   
with open("dev."+source_language, "w") as src_file, open("dev."+target_language3, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

# Test
with open("test."+source_language, "w") as src_file, open("test."+target_language3, "w") as trg_file:
  for index, row in test.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

In [None]:
! head train.*
! head dev.*
! head test.*

==> train.bpe.en <==
 T@@ hat day was the P@@ re@@ par@@ ation, and the Sab@@ b@@ ath drew ne@@ ar@@ .
 Behold, I am coming qui@@ ck@@ l@@ y@@ ! H@@ old fast what you ha@@ ve, that no one may take your crow@@ n.
 The next day, because he wan@@ ted to know for certain why he was acc@@ us@@ ed by the Jews, he r@@ ele@@ as@@ ed him from his bond@@ s, and commanded the chief priests and all their coun@@ ci@@ l to appear@@ , and brought Paul down and set him before them. 

 This He said, sig@@ ni@@ f@@ ying by what death He would di@@ e.
 Then they said to the wom@@ an, “@@ Now we believ@@ e, not because of what you said, for we our@@ selves have heard Him and we know that this is indeed the Christ, the Sa@@ vi@@ or of the worl@@ d.”
 But rej@@ ect prof@@ ane and old wi@@ ves@@ ’ f@@ ab@@ les, and ex@@ er@@ c@@ ise your@@ self toward god@@ lin@@ es@@ s.
 It is written in the prophe@@ ts, ‘@@ And they shall all be taught by God@@ .’ Therefore everyone who has heard and lear@@ ned from the Fa

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt
Collecting numpy==1.20.1
[?25l  Downloading https://files.pythonhosted.org/packages/70/8a/064b4077e3d793f877e3b77aa64f56fa49a4d37236a53f78ee28be009a16/numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3MB)
[K     |████████████████████████████████| 15.3MB 196kB/s 
Collecting torchtext==0.9.0
[?25l  Downloading https://files.pythonhosted.org/packages/36/50/84184d6230686e230c464f0dd4ff32eada2756b4a0b9cefec68b88d1d580/torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1MB)
[K     |████████████████████████████████| 7.1MB 20.9MB/s 
[?25hCollecting sacrebleu>=1.3.6
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/0c7ca4e31a126189dab99c19951910bd081dea5bbd25f24b77107750eae7/sacrebleu-1.5.1-py3-none-any.whl (54kB)
[K     |████████████████████████████████| 61kB 7.0MB/s 
[?25hCollecting subword-nmt
  Downloading https://files.pythonhosted.org/packages/74/60/6600a7bc09e7ab38bc53a48a20d8cae4

In [None]:
# Apply BPE splits to the development and test data.
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt3 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt3

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt3 < train.$tgt3 > train.bpe.$tgt3

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt3 < dev.$tgt3 > dev.bpe.$tgt3
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt3 < test.$tgt3 > test.bpe.$tgt3

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py train.bpe.$src train.bpe.$tgt3 --output_path vocab.txt

In [None]:
# Some output
! echo "BPE Luhya Sentences"
! tail -n 5 test.bpe.$tgt3
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Luhya Sentences
N@@ asi , ni@@ reeba endi , ‘ N@@ iwe wina , Omwami ? ’ Omw@@ oyo okwo , nik@@ umb@@ ool@@ ela kuri , ‘ N@@ isie Yesu owa Nazaret@@ i ow@@ os@@ a@@ and@@ injia . ’
shichila , omukh@@ a@@ anawe omut@@ elwa , ow@@ emiyika ekhumi na@@ chi@@ bili yali n@@ any@@ ir@@ anga . Ne olwa yali na@@ tsitsanga , abandu , bam@@ wi@@ bu@@ mb@@ akhwo okhurula mu@@ tsimb@@ eka tsiosi .
Ne olwa kab@@ isibwa mbu khu@@ khoyile okhu@@ tsi@@ ila , mum@@ eeli okhuula I@@ tal@@ ia , ba@@ haana Paulo nende abab@@ ohe , bandi khumu@@ s@@ injilili w@@ elihe J@@ ul@@ i@@ asi owe@@ ing'@@ anda eya , eshi@@ r@@ oma ey@@ il@@ angwa mbu , “ I@@ ng'@@ anda ey@@ il@@ ind@@ anga , Omuruchi . ”
Ol@@ uny@@ um@@ akhwo , abakuuka befwe , abab@@ ukula li@@ he@@ ema elo okhurula khub@@ as@@ abwe , bali@@ chinga , okhuula mutsinyanga tsia Yos@@ h@@ wa nibab@@ ukula eshialo , eshia amahanga aka Nyasaye yal@@ ondanga nik@@ arula imbeli , wabwe . Ne li@@ am@@ eny@@ ayo okhuula mutsinyanga tsia , omuruchi Daudi 

# Modeling

## Luganda

In [None]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.3-py3-none-any.whl size=85058 sha256=ee244f622f96330fbe7605e3786a0bfdcad815b0a4712d40a0e1bfc6e32b8332
  Stored in directory: /tmp/pip-ephem-wheel-cache-i96seoa9/wheels/b8/3e/ec/4da3b842b3679715f7cd3b4065c087c62dd0fcb0ab5f55b80c
Successfully built joeynmt
Installing collected packages: joeynmt
  Attempting uninstall: joeynmt
    Found existing installation

In [None]:
#@title
name = '%s%s' % (target_language1, source_language)

# Create the config
config = """
name: "{target_language1}{source_language}_reverse_transformer"

data:
    src: "{target_language1}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 3000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: True              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luganda", source_language=source_language, target_language1=target_language1)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt1$src.yaml

2021-07-26 09:05:19,702 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-26 09:05:19,730 - INFO - joeynmt.data - Loading training data...
2021-07-26 09:05:24,612 - INFO - joeynmt.data - Building vocabulary...
2021-07-26 09:05:24,917 - INFO - joeynmt.data - Loading dev data...
2021-07-26 09:05:24,946 - INFO - joeynmt.data - Loading test data...
2021-07-26 09:05:24,998 - INFO - joeynmt.data - Data loaded.
2021-07-26 09:05:24,998 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-26 09:05:25,255 - INFO - joeynmt.model - Enc-dec model built.
2021-07-26 09:05:25.431993: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-26 09:05:26,860 - INFO - joeynmt.training - Total params: 12152064
2021-07-26 09:05:29,079 - INFO - joeynmt.helpers - cfg.name                           : lgen_reverse_transformer
2021-07-26 09:05:29,079 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/lgen_reverse_transformer/validations.txt"

Steps: 3000	Loss: 88693.94531	PPL: 28.24070	bleu: 4.64948	LR: 0.00030000	*
Steps: 6000	Loss: 74349.93750	PPL: 16.45254	bleu: 9.56529	LR: 0.00030000	*
Steps: 9000	Loss: 67064.07031	PPL: 12.50400	bleu: 13.31115	LR: 0.00030000	*
Steps: 12000	Loss: 62883.14062	PPL: 10.68210	bleu: 16.21796	LR: 0.00030000	*
Steps: 15000	Loss: 59963.24609	PPL: 9.56957	bleu: 17.65447	LR: 0.00030000	*
Steps: 18000	Loss: 57535.53125	PPL: 8.73332	bleu: 18.88552	LR: 0.00030000	*
Steps: 21000	Loss: 55978.57422	PPL: 8.23588	bleu: 19.64789	LR: 0.00030000	*
Steps: 24000	Loss: 54671.21484	PPL: 7.84014	bleu: 20.36290	LR: 0.00030000	*
Steps: 27000	Loss: 53323.76562	PPL: 7.45216	bleu: 21.11758	LR: 0.00030000	*
Steps: 30000	Loss: 52226.29688	PPL: 7.15039	bleu: 22.07949	LR: 0.00030000	*
Steps: 33000	Loss: 51357.41016	PPL: 6.92016	bleu: 22.60658	LR: 0.00030000	*
Steps: 36000	Loss: 51089.74609	PPL: 6.85074	bleu: 22.43566	LR: 0.00030000	*
Steps: 39000	Loss: 50230.04688	PPL: 6.63246	bleu: 22.83738	LR: 0.00030000	*
Steps: 42000	

In [None]:
# Reloading configuration file
ckpt_number = 72000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/lgen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lgen_reverse_transformer"', f'model_dir: "models/lgen_reverse_transformer2"').replace(
            f'epochs: 30', f'epochs: 4')
with open("transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "transformer_lgen_reload.yaml"


name: "lgen_reverse_transformer"

data:
    src: "lg"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer/72000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam sched

In [None]:
!python -m joeynmt train transformer_lgen_reload.yaml

2021-07-27 08:55:21,463 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-27 08:55:21,504 - INFO - joeynmt.data - Loading training data...
2021-07-27 08:55:26,948 - INFO - joeynmt.data - Building vocabulary...
2021-07-27 08:55:27,661 - INFO - joeynmt.data - Loading dev data...
2021-07-27 08:55:28,719 - INFO - joeynmt.data - Loading test data...
2021-07-27 08:55:29,812 - INFO - joeynmt.data - Data loaded.
2021-07-27 08:55:29,812 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-27 08:55:30,032 - INFO - joeynmt.model - Enc-dec model built.
2021-07-27 08:55:30.206406: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 08:55:31,952 - INFO - joeynmt.training - Total params: 12152064
2021-07-27 08:55:35,437 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer/72000.ckpt
2021-07-27 08:55:

During testing we achieve a dev set BLEU score of 26.42 and a test set BLEU score of 35.85. This is very good as the model is not overfitting during training and we see good results on the test dataset.

In [None]:
# Reloading configuration file
ckpt_number = 190000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "joeynmt/models/lgen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/models/{name}_reverse_transformer2_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lgen_reverse_transformer2"', f'model_dir: "models/lgen_reverse_transformer2_continued2"').replace(
        f'epochs: 30', f'epochs: 5')
with open("joeynmt/configs/transformer_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_lgen_reload2.yaml"


name: "lgen_reverse_transformer"

data:
    src: "lg"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/models/lgen_reverse_transformer2_continued/190000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam s

In [None]:
# Train continued
!cd joeynmt; python3 -m joeynmt train configs/transformer_lgen_reload2.yaml

2021-07-27 09:39:36,423 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-27 09:39:36,464 - INFO - joeynmt.data - Loading training data...
2021-07-27 09:39:40,778 - INFO - joeynmt.data - Building vocabulary...
2021-07-27 09:39:41,063 - INFO - joeynmt.data - Loading dev data...
2021-07-27 09:39:41,093 - INFO - joeynmt.data - Loading test data...
2021-07-27 09:39:41,132 - INFO - joeynmt.data - Data loaded.
2021-07-27 09:39:41,132 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-27 09:39:41,352 - INFO - joeynmt.model - Enc-dec model built.
2021-07-27 09:39:41.520384: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 09:39:42,682 - INFO - joeynmt.training - Total params: 12152064
2021-07-27 09:39:46,130 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer2/78000.ckpt
2021-07-27 09:39

### Reverse model

In [None]:
#@title
name = '%s%s' % (source_language, target_language1)

# Create the config
config = """
name: "{source_language}{target_language1}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language1}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 3000         # TODO: Set to at least once per epoch.
    logging_freq: 200
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luganda", source_language=source_language, target_language1=target_language1)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt1.yaml

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt/joeynmt/__main__.py", line 48, in <module>
    main()
  File "/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt/joeynmt/__main__.py", line 35, in main
    train(cfg_file=args.config_path, skip_test=args.skip_test)
  File "/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt/joeynmt/training.py", line 767, in train
    "overwrite", False))
  File "/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt/joeynmt/helpers.py", line 43, in make_model_dir
    "Model directory exists and overwriting is disabled.")
FileExistsError: Model directory exists and overwriting is disabled.


In [None]:
# Output our validation accuracy
! cat "joeynmt/models/enlg_transformer/validations.txt"

Steps: 3000	Loss: 90329.59375	PPL: 34.15279	bleu: 1.50899	LR: 0.00030000	*
Steps: 6000	Loss: 78915.16406	PPL: 21.86029	bleu: 2.03797	LR: 0.00030000	*
Steps: 9000	Loss: 72870.83594	PPL: 17.26029	bleu: 3.00038	LR: 0.00030000	*
Steps: 12000	Loss: 68934.50781	PPL: 14.79876	bleu: 3.60862	LR: 0.00030000	*
Steps: 15000	Loss: 65144.08594	PPL: 12.76085	bleu: 5.62373	LR: 0.00030000	*
Steps: 18000	Loss: 62416.39453	PPL: 11.47029	bleu: 7.69962	LR: 0.00030000	*
Steps: 21000	Loss: 60411.26172	PPL: 10.60561	bleu: 9.47106	LR: 0.00030000	*
Steps: 24000	Loss: 58601.82812	PPL: 9.88141	bleu: 10.27910	LR: 0.00030000	*
Steps: 27000	Loss: 57266.67188	PPL: 9.37893	bleu: 10.88260	LR: 0.00030000	*
Steps: 30000	Loss: 55989.72656	PPL: 8.92228	bleu: 11.46961	LR: 0.00030000	*
Steps: 33000	Loss: 54901.21484	PPL: 8.55062	bleu: 13.01796	LR: 0.00030000	*
Steps: 36000	Loss: 54212.95312	PPL: 8.32365	bleu: 12.76414	LR: 0.00030000	*
Steps: 39000	Loss: 53024.57422	PPL: 7.94584	bleu: 13.52685	LR: 0.00030000	*
Steps: 42000	Lo

In [None]:
!cd joeynmt; python -m joeynmt test 'models/enlg_transformer/config.yaml'

2021-07-26 08:58:21,062 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-26 08:58:21,068 - INFO - joeynmt.data - Building vocabulary...
2021-07-26 08:58:21,377 - INFO - joeynmt.data - Loading dev data...
2021-07-26 08:58:21,395 - INFO - joeynmt.data - Loading test data...
2021-07-26 08:58:21,429 - INFO - joeynmt.data - Data loaded.
2021-07-26 08:58:21,451 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-26 08:58:24,119 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-26 08:58:24,375 - INFO - joeynmt.model - Enc-dec model built.
2021-07-26 08:58:24,458 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe.lg)...
2021-07-26 08:59:11,179 - INFO - joeynmt.prediction -  dev bleu[13a]:  20.37 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-26 08:59:11,179 - INFO - joeynmt.prediction - Decoding on test set (

In [None]:
# Load the TensorBoard notebook extension
#%load_ext tensorboard

In [None]:
#%tensorboard --logdir joeynmt/models/lgen_reverse_transformer2/tensorboard

## Kinyarwanda

In [None]:
# Changing to Kinyarwanda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda")

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 95 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 21.1 MB/s 
[?25hCollecting sacrebleu>=1.3.6
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.9 MB/s 
[?25hCollecting subword-nmt
  Downloading

In [None]:
#@title
name = '%s%s' % (target_language2, source_language)

# Create the config
config = """
name: "{target_language2}{source_language}_reverse_transformer"

data:
    src: "{target_language2}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 5000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: False              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda", source_language=source_language, target_language2=target_language2)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt2$src.yaml

2021-05-25 05:50:23,545 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-05-25 05:50:23,615 - INFO - joeynmt.data - Loading training data...
2021-05-25 05:50:33,266 - INFO - joeynmt.data - Building vocabulary...
2021-05-25 05:50:33,870 - INFO - joeynmt.data - Loading dev data...
2021-05-25 05:50:33,909 - INFO - joeynmt.data - Loading test data...
2021-05-25 05:50:34,705 - INFO - joeynmt.data - Data loaded.
2021-05-25 05:50:34,705 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-05-25 05:50:34,915 - INFO - joeynmt.model - Enc-dec model built.
2021-05-25 05:50:35.132232: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-25 05:50:38,257 - INFO - joeynmt.training - Total params: 12177664
2021-05-25 05:50:42,369 - INFO - joeynmt.helpers - cfg.name                           : rwen_reverse_transformer
2021-05-25 05:50:42,369 - INFO - joeynmt.helpers - cfg.data.src                       : r

15 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer/validations.txt"

Steps: 5000	Loss: 89031.21875	PPL: 24.66077	bleu: 3.82186	LR: 0.00030000	*
Steps: 10000	Loss: 76205.73438	PPL: 15.54103	bleu: 8.17303	LR: 0.00030000	*
Steps: 15000	Loss: 69015.25000	PPL: 11.99654	bleu: 12.17688	LR: 0.00030000	*
Steps: 20000	Loss: 64636.51953	PPL: 10.24695	bleu: 14.46302	LR: 0.00030000	*
Steps: 25000	Loss: 61545.32031	PPL: 9.16777	bleu: 16.18443	LR: 0.00030000	*
Steps: 30000	Loss: 59109.05078	PPL: 8.39793	bleu: 17.02692	LR: 0.00030000	*
Steps: 35000	Loss: 57419.58984	PPL: 7.90237	bleu: 18.38236	LR: 0.00030000	*
Steps: 40000	Loss: 55984.45312	PPL: 7.50445	bleu: 19.18505	LR: 0.00030000	*
Steps: 45000	Loss: 54509.96094	PPL: 7.11648	bleu: 20.19677	LR: 0.00030000	*
Steps: 50000	Loss: 53734.00781	PPL: 6.92043	bleu: 20.37179	LR: 0.00030000	*
Steps: 55000	Loss: 52948.68359	PPL: 6.72752	bleu: 21.17316	LR: 0.00030000	*
Steps: 60000	Loss: 52193.86328	PPL: 6.54716	bleu: 21.64481	LR: 0.00030000	*
Steps: 65000	Loss: 51547.43750	PPL: 6.39656	bleu: 21.87287	LR: 0.00030000	*
Steps: 7000

In [None]:
#@title
name = '%s%s' % (target_language2, source_language)

# Create the config
config = """
name: "{target_language2}{source_language}_reverse_transformer"

data:
    src: "{target_language2}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer/src_vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer/latest.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 15                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 5000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer2"
    overwrite: True              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3
    save_latest_ckpt: True

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda", source_language=source_language, target_language2=target_language2)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt2$src.yaml

2021-05-25 09:51:25,131 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-05-25 09:51:25,170 - INFO - joeynmt.data - Loading training data...
2021-05-25 09:51:33,485 - INFO - joeynmt.data - Building vocabulary...
2021-05-25 09:51:33,761 - INFO - joeynmt.data - Loading dev data...
2021-05-25 09:51:33,802 - INFO - joeynmt.data - Loading test data...
2021-05-25 09:51:33,859 - INFO - joeynmt.data - Data loaded.
2021-05-25 09:51:33,859 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-05-25 09:51:34,083 - INFO - joeynmt.model - Enc-dec model built.
2021-05-25 09:51:34.213080: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-25 09:51:36,791 - INFO - joeynmt.training - Total params: 12177664
2021-05-25 09:51:40,245 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/latest.ckpt
2021-05-25 0

6 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer2/validations.txt"

Steps: 95000	Loss: 49048.79688	PPL: 5.84628	bleu: 23.41406	LR: 0.00030000	*
Steps: 100000	Loss: 48546.58203	PPL: 5.74153	bleu: 24.03219	LR: 0.00030000	*
Steps: 105000	Loss: 48555.19531	PPL: 5.74331	bleu: 24.07039	LR: 0.00030000	
Steps: 110000	Loss: 48127.53125	PPL: 5.65556	bleu: 24.11453	LR: 0.00030000	*
Steps: 115000	Loss: 47717.62891	PPL: 5.57272	bleu: 24.40899	LR: 0.00030000	*
Steps: 120000	Loss: 47591.91797	PPL: 5.54755	bleu: 24.17666	LR: 0.00030000	*


In [None]:
!python3 joeynmt/scripts/plot_validations.py joeynmt/models/rwen_reverse_transformer2 --plot_values bleu PPL  --output_path joeynmt/models/rwen_reverse_transformer2/bleu-ppl.png

In [None]:
!python3 joeynmt/scripts/plot_validations.py joeynmt/models/rwen_reverse_transformer --plot_values bleu PPL  --output_path joeynmt/models/rwen_reverse_transformer2/bleu-ppl1.png

![blue](https://drive.google.com/uc?id=1-1QTxbqngZ1G1fPf1yK9BxeAJOyFdAmk) ![blue2](https://drive.google.com/uc?id=1twVqeK43f2DJyZwhAkIJsR__rSenZxc8)

https://drive.google.com/file/d/1-1QTxbqngZ1G1fPf1yK9BxeAJOyFdAmk/view?usp=sharing

https://drive.google.com/file/d/1twVqeK43f2DJyZwhAkIJsR__rSenZxc8/view?usp=sharing

In [None]:
!cd joeynmt; python -m joeynmt test 'models/rwen_reverse_transformer2/config.yaml'

2021-07-01 08:15:14,699 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-01 08:15:14,703 - INFO - joeynmt.data - Building vocabulary...
2021-07-01 08:15:15,018 - INFO - joeynmt.data - Loading dev data...
2021-07-01 08:15:16,056 - INFO - joeynmt.data - Loading test data...
2021-07-01 08:15:17,306 - INFO - joeynmt.data - Data loaded.
2021-07-01 08:15:17,331 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-01 08:15:20,078 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-01 08:15:20,345 - INFO - joeynmt.model - Enc-dec model built.
2021-07-01 08:15:20,418 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe.en)...
2021-07-01 08:16:15,305 - INFO - joeynmt.prediction -  dev bleu[13a]:  24.76 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-01 08:16:15,307 - INFO - joeynmt.prediction - Decoding on test s

In [None]:
# Reloading configuration file
ckpt_number = 120000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/latest.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer2/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer2"', f'model_dir: "models/rwen_reverse_transformer2_continued"').replace(
            f'epochs: 15', f'epochs: 9').replace(f'validation_freq: 5000', f'validation_freq: 6000')
with open("joeynmt/configs/transformer_reverse_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/src_vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2/120000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"


In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload.yaml

2021-07-20 07:32:55,694 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-20 07:32:55,776 - INFO - joeynmt.data - Loading training data...
2021-07-20 07:33:06,892 - INFO - joeynmt.data - Building vocabulary...
2021-07-20 07:33:07,180 - INFO - joeynmt.data - Loading dev data...
2021-07-20 07:33:08,666 - INFO - joeynmt.data - Loading test data...
2021-07-20 07:33:10,034 - INFO - joeynmt.data - Data loaded.
2021-07-20 07:33:10,035 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-20 07:33:10,404 - INFO - joeynmt.model - Enc-dec model built.
2021-07-20 07:33:11.933915: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-20 07:33:13,923 - INFO - joeynmt.training - Total params: 12177664
2021-07-20 07:33:23,796 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2/120000.ckpt
2021-07-20 

9 epochs done

In [None]:
# Reloading configuration file
ckpt_number = 162000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/latest.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer2_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer2"', f'model_dir: "models/rwen_reverse_transformer2_continued2"').replace(
            f'epochs: 15', f'epochs: 30').replace(f'validation_freq: 5000', f'validation_freq: 6000')
with open("joeynmt/configs/transformer_reverse_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload2.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/src_vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued/162000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization:

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload2.yaml

2021-07-25 08:32:52,891 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-25 08:32:53,245 - INFO - joeynmt.data - Loading training data...
2021-07-25 08:33:03,944 - INFO - joeynmt.data - Building vocabulary...
2021-07-25 08:33:05,112 - INFO - joeynmt.data - Loading dev data...
2021-07-25 08:33:06,512 - INFO - joeynmt.data - Loading test data...
2021-07-25 08:33:07,706 - INFO - joeynmt.data - Data loaded.
2021-07-25 08:33:07,706 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-25 08:33:07,921 - INFO - joeynmt.model - Enc-dec model built.
2021-07-25 08:33:09.398157: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-25 08:33:10,562 - INFO - joeynmt.training - Total params: 12177664
2021-07-25 08:33:14,035 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued/162000.ckpt
2

24 epochs done

In [None]:
# Reloading configuration file
ckpt_number = 288000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/latest.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer2_continued2/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer2"', f'model_dir: "models/rwen_reverse_transformer2_continued3"').replace(
            f'epochs: 15', f'epochs: 7').replace(f'validation_freq: 5000', f'validation_freq: 6000')
with open("joeynmt/configs/transformer_reverse_{name}_reload3.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload3.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/src_vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued2/288000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload3.yaml

2021-07-25 14:15:19,185 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-25 14:15:19,262 - INFO - joeynmt.data - Loading training data...
2021-07-25 14:15:29,907 - INFO - joeynmt.data - Building vocabulary...
2021-07-25 14:15:30,227 - INFO - joeynmt.data - Loading dev data...
2021-07-25 14:15:31,274 - INFO - joeynmt.data - Loading test data...
2021-07-25 14:15:32,322 - INFO - joeynmt.data - Data loaded.
2021-07-25 14:15:32,322 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-25 14:15:32,743 - INFO - joeynmt.model - Enc-dec model built.
2021-07-25 14:15:34.376676: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-25 14:15:36,502 - INFO - joeynmt.training - Total params: 12177664
2021-07-25 14:15:45,191 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued2/288000.ckpt


In [None]:
# Reloading configuration file
ckpt_number = 300000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/latest.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer2_continued3/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer2"', f'model_dir: "models/rwen_reverse_transformer2_continued4"').replace(
            f'epochs: 15', f'epochs: 4').replace(f'validation_freq: 5000', f'validation_freq: 6000')
with open("joeynmt/configs/transformer_reverse_{name}_reload4.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload4.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/src_vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/trg_vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued3/300000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload4.yaml

2021-07-26 07:16:13,457 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-26 07:16:13,536 - INFO - joeynmt.data - Loading training data...
2021-07-26 07:16:25,881 - INFO - joeynmt.data - Building vocabulary...
2021-07-26 07:16:26,215 - INFO - joeynmt.data - Loading dev data...
2021-07-26 07:16:27,613 - INFO - joeynmt.data - Loading test data...
2021-07-26 07:16:28,897 - INFO - joeynmt.data - Data loaded.
2021-07-26 07:16:28,898 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-26 07:16:29,335 - INFO - joeynmt.model - Enc-dec model built.
2021-07-26 07:16:31.106715: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-26 07:16:33,444 - INFO - joeynmt.training - Total params: 12177664
2021-07-26 07:16:42,179 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer2_continued3/300000.ckpt


### Reverse model

In [None]:
#@title
name = '%s%s' % (source_language, target_language2)

# Create the config
config = """
name: "{source_language}{target_language2}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language2}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 3000         # TODO: Set to at least once per epoch.
    logging_freq: 200
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda", source_language=source_language, target_language2=target_language2)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt2.yaml

2021-07-09 06:48:41,357 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-09 06:48:41,614 - INFO - joeynmt.data - Loading training data...
2021-07-09 06:48:52,577 - INFO - joeynmt.data - Building vocabulary...
2021-07-09 06:48:53,179 - INFO - joeynmt.data - Loading dev data...
2021-07-09 06:48:53,893 - INFO - joeynmt.data - Loading test data...
2021-07-09 06:48:54,816 - INFO - joeynmt.data - Data loaded.
2021-07-09 06:48:54,816 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-09 06:48:55,021 - INFO - joeynmt.model - Enc-dec model built.
2021-07-09 06:48:55.266319: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-09 06:48:56,880 - INFO - joeynmt.training - Total params: 12177664
2021-07-09 06:49:00,418 - INFO - joeynmt.helpers - cfg.name                           : enrw_transformer
2021-07-09 06:49:00,418 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-0

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/enrw_transformer/validations.txt"

Steps: 3000	Loss: 100428.37500	PPL: 34.94775	bleu: 1.69885	LR: 0.00030000	*
Steps: 6000	Loss: 85745.39062	PPL: 20.78576	bleu: 4.09416	LR: 0.00030000	*
Steps: 9000	Loss: 78221.25000	PPL: 15.92694	bleu: 6.55603	LR: 0.00030000	*
Steps: 12000	Loss: 72449.96875	PPL: 12.98485	bleu: 9.16044	LR: 0.00030000	*
Steps: 15000	Loss: 68738.13281	PPL: 11.38655	bleu: 10.37678	LR: 0.00030000	*
Steps: 18000	Loss: 65747.74219	PPL: 10.24318	bleu: 11.60779	LR: 0.00030000	*
Steps: 21000	Loss: 63069.15625	PPL: 9.31686	bleu: 12.74133	LR: 0.00030000	*
Steps: 24000	Loss: 61083.14844	PPL: 8.68456	bleu: 13.74739	LR: 0.00030000	*
Steps: 27000	Loss: 59487.35156	PPL: 8.20773	bleu: 14.55375	LR: 0.00030000	*
Steps: 30000	Loss: 58098.73828	PPL: 7.81416	bleu: 15.19644	LR: 0.00030000	*
Steps: 33000	Loss: 57091.47266	PPL: 7.54054	bleu: 15.63053	LR: 0.00030000	*
Steps: 36000	Loss: 56009.30078	PPL: 7.25723	bleu: 16.45190	LR: 0.00030000	*
Steps: 39000	Loss: 55082.56641	PPL: 7.02310	bleu: 16.97624	LR: 0.00030000	*
Steps: 42000

In [None]:
!cd joeynmt; python -m joeynmt test 'models/enrw_transformer/config.yaml'

2021-07-10 09:37:50,168 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-10 09:37:50,173 - INFO - joeynmt.data - Building vocabulary...
2021-07-10 09:37:50,998 - INFO - joeynmt.data - Loading dev data...
2021-07-10 09:37:52,280 - INFO - joeynmt.data - Loading test data...
2021-07-10 09:37:53,768 - INFO - joeynmt.data - Data loaded.
2021-07-10 09:37:53,830 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-10 09:38:00,004 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-10 09:38:00,357 - INFO - joeynmt.model - Enc-dec model built.
2021-07-10 09:38:00,428 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe.rw)...
2021-07-10 09:38:23,760 - INFO - joeynmt.prediction -  dev bleu[13a]:  23.34 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-10 09:38:23,761 - INFO - joeynmt.prediction - Decoding on test s

## Luhyia

In [None]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia")

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shareddrives/NMT_for_African_Language/Luhyia/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 101 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 23.4 MB/s 
[?25hCollecting sacrebleu>=1.3.6
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 3.7 MB/s 
[?25hCollecting subword-nmt
  Downloading subw

In [None]:
#@title
name = '%s%s' % (target_language3, source_language)

# Create the config
config = """
name: "{target_language3}{source_language}_reverse_transformer"

data:
    src: "{target_language3}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 1096
    batch_type: "token"
    eval_batch_size: 1600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 200         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: False              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia", source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt3$src.yaml

2021-07-01 09:16:23,658 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-01 09:16:23,688 - INFO - joeynmt.data - Loading training data...
2021-07-01 09:16:23,785 - INFO - joeynmt.data - Building vocabulary...
2021-07-01 09:16:24,049 - INFO - joeynmt.data - Loading dev data...
2021-07-01 09:16:24,070 - INFO - joeynmt.data - Loading test data...
2021-07-01 09:16:24,086 - INFO - joeynmt.data - Data loaded.
2021-07-01 09:16:24,087 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-01 09:16:24,361 - INFO - joeynmt.model - Enc-dec model built.
2021-07-01 09:16:24.555521: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-01 09:16:26,408 - INFO - joeynmt.training - Total params: 12097024
2021-07-01 09:16:28,708 - INFO - joeynmt.helpers - cfg.name                           : lhen_reverse_transformer
2021-07-01 09:16:28,708 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/lhen_reverse_transformer/validations.txt"

Steps: 200	Loss: 165704.70312	PPL: 161.22653	bleu: 0.00716	LR: 0.00030000	*
Steps: 400	Loss: 154830.34375	PPL: 115.49778	bleu: 0.00821	LR: 0.00030000	*
Steps: 600	Loss: 142989.31250	PPL: 80.32177	bleu: 0.13588	LR: 0.00030000	*
Steps: 800	Loss: 137030.95312	PPL: 66.90507	bleu: 0.69859	LR: 0.00030000	*
Steps: 1000	Loss: 133380.62500	PPL: 59.81789	bleu: 0.93283	LR: 0.00030000	*
Steps: 1200	Loss: 129902.87500	PPL: 53.76530	bleu: 1.87698	LR: 0.00030000	*
Steps: 1400	Loss: 127497.91406	PPL: 49.94184	bleu: 1.68085	LR: 0.00030000	*
Steps: 1600	Loss: 124984.14062	PPL: 46.23567	bleu: 2.37874	LR: 0.00030000	*
Steps: 1800	Loss: 123692.53125	PPL: 44.43969	bleu: 2.47989	LR: 0.00030000	*
Steps: 2000	Loss: 121792.11719	PPL: 41.92322	bleu: 2.44680	LR: 0.00030000	*
Steps: 2200	Loss: 119842.67969	PPL: 39.48982	bleu: 2.57330	LR: 0.00030000	*
Steps: 2400	Loss: 117787.98438	PPL: 37.07777	bleu: 2.65778	LR: 0.00030000	*
Steps: 2600	Loss: 116235.19531	PPL: 35.35315	bleu: 3.17998	LR: 0.00030000	*
Steps: 2800	Lo

In [None]:
!cd joeynmt; python -m joeynmt test 'models/lhen_reverse_transformer/config.yaml'

2021-07-01 10:35:11,527 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-01 10:35:11,534 - INFO - joeynmt.data - Building vocabulary...
2021-07-01 10:35:11,800 - INFO - joeynmt.data - Loading dev data...
2021-07-01 10:35:11,818 - INFO - joeynmt.data - Loading test data...
2021-07-01 10:35:11,833 - INFO - joeynmt.data - Data loaded.
2021-07-01 10:35:11,860 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 8000 (with beam_size)
2021-07-01 10:35:14,580 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-01 10:35:14,837 - INFO - joeynmt.model - Enc-dec model built.
2021-07-01 10:35:14,914 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe.en)...
2021-07-01 10:36:16,330 - INFO - joeynmt.prediction -  dev bleu[13a]:   7.58 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-01 10:36:16,331 - INFO - joeynmt.prediction - Decoding on test set (/c

In [None]:
!python3 joeynmt/scripts/plot_validations.py joeynmt/models/lhen_reverse_transformer --plot_values bleu PPL  --output_path joeynmt/models/lhen_reverse_transformer/bleu-ppl.png

![blue](https://drive.google.com/uc?id=1M0r8iSUYCyasJNO7yClAbvq_0hDencW-)![blue2](https://drive.google.com/uc?id=15Wbe6_ThVra_wSkdsELv9YuDyX_1axoh)


https://drive.google.com/file/d/1M0r8iSUYCyasJNO7yClAbvq_0hDencW-/view?usp=sharing

https://drive.google.com/file/d/15Wbe6_ThVra_wSkdsELv9YuDyX_1axoh/view?usp=sharing

In [None]:
# Reloading configuration file
ckpt_number = 8000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lhen_reverse_transformer"', f'model_dir: "models/lhen_reverse_transformer_continued"')
with open("joeynmt/configs/transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_lhen_reload.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/lhen_reverse_transformer/8000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
 

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/transformer_lhen_reload.yaml

2021-07-02 07:21:54,842 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-02 07:21:54,907 - INFO - joeynmt.data - Loading training data...
2021-07-02 07:21:55,841 - INFO - joeynmt.data - Building vocabulary...
2021-07-02 07:21:56,514 - INFO - joeynmt.data - Loading dev data...
2021-07-02 07:21:57,243 - INFO - joeynmt.data - Loading test data...
2021-07-02 07:21:58,205 - INFO - joeynmt.data - Data loaded.
2021-07-02 07:21:58,205 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-02 07:21:58,425 - INFO - joeynmt.model - Enc-dec model built.
2021-07-02 07:21:58.674452: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-02 07:22:00,502 - INFO - joeynmt.training - Total params: 12097024
2021-07-02 07:22:03,979 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/lhen_reverse_transformer/8000.ckpt
2021-07-02 07:22:04

In [None]:
! cat "joeynmt/models/lhen_reverse_transformer_continued/validations.txt"

Steps: 8200	Loss: 98958.71094	PPL: 20.81034	bleu: 7.14155	LR: 0.00030000	
Steps: 8400	Loss: 98222.44531	PPL: 20.34563	bleu: 7.26749	LR: 0.00030000	*
Steps: 8600	Loss: 97980.10156	PPL: 20.19495	bleu: 7.29421	LR: 0.00030000	*
Steps: 8800	Loss: 98038.94531	PPL: 20.23143	bleu: 7.51194	LR: 0.00030000	
Steps: 9000	Loss: 98010.07812	PPL: 20.21353	bleu: 7.45777	LR: 0.00030000	
Steps: 9200	Loss: 97764.75781	PPL: 20.06199	bleu: 7.71418	LR: 0.00030000	*
Steps: 9400	Loss: 97437.90625	PPL: 19.86186	bleu: 7.85631	LR: 0.00030000	*
Steps: 9600	Loss: 97703.65625	PPL: 20.02443	bleu: 7.42886	LR: 0.00030000	
Steps: 9800	Loss: 97414.12500	PPL: 19.84737	bleu: 7.80404	LR: 0.00030000	*
Steps: 10000	Loss: 97413.65625	PPL: 19.84709	bleu: 8.32098	LR: 0.00030000	*
Steps: 10200	Loss: 97024.35938	PPL: 19.61150	bleu: 8.29478	LR: 0.00030000	*
Steps: 10400	Loss: 97337.35156	PPL: 19.80069	bleu: 8.13847	LR: 0.00030000	
Steps: 10600	Loss: 97628.63281	PPL: 19.97840	bleu: 8.56480	LR: 0.00030000	
Steps: 10800	Loss: 97078.13

In [None]:
!cd joeynmt; python -m joeynmt test 'models/lhen_reverse_transformer_continued/config.yaml'

2021-07-02 07:54:01,679 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-02 07:54:01,684 - INFO - joeynmt.data - Building vocabulary...
2021-07-02 07:54:01,917 - INFO - joeynmt.data - Loading dev data...
2021-07-02 07:54:01,930 - INFO - joeynmt.data - Loading test data...
2021-07-02 07:54:01,942 - INFO - joeynmt.data - Data loaded.
2021-07-02 07:54:01,972 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 8000 (with beam_size)
2021-07-02 07:54:05,647 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-02 07:54:05,838 - INFO - joeynmt.model - Enc-dec model built.
2021-07-02 07:54:05,903 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe.en)...
2021-07-02 07:54:32,096 - INFO - joeynmt.prediction -  dev bleu[13a]:  10.47 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-02 07:54:32,096 - INFO - joeynmt.prediction - Decoding on test set (/c

### Reverse model

In [None]:
#@title
name = '%s%s' % (source_language, target_language3)

# Create the config
config = """
name: "{source_language}{target_language3}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language3}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 1096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 200         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia", source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt3.yaml

2021-07-10 10:01:14,809 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-10 10:01:14,859 - INFO - joeynmt.data - Loading training data...
2021-07-10 10:01:16,638 - INFO - joeynmt.data - Building vocabulary...
2021-07-10 10:01:17,391 - INFO - joeynmt.data - Loading dev data...
2021-07-10 10:01:18,749 - INFO - joeynmt.data - Loading test data...
2021-07-10 10:01:20,134 - INFO - joeynmt.data - Data loaded.
2021-07-10 10:01:20,134 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-10 10:01:20,341 - INFO - joeynmt.model - Enc-dec model built.
2021-07-10 10:01:20.573978: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 10:01:22,413 - INFO - joeynmt.training - Total params: 12097024
2021-07-10 10:01:25,778 - INFO - joeynmt.helpers - cfg.name                           : enlh_transformer
2021-07-10 10:01:25,779 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-0

In [None]:
# Reloading configuration file
ckpt_number = 9000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/enlh_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/{name}_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/enlh_transformer"', f'model_dir: "models/enlh_transformer_continued"')
with open("joeynmt/configs/transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_enlh_reload.yaml"


name: "enlh_transformer"

data:
    src: "en"
    trg: "lh"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/enlh_transformer/9000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5  

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/transformer_enlh_reload.yaml

2021-07-10 10:58:03,748 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-10 10:58:03,782 - INFO - joeynmt.data - Loading training data...
2021-07-10 10:58:03,854 - INFO - joeynmt.data - Building vocabulary...
2021-07-10 10:58:04,095 - INFO - joeynmt.data - Loading dev data...
2021-07-10 10:58:04,114 - INFO - joeynmt.data - Loading test data...
2021-07-10 10:58:04,127 - INFO - joeynmt.data - Data loaded.
2021-07-10 10:58:04,127 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-10 10:58:04,331 - INFO - joeynmt.model - Enc-dec model built.
2021-07-10 10:58:04.499348: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 10:58:05,951 - INFO - joeynmt.training - Total params: 12097024
2021-07-10 10:58:09,252 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/enlh_transformer/9000.ckpt
2021-07-10 10:58:09,707 - I

In [None]:
!cd joeynmt; python -m joeynmt test 'models/enlh_transformer_continued/config.yaml'

2021-07-18 09:06:34,004 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 09:06:34,802 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 09:06:35,535 - INFO - joeynmt.data - Loading dev data...
2021-07-18 09:06:36,854 - INFO - joeynmt.data - Loading test data...
2021-07-18 09:06:38,258 - INFO - joeynmt.data - Data loaded.
2021-07-18 09:06:38,317 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-18 09:07:31,413 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 09:07:31,792 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 09:07:31,865 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe.lh)...
2021-07-18 09:08:01,311 - INFO - joeynmt.prediction -  dev bleu[13a]:   6.39 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-18 09:08:01,311 - INFO - joeynmt.prediction - Decoding on test set (/

# Backtranslation

## Data preparation

In [None]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [None]:
# Getting English data from the Luganda dataset
lug = pd.read_csv("Luganda.csv")
mon_en = pd.DataFrame(lug['source_sentence'])

In [None]:
mon_en.reset_index(drop=True,inplace=True)

In [None]:
mon_en

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .
...,...
249490,Among these publishers today are third - gener...
249491,We give thanks to Jehovah and to those early f...
249492,"15 : 15 , 16 . ​ — From our archives in Portug..."
249493,See “ There Is More Harvest Work to Be Done ” ...


In [None]:
# Function to identify if a string has a number or not
import re

def hasNum(inputString):
  input = str(inputString)
  return not re.findall('\d+', input)

In [None]:
# Detecting numbers
mon_en['has_num'] = mon_en['source_sentence'].apply(hasNum)

In [None]:
mon_en.head(10)

Unnamed: 0,source_sentence,has_num
0,This publication is not for sale .,True
1,COVER SUBJECT,True
2,The Bible was completed about two thousand yea...,True
3,"Since then , countless other books have come a...",True
4,But not the Bible .,True
5,Consider the following .,True
6,The Bible has survived many vicious attacks by...,True
7,"For example , during the Middle Ages in certai...",True
8,Scholars who translated the Bible into the ver...,True
9,"Despite its many enemies , the Bible became ​ ...",True


In [None]:
mon_en.describe()

Unnamed: 0,source_sentence,has_num
count,247121,249495
unique,231428,2
top,*,True
freq,462,203930


In [None]:
mon_en = mon_en[mon_en['has_num'] == True] 

In [None]:
mon_en.drop(['has_num'], axis=1,inplace = True)

In [None]:
# Clean data
mon_en

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .
...,...
249481,With the printing and distribution of Bible li...
249483,"However , the seeds of truth had been sown ."
249484,Amid the upheaval in Europe during the Spanish...
249486,"After that , the growth in the number of Kingd..."


In [None]:
# Monolingual English
#mon_en.to_csv('mon_en.csv',index=False) 

In [None]:
mon = pd.read_csv("mon_en.csv")
mon.head()

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .


In [None]:
mon.isnull().sum()

source_sentence    0
dtype: int64

In [None]:
mon.dropna(inplace=True)

In [None]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda


In [None]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia")

In [None]:
!pwd

/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia


In [None]:
# Getting monolingual BPEs
with open("mon."+source_language, "w") as src_file:
  for index, row in mon.iterrows():
    src_file.write(row["source_sentence"]+"\n")

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < mon.$src > mon.bpe.$src

In [None]:
! head mon.*

==> mon.bpe.en <==
This p@@ ub@@ li@@ cation is not for s@@ ale .
C@@ O@@ V@@ E@@ R S@@ U@@ B@@ J@@ E@@ C@@ T
The B@@ ible was comple@@ ted about two thousand years a@@ go .
S@@ in@@ ce then , coun@@ t@@ less other bo@@ ok@@ s have come and gone .
But not the B@@ ible .
C@@ on@@ si@@ der the follow@@ ing .
The B@@ ible has sur@@ v@@ ived many vi@@ ci@@ ous at@@ ta@@ c@@ ks by p@@ ow@@ er@@ ful people .
For exam@@ ple , d@@ ur@@ ing the M@@ id@@ d@@ le A@@ g@@ es in certain “ Chris@@ ti@@ an ” l@@ ands , “ the poss@@ ess@@ ion and read@@ ing of the B@@ ible in the ver@@ nac@@ ul@@ ar [ the l@@ ang@@ u@@ age of the comm@@ on people ] was in@@ cre@@ as@@ ingly as@@ so@@ ci@@ ated with her@@ es@@ y and dis@@ sent , ” says the book A@@ n I@@ n@@ t@@ ro@@ du@@ ction to the M@@ e@@ di@@ ev@@ al B@@ ible .
S@@ ch@@ ol@@ ars who trans@@ l@@ ated the B@@ ible into the ver@@ nac@@ ul@@ ar or who prom@@ o@@ ted B@@ ible st@@ ud@@ y ris@@ ked their lives . S@@ ome were killed .
D@@ es@@ p@@ ite its

In [None]:
!tail mon.*

==> mon.bpe.en <==
He sought per@@ mis@@ sion to use his h@@ ome for reg@@ ul@@ ar me@@ et@@ ings .
In ad@@ d@@ ition , through tr@@ ac@@ ts and bo@@ ok@@ le@@ ts , the word of truth sp@@ read to the fa@@ r re@@ ac@@ hes of the P@@ or@@ t@@ u@@ gu@@ es@@ e Em@@ p@@ ire ​ — A@@ ng@@ ol@@ a , the A@@ z@@ or@@ es , C@@ ap@@ e V@@ er@@ de , E@@ ast T@@ im@@ or , G@@ o@@ a , M@@ a@@ de@@ ira , and M@@ o@@ z@@ am@@ bi@@ qu@@ e .
W@@ hi@@ le living in B@@ ra@@ z@@ il , he had heard a p@@ ub@@ li@@ c tal@@ k given by B@@ ro@@ ther Y@@ oung .
He read@@ ily re@@ c@@ og@@ ni@@ zed the r@@ ing of truth and was e@@ ag@@ er to hel@@ p B@@ ro@@ ther F@@ er@@ g@@ us@@ on to ex@@ p@@ and the pre@@ aching work .
To do so , M@@ anu@@ el began to serve as a col@@ p@@ or@@ te@@ u@@ r , as pi@@ on@@ e@@ ers were then called .
W@@ ith the pr@@ in@@ ting and dis@@ tri@@ bu@@ t@@ ion of B@@ ible l@@ it@@ er@@ at@@ ure now well - or@@ g@@ ani@@ zed , the f@@ led@@ gl@@ ing con@@ gre@@ g@@ ation in L@@ is@@ b@@ 

In [None]:
!cd joeynmt; python -m joeynmt translate 'models/enlh_transformer_continued/config.yaml' < "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/mon.bpe.en" > "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/mon.lh"

2021-07-18 09:11:56,551 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 09:12:00,358 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 09:12:00,564 - INFO - joeynmt.model - Enc-dec model built.


In [None]:
!head mon.en
!head mon.lh

This publication is not for sale .
COVER SUBJECT
The Bible was completed about two thousand years ago .
Since then , countless other books have come and gone .
But not the Bible .
Consider the following .
The Bible has survived many vicious attacks by powerful people .
For example , during the Middle Ages in certain “ Christian ” lands , “ the possession and reading of the Bible in the vernacular [ the language of the common people ] was increasingly associated with heresy and dissent , ” says the book An Introduction to the Medieval Bible .
Scholars who translated the Bible into the vernacular or who promoted Bible study risked their lives . Some were killed .
Despite its many enemies , the Bible became ​ — and continues to be — ​ the most widely distributed book of all time .
Oburume obwomundu shibuliho ta , habula nobwatoto
Olunyuma lwetsinyanga tsitaru , Yorodani nende Siria .
Yali ahambi isaa yashienda yemiyika , chibili .
Abakhalabani bobubeeyi bakhetsukhana , nibakhupa ikha .
Ne

In [None]:
# Dev data source
file1 = ['train.en', 'mon.en']

# Dev data target
file2 = ['train.lh', 'mon.lh']

In [None]:
# Procedure to create concatenated files
def create_file(x,filename):
  # Open filename in write mode
  with open(filename, 'w') as outfile:
      for names in x:
          # Open each file in read mode
          with open(names) as infile:
              # read the data and write it in file3
              outfile.write(infile.read())
          outfile.write("\n")

In [None]:
create_file(file1,'back.en')
create_file(file2,'back.lh')

In [None]:
# Apply BPE splits to the development and test data.
! subword-nmt learn-joint-bpe-and-vocab --input back.$src back.$tgt3 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab2.$src vocab2.$tgt3

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < back.$src > back.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < back.$tgt3 > back.bpe.$tgt3

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < dev.$src > back_dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < dev.$tgt3 > back_dev.bpe.$tgt3
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < test.$src > back_test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < test.$tgt3 > back_test.bpe.$tgt3

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py back.bpe.$src back.bpe.$tgt3 --output_path vocab2.txt

## Modelling

In [None]:
#@title
name = '%s%s' % (target_language3, source_language)

# Create the config
config = """
name: "{target_language3}{source_language}_reverse_transformer"

data:
    src: "{target_language3}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 1600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 5000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/back_{name}_reverse_transformer"
    overwrite: False              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia", source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/back_transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_$tgt3$src.yaml

2021-07-18 11:13:02,325 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 11:13:02,785 - INFO - joeynmt.data - Loading training data...
2021-07-18 11:13:05,985 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 11:13:06,237 - INFO - joeynmt.data - Loading dev data...
2021-07-18 11:13:06,262 - INFO - joeynmt.data - Loading test data...
2021-07-18 11:13:06,831 - INFO - joeynmt.data - Data loaded.
2021-07-18 11:13:06,831 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 11:13:07,034 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 11:13:07.277581: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-18 11:13:08,753 - INFO - joeynmt.training - Total params: 12138240
2021-07-18 11:13:11,975 - INFO - joeynmt.helpers - cfg.name                           : lhen_reverse_transformer
2021-07-18 11:13:11,975 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [None]:
# Reloading configuration file
ckpt_number = 25000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued"').replace(
        f'epochs: 30', f'epochs: 17').replace(f'validation_freq: 5000', f'validation_freq: 2500')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer/25000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to 

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload.yaml

2021-07-18 17:10:05,612 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 17:10:05,687 - INFO - joeynmt.data - Loading training data...
2021-07-18 17:10:09,772 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 17:10:10,273 - INFO - joeynmt.data - Loading dev data...
2021-07-18 17:10:10,939 - INFO - joeynmt.data - Loading test data...
2021-07-18 17:10:12,029 - INFO - joeynmt.data - Data loaded.
2021-07-18 17:10:12,029 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 17:10:12,411 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 17:10:12.672249: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-18 17:10:14,453 - INFO - joeynmt.training - Total params: 12138240
2021-07-18 17:10:25,188 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer/25000.ckpt
2021-07-18 17

In [None]:
# Reloading configuration file
ckpt_number = 62500
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued2"').replace(
            f'validation_freq: 5000', f'validation_freq: 2500')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload2.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued/62500.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from p

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload2.yaml

2021-07-19 06:34:19,360 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-19 06:34:19,477 - INFO - joeynmt.data - Loading training data...
2021-07-19 06:34:25,415 - INFO - joeynmt.data - Building vocabulary...
2021-07-19 06:34:26,608 - INFO - joeynmt.data - Loading dev data...
2021-07-19 06:34:27,539 - INFO - joeynmt.data - Loading test data...
2021-07-19 06:34:29,217 - INFO - joeynmt.data - Data loaded.
2021-07-19 06:34:29,218 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-19 06:34:29,748 - INFO - joeynmt.model - Enc-dec model built.
2021-07-19 06:34:30.003105: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-19 06:34:32,209 - INFO - joeynmt.training - Total params: 12138240
2021-07-19 06:34:40,915 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued/62500.ckpt
202

In [None]:
# Reloading configuration file
ckpt_number = 102500
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer_continued2/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued3"').replace(
            f'validation_freq: 5000', f'validation_freq: 2500').replace(
            f'epochs: 30', f'epochs: 11')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload3.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload3.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued2/102500.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload3.yaml

2021-07-27 07:34:56,135 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-27 07:34:56,204 - INFO - joeynmt.data - Loading training data...
2021-07-27 07:35:00,825 - INFO - joeynmt.data - Building vocabulary...
2021-07-27 07:35:01,381 - INFO - joeynmt.data - Loading dev data...
2021-07-27 07:35:02,417 - INFO - joeynmt.data - Loading test data...
2021-07-27 07:35:03,772 - INFO - joeynmt.data - Data loaded.
2021-07-27 07:35:03,772 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-27 07:35:04,161 - INFO - joeynmt.model - Enc-dec model built.
2021-07-27 07:35:04.414349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 07:35:06,069 - INFO - joeynmt.training - Total params: 12138240
2021-07-27 07:35:16,618 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued2/102500.ckpt
2