# Summary of Baseline Models 

**Overview:**
1. Text preprocessing
2. Inputs of the transformer
3. Workings of a transformer: *Submitted write up*
4. Results of baseline models

Codes are adapted from Masakhane reverse model notebook: https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb


#### Setting up locations and libraries

In [1]:
# Linking to drive
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [2]:
# Importing needed libraries for preprocessing and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#@title Default title text
# Install Pytorch with GPU support v1.8.0.
! pip install torch==1.8.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.8.0+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.8.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (763.5 MB)
[K     |████████████████████████████████| 763.5 MB 15 kB/s 
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.10.0+cu102 requires torch==1.9.0, but you have torch 1.8.0+cu101 which is incompatible.
torchtext 0.10.0 requires torch==1.9.0, but you have torch 1.8.0+cu101 which is incompatible.[0m
Successfully installed torch-1.8.0+cu101


In [4]:
# Filtering warnings
import warnings
warnings.filterwarnings('ignore')

In [5]:
# Loading the drive
import os
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language")

In [6]:
# Setting source and target languages
source_language = "en"
target_language1 = "lg"
target_language2 = "rw"
target_language3 = "lh"

os.environ["src"] = source_language 
os.environ["tgt1"] = target_language1
os.environ["tgt2"] = target_language2
os.environ["tgt3"] = target_language3

# Getting Data

JW300 to dataframes

In [None]:
! pip install opustools-pkg

Helper procedures for data preprocessing

In [None]:
def split_srctgt(df, target_language):
  # Splitting train,validation and test
  num_valid = int(0.01 * df.shape[0])

  dev = df.tail(num_valid) 
  print(dev.shape)
  stripped = df.drop(df.tail(num_valid).index)
  test = stripped.tail(num_valid)
  print(test.shape)
  stripped2 = stripped.drop(stripped.tail(num_valid).index)

  # Creating files: Train
  with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
    for index, row in stripped2.iterrows():
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")

  # Dev (1%)
  with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
    for index, row in dev.iterrows():
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")

  # Test (1%)
  with open("test."+source_language, "w") as src_file, open("test."+target_language, "w") as trg_file:
    for index, row in test.iterrows():
      src_file.write(row["source_sentence"]+"\n")
      trg_file.write(row["target_sentence"]+"\n")

In [None]:
# Code adapted from https://www.geeksforgeeks.org/count-number-of-lines-in-a-text-file-in-python/
# Count lines in a file
def count_lines(filename):
  # Opening a file
  file = open(filename,"r")
  Counter = 0
    
  # Reading from file
  Content = file.read()
  CoList = Content.split("\n")
    
  for i in CoList:
      if i:
          Counter += 1
            
  return Counter

In [None]:
def generating_BPE(source_language, target_language):
  # Apply BPE splits to the development and test data.
  os.environ["src"] = source_language 
  os.environ["tgt1"] = target_language
  ! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt1 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt1

  # Apply BPE splits to the development and test data.
  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < train.$tgt1 > train.bpe.$tgt1

  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < dev.$tgt1 > dev.bpe.$tgt1
  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
  ! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt1 < test.$tgt1 > test.bpe.$tgt1

  # Create that vocab using build_vocab
  ! sudo chmod 777 joeynmt/scripts/build_vocab.py
  ! joeynmt/scripts/build_vocab.py train.bpe.$src train.bpe.$tgt1 --output_path vocab.txt

  print('Done generating BPE')

## Luganda   

### Turning data from JW300 to dataframe

**Do not rerun**: Load pandas dataframe instead

In [None]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [None]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt1 -wm moses -w jw300.$src jw300.$tgt1 -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt1.xml.gz

# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language1

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df.head(3)

# Luganda training set
df.to_csv('Luganda.csv',index=False) 

### Data preprocessing

In [None]:
lug = pd.read_csv("Luganda.csv")
lug.head(3)

Unnamed: 0,source_sentence,target_sentence
0,This publication is not for sale .,Akatabo kano tekatundibwa .
1,COVER SUBJECT,OMUTWE OGULI KUNGULU
2,The Bible was completed about two thousand yea...,Bayibuli yamalirizibwa okuwandiikibwa emyaka n...


In [None]:
# drop duplicate translations
df_pp = lug.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
df_pp.dropna(inplace=True)

In [None]:
df_pp.isna().sum()

index              0
source_sentence    0
target_sentence    0
dtype: int64

In [None]:
# Creating files for luganda and english
split_srctgt(df_pp,target_language1)

(2270, 3)
(2270, 3)


In [None]:
df_pp.shape

(227005, 3)

In [None]:
# Luganda files
lg_train = count_lines('train.lg')
lg_dev = count_lines('dev.lg')
lg_test = count_lines('test.lg')

print("Number of sentences in train files:", lg_train, count_lines('train.en'))
print("Number of sentences in valid files:", lg_dev, count_lines('dev.en'))
print("Number of sentences in test files:", lg_test, count_lines('test.en'))

Number of sentences in train files: 222465 222465
Number of sentences in valid files: 2270 2270
Number of sentences in test files: 2270 2270


In [None]:
lg_train+lg_dev+lg_test

227005

In [None]:
! head train.*
! head dev.*

==> train.bpe.en <==
Ev@@ en@@ tually , however , the tru@@ ths I learned from the Bible began to sin@@ k deep@@ er into my heart . I real@@ ized that if I wanted to serve Jehovah , I had to change my pol@@ it@@ ical view@@ poin@@ ts and associ@@ ations .
At last , I have the st@@ able family life that I always cr@@ av@@ ed , and I have the loving Father that I always wanted .
I was a new husband , only 25 years old and very in@@ experienced , but off we went with confidence in Jehovah .
What can you do to show these de@@ a@@ f brothers personal attention ?
R@@ ef@@ er@@ r@@ ing to what the rul@@ er@@ ship of God’s Son will accompl@@ ish , Isaiah 9 : 7 says : “ The very z@@ eal of Jehovah of arm@@ ies will do this . ”
Jesus is the m@@ igh@@ ti@@ est of all of Jehovah’s spirit sons .
The ste@@ ad@@ f@@ ast example set by J@@ ac@@ o@@ b and R@@ ac@@ he@@ l no doubt had a powerful effect on their son Joseph , influ@@ enc@@ ing how he would hand@@ le t@@ ests of his own faith .
When s@@ en

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shareddrives/NMT_for_African_Language/Luganda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.3-py3-none-any.whl size=85116 sha256=0e52ab8a7b6d4cce683e99b4b8ded97fd49ec56afad56e6077345316e246e98c
  Stored in directory: /tmp/pip-ephem-wheel-cache-d1wpidi2/wheels/b8/3e/ec/4da3b842b3679715f7cd3b4065c087c62dd0fcb0ab5f55b80c
Successfully built joeynmt
Installing collected packages: joeynmt
  Attempting uninstall: joeynmt
    Found existing installation

In [None]:
generating_BPE(source_language,target_language1)

Done generating BPE


In [None]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda


In [None]:
# Some output
! echo "BPE Luganda Sentences"
! tail -n 5 test.bpe.$tgt1
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Luganda Sentences
Bal@@ angirira eri abantu amawulire amalungi ag@@ as@@ ing@@ ir@@ ayo ddala obulungi .
Am@@ aanyi ga Katonda ge ga@@ as@@ obozesa eky@@ amag@@ ero ekyo okubaawo ng’@@ era bwe ga@@ as@@ ob@@ ozes@@ anga E@@ ris@@ a okukola eby@@ amag@@ ero ng’@@ ak@@ yali mul@@ amu .
Ob@@ uk@@ akafu O@@ bul@@ aga nti Katonda A@@ wa Abantu Be O@@ bul@@ agirizi , 4 / 15
Abantu abasinga obungi mu nsi teba@@ agal@@ ana era eyo ye nsonga lwaki oluusi bee@@ wuun@@ ya bwe bal@@ aba Abajulirwa ba Yakuwa nga ba@@ agal@@ ana .
N’@@ obuvunaanyizibwa ob@@ wat@@ u@@ weebwa Katonda mu maka tulina okweyongera okubu@@ twala nga bu@@ kulu .
Combined BPE Vocab
taayo
meet@@
\
ŋ
ʺ
£
”@@
Prover@@
Ó@@
erusaalemi


## Kinyarwanda  

In [None]:
# Changing to Kinyarwanda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda")

### Turning data from JW300 to dataframe

**Do not rerun**: Load pandas dataframe instead

In [None]:
# Downloading our corpus
! opus_read -d JW300 -s $src -t $tgt2 -wm moses -w jw300.$src jw300.$tgt2 -q

# extract the corpus file
! gunzip JW300_latest_xml_$src-$tgt2.xml.gz

# TMX file to dataframe
source_file = 'jw300.' + source_language
target_file = 'jw300.' + target_language2

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df2 = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df2.head(3)

# Kinyarwanda training set
df2.to_csv('Kinyarwanda.csv',index=False) 


Alignment file /proj/nlpl/data/OPUS/JW300/latest/xml/en-rw.xml.gz not found. The following files are available for downloading:

   5 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en-rw.xml.gz
 263 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/en.zip
  48 MB https://object.pouta.csc.fi/OPUS-JW300/v1b/xml/rw.zip

 316 MB Total size
./JW300_latest_xml_en-rw.xml.gz ... 100% of 5 MB
./JW300_latest_xml_en.zip ... 100% of 263 MB
./JW300_latest_xml_rw.zip ... 100% of 48 MB


### Data preprocessing

In [None]:
rwa = pd.read_csv("Kinyarwanda.csv")
rwa.head(3)

Unnamed: 0,source_sentence,target_sentence
0,The Deaf Praise Jehovah,Ibipfamatwi Bisingiza Yehova
1,BY AWAKE !,BY AWAKE !
2,CORRESPONDENT IN NIGERIA,CORRESPONDENT IN NIGERIA


In [None]:
# drop duplicate translations
df_pp = rwa.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
df_pp.dropna(inplace=True)

In [None]:
df_pp.isna().sum()

index              0
source_sentence    0
target_sentence    0
dtype: int64

In [None]:
# Creating files for Kinyarwanda and english
split_srctgt(df_pp, target_language2)

(4368, 3)
(4368, 3)


In [None]:
# Kinyarwanda files
rw_train = count_lines('train.rw')
rw_dev = count_lines('dev.rw')
rw_test = count_lines('test.rw')

print("Number of sentences in train files:", rw_train, count_lines('train.en'))
print("Number of sentences in valid files:", rw_dev, count_lines('dev.en'))
print("Number of sentences in test files:", rw_test, count_lines('test.en'))

Number of sentences in train files: 428127 428127
Number of sentences in valid files: 4368 4368
Number of sentences in test files: 4368 4368


In [None]:
df_pp.shape

(436863, 3)

In [None]:
rw_train+rw_test+rw_dev

436863

In [None]:
! head train.*
! head dev.*

==> train.bpe.en <==
R@@ ight after his bapt@@ ism , he “ went off into Ar@@ ab@@ ia ” ​ — e@@ ither the S@@ y@@ ri@@ an D@@ es@@ ert or pos@@ sib@@ ly some qu@@ i@@ et place on the Ar@@ ab@@ ian P@@ en@@ ins@@ ul@@ a that was conduc@@ ive to med@@ it@@ ation .
You will see the time when God br@@ ings righteous rule to all the earth , und@@ o@@ ing the d@@ am@@ age and inj@@ ust@@ ice brought by human rul@@ er@@ ship .
Let us consider f@@ ive reas@@ ons why we should want to follow the Christ .
Even in the Bible , the id@@ ea of pers@@ u@@ as@@ ion som@@ et@@ im@@ es has n@@ eg@@ ative con@@ no@@ t@@ ations , den@@ ot@@ ing a cor@@ rup@@ ting or a lead@@ ing as@@ tr@@ ay .
For God’s servants to be deliv@@ ered , Satan and his ent@@ ire world@@ wide system of things need to be rem@@ ov@@ ed .
I had never heard that name used in my ch@@ urch .
S@@ imp@@ ly having authority or a wid@@ er name recogn@@ ition is not the important thing .
M@@ ost people do not believe in the spir@@ its .
And

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Cloning into 'joeynmt'...
remote: Enumerating objects: 3127, done.[K
remote: Counting objects: 100% (176/176), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 3127 (delta 101), reused 142 (delta 91), pack-reused 2951[K
Receiving objects: 100% (3127/3127), 8.09 MiB | 10.21 MiB/s, done.
Resolving deltas: 100% (2130/2130), done.
Checking out files: 100% (119/119), done.
Processing /content/gdrive/Shareddrives/NMT_for_African_Language2/Kinyarwanda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
 

In [None]:
generating_BPE(source_language,target_language2)

Done generating BPE


In [None]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Kinyarwanda


In [None]:
# Some output
! echo "BPE Kinyarwanda Sentences"
! tail -n 5 test.bpe.$tgt2
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Kinyarwanda Sentences
Ijambo ry’@@ ik@@ igiriki ry@@ ahind@@ uw@@ emo “ urukundo r@@ ur@@ angwa n’@@ ubw@@ uz@@ u , ” ry@@ erekeza ku mur@@ unga ukomeye uh@@ uza abantu bagize umuryango umwe bak@@ und@@ ana kandi b@@ afash@@ anya .
( N@@ eh@@ em@@ iya 1 : 1 – 6 : 19 )
Mu by’ukuri se , hari iyo igaragaza ?
I@@ bintu bibi bit@@ ug@@ eraho twese muri iki gihe byat@@ ewe na cya gik@@ orwa k@@ ibi cyo kw@@ igom@@ eka .
B@@ ahawe am@@ ac@@ umbi ir@@ uh@@ ande rw’@@ urus@@ engero .
Combined BPE Vocab
Ï@@
ʺ
⁄
Ă@@
̄@@
ointed
̆
ḥ
ḍ@@
Ā@@


## Luhyia

### Data preprocessing

In [None]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhya")

In [None]:
luh = pd.read_csv("Luhya.csv")
luh.tail(3)

Unnamed: 0,target_sentence,source_sentence
7949,Ne omundu yesi naba narusiakhwo likhuwa liosi ...,and if anyone takes away from the words of the...
7950,Ulia ourusinjia obuloli khumakhuwa kano koosi ...,"He who testifies to these things says , “ Sure..."
7951,Obukoosia obwa Omwami Yesu bube khubandu ba Ny...,The grace of our Lord Jesus Christ be with you...


In [None]:
# Tokenizing the data
#  import nltk
# nltk.download('punkt')
# from nltk import sent_tokenize, word_tokenize

# luh['target_sentence'] = luh['0'].apply(lambda x: ' '.join(word_tokenize(x)))
# luh['source_sentence'] = luh['1'].apply(lambda x: ' '.join(word_tokenize(x)))
# luh = luh.drop(['0', '1'], axis = 1)

In [None]:
# drop duplicate translations
df_pp = luh.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

In [None]:
df_pp.dropna(inplace=True)

In [None]:
df_pp.isna().sum()

index              0
target_sentence    0
source_sentence    0
dtype: int64

In [None]:
# Creating files for Luhya and english
split_srctgt(df_pp,target_language3)

(79, 3)
(79, 3)


In [None]:
df_pp.shape

(7907, 3)

In [None]:
# Luhya files
lh_train = count_lines('train.lh')
lh_dev = count_lines('dev.lh')
lh_test = count_lines('test.lh')

print("Number of sentences in train files:", lh_train, count_lines('train.en'))
print("Number of sentences in valid files:", lh_dev, count_lines('dev.en'))
print("Number of sentences in test files:", lh_test, count_lines('test.en'))

Number of sentences in train files: 7749 7749
Number of sentences in valid files: 79 79
Number of sentences in test files: 79 79


In [None]:
lh_train+lh_test+lh_dev

7907

In [None]:
! head train.*
! head dev.*
! head test.*

==> train.en <==
Then Pilate entered the Praetorium again , called Jesus , and said to Him , “ Are You the King of the Jews ? ”
If anyone thinks himself to be a prophet or spiritual , let him acknowledge that the things which I write to you are the commandments of the Lord .
Every branch in Me that does not bear fruit He takes away ; and every branch that bears fruit He prunes , that it may bear more fruit .
Demetrius has a good testimony from all , and from the truth itself . And we also bear witness , and you know that our testimony is true .
And supper being ended , the devil having already put it into the heart of Judas Iscariot , Simon ’ s son , to betray Him ,
imploring us with much urgency that we would receive the gift and the fellowship of the ministering to the saints .
It is written in the prophets , ‘ And they shall all be taught by God. ’ Therefore everyone who has heard and learned from the Father comes to Me .
For those who are such do not serve our Lord Jesus Christ , b

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Cloning into 'joeynmt'...
remote: Enumerating objects: 3127, done.[K
remote: Counting objects: 100% (176/176), done.[K
remote: Compressing objects: 100% (85/85), done.[K
remote: Total 3127 (delta 101), reused 142 (delta 91), pack-reused 2951[K
Receiving objects: 100% (3127/3127), 8.09 MiB | 2.56 MiB/s, done.
Resolving deltas: 100% (2130/2130), done.
Checking out files: 100% (119/119), done.
Processing /content/gdrive/Shareddrives/NMT_for_African_Language2/Luhya/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Create

In [None]:
generating_BPE(source_language,target_language3)

Done generating BPE


In [None]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luhya


In [None]:
# Some output
! echo "BPE Luhya Sentences"
! tail -n 5 test.bpe.$tgt3
! echo "Combined BPE Vocab"
! tail -n 10 vocab.txt

BPE Luhya Sentences
B@@ ach@@ am@@ anga , okhwikh@@ ala khub@@ if@@ umb@@ i habundu w@@ oluyali mum@@ as@@ abo , nende ebif@@ umb@@ i bi@@ oluyali bi@@ e@@ imbeli mutsis@@ in@@ agog@@ i. ,
Mana abandu abali nibe@@ mi@@ ile imbeli nibamu@@ h@@ alab@@ ila , nibamuboolela mbu ah@@ ol@@ eele ts@@ i , nebutswa y@@ ame@@ eta , butswa okhul@@ anjilisia obutinyu ari , “ Omwana wa Daudi ! , W@@ umb@@ eele tsimbabasi ! ”
Mana , nibe@@ mba ol@@ wimb@@ o oluyia bari “ N@@ iwe ou@@ kw@@ anile okhu@@ bukula eshi@@ tabu eshik@@ anye , nende okhw@@ ik@@ ula ebib@@ ali@@ kho bi@@ ashi@@ o. , Okhuba w@@ erwa , ne khulwa okhufw@@ akhwo khw@@ eshit@@ is@@ o , w@@ areera khu Nyasaye abandu okhurula mu@@ buli olwibulo olul@@ imi amahanga nende okhurula mu@@ tsimbia tsi@@ osi. ,
habula ow@@ enya , okhuba omukhongo mwinywe , okhuula abe omukhalabani , w@@ ab@@ oosi .
A@@ bukula ob@@ ise bi@@ hel@@ ile okhwi@@ h@@ enga lik@@ ond@@ ol@@ ie , ne olwa , ar@@ ulaho , y@@ ebil@@ ila bwangu shinga lw@@ obw@@ en@@ ib

# Modeling

## Luganda

In [43]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [8]:
!pwd

/content/gdrive/Shared drives/NMT_for_African_Language/Luganda


In [9]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 96 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 22.2 MB/s 
[?25hCollecting sacrebleu>=1.3.6
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.6 MB/s 
[?25hCollecting subword-nmt
  Downloading sub

In [10]:
#@title
path = "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda"
name = '%s%s' % (target_language1, source_language)

# Create the config
config = """
name: "{target_language1}{source_language}_reverse_transformer"

data:
    src: "{target_language1}"
    trg: "{source_language}"
    train: "{path}/train.bpe"
    dev:   "{path}/dev.bpe"
    test:  "{path}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "{path}/vocab.txt"
    trg_vocab: "{path}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 1000
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 2000         # TODO: Set to at least once per epoch.
    logging_freq: 200
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: True              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, path=path, source_language=source_language, target_language1=target_language1)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_lgen.yaml

2021-08-01 17:28:14,592 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-01 17:28:14,630 - INFO - joeynmt.data - Loading training data...
2021-08-01 17:28:18,793 - INFO - joeynmt.data - Building vocabulary...
2021-08-01 17:28:19,061 - INFO - joeynmt.data - Loading dev data...
2021-08-01 17:28:19,108 - INFO - joeynmt.data - Loading test data...
2021-08-01 17:28:19,132 - INFO - joeynmt.data - Data loaded.
2021-08-01 17:28:19,133 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-01 17:28:19,337 - INFO - joeynmt.model - Enc-dec model built.
2021-08-01 17:28:19.489900: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-01 17:28:20,568 - INFO - joeynmt.training - Total params: 12151040
2021-08-01 17:28:23,821 - INFO - joeynmt.helpers - cfg.name                           : lgen_reverse_transformer
2021-08-01 17:28:23,821 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/lgen_reverse_transformer/validations.txt"

Steps: 2000	Loss: 225537.12500	PPL: 42.48461	bleu: 1.91223	LR: 0.00030000	*
Steps: 4000	Loss: 186507.21875	PPL: 22.20552	bleu: 6.78872	LR: 0.00030000	*
Steps: 6000	Loss: 169341.76562	PPL: 16.69313	bleu: 9.80236	LR: 0.00030000	*
Steps: 8000	Loss: 158586.51562	PPL: 13.96020	bleu: 11.66645	LR: 0.00030000	*
Steps: 10000	Loss: 150443.31250	PPL: 12.19279	bleu: 13.82781	LR: 0.00030000	*
Steps: 12000	Loss: 144066.60938	PPL: 10.96648	bleu: 15.39399	LR: 0.00030000	*
Steps: 14000	Loss: 139397.29688	PPL: 10.14748	bleu: 17.02612	LR: 0.00030000	*
Steps: 16000	Loss: 134982.57812	PPL: 9.42946	bleu: 18.16888	LR: 0.00030000	*
Steps: 18000	Loss: 132093.65625	PPL: 8.98733	bleu: 18.90839	LR: 0.00030000	*
Steps: 20000	Loss: 128846.18750	PPL: 8.51502	bleu: 19.27863	LR: 0.00030000	*
Steps: 22000	Loss: 126759.35938	PPL: 8.22470	bleu: 19.87707	LR: 0.00030000	*
Steps: 24000	Loss: 124511.36719	PPL: 7.92303	bleu: 20.25674	LR: 0.00030000	*
Steps: 26000	Loss: 122831.87500	PPL: 7.70489	bleu: 20.81013	LR: 0.00030000	*

In [None]:
# Reloading configuration file
ckpt_number = 82000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/lgen_transformer/1.ckpt"', 
    f'load_model: "{path}/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lgen_reverse_transformer"', f'model_dir: "models/lgen_reverse_transformer_continued"')
with open("joeynmt/configs/transformer_reverse_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_lgen_reload.yaml"


name: "lgen_reverse_transformer"

data:
    src: "lg"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer/82000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam sched

In [None]:
!cd joeynmt; python -m joeynmt train joeynmt/configs/transformer_reverse_lgen_reload.yaml

2021-08-03 20:37:04,486 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-03 20:37:04,511 - INFO - joeynmt.data - Loading training data...
2021-08-03 20:37:11,086 - INFO - joeynmt.data - Building vocabulary...
2021-08-03 20:37:11,924 - INFO - joeynmt.data - Loading dev data...
2021-08-03 20:37:13,023 - INFO - joeynmt.data - Loading test data...
2021-08-03 20:37:13,997 - INFO - joeynmt.data - Data loaded.
2021-08-03 20:37:13,997 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-03 20:37:14,252 - INFO - joeynmt.model - Enc-dec model built.
2021-08-03 20:37:14.433449: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-03 20:37:15,850 - INFO - joeynmt.training - Total params: 12151040
2021-08-03 20:37:17,946 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer/82000.ckpt
2021-08-03 20:37:

In [11]:
# Output our validation accuracy
! cat "joeynmt/models/lgen_reverse_transformer_continued/validations.txt"

Steps: 84000	Loss: 104546.17188	PPL: 5.68533	bleu: 25.74462	LR: 0.00030000	*
Steps: 86000	Loss: 104900.71875	PPL: 5.71893	bleu: 25.67198	LR: 0.00030000	
Steps: 88000	Loss: 104344.71875	PPL: 5.66632	bleu: 25.77683	LR: 0.00030000	*
Steps: 90000	Loss: 103869.98438	PPL: 5.62178	bleu: 25.60610	LR: 0.00030000	*
Steps: 92000	Loss: 103786.93750	PPL: 5.61402	bleu: 25.99209	LR: 0.00030000	*
Steps: 94000	Loss: 103737.92969	PPL: 5.60945	bleu: 25.85617	LR: 0.00030000	*
Steps: 96000	Loss: 103274.71875	PPL: 5.56643	bleu: 26.12597	LR: 0.00030000	*
Steps: 98000	Loss: 103066.27344	PPL: 5.54717	bleu: 26.00346	LR: 0.00030000	*
Steps: 100000	Loss: 103366.48438	PPL: 5.57492	bleu: 25.97731	LR: 0.00030000	


In [None]:
# Reloading configuration file
ckpt_number = 100000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/lgen_transformer/1.ckpt"', 
    f'load_model: "{path}/joeynmt/models/{name}_reverse_transformer_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lgen_reverse_transformer"', f'model_dir: "models/lgen_reverse_transformer_continued2"').replace(
        f'epochs: 30', f'epochs: 22')
with open("joeynmt/configs/transformer_reverse_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_lgen_reload2.yaml"


name: "lgen_reverse_transformer"

data:
    src: "lg"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer_continued/100000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to

In [None]:
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_lgen_reload2.yaml

2021-08-04 00:01:28,130 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-04 00:01:28,206 - INFO - joeynmt.data - Loading training data...
2021-08-04 00:01:33,001 - INFO - joeynmt.data - Building vocabulary...
2021-08-04 00:01:33,640 - INFO - joeynmt.data - Loading dev data...
2021-08-04 00:01:34,362 - INFO - joeynmt.data - Loading test data...
2021-08-04 00:01:34,988 - INFO - joeynmt.data - Data loaded.
2021-08-04 00:01:34,989 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-04 00:01:35,201 - INFO - joeynmt.model - Enc-dec model built.
2021-08-04 00:01:35.459827: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 00:01:37,279 - INFO - joeynmt.training - Total params: 12151040
2021-08-04 00:01:40,960 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer_continued/100000.ckpt
2021-0

In [15]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda


In [17]:
# Output our validation accuracy
! cat "joeynmt/models/lgen_reverse_transformer_continued2/validations.txt"

Steps: 102000	Loss: 102950.84375	PPL: 5.53654	bleu: 26.18486	LR: 0.00030000	*
Steps: 104000	Loss: 102431.87500	PPL: 5.48898	bleu: 26.33524	LR: 0.00030000	*
Steps: 106000	Loss: 102424.21094	PPL: 5.48828	bleu: 26.33946	LR: 0.00030000	*
Steps: 108000	Loss: 102250.45312	PPL: 5.47245	bleu: 26.42383	LR: 0.00030000	*
Steps: 110000	Loss: 102364.25000	PPL: 5.48281	bleu: 26.23428	LR: 0.00030000	
Steps: 112000	Loss: 101759.60156	PPL: 5.42798	bleu: 26.55033	LR: 0.00030000	*
Steps: 114000	Loss: 101613.21875	PPL: 5.41479	bleu: 26.52269	LR: 0.00030000	*
Steps: 116000	Loss: 101536.27344	PPL: 5.40787	bleu: 26.24736	LR: 0.00030000	*
Steps: 118000	Loss: 101056.37500	PPL: 5.36490	bleu: 26.52566	LR: 0.00030000	*
Steps: 120000	Loss: 100925.59375	PPL: 5.35325	bleu: 26.66043	LR: 0.00030000	*
Steps: 122000	Loss: 101490.95312	PPL: 5.40379	bleu: 26.45827	LR: 0.00030000	
Steps: 124000	Loss: 101171.45312	PPL: 5.37517	bleu: 26.83602	LR: 0.00030000	
Steps: 126000	Loss: 100666.17969	PPL: 5.33021	bleu: 26.66287	LR: 0.

In [18]:
# Reloading configuration file
ckpt_number = 156000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/lgen_transformer/1.ckpt"', 
    f'load_model: "{path}/joeynmt/models/{name}_reverse_transformer_continued2/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lgen_reverse_transformer"', f'model_dir: "models/lgen_reverse_transformer_continued3"').replace(
        f'epochs: 30', f'epochs: 1')
with open("joeynmt/configs/transformer_reverse_{name}_reload3.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [19]:
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_lgen_reload3.yaml

2021-08-04 06:53:01,932 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-04 06:53:02,009 - INFO - joeynmt.data - Loading training data...
2021-08-04 06:53:08,691 - INFO - joeynmt.data - Building vocabulary...
2021-08-04 06:53:09,634 - INFO - joeynmt.data - Loading dev data...
2021-08-04 06:53:11,069 - INFO - joeynmt.data - Loading test data...
2021-08-04 06:53:12,966 - INFO - joeynmt.data - Data loaded.
2021-08-04 06:53:12,967 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-04 06:53:13,361 - INFO - joeynmt.model - Enc-dec model built.
2021-08-04 06:53:13.609638: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 06:53:15,746 - INFO - joeynmt.training - Total params: 12151040
2021-08-04 06:53:20,037 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/lgen_reverse_transformer_continued2/156000.ckpt
2021-

#### Sample translations

In [44]:
# Candidates
! head "joeynmt/models/lgen_reverse_transformer_continued3/00158000.hyps.test"

What question about the “ other sheep ” can we ask ourselves ?
Because “ God is not partial , but in every nation a man fears him and works righteousness is acceptable to him . ” ​ — Acts 10 : 34 , 35 .
Jesus will destroy Satan and eliminate all the problems Satan caused . ​ — Gen .
Because David loved Jehovah .
9 : 24 ; Luke 4 : 43 .
Indeed , those who accept God’s word have been “ out of every tribe and tongue and nations . ”
However , we know that “ the appointed times have given me . ”
Nevertheless , people who seem to be unharmful are often hurt .
“ The Endurance of Job ”
But we may wonder : How can our conscience be trained to help us when


In [45]:
# References
! head "test.en"

What question about the “ other sheep ” now arises ?
Because “ God is not partial , but in every nation the man that fears him and works righteousness is acceptable to him . ” ​ — Acts 10 : 34 , 35 .
Jesus will crush the serpent’s head and erase from the universe all traces of Satan’s rebellion . ​ — Gen .
Because David loved Jehovah .
9 : 24 ; Luke 4 : 43 .
Truly , those who have embraced God’s word have come from “ every tribe and tongue and people and nation . ”
We do know , however , that “ the time left is reduced . ”
Still , shocking deeds are often perpetrated by seemingly ordinary people in the neighborhood .
“ The Endurance of Job ”
However , we might ask : How can a well - trained conscience help us when we need to make decisions ?


In [46]:
# Source
! head "test.lg"

Kibuuzo ki ekikwata ku ‘ b’endiga endala ’ kye tuyinza okwebuuza ?
Kubanga “ Katonda tasosola , naye mu buli ggwanga omuntu amutya n’akola eby’obutuukirivu amukkiriza . ” ​ — Bik . 10 : 34 , 35 .
Yesu ajja kuzikiriza Sitaani era amalewo ebizibu byonna Sitaani bye yaleetawo . ​ — Lub .
Kubanga Dawudi yali ayagala nnyo Yakuwa .
9 : 24 ; Luk . 4 : 43 .
Mazima ddala , abo abakkiriza ekigambo kya Katonda bavudde ‘ mu buli kika n’ennimi n’amawanga . ’
Kyokka , tumanyi nti “ ebiro biyimpawadde . ”
Wadde kiri kityo , abantu abalabika ng’abatalina mutawaana be batera okukola ebintu ebyesisiwaza ennyo .
‘ Obugumiikiriza bwa Yobu ’
Naye tuyinza okuba nga twebuuza : Omuntu waffe ow’omunda atendekeddwa obulungi ayinza atya okutuyamba nga tulina bye tusalawo ?


## Kinyarwanda

In [39]:
# Changing to Kinyarwanda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda")

In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 95 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 26.4 MB/s 
[?25hCollecting sacrebleu>=1.3.6
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 2.8 MB/s 
[?25hCollecting subword-nmt
  Downloading

In [None]:
#@title
name = '%s%s' % (target_language2, source_language)

# Create the config
config = """
name: "{target_language2}{source_language}_reverse_transformer"

data:
    src: "{target_language2}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 2000
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 5000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: True              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda", source_language=source_language, target_language2=target_language2)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt2$src.yaml

2021-08-01 21:43:46,916 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-01 21:43:46,984 - INFO - joeynmt.data - Loading training data...
2021-08-01 21:43:56,063 - INFO - joeynmt.data - Building vocabulary...
2021-08-01 21:43:56,709 - INFO - joeynmt.data - Loading dev data...
2021-08-01 21:43:57,401 - INFO - joeynmt.data - Loading test data...
2021-08-01 21:43:58,064 - INFO - joeynmt.data - Data loaded.
2021-08-01 21:43:58,064 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-01 21:43:58,288 - INFO - joeynmt.model - Enc-dec model built.
2021-08-01 21:43:58.540201: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-01 21:44:01,479 - INFO - joeynmt.training - Total params: 12177664
2021-08-01 21:44:05,154 - INFO - joeynmt.helpers - cfg.name                           : rwen_reverse_transformer
2021-08-01 21:44:05,154 - INFO - joeynmt.helpers - cfg.data.src                       : r

5.5 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer/validations.txt"

Steps: 5000	Loss: 369605.15625	PPL: 21.17395	bleu: 6.42552	LR: 0.00030000	*
Steps: 10000	Loss: 315656.06250	PPL: 13.56071	bleu: 11.47374	LR: 0.00030000	*
Steps: 15000	Loss: 283178.84375	PPL: 10.37013	bleu: 15.52942	LR: 0.00030000	*
Steps: 20000	Loss: 264551.62500	PPL: 8.89133	bleu: 17.49981	LR: 0.00030000	*
Steps: 25000	Loss: 252179.82812	PPL: 8.02765	bleu: 19.33033	LR: 0.00030000	*
Steps: 30000	Loss: 243752.06250	PPL: 7.48785	bleu: 20.43871	LR: 0.00030000	*


In [None]:
# Reloading configuration file
ckpt_number = 30000

reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/models/rwen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer"', f'model_dir: "models/rwen_reverse_transformer_continued"').replace(
            f'epochs: 30', f'epochs: 25')
with open("joeynmt/configs/transformer_reverse_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)


In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/30000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching fr

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload.yaml

2021-08-02 06:56:51,535 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-02 06:56:51,568 - INFO - joeynmt.data - Loading training data...
2021-08-02 06:56:59,560 - INFO - joeynmt.data - Building vocabulary...
2021-08-02 06:57:00,114 - INFO - joeynmt.data - Loading dev data...
2021-08-02 06:57:00,758 - INFO - joeynmt.data - Loading test data...
2021-08-02 06:57:01,307 - INFO - joeynmt.data - Data loaded.
2021-08-02 06:57:01,308 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-02 06:57:01,542 - INFO - joeynmt.model - Enc-dec model built.
2021-08-02 06:57:01.787151: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-02 06:57:04,913 - INFO - joeynmt.training - Total params: 12177664
2021-08-02 06:57:08,725 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/30000.ckpt
2021-08-02 06

25.5 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer_continued/validations.txt"

Steps: 35000	Loss: 236982.75000	PPL: 7.08068	bleu: 21.07577	LR: 0.00030000	*
Steps: 40000	Loss: 231373.42188	PPL: 6.76011	bleu: 21.92058	LR: 0.00030000	*
Steps: 45000	Loss: 226930.53125	PPL: 6.51654	bleu: 22.44196	LR: 0.00030000	*
Steps: 50000	Loss: 224104.50000	PPL: 6.36619	bleu: 22.81634	LR: 0.00030000	*
Steps: 55000	Loss: 220764.64062	PPL: 6.19298	bleu: 23.03003	LR: 0.00030000	*
Steps: 60000	Loss: 218070.60938	PPL: 6.05670	bleu: 23.57888	LR: 0.00030000	*
Steps: 65000	Loss: 215245.06250	PPL: 5.91698	bleu: 23.98163	LR: 0.00030000	*
Steps: 70000	Loss: 213543.65625	PPL: 5.83441	bleu: 24.09171	LR: 0.00030000	*
Steps: 75000	Loss: 211926.60938	PPL: 5.75701	bleu: 24.59532	LR: 0.00030000	*
Steps: 80000	Loss: 209693.71875	PPL: 5.65181	bleu: 24.83026	LR: 0.00030000	*
Steps: 85000	Loss: 207986.29688	PPL: 5.57266	bleu: 24.82739	LR: 0.00030000	*
Steps: 90000	Loss: 206906.20312	PPL: 5.52317	bleu: 24.99484	LR: 0.00030000	*
Steps: 95000	Loss: 204909.06250	PPL: 5.43281	bleu: 25.03757	LR: 0.00030000	*

In [None]:
!cd joeynmt; python -m joeynmt translate 'models/rwen_reverse_transformer_continued/config.yaml' < "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe.rw" > "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/translation.bpe.rw_en"

2021-08-03 07:02:21,339 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-03 07:02:26,033 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-03 07:02:26,258 - INFO - joeynmt.model - Enc-dec model built.


In [None]:
!cat "translation.bpe.rw_en" | sacrebleu "test.en"

sacreBLEU: That's 100 lines that end in a tokenized period ('.')
sacreBLEU: It looks like you forgot to detokenize your test data, which may hurt your score.
sacreBLEU: If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1 = 25.8 59.3/34.4/23.0/16.3 (BP = 0.870 ratio = 0.878 hyp_len = 74798 ref_len = 85182)


In [None]:
# Reloading configuration file
ckpt_number = 155000

reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/models/rwen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer2"', f'model_dir: "models/rwen_reverse_transformer2_continued"')
with open("joeynmt/configs/transformer_reverse_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload2.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer_continued/155000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try s

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload2.yaml

2021-08-03 07:13:43,869 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-03 07:13:43,906 - INFO - joeynmt.data - Loading training data...
2021-08-03 07:13:54,094 - INFO - joeynmt.data - Building vocabulary...
2021-08-03 07:13:54,380 - INFO - joeynmt.data - Loading dev data...
2021-08-03 07:13:55,487 - INFO - joeynmt.data - Loading test data...
2021-08-03 07:13:56,313 - INFO - joeynmt.data - Data loaded.
2021-08-03 07:13:56,313 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-03 07:13:56,513 - INFO - joeynmt.model - Enc-dec model built.
2021-08-03 07:13:56.783953: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-03 07:13:58,343 - INFO - joeynmt.training - Total params: 12177664
2021-08-03 07:14:01,747 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer_continued/155000.ckpt
20

20.5 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer/validations.txt"

Steps: 160000	Loss: 194533.73438	PPL: 4.98663	bleu: 26.36352	LR: 0.00030000	*
Steps: 165000	Loss: 193757.01562	PPL: 4.95474	bleu: 26.37000	LR: 0.00030000	*
Steps: 170000	Loss: 193487.07812	PPL: 4.94371	bleu: 26.52795	LR: 0.00030000	*
Steps: 175000	Loss: 192673.14062	PPL: 4.91058	bleu: 26.41083	LR: 0.00030000	*
Steps: 180000	Loss: 192016.76562	PPL: 4.88403	bleu: 26.90800	LR: 0.00030000	*
Steps: 185000	Loss: 192402.76562	PPL: 4.89963	bleu: 26.86276	LR: 0.00030000	
Steps: 190000	Loss: 191457.79688	PPL: 4.86154	bleu: 27.04945	LR: 0.00030000	*
Steps: 195000	Loss: 190923.75000	PPL: 4.84014	bleu: 26.96196	LR: 0.00030000	*
Steps: 200000	Loss: 191517.00000	PPL: 4.86392	bleu: 26.88936	LR: 0.00030000	
Steps: 205000	Loss: 190386.06250	PPL: 4.81869	bleu: 26.96544	LR: 0.00030000	*
Steps: 210000	Loss: 190259.48438	PPL: 4.81366	bleu: 26.85295	LR: 0.00030000	*
Steps: 215000	Loss: 190039.75000	PPL: 4.80493	bleu: 27.04768	LR: 0.00030000	*
Steps: 220000	Loss: 190176.12500	PPL: 4.81035	bleu: 27.14828	LR: 0

In [None]:
# Reloading configuration file
ckpt_number = 260000

reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/models/rwen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/rwen_reverse_transformer"', f'model_dir: "models/rwen_reverse_transformer_continued2"').replace(
            f'epochs: 30', f'epochs: 10')
with open("joeynmt/configs/transformer_reverse_{name}_reload3.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_reverse_{name}_reload3.yaml"


name: "rwen_reverse_transformer"

data:
    src: "rw"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/260000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching f

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_rwen_reload3.yaml

2021-08-03 15:10:15,726 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-03 15:10:15,813 - INFO - joeynmt.data - Loading training data...
2021-08-03 15:10:26,650 - INFO - joeynmt.data - Building vocabulary...
2021-08-03 15:10:27,478 - INFO - joeynmt.data - Loading dev data...
2021-08-03 15:10:28,651 - INFO - joeynmt.data - Loading test data...
2021-08-03 15:10:29,787 - INFO - joeynmt.data - Data loaded.
2021-08-03 15:10:29,788 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-03 15:10:30,058 - INFO - joeynmt.model - Enc-dec model built.
2021-08-03 15:10:30.320043: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-03 15:10:33,965 - INFO - joeynmt.training - Total params: 12177664
2021-08-03 15:10:36,208 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/joeynmt/models/rwen_reverse_transformer/260000.ckpt
2021-08-03 1

9.5 epochs done

In [None]:
# Output our validation accuracy
! cat "joeynmt/models/rwen_reverse_transformer_continued2/validations.txt"

Steps: 265000	Loss: 186668.78125	PPL: 4.67299	bleu: 27.62741	LR: 0.00030000	*
Steps: 270000	Loss: 186920.09375	PPL: 4.68270	bleu: 27.46330	LR: 0.00030000	
Steps: 275000	Loss: 186200.25000	PPL: 4.65494	bleu: 27.48177	LR: 0.00030000	*
Steps: 280000	Loss: 186035.54688	PPL: 4.64862	bleu: 27.39439	LR: 0.00030000	*
Steps: 285000	Loss: 186202.71875	PPL: 4.65504	bleu: 27.76276	LR: 0.00030000	
Steps: 290000	Loss: 185566.21875	PPL: 4.63063	bleu: 27.73543	LR: 0.00030000	*
Steps: 295000	Loss: 185271.35938	PPL: 4.61937	bleu: 27.66582	LR: 0.00030000	*
Steps: 300000	Loss: 184974.01562	PPL: 4.60804	bleu: 27.81265	LR: 0.00030000	*
Steps: 305000	Loss: 184685.51562	PPL: 4.59707	bleu: 27.79174	LR: 0.00030000	*


In [None]:
!cd joeynmt; python -m joeynmt translate 'models/rwen_reverse_transformer_continued2/config.yaml' < "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/test.bpe.rw" > "/content/gdrive/Shared drives/NMT_for_African_Language/Kinyarwanda/translation2.bpe.rw_en"

2021-08-03 19:54:02,056 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-03 19:54:04,931 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-03 19:54:05,188 - INFO - joeynmt.model - Enc-dec model built.


In [None]:
!cat "translation2.bpe.rw_en" | sacrebleu "test.en"

sacreBLEU: That's 100 lines that end in a tokenized period ('.')
sacreBLEU: It looks like you forgot to detokenize your test data, which may hurt your score.
sacreBLEU: If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.
BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.5.1 = 27.1 60.2/35.7/24.4/17.6 (BP = 0.875 ratio = 0.883 hyp_len = 75182 ref_len = 85182)


#### Sample translations

In [40]:
# Candidates
! head "translation2.bpe.rw_en"

Although my parents did not become baptized Witnesses , they soon concluded that the Catholic teachings were not in harmony with the Bible .
Why is it important to seek time for communication ?
Would God resolve this problem in that Adam would continue to live in pleasure and goodness ?
Yoonhee : I learned that I was pregnant !
; Degandt , B .
However , I feel comforted when I meditate on all the things we have done for 45 years .
They stood , and they suffered in the face .
The End of the Christian Congregation
What counsel may we have been given when we were about to make decisions ?
These crafty tactics that Satan used were really a thief .


In [41]:
# References
! head "test.en"

Though my parents never became baptized Witnesses , they soon concluded that the teachings of the Catholic Church were not in harmony with the Bible .
Why is it important to make time for communication ?
Could God solve this problem in such a way as to ensure Adam’s continued happiness and welfare ?
Yoonhee : I was devastated ​ — and scared !
; Degandt , B .
I take considerable comfort , though , in what was accomplished in the 45 years we were together .
And they stood still with sad faces .
Christian Congregation Affected
We likely have received what advice when we faced a decision ?
This sly approach exposed Satan for what he really is ​ — a devious intruder .


In [42]:
# Source
! head "test.rw"

N’ubwo ababyeyi banjye batigeze baba Abahamya babatijwe , bidatinze bageze ku mwanzuro w’uko inyigisho za kiliziya Gatolika zitari zihuje na Bibiliya .
Kuki gushaka igihe cyo gushyikirana ari iby’ingenzi ?
Mbese , Imana yashoboraga gukemura icyo kibazo mu buryo bw’uko Adamu yari gukomeza kubaho mu munezero no kugubwa neza ?
Yoonhee : Maze kumenya ko ntwite nagize ubwoba !
; Degandt , B .
Icyakora , iyo ntekereje ibintu byose twakoranye mu myaka 45 , numva mpumurijwe .
Nuko barahagarara , bafite umubabaro mu maso .
Ingaruka ku Itorero rya Gikristo
Ni iyihe nama dushobora kuba twarahawe igihe twari tugiye gufata umwanzuro ?
Ayo mayeri Satani yakoresheje , yagaragaje uwo ari we by’ukuri , ni umujura wuzuye ubucakura .


## Luhyia

In [20]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhya")

In [21]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luhya


In [None]:
#! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .

Processing /content/gdrive/Shared drives/NMT_for_African_Language/Luhya/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting numpy==1.20.1
  Downloading numpy-1.20.1-cp37-cp37m-manylinux2010_x86_64.whl (15.3 MB)
[K     |████████████████████████████████| 15.3 MB 97 kB/s 
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 19.2 MB/s 
[?25hCollecting sacrebleu>=1.3.6
  Downloading sacrebleu-1.5.1-py3-none-any.whl (54 kB)
[K     |████████████████████████████████| 54 kB 3.4 MB/s 
[?25hCollecting subword-nmt
  Downloading subwo

In [22]:
#@title
name = '%s%s' % (target_language3, source_language)
path = "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya"

# Create the config
config = """
name: "{target_language3}{source_language}_reverse_transformer"

data:
    src: "{target_language3}"
    trg: "{source_language}"
    train: "{path}/train.bpe"
    dev:   "{path}/dev.bpe"
    test:  "{path}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "{path}/vocab.txt"
    trg_vocab: "{path}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"          
    patience: 5                     
    learning_rate_factor: 0.5       
    learning_rate_warmup: 1000      
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 1096
    batch_type: "token"
    eval_batch_size: 1600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  
    validation_freq: 200         # Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_reverse_transformer"
    overwrite: True             # Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             
        embeddings:
            embedding_dim: 256   
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         
        ff_size: 1024            
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              
        embeddings:
            embedding_dim: 256    
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         
        ff_size: 1024            
        dropout: 0.3
""".format(name=name, path=path, source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [23]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_reverse_$tgt3$src.yaml

2021-08-04 07:16:17,936 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-04 07:16:18,542 - INFO - joeynmt.data - Loading training data...
2021-08-04 07:16:21,126 - INFO - joeynmt.data - Building vocabulary...
2021-08-04 07:16:22,152 - INFO - joeynmt.data - Loading dev data...
2021-08-04 07:16:23,540 - INFO - joeynmt.data - Loading test data...
2021-08-04 07:16:24,855 - INFO - joeynmt.data - Data loaded.
2021-08-04 07:16:24,855 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-04 07:16:25,116 - INFO - joeynmt.model - Enc-dec model built.
2021-08-04 07:16:25.302320: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 07:16:27,056 - INFO - joeynmt.training - Total params: 12099840
2021-08-04 07:16:29,276 - INFO - joeynmt.helpers - cfg.name                           : lhen_reverse_transformer
2021-08-04 07:16:29,276 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [24]:
# Output our validation accuracy
! cat "joeynmt/models/lhen_reverse_transformer/validations.txt"

Steps: 200	Loss: 13427.67969	PPL: 147.46030	bleu: 0.12029	LR: 0.00030000	*
Steps: 400	Loss: 12647.71094	PPL: 110.33264	bleu: 0.03086	LR: 0.00030000	*
Steps: 600	Loss: 11534.63281	PPL: 72.93453	bleu: 0.54131	LR: 0.00030000	*
Steps: 800	Loss: 11055.41406	PPL: 61.02891	bleu: 0.44899	LR: 0.00030000	*
Steps: 1000	Loss: 10761.14355	PPL: 54.70267	bleu: 1.30696	LR: 0.00030000	*
Steps: 1200	Loss: 10515.99219	PPL: 49.93612	bleu: 1.08567	LR: 0.00030000	*
Steps: 1400	Loss: 10267.27930	PPL: 45.52456	bleu: 1.75825	LR: 0.00030000	*
Steps: 1600	Loss: 10147.88672	PPL: 43.54746	bleu: 1.83333	LR: 0.00030000	*
Steps: 1800	Loss: 9945.24609	PPL: 40.38637	bleu: 2.22384	LR: 0.00030000	*
Steps: 2000	Loss: 9796.03613	PPL: 38.20642	bleu: 2.90277	LR: 0.00030000	*
Steps: 2200	Loss: 9678.15332	PPL: 36.56767	bleu: 3.10603	LR: 0.00030000	*
Steps: 2400	Loss: 9538.60742	PPL: 34.71838	bleu: 2.35361	LR: 0.00030000	*
Steps: 2600	Loss: 9480.06152	PPL: 33.97065	bleu: 3.59264	LR: 0.00030000	*
Steps: 2800	Loss: 9280.22656	PPL

In [26]:
# Reloading configuration file
ckpt_number = 10400
reload_config = config.replace(
    f'#load_model: "{path}/models/lhen_transformer/1.ckpt"', 
    f'load_model: "{path}/joeynmt/models/{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/lhen_reverse_transformer"', f'model_dir: "models/lhen_reverse_transformer_continued"')
with open("joeynmt/configs/transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [27]:
!cat "joeynmt/configs/transformer_lhen_reload.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhya/joeynmt/models/lhen_reverse_transformer/10400.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"          
    patience: 5                     
    learning_rate_facto

In [28]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/transformer_lhen_reload.yaml

2021-08-04 07:46:24,224 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-08-04 07:46:24,251 - INFO - joeynmt.data - Loading training data...
2021-08-04 07:46:24,365 - INFO - joeynmt.data - Building vocabulary...
2021-08-04 07:46:24,636 - INFO - joeynmt.data - Loading dev data...
2021-08-04 07:46:24,640 - INFO - joeynmt.data - Loading test data...
2021-08-04 07:46:24,648 - INFO - joeynmt.data - Data loaded.
2021-08-04 07:46:24,649 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-08-04 07:46:24,903 - INFO - joeynmt.model - Enc-dec model built.
2021-08-04 07:46:25.095565: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-08-04 07:46:26,812 - INFO - joeynmt.training - Total params: 12099840
2021-08-04 07:46:28,927 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhya/joeynmt/models/lhen_reverse_transformer/10400.ckpt
2021-08-04 07:46:29

In [29]:
! cat "joeynmt/models/lhen_reverse_transformer_continued/validations.txt"

Steps: 10600	Loss: 7617.05957	PPL: 16.99083	bleu: 8.62748	LR: 0.00030000	
Steps: 10800	Loss: 7638.00293	PPL: 17.12368	bleu: 9.12874	LR: 0.00030000	
Steps: 11000	Loss: 7664.13281	PPL: 17.29088	bleu: 8.35299	LR: 0.00030000	
Steps: 11200	Loss: 7594.65723	PPL: 16.84986	bleu: 8.34067	LR: 0.00030000	*
Steps: 11400	Loss: 7604.81201	PPL: 16.91362	bleu: 8.94634	LR: 0.00030000	
Steps: 11600	Loss: 7596.60742	PPL: 16.86209	bleu: 9.35923	LR: 0.00030000	
Steps: 11800	Loss: 7614.91504	PPL: 16.97728	bleu: 8.90905	LR: 0.00030000	
Steps: 12000	Loss: 7560.38281	PPL: 16.63646	bleu: 9.33145	LR: 0.00030000	*
Steps: 12200	Loss: 7593.83594	PPL: 16.84472	bleu: 8.53919	LR: 0.00030000	
Steps: 12400	Loss: 7621.66943	PPL: 17.01998	bleu: 9.02539	LR: 0.00030000	
Steps: 12600	Loss: 7559.97363	PPL: 16.63392	bleu: 9.00430	LR: 0.00030000	*
Steps: 12800	Loss: 7597.33887	PPL: 16.86668	bleu: 9.35436	LR: 0.00030000	
Steps: 13000	Loss: 7606.21094	PPL: 16.92242	bleu: 9.21840	LR: 0.00030000	
Steps: 13200	Loss: 7605.77197	PPL: 

#### Sample translations

In [31]:
# Candidates
! head "joeynmt/models/lhen_reverse_transformer_continued/00017600.hyps.test"

Now I know that I am a son of our brother , and sent him away from my brother to death .
But we have no need of God for us , that we might be justified by the grace of God through Jesus Christ ,
For I remember that the gospel of the gospel which was in me , that I might wish with you in the beginning
For He taught them as some of the scribes , but they did not receive authority .
Paul , an apostle of Jesus Christ , according to the will of God our Lord Jesus Christ ,
Now when the disciples had come into the middle of the boat , they saw the linen cloth lying on the sea , and they came to Him .
that you may walk according to the flesh , according to the lust of a good man , according to the resurrection of Christ ,
Therefore , “ If anyone says , “ Lord , if we are willing , let us eat such things . ”
You are chosen by God , who is sanctified by Him who is sanctified by the flesh ?
“ I have many things to you , but you do not seek Me .


In [34]:
# References
! head "test.en"

Know that our brother Timothy has been set free , with whom I shall see you if he comes shortly .
And not only that , but we also rejoice in God through our Lord Jesus Christ , through whom we have now received the reconciliation .
You know that because of physical infirmity I preached the gospel to you at the first .
for He taught them as one having authority , and not as the scribes .
Paul , an apostle of Jesus Christ by the will of God , and Timothy our brother ,
So when they had rowed about three or four miles , they saw Jesus walking on the sea and drawing near the boat ; and they were afraid .
that you may approve the things that are excellent , that you may be sincere and without offense till the day of Christ ,
Instead you ought to say , “ If the Lord wills , we shall live and do this or that . ”
Foolish ones ! Did not He who made the outside make the inside also ?
“ I still have many things to say to you , but you can not bear them now .


In [32]:
! head "test.lh"

Ndenya , mumanye mbu omwana wefwe Timotseyo yaboololwe , okhurula mumbohe . Naba niyakhetsa lwangu nembe , ninaye olwa nanditse okhumulola inywe .
Nebutswa shikali ako konyene ta ; khwikhoyanga khulwa okhubela aka Nyasaye yakhukholelaokhubirira mu Yesu , Kristo , owakhukholile bulano okhuba abetsa ba Nyasaye . Adamu nende Kristo ,
Mwitsulilanga eshiachila isie , nemuyaalila Injiili olwambeli lwene ; kali shichila mbu , ndali nindwala .
okhuba shiyabeechesia shinga abeechesia bandi , bamalako bechesinjia ta , habula yabeechesia nobunyali .
Eyirula khwisie Paulo , ouli omurume wa Yesu Kristo , khulwa okhwenya khwa Nyasaye , khandi okhurula khu , Timotseyo omwana wefwe .
Ne olwa abeechi bali nibafuchile , eliaro oluchendo oluhela tsimailo tsitaru noho tsine , balola , Yesu nachendanga khumaatsi , niyetsa ahambi nende , eliaro , nibaria muno .
kho , munyalilwe okhwahula akali amalayi okhushila . Mana , mulaba abalekhuule okhurula mushifwabi shiosi , khandi , okhubula eshikha khunyanga eya 

### Reverse model

In [None]:
#@title
name = '%s%s' % (source_language, target_language3)

# Create the config
config = """
name: "{source_language}{target_language3}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language3}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "joeynmt/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 1096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 200         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: False
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia", source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt3.yaml

2021-07-10 10:01:14,809 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-10 10:01:14,859 - INFO - joeynmt.data - Loading training data...
2021-07-10 10:01:16,638 - INFO - joeynmt.data - Building vocabulary...
2021-07-10 10:01:17,391 - INFO - joeynmt.data - Loading dev data...
2021-07-10 10:01:18,749 - INFO - joeynmt.data - Loading test data...
2021-07-10 10:01:20,134 - INFO - joeynmt.data - Data loaded.
2021-07-10 10:01:20,134 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-10 10:01:20,341 - INFO - joeynmt.model - Enc-dec model built.
2021-07-10 10:01:20.573978: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 10:01:22,413 - INFO - joeynmt.training - Total params: 12097024
2021-07-10 10:01:25,778 - INFO - joeynmt.helpers - cfg.name                           : enlh_transformer
2021-07-10 10:01:25,779 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-0

In [None]:
# Reloading configuration file
ckpt_number = 9000
reload_config = config.replace(
    f'#load_model: "joeynmt/models/enlh_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/{name}_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/enlh_transformer"', f'model_dir: "models/enlh_transformer_continued"')
with open("joeynmt/configs/transformer_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/transformer_enlh_reload.yaml"


name: "enlh_transformer"

data:
    src: "en"
    trg: "lh"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/train.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/enlh_transformer/9000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5  

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/transformer_enlh_reload.yaml

2021-07-10 10:58:03,748 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-10 10:58:03,782 - INFO - joeynmt.data - Loading training data...
2021-07-10 10:58:03,854 - INFO - joeynmt.data - Building vocabulary...
2021-07-10 10:58:04,095 - INFO - joeynmt.data - Loading dev data...
2021-07-10 10:58:04,114 - INFO - joeynmt.data - Loading test data...
2021-07-10 10:58:04,127 - INFO - joeynmt.data - Data loaded.
2021-07-10 10:58:04,127 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-10 10:58:04,331 - INFO - joeynmt.model - Enc-dec model built.
2021-07-10 10:58:04.499348: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-10 10:58:05,951 - INFO - joeynmt.training - Total params: 12097024
2021-07-10 10:58:09,252 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/enlh_transformer/9000.ckpt
2021-07-10 10:58:09,707 - I

In [None]:
!cd joeynmt; python -m joeynmt test 'models/enlh_transformer_continued/config.yaml'

2021-07-18 09:06:34,004 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 09:06:34,802 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 09:06:35,535 - INFO - joeynmt.data - Loading dev data...
2021-07-18 09:06:36,854 - INFO - joeynmt.data - Loading test data...
2021-07-18 09:06:38,258 - INFO - joeynmt.data - Data loaded.
2021-07-18 09:06:38,317 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 18000 (with beam_size)
2021-07-18 09:07:31,413 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 09:07:31,792 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 09:07:31,865 - INFO - joeynmt.prediction - Decoding on dev set (/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/dev.bpe.lh)...
2021-07-18 09:08:01,311 - INFO - joeynmt.prediction -  dev bleu[13a]:   6.39 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-07-18 09:08:01,311 - INFO - joeynmt.prediction - Decoding on test set (/

# Backtranslation

## Data preparation

In [None]:
# Changing to Luganda directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luganda")

In [None]:
lug = pd.read_csv("Luganda.csv")
mon_en = pd.DataFrame(lug['source_sentence'])

In [None]:
mon_en.reset_index(drop=True,inplace=True)

In [None]:
mon_en

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .
...,...
249490,Among these publishers today are third - gener...
249491,We give thanks to Jehovah and to those early f...
249492,"15 : 15 , 16 . ​ — From our archives in Portug..."
249493,See “ There Is More Harvest Work to Be Done ” ...


In [None]:
hasNum(mon_en['source_sentence'][0])

True

In [None]:
import re

def hasNum(inputString):
  input = str(inputString)
  return not re.findall('\d+', input)

In [None]:
hasNum()

In [None]:
mon_en['has_num'] = mon_en['source_sentence'].apply(hasNum)

In [None]:
mon_en.head(10)

Unnamed: 0,source_sentence,has_num
0,This publication is not for sale .,True
1,COVER SUBJECT,True
2,The Bible was completed about two thousand yea...,True
3,"Since then , countless other books have come a...",True
4,But not the Bible .,True
5,Consider the following .,True
6,The Bible has survived many vicious attacks by...,True
7,"For example , during the Middle Ages in certai...",True
8,Scholars who translated the Bible into the ver...,True
9,"Despite its many enemies , the Bible became ​ ...",True


In [None]:
mon_en.describe()

Unnamed: 0,source_sentence,has_num
count,247121,249495
unique,231428,2
top,*,True
freq,462,203930


In [None]:
mon_en

Unnamed: 0,source_sentence,has_num
0,This publication is not for sale .,True
1,COVER SUBJECT,True
2,The Bible was completed about two thousand yea...,True
3,"Since then , countless other books have come a...",True
4,But not the Bible .,True
...,...,...
249490,Among these publishers today are third - gener...,False
249491,We give thanks to Jehovah and to those early f...,True
249492,"15 : 15 , 16 . ​ — From our archives in Portug...",False
249493,See “ There Is More Harvest Work to Be Done ” ...,False


In [None]:
mon_en = mon_en[mon_en['has_num'] == True] 

In [None]:
mon_en.drop(['has_num'], axis=1,inplace = True)

In [None]:
mon_en

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .
...,...
249481,With the printing and distribution of Bible li...
249483,"However , the seeds of truth had been sown ."
249484,Amid the upheaval in Europe during the Spanish...
249486,"After that , the growth in the number of Kingd..."


In [None]:
# Monolingual English
#mon_en.to_csv('mon_en.csv',index=False) 

In [None]:
mon = pd.read_csv("mon_en.csv")
mon.head()

Unnamed: 0,source_sentence
0,This publication is not for sale .
1,COVER SUBJECT
2,The Bible was completed about two thousand yea...
3,"Since then , countless other books have come a..."
4,But not the Bible .


In [None]:
mon.isnull().sum()

source_sentence    0
dtype: int64

In [None]:
mon.dropna(inplace=True)

In [None]:
!pwd

/content/gdrive/Shareddrives/NMT_for_African_Language/Luganda


In [None]:
# Changing to Luhyia directory
os.chdir("/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia")

In [None]:
!pwd

/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia


In [None]:
# Getting monolingual BPEs
with open("mon."+source_language, "w") as src_file:
  for index, row in mon.iterrows():
    src_file.write(row["source_sentence"]+"\n")

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < mon.$src > mon.bpe.$src

In [None]:
! head mon.*

==> mon.bpe.en <==
This p@@ ub@@ li@@ cation is not for s@@ ale .
C@@ O@@ V@@ E@@ R S@@ U@@ B@@ J@@ E@@ C@@ T
The B@@ ible was comple@@ ted about two thousand years a@@ go .
S@@ in@@ ce then , coun@@ t@@ less other bo@@ ok@@ s have come and gone .
But not the B@@ ible .
C@@ on@@ si@@ der the follow@@ ing .
The B@@ ible has sur@@ v@@ ived many vi@@ ci@@ ous at@@ ta@@ c@@ ks by p@@ ow@@ er@@ ful people .
For exam@@ ple , d@@ ur@@ ing the M@@ id@@ d@@ le A@@ g@@ es in certain “ Chris@@ ti@@ an ” l@@ ands , “ the poss@@ ess@@ ion and read@@ ing of the B@@ ible in the ver@@ nac@@ ul@@ ar [ the l@@ ang@@ u@@ age of the comm@@ on people ] was in@@ cre@@ as@@ ingly as@@ so@@ ci@@ ated with her@@ es@@ y and dis@@ sent , ” says the book A@@ n I@@ n@@ t@@ ro@@ du@@ ction to the M@@ e@@ di@@ ev@@ al B@@ ible .
S@@ ch@@ ol@@ ars who trans@@ l@@ ated the B@@ ible into the ver@@ nac@@ ul@@ ar or who prom@@ o@@ ted B@@ ible st@@ ud@@ y ris@@ ked their lives . S@@ ome were killed .
D@@ es@@ p@@ ite its

In [None]:
!tail mon.*

==> mon.bpe.en <==
He sought per@@ mis@@ sion to use his h@@ ome for reg@@ ul@@ ar me@@ et@@ ings .
In ad@@ d@@ ition , through tr@@ ac@@ ts and bo@@ ok@@ le@@ ts , the word of truth sp@@ read to the fa@@ r re@@ ac@@ hes of the P@@ or@@ t@@ u@@ gu@@ es@@ e Em@@ p@@ ire ​ — A@@ ng@@ ol@@ a , the A@@ z@@ or@@ es , C@@ ap@@ e V@@ er@@ de , E@@ ast T@@ im@@ or , G@@ o@@ a , M@@ a@@ de@@ ira , and M@@ o@@ z@@ am@@ bi@@ qu@@ e .
W@@ hi@@ le living in B@@ ra@@ z@@ il , he had heard a p@@ ub@@ li@@ c tal@@ k given by B@@ ro@@ ther Y@@ oung .
He read@@ ily re@@ c@@ og@@ ni@@ zed the r@@ ing of truth and was e@@ ag@@ er to hel@@ p B@@ ro@@ ther F@@ er@@ g@@ us@@ on to ex@@ p@@ and the pre@@ aching work .
To do so , M@@ anu@@ el began to serve as a col@@ p@@ or@@ te@@ u@@ r , as pi@@ on@@ e@@ ers were then called .
W@@ ith the pr@@ in@@ ting and dis@@ tri@@ bu@@ t@@ ion of B@@ ible l@@ it@@ er@@ at@@ ure now well - or@@ g@@ ani@@ zed , the f@@ led@@ gl@@ ing con@@ gre@@ g@@ ation in L@@ is@@ b@@ 

In [None]:
!cd joeynmt; python -m joeynmt translate 'models/enlh_transformer_continued/config.yaml' < "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/mon.bpe.en" > "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/mon.lh"

2021-07-18 09:11:56,551 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 09:12:00,358 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 09:12:00,564 - INFO - joeynmt.model - Enc-dec model built.


In [None]:
!head mon.en
!head mon.lh

This publication is not for sale .
COVER SUBJECT
The Bible was completed about two thousand years ago .
Since then , countless other books have come and gone .
But not the Bible .
Consider the following .
The Bible has survived many vicious attacks by powerful people .
For example , during the Middle Ages in certain “ Christian ” lands , “ the possession and reading of the Bible in the vernacular [ the language of the common people ] was increasingly associated with heresy and dissent , ” says the book An Introduction to the Medieval Bible .
Scholars who translated the Bible into the vernacular or who promoted Bible study risked their lives . Some were killed .
Despite its many enemies , the Bible became ​ — and continues to be — ​ the most widely distributed book of all time .
Oburume obwomundu shibuliho ta , habula nobwatoto
Olunyuma lwetsinyanga tsitaru , Yorodani nende Siria .
Yali ahambi isaa yashienda yemiyika , chibili .
Abakhalabani bobubeeyi bakhetsukhana , nibakhupa ikha .
Ne

In [None]:
# Dev data source
file1 = ['train.en', 'mon.en']

# Dev data target
file2 = ['train.lh', 'mon.lh']

In [None]:
# Procedure to create concatenated files
def create_file(x,filename):
  # Open filename in write mode
  with open(filename, 'w') as outfile:
      for names in x:
          # Open each file in read mode
          with open(names) as infile:
              # read the data and write it in file3
              outfile.write(infile.read())
          outfile.write("\n")

In [None]:
create_file(file1,'back.en')
create_file(file2,'back.lh')

In [None]:
# Apply BPE splits to the development and test data.
! subword-nmt learn-joint-bpe-and-vocab --input back.$src back.$tgt3 -s 4000 -o bpe.codes.4000 --write-vocabulary vocab2.$src vocab2.$tgt3

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < back.$src > back.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < back.$tgt3 > back.bpe.$tgt3

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < dev.$src > back_dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < dev.$tgt3 > back_dev.bpe.$tgt3
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$src < test.$src > back_test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab2.$tgt3 < test.$tgt3 > back_test.bpe.$tgt3

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py back.bpe.$src back.bpe.$tgt3 --output_path vocab2.txt

## Modelling

In [None]:
#@title
name = '%s%s' % (target_language3, source_language)

# Create the config
config = """
name: "{target_language3}{source_language}_reverse_transformer"

data:
    src: "{target_language3}"
    trg: "{source_language}"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 1600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                  # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 5000         # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/back_{name}_reverse_transformer"
    overwrite: False              # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 3

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path="/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia", source_language=source_language, target_language3=target_language3)
with open("joeynmt/configs/back_transformer_reverse_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

In [None]:
# Train the model
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_$tgt3$src.yaml

2021-07-18 11:13:02,325 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 11:13:02,785 - INFO - joeynmt.data - Loading training data...
2021-07-18 11:13:05,985 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 11:13:06,237 - INFO - joeynmt.data - Loading dev data...
2021-07-18 11:13:06,262 - INFO - joeynmt.data - Loading test data...
2021-07-18 11:13:06,831 - INFO - joeynmt.data - Data loaded.
2021-07-18 11:13:06,831 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 11:13:07,034 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 11:13:07.277581: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-18 11:13:08,753 - INFO - joeynmt.training - Total params: 12138240
2021-07-18 11:13:11,975 - INFO - joeynmt.helpers - cfg.name                           : lhen_reverse_transformer
2021-07-18 11:13:11,975 - INFO - joeynmt.helpers - cfg.data.src                       : l

In [None]:
# Reloading configuration file
ckpt_number = 25000
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued"').replace(
        f'epochs: 30', f'epochs: 17').replace(f'validation_freq: 5000', f'validation_freq: 2500')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer/25000.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to 

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload.yaml

2021-07-18 17:10:05,612 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-18 17:10:05,687 - INFO - joeynmt.data - Loading training data...
2021-07-18 17:10:09,772 - INFO - joeynmt.data - Building vocabulary...
2021-07-18 17:10:10,273 - INFO - joeynmt.data - Loading dev data...
2021-07-18 17:10:10,939 - INFO - joeynmt.data - Loading test data...
2021-07-18 17:10:12,029 - INFO - joeynmt.data - Data loaded.
2021-07-18 17:10:12,029 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-18 17:10:12,411 - INFO - joeynmt.model - Enc-dec model built.
2021-07-18 17:10:12.672249: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-18 17:10:14,453 - INFO - joeynmt.training - Total params: 12138240
2021-07-18 17:10:25,188 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer/25000.ckpt
2021-07-18 17

In [None]:
# Reloading configuration file
ckpt_number = 62500
#model_path = '/content/gdrive/Shared drives/NMT_for_African_Language/Luganda/joeynmt/models/{name}_reverse_transformer2'
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer_continued/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued2"').replace(
            f'validation_freq: 5000', f'validation_freq: 2500')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload2.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload2.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued/62500.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from p

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload2.yaml

2021-07-19 06:34:19,360 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-19 06:34:19,477 - INFO - joeynmt.data - Loading training data...
2021-07-19 06:34:25,415 - INFO - joeynmt.data - Building vocabulary...
2021-07-19 06:34:26,608 - INFO - joeynmt.data - Loading dev data...
2021-07-19 06:34:27,539 - INFO - joeynmt.data - Loading test data...
2021-07-19 06:34:29,217 - INFO - joeynmt.data - Data loaded.
2021-07-19 06:34:29,218 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-19 06:34:29,748 - INFO - joeynmt.model - Enc-dec model built.
2021-07-19 06:34:30.003105: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-19 06:34:32,209 - INFO - joeynmt.training - Total params: 12138240
2021-07-19 06:34:40,915 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued/62500.ckpt
202

In [None]:
# Reloading configuration file
ckpt_number = 102500
reload_config = config.replace(
    f'#load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/models/lhen_transformer/1.ckpt"', 
    f'load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_{name}_reverse_transformer_continued2/{ckpt_number}.ckpt"').replace(
        f'model_dir: "models/back_lhen_reverse_transformer"', f'model_dir: "models/back_lhen_reverse_transformer_continued3"').replace(
            f'validation_freq: 5000', f'validation_freq: 2500').replace(
            f'epochs: 30', f'epochs: 11')
with open("joeynmt/configs/back_transformer_reverse_{name}_reload3.yaml".format(name=name),'w') as f:
    f.write(reload_config)

In [None]:
!cat "joeynmt/configs/back_transformer_reverse_lhen_reload3.yaml"


name: "lhen_reverse_transformer"

data:
    src: "lh"
    trg: "en"
    train: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back.bpe"
    dev:   "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_dev.bpe"
    test:  "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/back_test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"
    trg_vocab: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/vocab2.txt"

testing:
    beam_size: 5
    alpha: 1.0

training:
    load_model: "/content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued2/102500.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 42
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from

In [None]:
# Training continued
!cd joeynmt; python3 -m joeynmt train configs/back_transformer_reverse_lhen_reload3.yaml

2021-07-27 07:34:56,135 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-07-27 07:34:56,204 - INFO - joeynmt.data - Loading training data...
2021-07-27 07:35:00,825 - INFO - joeynmt.data - Building vocabulary...
2021-07-27 07:35:01,381 - INFO - joeynmt.data - Loading dev data...
2021-07-27 07:35:02,417 - INFO - joeynmt.data - Loading test data...
2021-07-27 07:35:03,772 - INFO - joeynmt.data - Data loaded.
2021-07-27 07:35:03,772 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-07-27 07:35:04,161 - INFO - joeynmt.model - Enc-dec model built.
2021-07-27 07:35:04.414349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-27 07:35:06,069 - INFO - joeynmt.training - Total params: 12138240
2021-07-27 07:35:16,618 - INFO - joeynmt.training - Loading model from /content/gdrive/Shared drives/NMT_for_African_Language/Luhyia/joeynmt/models/back_lhen_reverse_transformer_continued2/102500.ckpt
2