# Masakhane - Machine Translation for African Languages (Using JoeyNMT)

## Note before beginning:
### - The idea is that you should be able to make minimal changes to this in order to get SOME result for your own translation corpus. 

### - The tl;dr: Go to the **"TODO"** comments which will tell you what to update to get up and running

### - If you actually want to have a clue what you're doing, read the text and peek at the links

### - With 100 epochs, it should take around 7 hours to run in Google Colab

### - Once you've gotten a result for your language, please attach and email your notebook that generated it to masakhanetranslation@gmail.com

### - If you care enough and get a chance, doing a brief background on your language would be amazing. See examples in  [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

### - This notebook is intended to be used with custom parallel data. That means that you need two files, where one is in your language, the other English, and the lines in the files are corresponding translations.

## Pre-process your data

We assume here that you already have a data set. The format in which we will process it here requires that 
1. you have two files, one for each language
2. the files are sentence-aligned, which means that each line should correspond to the same line in the other file.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
# TODO: Set your source and target languages. Keep in mind, these traditionally use language codes as found here:
# These will also become the suffix's of all vocab and corpus files used throughout
import os
source_language = "en"
target_language = "zu" 
lc = False  # If True, lowercase the data.
seed = 42  # Random seed for shuffling.
tag = "baseline" # Give a unique name to your folder - this is to ensure you don't rewrite any models you've already submitted

os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language
os.environ["tag"] = tag

# This will save it to a folder in our gdrive instead!
!mkdir -p "/content/drive/My Drive/parallel_corpus/baseline/$src-$tgt-$tag"
os.environ["gdrive_path"] = "/content/drive/My Drive/parallel_corpus/baseline/%s-%s-%s" % (source_language, target_language, tag)

In [2]:
!echo $gdrive_path

/content/drive/My Drive/parallel_corpus/baseline/en-zu-baseline


In [3]:
# Install opus-tools
! pip install opustools-pkg



In [4]:
# TODO: specify the file paths here
source_file = "/content/drive/My Drive/parallel_corpus/my_en_eval.txt"
target_file = "/content/drive/My Drive/parallel_corpus/my_zu_eval_998.txt"

# They should both have the same length.
! wc -l "$source_file"
! wc -l "$target_file"

997 /content/drive/My Drive/parallel_corpus/my_en_eval.txt
997 /content/drive/My Drive/parallel_corpus/my_zu_eval_998.txt


In [6]:
# TODO: Pre-processing! (OPTIONAL)

# If your data contains weird symbols or the like, you might want to do some cleaning and normalization.
# We don't have the code in the notebook for that, but you can use sacremoses "normalize" for example for normalization punctuation: https://github.com/alvations/sacremoses.

# We apply tokenization to separate punctuation marks from the actual words, split words at hyphens etc.
# If you're data is already tokenized, that's great! Skip this cell.
# Otherwise we can use sacremoses to do the tokenization for us. 
# We need the data to be tokenized such that it matches the global test set.

! pip install sacremoses

tok_source_file = source_file+".tok"
tok_target_file = target_file+".tok"

# Tokenize the source
! sacremoses -l "$source_language" tokenize < "$source_file" > "$tok_source_file"
# Tokenize the target
! sacremoses -l "$target_language" tokenize < "$target_file" > "$tok_target_file"

# Let's take a look what tokenization did to the text.
! head "$source_file"*
! head "$target_file"*

# Change the pointers to our files such that we continue to work with the tokenized data.
source_file = tok_source_file
target_file = tok_target_file

==> /content/drive/My Drive/parallel_corpus/my_en_eval.txt <==
Peter Van Sant: And it means what?
The cost to society will be substantial, the report says. In 2019 alone, it estimates a $290 billion burden from health care, long-term case and hospice combined. Medicare and Medicaid will cover $195 billion of that, with out-of-pocket costs to caregivers reaching $63 billion.
It is now up to them to make the most of it.
"""The CPS is carefully considering all the available information, including the impact on Harry's family, in order to make an independent and objective charging decision."
TV audiences were left outraged after a teenage girl appeared on Channel 4's 24 Hours in A&E complaining of a broken finger nail.
The disgusting note read: 'Put your dog on a lead, slag!
Sen Sen CEO Subhash Challa
Someone behind the camera says: 'Hi guys, hi!
"""She was devoted to children and especially animals, including a wild fox who we are continuing to feed now that she has gone."""
Are there lit

In [5]:
# Download the global test set.
! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en
  
# And the specific test set for this language pair.
os.environ["trg"] = target_language 
os.environ["src"] = source_language 

! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.en 
! mv test.en-$trg.en test.en
! wget https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-$trg.$trg 
! mv test.en-$trg.$trg test.$trg

# TODO: if this fails it means that there is NO test set for your language yet. It's on you to create one.
# A good idea would be to take a random subset of your data, and add it to https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en.
# Make a Pull Request and get it approved and merged.
# Then repeat this cell to retrieve the new test set.
# Then proceed to the next cell that will filter out all duplicates from the training set, so that there is no overlap between training and test set.

--2021-11-20 16:50:21--  https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-any.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 277791 (271K) [text/plain]
Saving to: ‘test.en-any.en.2’


2021-11-20 16:50:21 (8.73 MB/s) - ‘test.en-any.en.2’ saved [277791/277791]

--2021-11-20 16:50:21--  https://raw.githubusercontent.com/masakhane-io/masakhane/master/jw300_utils/test/test.en-zu.en
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206207 (201K) [text/plain]
Saving to: ‘test.en-zu.en’


2021-11-

In [6]:
# Read the test data to filter from train and dev splits.
# Store english portion in set for quick filtering checks.
en_test_sents = set()
filter_test_sents = "test.en-any.en"
j = 0
with open(filter_test_sents) as f:
  for line in f:
    en_test_sents.add(line.strip())
    j += 1
print('Loaded {} global test sentences to filter from the training/dev data.'.format(j))

Loaded 3571 global test sentences to filter from the training/dev data.


In [7]:
import pandas as pd

source = []
target = []
skip_lines = []  # Collect the line numbers of the source portion to skip the same lines for the target portion.
with open(source_file) as f:
    for i, line in enumerate(f):
        # Skip sentences that are contained in the test set.
        if line.strip() not in en_test_sents:
            source.append(line.strip())
        else:
            skip_lines.append(i)             
with open(target_file) as f:
    for j, line in enumerate(f):
        # Only add to corpus if corresponding source was not skipped.
        if j not in skip_lines:
            target.append(line.strip())
    
print('Loaded data and skipped {}/{} lines since contained in test set.'.format(len(skip_lines), i))
    
df = pd.DataFrame(zip(source, target), columns=['source_sentence', 'target_sentence'])
df.head(3)

Loaded data and skipped 0/997 lines since contained in test set.


Unnamed: 0,source_sentence,target_sentence
0,Peter Van Sant: And it means what?,Peter Van Sant: Bese kusho ukuthini?
1,"The cost to society will be substantial, the r...","Izindleko zomphakathi zizoba zinkulu, ngokusho..."
2,It is now up to them to make the most of it.,Sekulele kubo ukuthi bayisebenzise ngendlela e...


## Pre-processing and export

It is generally a good idea to remove duplicate translations and conflicting translations from the corpus. In practice, these public corpora include some number of these that need to be cleaned.

In addition we will split our data into dev/test/train and export to the filesystem.

In [10]:
# DOnt  RUN
# drop duplicate translations
df_pp = df.drop_duplicates()

# drop conflicting translations
df_pp.drop_duplicates(subset='source_sentence', inplace=True)
df_pp.drop_duplicates(subset='target_sentence', inplace=True)

# Shuffle the data to remove bias in dev set selection.
df_pp = df_pp.sample(frac=1, random_state=seed).reset_index(drop=True)

In [11]:
# DOnt RUN
# Install fuzzy wuzzy to remove "almost duplicate" sentences in the
# test and training sets.
! pip install fuzzywuzzy
! pip install python-Levenshtein
import time
from fuzzywuzzy import process
import numpy as np
from os import cpu_count
from functools import partial
from multiprocessing import Pool


# reset the index of the training set after previous filtering
df_pp.reset_index(drop=False, inplace=True)

# Remove samples from the training data set if they "almost overlap" with the
# samples in the test set.

# Filtering function. Adjust pad to narrow down the candidate matches to
# within a certain length of characters of the given sample.
def fuzzfilter(sample, candidates, pad):
  candidates = [x for x in candidates if len(x) <= len(sample)+pad and len(x) >= len(sample)-pad] 
  if len(candidates) > 0:
    return process.extractOne(sample, candidates)[1]
  else:
    return np.nan



In [10]:
# Dont RUN
start_time = time.time()
### iterating over pandas dataframe rows is not recomended, let use multi processing to apply the function

with Pool(cpu_count()-1) as pool:
    scores = pool.map(partial(fuzzfilter, candidates=list(en_test_sents), pad=5), df_pp['source_sentence'])
hours, rem = divmod(time.time() - start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("done in {}h:{}min:{}seconds".format(hours, minutes, seconds))

# Filter out "almost overlapping samples"
df_pp = df_pp.assign(scores=scores)
df_pp = df_pp[df_pp['scores'] < 95]

NameError: ignored

In [8]:
# Dont Run
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev test and the given test set.
import csv

# TODO: if your corpus is smaller than 1000, reduce this number. With a corpus that small you might not obtain good results with NMT though :/
# Do the split between dev/train and create parallel corpora
num_dev_patterns = 100

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # Julia: making lowercasing optional
    df_pp["source_sentence"] = df_pp["source_sentence"].str.lower()
    df_pp["target_sentence"] = df_pp["target_sentence"].str.lower()

# Julia: test sets are already generated
dev = df_pp.tail(num_dev_patterns) # Herman: Error in original
stripped = df_pp.drop(df_pp.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in stripped.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

#stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  # Herman: Added `header=False` everywhere
#stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=False)  # Julia: Problematic handling of quotation marks.

#dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
#dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)


# TODO: Doublecheck the format below. There should be no extra quotation marks or weird characters. It should also not be empty.
! head train.*
! head dev.*

NameError: ignored

In [12]:
# This section does the split between train/dev for the parallel corpora then saves them as separate files
# We use 1000 dev test and the given test set.
import csv

# TODO: if your corpus is smaller than 1000, reduce this number. With a corpus that small you might not obtain good results with NMT though :/
# Do the split between dev/train and create parallel corpora
num_dev_patterns = 100

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
if lc:  # Julia: making lowercasing optional
    df["source_sentence"] = df["source_sentence"].str.lower()
    df["target_sentence"] = df["target_sentence"].str.lower()

# Julia: test sets are already generated
dev = df.tail(num_dev_patterns) # Herman: Error in original
stripped = df.drop(df.tail(num_dev_patterns).index)

with open("train."+source_language, "w") as src_file, open("train."+target_language, "w") as trg_file:
  for index, row in stripped.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")
    
with open("dev."+source_language, "w") as src_file, open("dev."+target_language, "w") as trg_file:
  for index, row in dev.iterrows():
    src_file.write(row["source_sentence"]+"\n")
    trg_file.write(row["target_sentence"]+"\n")

#stripped[["source_sentence"]].to_csv("train."+source_language, header=False, index=False)  # Herman: Added `header=False` everywhere
#stripped[["target_sentence"]].to_csv("train."+target_language, header=False, index=False)  # Julia: Problematic handling of quotation marks.

#dev[["source_sentence"]].to_csv("dev."+source_language, header=False, index=False)
#dev[["target_sentence"]].to_csv("dev."+target_language, header=False, index=False)


# TODO: Doublecheck the format below. There should be no extra quotation marks or weird characters. It should also not be empty.
! head train.*
! head dev.*

==> train.bpe.en <==

==> train.bpe.zu <==

==> train.en <==
Peter Van Sant: And it means what?
The cost to society will be substantial, the report says. In 2019 alone, it estimates a $290 billion burden from health care, long-term case and hospice combined. Medicare and Medicaid will cover $195 billion of that, with out-of-pocket costs to caregivers reaching $63 billion.
It is now up to them to make the most of it.
"""The CPS is carefully considering all the available information, including the impact on Harry's family, in order to make an independent and objective charging decision."
TV audiences were left outraged after a teenage girl appeared on Channel 4's 24 Hours in A&E complaining of a broken finger nail.
The disgusting note read: 'Put your dog on a lead, slag!
Sen Sen CEO Subhash Challa
Someone behind the camera says: 'Hi guys, hi!
"""She was devoted to children and especially animals, including a wild fox who we are continuing to feed now that she has gone."""
Are there liter



---


## Installation of JoeyNMT

JoeyNMT is a simple, minimalist NMT package which is useful for learning and teaching. Check out the documentation for JoeyNMT [here](https://joeynmt.readthedocs.io)  

In [14]:
# Install JoeyNMT
! git clone https://github.com/joeynmt/joeynmt.git
! cd joeynmt; pip3 install .
# Install Pytorch with GPU support v1.7.1.
# ! pip install torch==1.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

fatal: destination path 'joeynmt' already exists and is not an empty directory.
Processing /content/joeynmt
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Building wheels for collected packages: joeynmt
  Building wheel for joeynmt (setup.py) ... [?25l[?25hdone
  Created wheel for joeynmt: filename=joeynmt-1.3-py3-none-any.whl size=86029 sha256=5f4a3b906fc262e2bd76922cf1f7d8f85074c3ba573c02411e7d51e4fc518af2
  Stored in directory: /tmp/pip-ephem-wheel-cache-9m1ym5vw/wheels/0a/f4/bf/6c9d3b8efbfece6cd209f865be37382b02e7c3584df2e28ca4
Successfully built joeynmt
Installing collected packages: joeynmt
  Attempting uninstall: joeynmt
    

# Preprocessing the Data into Subword BPE Tokens

- One of the most powerful improvements for agglutinative languages (a feature of most Bantu languages) is using BPE tokenization [ (Sennrich, 2015) ](https://arxiv.org/abs/1508.07909).

- It was also shown that by optimizing the umber of BPE codes we significantly improve results for low-resourced languages [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021) [(Martinus, 2019)](https://arxiv.org/abs/1906.05685)

- Below we have the scripts for doing BPE tokenization of our data. We use 4000 tokens as recommended by [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021). You do not need to change anything. Simply running the below will be suitable. 

In [13]:
# One of the huge boosts in NMT performance was to use a different method of tokenizing. 
# Usually, NMT would tokenize by words. However, using a method called BPE gave amazing boosts to performance

# Do subword NMT
from os import path
os.environ["src"] = source_language # Sets them in bash as well, since we often use bash scripts
os.environ["tgt"] = target_language

# Learn BPEs on the training data.
os.environ["data_path"] = path.join("joeynmt", "data", source_language + target_language) # Herman! 
! subword-nmt learn-joint-bpe-and-vocab --input train.$src train.$tgt -s 4000 -o bpe.codes.4000 --write-vocabulary vocab.$src vocab.$tgt

# Apply BPE splits to the development and test data.
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < train.$src > train.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < train.$tgt > train.bpe.$tgt

! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < dev.$src > dev.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < dev.$tgt > dev.bpe.$tgt
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$src < test.$src > test.bpe.$src
! subword-nmt apply-bpe -c bpe.codes.4000 --vocabulary vocab.$tgt < test.$tgt > test.bpe.$tgt

# Create directory, move everyone we care about to the correct location
! mkdir -p "$data_path"
! cp train.* "$data_path"
! cp test.* "$data_path"
! cp dev.* "$data_path"
! cp bpe.codes.4000 "$data_path"
! ls "$data_path"

# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

# Create that vocab using build_vocab
! sudo chmod 777 joeynmt/scripts/build_vocab.py
! joeynmt/scripts/build_vocab.py joeynmt/data/$src$tgt/train.bpe.$src joeynmt/data/$src$tgt/train.bpe.$tgt --output_path joeynmt/data/$src$tgt/vocab.txt

# Some output
! echo "BPE Test language Sentences"
! tail -n 5 test.bpe.$tgt
! echo "Combined BPE Vocab"
! tail -n 10 joeynmt/data/$src$tgt/vocab.txt  # Herman

bpe.codes.4000	dev.zu	     test.en-any.en    train.bpe.en  vocab.txt
dev.bpe.en	test.bpe.en  test.en-any.en.1  train.bpe.zu
dev.bpe.zu	test.bpe.zu  test.en-any.en.2  train.en
dev.en		test.en      test.zu	       train.zu
bpe.codes.4000	dev.zu	     test.en-any.en    train.bpe.en
dev.bpe.en	test.bpe.en  test.en-any.en.1  train.bpe.zu
dev.bpe.zu	test.bpe.zu  test.en-any.en.2  train.en
dev.en		test.en      test.zu	       train.zu
BPE Test language Sentences
Ng@@ en@@ xa yal@@ okho , ngang@@ aziwa njengom@@ untu ong@@ ath@@ emb@@ ekile .
L@@ apho ng@@ ifund@@ a iq@@ in@@ iso , ngen@@ qaba ukuqhubeka nal@@ ow@@ o m@@ kh@@ uba , n@@ aku@@ ba lo m@@ sebenzi waw@@ ung@@ i@@ hol@@ ela kahle kakhulu .
Ng@@ iy@@ is@@ ibon@@ elo es@@ ihl@@ e em@@ ad@@ od@@ an@@ eni ami amab@@ ili futhi seng@@ i@@ ye ng@@ af@@ anel@@ ekela amal@@ ung@@ elo eb@@ andl@@ eni .
Kub@@ ac@@ wan@@ ingi - m@@ abh@@ uku ent@@ ela nabanye eng@@ isebenz@@ elana nabo ebh@@ izin@@ is@@ ini , manje seng@@ aziwa njengom@@ untu oth@

In [14]:
# Also move everything we care about to a mounted location in google drive (relevant if running in colab) at gdrive_path
! cp train.* "$gdrive_path"
! cp test.* "$gdrive_path"
! cp dev.* "$gdrive_path"
! cp bpe.codes.4000 "$gdrive_path"
! ls "$gdrive_path"

bpe.codes.4000	dev.zu	     test.en-any.en    train.bpe.en
dev.bpe.en	test.bpe.en  test.en-any.en.1  train.bpe.zu
dev.bpe.zu	test.bpe.zu  test.en-any.en.2  train.en
dev.en		test.en      test.zu	       train.zu


# Creating the JoeyNMT Config

JoeyNMT requires a yaml config. We provide a template below. We've also set a number of defaults with it, that you may play with!

- We used Transformer architecture 
- We set our dropout to reasonably high: 0.3 (recommended in  [(Sennrich, 2019)](https://www.aclweb.org/anthology/P19-1021))

Things worth playing with:
- The batch size (also recommended to change for low-resourced languages)
- The number of epochs (we've set it at 30 just so it runs in about an hour, for testing purposes)
- The decoder options (beam_size, alpha)
- Evaluation metrics (BLEU versus Crhf4)

In [25]:
# This creates the config file for our JoeyNMT system. It might seem overwhelming so we've provided a couple of useful parameters you'll need to update
# (You can of course play with all the parameters if you'd like!)

name = '%s%s' % (source_language, target_language)
gdrive_path = os.environ["gdrive_path"]

# Create the config
config = """
name: "{name}_transformer"

data:
    src: "{source_language}"
    trg: "{target_language}"
    train: "data/{name}/train.bpe"
    dev:   "data/{name}/dev.bpe"
    test:  "data/{name}/test.bpe"
    level: "bpe"
    lowercase: False
    max_sent_length: 100
    src_vocab: "data/{name}/vocab.txt"
    trg_vocab: "data/{name}/vocab.txt"

testing:
    beam_size: 5
    alpha: 1.0
    sacrebleu:                      # sacrebleu options
        remove_whitespace: True     # `remove_whitespace` option in sacrebleu.corpus_chrf() function (defalut: True)
        tokenize: "none"            # `tokenize` option in sacrebleu.corpus_bleu() function (options include: "none" (use for already tokenized test data), "13a" (default minimal tokenizer), "intl" which mostly does punctuation and unicode, etc) 

training:
    #load_model: "{gdrive_path}/models/{name}_transformer/1.ckpt" # if uncommented, load a pre-trained model from this checkpoint
    random_seed: 25
    optimizer: "adam"
    normalization: "tokens"
    adam_betas: [0.9, 0.999] 
    scheduling: "plateau"           # TODO: try switching from plateau to Noam scheduling
    patience: 5                     # For plateau: decrease learning rate by decrease_factor if validation score has not improved for this many validation rounds.
    learning_rate_factor: 0.5       # factor for Noam scheduler (used with Transformer)
    learning_rate_warmup: 1000      # warmup steps for Noam scheduler (used with Transformer)
    decrease_factor: 0.7
    loss: "crossentropy"
    learning_rate: 0.0003
    learning_rate_min: 0.00000001
    weight_decay: 0.0
    label_smoothing: 0.1
    batch_size: 4096
    batch_type: "token"
    eval_batch_size: 3600
    eval_batch_type: "token"
    batch_multiplier: 1
    early_stopping_metric: "ppl"
    epochs: 30                     # TODO: Decrease for when playing around and checking of working. Around 30 is sufficient to check if its working at all
    validation_freq: 100          # TODO: Set to at least once per epoch.
    logging_freq: 100
    eval_metric: "bleu"
    model_dir: "models/{name}_transformer"
    overwrite: True               # TODO: Set to True if you want to overwrite possibly existing models. 
    shuffle: True
    use_cuda: True
    max_output_length: 100
    print_valid_sents: [0, 1, 2, 3]
    keep_last_ckpts: 1

model:
    initializer: "xavier"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4             # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256   # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4              # TODO: Increase to 8 for larger data.
        embeddings:
            embedding_dim: 256    # TODO: Increase to 512 for larger data.
            scale: True
            dropout: 0.2
        # typically ff_size = 4 x hidden_size
        hidden_size: 256         # TODO: Increase to 512 for larger data.
        ff_size: 1024            # TODO: Increase to 2048 for larger data.
        dropout: 0.3
""".format(name=name, gdrive_path=os.environ["gdrive_path"], source_language=source_language, target_language=target_language)
with open("joeynmt/configs/transformer_{name}.yaml".format(name=name),'w') as f:
    f.write(config)

# Train the Model

This single line of joeynmt runs the training using the config we made above

In [26]:
# Train the model
# You can press Ctrl-C to stop. And then run the next cell to save your checkpoints! 
!cd joeynmt; python3 -m joeynmt train configs/transformer_$src$tgt.yaml

2021-11-20 17:29:03,215 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-11-20 17:29:03,248 - INFO - joeynmt.data - Loading training data...
2021-11-20 17:29:03,263 - INFO - joeynmt.data - Building vocabulary...
2021-11-20 17:29:03,526 - INFO - joeynmt.data - Loading dev data...
2021-11-20 17:29:03,529 - INFO - joeynmt.data - Loading test data...
2021-11-20 17:29:03,561 - INFO - joeynmt.data - Data loaded.
2021-11-20 17:29:03,562 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-11-20 17:29:03,875 - INFO - joeynmt.model - Enc-dec model built.
2021-11-20 17:29:06,883 - INFO - joeynmt.training - Total params: 12112128
2021-11-20 17:29:09,855 - INFO - joeynmt.helpers - cfg.name                           : enzu_transformer
2021-11-20 17:29:09,855 - INFO - joeynmt.helpers - cfg.data.src                       : en
2021-11-20 17:29:09,855 - INFO - joeynmt.helpers - cfg.data.trg                       : zu
2021-11-20 17:29:09,856 - INFO - joeynmt.helpers - cfg.data.t

In [27]:
# Copy the created models from the notebook storage to google drive for persistent storage 
! mkdir -p "$gdrive_path/models/${src}${tgt}_transformer/"
! cp -r joeynmt/models/${src}${tgt}_transformer/* "$gdrive_path/models/${src}${tgt}_transformer/"

In [28]:
# Output our validation accuracy
! cat "$gdrive_path/models/${src}${tgt}_transformer/validations.txt"

Steps: 100	Loss: 29641.04297	PPL: 469.12527	bleu: 0.01651	LR: 0.00030000	*
Steps: 200	Loss: 29303.06250	PPL: 437.35040	bleu: 0.00000	LR: 0.00030000	*
Steps: 300	Loss: 29319.38477	PPL: 438.83441	bleu: 0.01352	LR: 0.00030000	
Steps: 400	Loss: 29250.28711	PPL: 432.58688	bleu: 0.02931	LR: 0.00030000	*
Steps: 500	Loss: 29080.83984	PPL: 417.64059	bleu: 0.03035	LR: 0.00030000	*


In [29]:
# Test our model
! cd joeynmt; python3 -m joeynmt test "$gdrive_path/models/${src}${tgt}_transformer/config.yaml"

2021-11-20 17:58:17,707 - INFO - root - Hello! This is Joey-NMT (version 1.3).
2021-11-20 17:58:17,708 - INFO - joeynmt.data - Building vocabulary...
2021-11-20 17:58:17,977 - INFO - joeynmt.data - Loading dev data...
2021-11-20 17:58:17,980 - INFO - joeynmt.data - Loading test data...
2021-11-20 17:58:18,012 - INFO - joeynmt.data - Data loaded.
2021-11-20 17:58:18,038 - INFO - joeynmt.prediction - Process device: cuda, n_gpu: 1, batch_size per device: 3600
2021-11-20 17:58:18,038 - INFO - joeynmt.prediction - Loading model from models/enzu_transformer/500.ckpt
2021-11-20 17:58:20,820 - INFO - joeynmt.model - Building an encoder-decoder model...
2021-11-20 17:58:21,078 - INFO - joeynmt.model - Enc-dec model built.
2021-11-20 17:58:21,162 - INFO - joeynmt.prediction - Decoding on dev set (data/enzu/dev.bpe.zu)...
2021-11-20 17:59:09,102 - INFO - joeynmt.prediction -  dev bleu[none]:   0.00 [Beam search decoding with beam size = 5 and alpha = 1.0]
2021-11-20 17:59:09,103 - INFO - joeynmt