# Starter notebook

The purpose of this Notebook is to build baseline model that translate dyula (dyu) language (source language) into French (fr) language (target language). we'll train from scratch a Transformer model using JoeyNMT.

NB: Run time execution of this notebook it less than **1h** respect resources (GPU, RAM) define below.

For more details about JoeyNMT see [here](https://github.com/joeynmt)

## Environmental setup

> ⚠ **Important:** Before you start, set runtime type to GPU.

In [1]:
!nvidia-smi

Mon Jun  3 07:49:34 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA RTX A6000               Off |   00000000:3B:00.0 Off |                  Off |
| 30%   32C    P8             27W /  300W |    3568MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00

In [2]:
import torch
torch.__version__

'2.3.0+cu121'

## Data Preparation

### Download

We download the corpus train-dev-test subsets from Huggingface hub.

In [3]:
from ml_translation.model import get_datasets

# Make sure you have run huggingface-cli login
train, val, test = get_datasets(True)
val = val.train_test_split(test_size=1000)

In [4]:
dev = val["train"]
test = val["test"]

In [5]:
train

Dataset({
    features: ['ID', 'translation'],
    num_rows: 377515
})

In [6]:
import re

## Data preprocessing

# Optional: lower case the corpora - this will make it easier to generalize, but without proper casing.
# Optional: remove punctuation.

src_lang = 'dyu'
trg_lang = "fr"
chars_to_remove_regex = '[!"&\(\),-./:;=?+.\n\[\]]'
def remove_special_characters(text: str):
    text = re.sub(chars_to_remove_regex, ' ', text.lower())
    return text.strip()

def clean_text(batch):
    # process source text
    batch["translation"] = [
        {src_lang: remove_special_characters(line[src_lang]), trg_lang: remove_special_characters(line[trg_lang])} for line in batch["translation"]
	]
    return batch

train = train.map(clean_text, batched=True, batch_size=100_000)
dev = dev.map(clean_text, batched=True)
test = test.map(clean_text, batched=True)

Map:   0%|          | 0/377515 [00:00<?, ? examples/s]

Map:   0%|          | 0/471 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Let's inspect the sentences.

In [7]:
train[:5]

{'ID': ['ID_18897661270129',
  'ID_18479132727846',
  'ID_18164131280307',
  'ID_18344573728152',
  'ID_18127342282717'],
 'translation': [{'dyu': 'a bi ji min na', 'fr': 'il boit de l’eau'},
  {'dyu': 'a le dalakolontɛ lon bɛ', 'fr': 'il se plaint toujours'},
  {'dyu': 'mun  fɛn dɔ', 'fr': 'quoi   quelque chose'},
  {'dyu': 'o bɛ bi bɔra fo gubeta', 'fr': 'tous sortent excepté gubetta'},
  {'dyu': 'a ale lo bi da bugɔ la', 'fr': 'ah   c’est lui… il sonne…'}]}

In [8]:
train.features

{'ID': Value(dtype='string', id=None),
 'translation': Translation(languages=['dyu', 'fr'], id=None)}

Save the train-dev subsets on disk.

In [9]:
import os

In [10]:
data_dir = "extra_dataset/dyu_fr"
train.save_to_disk(os.path.join(data_dir, "train"))

dev.save_to_disk(os.path.join(data_dir, "validation"))

test.save_to_disk(os.path.join(data_dir, "test"))

Saving the dataset (0/1 shards):   0%|          | 0/377515 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/471 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

### Vocabulary

We will use the [sentencepiece](https://github.com/google/sentencepiece) library to split words into subwords (BPE) according to their frequency in the training corpus.

`build_vocab.py` script will train the BPE model and creates joint vocabulary. It takes the same config file as the joeynmt.

In [11]:
from pathlib import Path

# model dir
model_dir = "models/dyu_fr"

# Create the config
config = """
name: "dyu_fr_transformer-sp"
joeynmt_version: "2.3.0"
model_dir: "{model_dir}"
use_cuda: True
fp16: False

data:
    train: "{data_dir}/train"
    dev: "{data_dir}/validation"
    test: "{data_dir}/test"
    dataset_type: "huggingface"
    dataset_cfg:
        name: "dyu-fr"
    # sample_dev_subset: 1460
    src:
        lang: "dyu"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 10000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    trg:
        lang: "fr"
        max_length: 100
        lowercase: False
        normalize: False
        level: "bpe"
        voc_limit: 10000
        voc_min_freq: 1
        voc_file: "{data_dir}/vocab.txt"
        tokenizer_type: "sentencepiece"
        tokenizer_cfg:
            model_file: "{data_dir}/sp.model"
    special_symbols:
        unk_token: "<unk>"
        unk_id: 0
        pad_token: "<pad>"
        pad_id: 1
        bos_token: "<s>"
        bos_id: 2
        eos_token: "</s>"
        eos_id: 3

""".format(data_dir=data_dir, model_dir=model_dir)
with (Path(data_dir) / "config.yaml").open('w') as f:
    f.write(config)

Call the `build_vocab.py` script with `--joint` flag to build the vocabulary

In [13]:
!python build_vocab.py {data_dir}/config.yaml --joint

Traceback (most recent call last):
  File "/data/home/eak/learning/zindi_challenge/machine_translation/build_vocab.py", line 399, in <module>
    main(args)
  File "/data/home/eak/learning/zindi_challenge/machine_translation/build_vocab.py", line 342, in main
    run(
  File "/data/home/eak/learning/zindi_challenge/machine_translation/build_vocab.py", line 252, in run
    sents = _get_sents(args, train_data, langs, tokenized=False)
  File "/data/home/eak/learning/zindi_challenge/machine_translation/build_vocab.py", line 236, in _get_sents
    assert len(sents) <= args.random_subset, (len(sents), len(dataset))
AssertionError: (1372420, 686210)


The generated vocabulary looks like this:

In [14]:
!head -10 {data_dir}/vocab.txt

head: cannot open 'extra_dataset/dyu_fr/vocab.txt' for reading: No such file or directory


## Model Training

### Configuration

Joey NMT reads model and training hyperparameters from a configuration file. We're generating this now to configure paths in the appropriate places.

The configuration below builds a small Transformer model with shared embeddings between source and target language on the base of the subword vocabularies created above.

In [15]:
config += """
testing:
    #load_model: "{model_dir}/best.ckpt"
    n_best: 2
    beam_size: 5
    beam_alpha: 1.0
    batch_size: 64
    batch_type: "token"
    max_output_length: 125
    eval_metrics: ["bleu"]
    #return_prob: "hyp"
    #return_attention: False
    sacrebleu_cfg:
        tokenize: "13a"

training:
    # load_model: "{model_dir}/latest.ckpt"
    # reset_best_ckpt: True
    #reset_scheduler: False
    #reset_optimizer: False
    #reset_iter_state: False
    random_seed: 42
    optimizer: "adamw"
    normalization: "tokens"
    adam_betas: [0.9, 0.999]
    scheduling: "warmupinversesquareroot"
    learning_rate_warmup: 100
    learning_rate: 0.0003
    learning_rate_min: 0.00001
    weight_decay: 0.00001
    label_smoothing: 0.1
    loss: "crossentropy"
    batch_size: 64
    batch_type: "token"
    batch_multiplier: 4
    early_stopping_metric: "bleu"
    epochs: 30
    updates: 550
    validation_freq: 500
    logging_freq: 2
    overwrite: True
    shuffle: True
    print_valid_sents: [0, 1, 2, 3]
    keep_best_ckpts: 3

model:
    initializer: "xavier_uniform"
    bias_initializer: "zeros"
    init_gain: 1.0
    embed_initializer: "xavier_uniform"
    embed_init_gain: 1.0
    tied_embeddings: True
    tied_softmax: True
    encoder:
        type: "transformer"
        num_layers: 6
        num_heads: 4
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.2
        layer_norm: "pre"
    decoder:
        type: "transformer"
        num_layers: 6
        num_heads: 8
        embeddings:
            embedding_dim: 256
            scale: True
            dropout: 0.0
        # typically ff_size = 4 x hidden_size
        hidden_size: 256
        ff_size: 1024
        dropout: 0.1
        layer_norm: "pre"

""".format(model_dir=model_dir)
with (Path(data_dir) / "train_config.yaml").open('w') as f:
    f.write(config)

### Run training
⏳ The log reports the training process, look out for the prints of example translations and the BLEU evaluation scores to get an impression of the current quality.

In [16]:
!python -m joeynmt train {data_dir}/train_config.yaml --skip-test

2024-05-31 14:09:04,758 - INFO - root - Hello! This is Joey-NMT (version 2.3.0).
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                           cfg.name : dyu_fr_transformer-sp
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                cfg.joeynmt_version : 2.3.0
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                      cfg.model_dir : models/dyu_fr
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                       cfg.use_cuda : True
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                           cfg.fp16 : False
2024-05-31 14:09:04,758 - INFO - joeynmt.config -                     cfg.data.train : extra_dataset/dyu_fr/train
2024-05-31 14:09:04,759 - INFO - joeynmt.config -                       cfg.data.dev : extra_dataset/dyu_fr/validation
2024-05-31 14:09:04,759 - INFO - joeynmt.config -                      cfg.data.test : extra_dataset/dyu_fr/test
2024-05-31 14:09:04,759 - INFO - joeynmt.config -              cfg.data.dataset_type 

In [17]:
# Add the best model info on config file
with (Path(model_dir) / "config.yaml").open('r') as f:
    config = f.read()
resume_config = config\
  .replace(f'#load_model: "{model_dir}/best.ckpt"',
           f'load_model: "{model_dir}/best.ckpt"')

resume_config = resume_config\
  .replace(f'model_file: "{data_dir}/sp.model"',
           f'model_file: "{model_dir}/sp.model"')

resume_config = resume_config\
  .replace(f'voc_file: "{data_dir}/vocab.txt"',
           f'voc_file: "{model_dir}/vocab.txt"')

with (Path(model_dir) / "config.yaml").open('w') as f:
    f.write(resume_config)