In [6]:
# Install OpenNMT-py 3.x
!pip3 install OpenNMT-py



In [10]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano
# Note here we are using some smaller values because the dataset is small
# For larger datasets, consider increasing: train_steps, valid_steps, warmup_steps, save_checkpoint_steps, keep_checkpoint

config = '''# config.yaml


## Where the samples will be written
save_data: run

# Training files
data:
    corpus_1:
        path_src: en-zh.en-filtered.en.subword.train
        path_tgt: en-zh.zh-filtered.zh.subword.train
        transforms: [filtertoolong]
    valid:
        path_src: en-zh.en-filtered.en.subword.dev
        path_tgt: en-zh.zh-filtered.zh.subword.dev
        transforms: [filtertoolong]

# Vocabulary files, generated by onmt_build_vocab
src_vocab: run/source.vocab
tgt_vocab: run/target.vocab

# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 50000
tgt_vocab_size: 50000

# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 150
src_seq_length: 150

# Tokenization options
src_subword_model: source.model
tgt_subword_model: target.model

# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: models/model.fren

# Stop training if it does not imporve after n validations
early_stopping: 4

# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 1000

# To save space, limit checkpoints to last n
# keep_checkpoint: 3

seed: 3435

# Default: 100000 - Train the model to max n steps 
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 3000

# Default: 10000 - Run validation after n steps
valid_steps: 1000

# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 1000
report_every: 100

# Number of GPUs, and IDs of GPUs
world_size: 1
gpu_ranks: [0]

# Batching
bucket_size: 4096
num_workers: 0  # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 4096   # Tokens per batch, change when CUDA out of memory
valid_batch_size: 2048
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 2
# warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
'''

with open("config.yaml", "w+") as config_yaml:
  config_yaml.write(config)

In [None]:
# Find the number of CPUs/cores on the machine
!nproc --all

In [3]:
# Build Vocabulary

# -config: path to your config.yaml file
# -n_sample: use -1 to build vocabulary on all the segment in the training dataset
# -num_threads: change it to match the number of CPUs to run it faster

!onmt_build_vocab -config config.yaml -n_sample -1 -num_threads 7

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2025-03-21 01:38:12,956 INFO] Counter vocab from -1 samples.
[2025-03-21 01:38:12,956 INFO] n_sample=-1: Build vocab on full datasets.
[2025-03-21 01:38:16,602 INFO] Counters src: 4594
[2025-03-21 01:38:16,602 INFO] Counters tgt: 2820


In [2]:
# Check if the GPU is active
!nvidia-smi -L

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3da117a9-dc75-64e3-3c01-4f5d3b83c278)


In [3]:
# Check if the GPU is visable to PyTorch
import torch

print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

gpu_memory = torch.cuda.mem_get_info(0)
print("Free GPU memory:", gpu_memory[0]/1024**2, "out of:", gpu_memory[1]/1024**2)

True
NVIDIA A100-SXM4-40GB
Free GPU memory: 39900.25 out of: 40326.375


In [11]:
# Train the NMT model
!onmt_train -config config.yaml


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.1.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/venv/main/bin/onmt_train", line 5, in <module>
    from onmt.bin.train import main
  File "/venv/main/lib/python3.10/site-packages/onmt/__init__.py", line 2, in <module>
    import onmt.inputters
  File "/venv/main/lib/python3.10/site-packages/onmt/inputters/__init__.py", line 7, in <module>
    from onmt.inputters.text_utils import text_sort_key, process, numericalize, tensorify
  File "/venv/main/lib/python3.10/site-packages/onmt/inputters/text_utils.py", line 1, in <module>
    import torch
  File "/venv/main/l

## Translate

In [23]:
# Translate the "subworded" source file of the test dataset
# Change the model name, if needed.
!onmt_translate -model models/model.fren_step_3000.pt -src en-zh.zh-filtered.zh.subword.test -output en.translated -gpu 0 -min_length 1

[2025-03-20 18:38:03,064 INFO] Loading checkpoint from models/model.fren_step_3000.pt
[2025-03-20 18:38:03,915 INFO] Loading data into the model
[2025-03-20 18:38:05,816 INFO] PRED SCORE: -0.3778, PRED PPL: 1.46 NB SENTENCES: 200
Time w/o python interpreter load/terminate:  2.7549333572387695


In [16]:
%pip install "numpy<2"


Collecting numpy<2
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m92.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.1.2
    Uninstalling numpy-2.1.2:
      Successfully uninstalled numpy-2.1.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.21.0+cu124 requires torch==2.6.0, but you have torch 2.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.


In [35]:
# Check the first 5 lines of the translation file
!head -n 30 en.translated

▁Of ▁course .
▁And ▁Le gad ema ?
▁Perfect .
▁V olunteer : ▁Yes .
▁Electr i C ity .
▁Where ▁is ▁Olu ▁from ?
▁A llow ▁your ▁eyes ▁to ▁close , ▁on ▁five , ▁four , ▁three , ▁two , ▁one .
▁Hope ful ly ▁everybody .
▁There ' s ▁no ▁question .
▁A udience : ▁Now .
▁A udience : ▁ 6 .
▁Be ▁gen u ine .
▁What ▁happens ?
▁Woo ▁hoo !
▁So , ▁lin ked ▁data ▁-- ▁it ' s ▁hu
▁BJ : ▁Ab out ▁ 4 0 ▁people .
▁And ▁even tually , ▁we ▁found ▁our ▁photo bomb ing ▁Ka up ich phys ▁e el
▁False .
▁They ▁are ▁elephant - adapt ed .
▁That ' s ▁all .
▁L isten .
▁It ' s ▁remarkable .
▁There . ▁Good .
▁June ▁Co hen : ▁So ▁I s abel ▁ — ▁I A : ▁Thank ▁you .
▁You ' re ▁V - I - O - L - E - N - T .
▁We ▁sh all .
▁And ▁here ' s ▁the ▁ rub .
▁One , ▁two , ▁three , ▁four ,
▁Well , ▁p retty ▁ba d .
▁Hi ▁there .


In [36]:
# If needed install/update sentencepiece
!pip3 install --upgrade -q sentencepiece

# Desubword the translation file
!python3 ./MT-Preparation/subwording/3-desubword.py ./target.model en.translated

Done desubwording! Output: en.translated.desubword


In [37]:
# Desubword the target file (reference) of the test dataset
# Note: You might as well have split files *before* subwording during dataset preperation, 
# but sometimes datasets have tokeniztion issues, so this way you are sure the file is really untokenized.
!python3 ./MT-Preparation/subwording/3-desubword.py ./target.model en-zh.en-filtered.en.subword.test

Done desubwording! Output: en-zh.en-filtered.en.subword.test.desubword


In [40]:
# Check the first 5 lines of the desubworded translation file
!head -n 30 en.translated.desubword

print("---------------")
# Check the first 5 lines of the desubworded reference
!head -n 30 en-zh.en-filtered.en.subword.test.desubword

Of course.
And Legadema?
Perfect.
Volunteer: Yes.
ElectriCity.
Where is Olu from?
Allow your eyes to close, on five, four, three, two, one.
Hopefully everybody.
There's no question.
Audience: Now.
Audience: 6.
Be genuine.
What happens?
Woo hoo!
So, linked data -- it's hu
BJ: About 40 people.
And eventually, we found our photobombing Kaupichphys eel
False.
They are elephant-adapted.
That's all.
Listen.
It's remarkable.
There. Good.
June Cohen: So Isabel — IA: Thank you.
You're V-I-O-L-E-N-T.
We shall.
And here's the rub.
One, two, three, four,
Well, pretty bad.
Hi there.
---------------
Sign language.
Does it say "Michelle Obama" under the picture?
That's you.
Volunteer: No.
Telegraph? No.
I don't care. It's a loaner.
Okay, so computer translation, not yet good enough.
You've got Lady Gaga.
Rives: Exactly.
It's a remarkable thing.
65 dollars.
Sincerely, Mr Micheal Bangura.
What happened?
Yeah, yeah!
obese ...
BJ: Joey.
There's magic to love!
False.
Like a man.
Oh, it's so sad.
Continue,

## Evaluation

In [32]:
# Download the BLEU script
!wget https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py

--2025-03-20 18:42:23--  https://raw.githubusercontent.com/ymoslem/MT-Evaluation/main/BLEU/compute-bleu.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957 [text/plain]
Saving to: ‘compute-bleu.py’


2025-03-20 18:42:23 (46.4 MB/s) - ‘compute-bleu.py’ saved [957/957]



In [33]:
# Install sacrebleu
!pip3 install sacrebleu



In [34]:
# Evaluate the translation (without subwording)
!python3 compute-bleu.py en-zh.en-filtered.en.subword.test.desubword en.translated.desubword

Reference 1st sentence: Sign language.
MTed 1st sentence: Of course.
BLEU:  7.0510299934554235
