# **Machine Translation**

Machine translation (MT) is the task of automatically translating text or speech from
one natural language to another. MT is a subfield of NLP that leverages
the disciplines of artificial intelligence, information theory, computer science, and
statistics [[1]](#scrollTo=XVBaTGCh4zGt).



## **Machine translation with OpenNMT**
OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning [[2]](https://opennmt.net/).OpenNMT provides implementations in 2 popular deep learning frameworks: 
* ``OpenNMT-py`` (PyTorch)
* ``OpenNMT-tf`` (TensorFlow)

In this example the ``OpenNMT-py`` library is used to demnstrate a neural machine translation (NMT) task.
The following example is based on [[3]](https://github.com/OpenNMT/OpenNMT-py#quickstart).

### Install ``OpenNMT-py``

In [1]:
# Install OpenNMT-py 2.x
## NOTE: By the end of the insatallation, it might ask for restarting the runtime...
## In this case, just click the "RESTART RUNTIME" button.
!pip3 install git+https://github.com/OpenNMT/OpenNMT-py.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/OpenNMT/OpenNMT-py.git
  Cloning https://github.com/OpenNMT/OpenNMT-py.git to /tmp/pip-req-build-5nyg1iw0
  Running command git clone -q https://github.com/OpenNMT/OpenNMT-py.git /tmp/pip-req-build-5nyg1iw0
Collecting torchtext==0.5.0
  Downloading torchtext-0.5.0-py3-none-any.whl (73 kB)
[K     |████████████████████████████████| 73 kB 2.0 MB/s 
[?25hCollecting configargparse
  Downloading ConfigArgParse-1.5.3-py3-none-any.whl (20 kB)
Collecting waitress
  Downloading waitress-2.1.2-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 5.1 MB/s 
[?25hCollecting pyonmttok<2,>=1.23
  Downloading pyonmttok-1.32.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (16.6 MB)
[K     |████████████████████████████████| 16.6 MB 24.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.man

In [2]:
# On Google Colab ONLY
# Reinstall Torch to avoid incompatibility with Cuda 10.1

# NOTE: By the end of the insatallation, it might ask for restarting the runtime...
# In this case, just click the "RESTART RUNTIME" button.

!pip3 install --ignore-installed torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (708.0 MB)
[K     |████████████████████████████████| 708.0 MB 10 kB/s 
[?25hCollecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 8.1 MB/s 
[?25hCollecting numpy
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 60.4 MB/s 
[?25hBuilding wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25l[?25hdone
  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491070 sha256=8c673353e17392b78bf0b32d972edcea2807417772490211293ab5e68bcd910e
  Stored in directory: /root/.cache/pip/wheels/56/b0/fe/

### Download files

In [1]:
# Download the files of the QuickStart
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz

--2022-07-26 07:22:38--  https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.75.54
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.75.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1662081 (1.6M) [application/x-gzip]
Saving to: ‘toy-ende.tar.gz’


2022-07-26 07:22:39 (4.21 MB/s) - ‘toy-ende.tar.gz’ saved [1662081/1662081]



In [2]:
# Optional: List the extracted files
!cd toy-ende/ && ls

src-test.txt   src-val.txt   tgt-train.txt
src-train.txt  tgt-test.txt  tgt-val.txt


In [3]:
# Optional: Print the first 3 lines of the source file
!head -n 3 toy-ende/src-train.txt

It is not acceptable that , with the help of the national bureaucracies , Parliament &apos;s legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .
&quot; Two soldiers came up to me and told me that if I refuse to sleep with them , they will kill me . They beat me and ripped my clothes .


In [4]:
# Optional: Check the number of lines in the source file
!echo "Number of lines:" && wc -l toy-ende/src-train.txt

Number of lines:
10000 toy-ende/src-train.txt


### Prepare data

In [5]:
# Create the YAML configuration file
# On a regular machine, you can create it manually or with nano

config = '''# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example

## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

## Where the model will be saved
save_model: model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

world_size: 1
gpu_ranks: [0]

# Remove or modify these lines for bigger files
train_steps: 1000
valid_steps: 200
'''
# Look at the file content
with open("toy_en_de.yaml", "w+") as config_yaml:
  config_yaml.write(config)

!cat toy_en_de.yaml

# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example

## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

## Where the model will be saved
save_model: model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt

world_size: 1
gpu_ranks: [0]

# Remove or modify these lines for bigger files
train_steps: 1000
valid_steps: 200


### Build Vocabulary

In [6]:
# Build Vocabulary
!onmt_build_vocab -config toy_en_de.yaml -n_sample -1

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2022-07-26 07:22:58,100 INFO] Counter vocab from -1 samples.
[2022-07-26 07:22:58,100 INFO] n_sample=-1: Build vocab on full datasets.
[2022-07-26 07:22:58,108 INFO] corpus_1's transforms: TransformPipe()
[2022-07-26 07:22:58,410 INFO] Counters src:24995
[2022-07-26 07:22:58,410 INFO] Counters tgt:35816


### Check GPU availability

In [7]:
# Check if GPU is active
# If not, go to "Runtime" menu > "Change runtime type" > "GPU"
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-1aa881c8-746c-0c5d-abc9-2f4fba74f27c)


In [8]:
# Check PyTorch and GPU connection
import torch

gpu_id = torch.cuda.current_device()
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(gpu_id))

True
Tesla P100-PCIE-16GB


### Train model

In [9]:
# Train the NMT model  -> will take ~ 5min
!onmt_train -config toy_en_de.yaml

[2022-07-26 07:23:09,209 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2022-07-26 07:23:09,209 INFO] Missing transforms field for valid data, set to default: [].
[2022-07-26 07:23:09,209 INFO] Parsed 2 corpora from -data.
[2022-07-26 07:23:09,209 INFO] Get special vocabs from Transforms: {'src': set(), 'tgt': set()}.
[2022-07-26 07:23:09,209 INFO] Loading vocab from text file...
[2022-07-26 07:23:09,209 INFO] Loading src vocabulary from toy-ende/run/example.vocab.src
[2022-07-26 07:23:09,252 INFO] Loaded src vocab has 24995 tokens.
[2022-07-26 07:23:09,263 INFO] Loading tgt vocabulary from toy-ende/run/example.vocab.tgt
[2022-07-26 07:23:09,324 INFO] Loaded tgt vocab has 35816 tokens.
[2022-07-26 07:23:09,339 INFO] Building fields with vocab in counters...
[2022-07-26 07:23:09,403 INFO]  * tgt vocab size: 35820.
[2022-07-26 07:23:09,434 INFO]  * src vocab size: 24997.
[2022-07-26 07:23:09,436 INFO]  * src vocab size = 24997
[2022-07-26 07:23:09,436 INFO]  * tgt

### Translate

In [13]:
!head -n 3 toy-ende/src-test.txt

Orlando Bloom and Miranda Kerr still love each other
Actors Orlando Bloom and Model Miranda Kerr want to go their separate ways .
However , in an interview , Bloom has said that he and Kerr still love each other .


In [10]:
# Translate
!onmt_translate -model model/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

[1;30;43mDie letzten 5000 Zeilen der Streamingausgabe wurden abgeschnitten.[0m
SENT 1738: ['Pope', 'Francis', 'to', 'name', 'first', 'cardinals', 'in', 'February']
PRED 1738: Das ist es auf die Möglichkeit .
PRED SCORE: -17.4993

[2022-07-26 07:26:14,112 INFO] 
SENT 1739: ['Pope', 'Francis', 'will', 'create', 'new', 'cardinals', 'of', 'the', 'Catholic', 'Church', 'for', 'his', 'first', 'time', 'on', 'February', '22', ',', 'the', 'Vatican', 'announced', 'Thursday', '.']
PRED 1739: Die Europäische Union , die sich auf die Möglichkeit , die auf die Möglichkeit von der EU , die sich auf die Nähe der EU .
PRED SCORE: -69.9081

[2022-07-26 07:26:14,112 INFO] 
SENT 1740: ['Cardinals', 'are', 'the', 'highest-ranking', 'clergy', 'in', 'the', 'Catholic', 'Church', 'below', 'the', 'pope', ',', 'and', 'they', '&apos;re', 'the', 'ones', 'who', 'elect', 'popes', ',', 'so', 'Francis', 'will', 'be', 'appointing', 'his', 'first', 'group', 'of', 'men', 'who', 'will', 'ultimately', 'help', 'choose', 'h

In [14]:
# Look at some of the translations 
!head -n 3 toy-ende/pred_1000.txt

Die Europäische Union , die sich auf die Nähe der EU .
Die Europäische Union , die sich auf die Möglichkeit , die sich auf die Möglichkeit .
Das ist es auf die Europäische Union .


# **References**

- [1] NLP and Computer Vision_DLMAINLPCV01 Course Book
- [2] https://opennmt.net/
- [3] https://github.com/OpenNMT/OpenNMT-py#quickstart


Copyright © 2022 IU International University of Applied Sciences