<a href="https://colab.research.google.com/github/bucuram/machine-translation-labs/blob/main/Lab2_MT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Overview of Approaches to MT

### Open NMT frameworks
* [Moses](http://www.statmt.org/moses/https://aclanthology.org/P07-2045.pdf). Paper: [Moses: Open Source Toolkit for Statistical Machine Translation](https://aclanthology.org/P07-2045.pdf). C++

* [OpenNMT](https://github.com/OpenNMT/OpenNMT-py). Paper: [OpenNMT: Open-Source Toolkit for Neural Machine Translation](https://aclanthology.org/P17-4012.pdf). PyTorch / TensorFlow. Developed by Harvard NLP,  SYSTRAN
* [Marian](https://marian-nmt.github.io/). Paper: [Marian: Fast Neural Machine Translation in C++](https://aclanthology.org/P18-4020.pdf). C++. Developed by Microsoft Translator
* [Fairseq](https://github.com/pytorch/fairseq). Paper: [https://aclanthology.org/N19-4009.pdf](https://aclanthology.org/N19-4009.pdf). PyTorch. Developed by Facebook AI
* [Nematus](https://github.com/EdinburghNLP/nematus). Paper: [Nematus: a Toolkit for Neural Machine Translation](https://aclanthology.org/E17-3017.pdf). TensorFlow. Developed by Edinburgh NLP
* [Sockeye](https://github.com/awslabs/sockeye). Paper: [SOCKEYE 2:A Toolkit for Neural Machine Translation](https://aclanthology.org/2020.eamt-1.50.pdf). MXNet. Developed by Amazon
* [JoeyNMT](https://github.com/joeynmt/joeynmt). Paper: [Joey NMT: A Minimalist NMT Toolkit for Novices](https://aclanthology.org/D19-3019v1.pdf). PyTorch



###Testing the fairseq framework

Installing fairseq, mosestokenizer and tensorboardX

In [1]:
!pip install sentencepiece fairseq tensorboardX mosestokenizer

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.1 MB/s 
[?25hCollecting fairseq
  Downloading fairseq-0.10.2-cp37-cp37m-manylinux1_x86_64.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 29.0 MB/s 
[?25hCollecting tensorboardX
  Downloading tensorboardX-2.4-py2.py3-none-any.whl (124 kB)
[K     |████████████████████████████████| 124 kB 57.6 MB/s 
[?25hCollecting mosestokenizer
  Downloading mosestokenizer-1.1.0.tar.gz (37 kB)
Collecting dataclasses
  Downloading dataclasses-0.6-py3-none-any.whl (14 kB)
Collecting sacrebleu>=1.4.12
  Downloading sacrebleu-2.0.0-py3-none-any.whl (90 kB)
[K     |████████████████████████████████| 90 kB 8.7 MB/s 
[?25hCollecting hydra-core
  Downloading hydra_core-1.1.1-py3-none-any.whl (145 kB)
[K     |████████████████████████████████| 145 kB 56.6 MB/s 
Collecting portalocker
  Downloading portalocker-2.3.2-

Downloading the data

We will use the Europarl parallel corpus https://www.statmt.org/europarl/. It contains translations of parliament proceedings

In [10]:
!wget https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-ro.txt.zip

--2021-10-20 14:04:41--  https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-ro.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39495951 (38M) [application/zip]
Saving to: ‘en-ro.txt.zip’


2021-10-20 14:04:44 (15.4 MB/s) - ‘en-ro.txt.zip’ saved [39495951/39495951]



In [11]:
!mkdir data
!mv en-ro.txt.zip data/en-ro.txt.zip

In [17]:
%cd data/
!unzip en-ro.txt.zip
!rm Europarl.en-ro.xml

Let's check how many lines our files contain:

In [18]:
!wc -l Europarl*

   400356 Europarl.en-ro.en
   400356 Europarl.en-ro.ro
   800712 total


Let's see what some random sentence pairs from this corpus look like. First, let's shuffle and merge the source and target files horizontally (each line of the resulting file will contain a source line and a target line, separated by a tab):

In [23]:
!paste Europarl.en-ro.ro Europarl.en-ro.en | shuf > shuf-Europarl.en-ro.both

In [24]:
with open('shuf-Europarl.en-ro.both', 'r', encoding='utf8') as fh:
    for i in range(5):
        et_sentence, en_sentence = fh.readline().strip().split('\t')
        print('RO: {}\nEN: {}\n'.format(et_sentence, en_sentence))

RO: Ca toţi ceilalţi din această cameră, sunt, desigur, oripilat de ceea ce se întâmplă în Orientul Mijlociu.
EN: Like everybody else in this Chamber, I am, of course, horrified by what has been happening in the Middle East.

RO: În ceea ce mă privește, voi trata astfel de situații respectând termenele limită care îmi sunt impuse.
EN: I will abide by my deadlines in terms of dealing with this kind of situation.

RO: Scopul este acela de a îmbunătăți capacitatea Uniunii Europene de a acționa într-un rol de gestionare a situațiilor de criză, permițând furnizarea și utilizarea mai eficientă a resurselor financiare, civile și militare.
EN: The aim is to improve the ability of the European Union to act in a crisis management role by allowing financial, civilian and military resources to be provided and used more efficiently.

RO: Avem nevoie de o revoluţie verde şi trebuie să ne frânăm excesele din prezent.
EN: We need a green revolution and we must curb our own excesses.

RO: membru al Com

###Resources

* [Intro to Pytorch](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/intro-to-pytorch)
* [Pytorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning)
* [Intro to Tensorflow](https://github.com/udacity/intro-to-ml-tensorflow)

Notebook adapted from: [MTAT.06.055 Machine Translation](https://courses.cs.ut.ee/2021/mt/spring/Main/HomePage)