<a href="https://colab.research.google.com/github/FabienMiguel/NLP-Fellowship/blob/main/week4/EasyNMT_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#EasyNMT - Example (Opus-MT Model)
This notebook shows the usage of [EasyNMT](https://github.com/UKPLab/EasyNMT) for machine translation.

Here, we use the [Opus-MT model](https://github.com/Helsinki-NLP/Opus-MT). The Helsiniki-NLP group provides 1200+ pre-trained models for various language directions (e.g. en-de, es-fr, ru-fr). Each model has a size of about 300 MB. 

We make the usage of the models easy: The suitable model needed for your translation is loaded automatically and kept in memory for future use.

# Colab with GPU
When running this notebook in colab, ensure that you run it with a GPU as hardware accelerator. To enable this:
- Navigate to Edit → Notebook Settings
- select GPU from the Hardware Accelerator drop-down

With `!nvidia-smi` we can check which GPU was assigned to us in Colab.

In [13]:
!nvidia-smi

Tue Nov  8 15:06:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Installation
You can install EasyNMT by using pip. EasyNMT is using Pytorch. If you have a GPU available on your local machine, have a look at [PyTorch Get Started](https://pytorch.org/get-started/locally/) how to install PyTorch with CUDA support. 

In [14]:
!pip install -U easynmt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Create EasyNMT instance

Creating an EasyNMT instance and loading a model is easy. You pass the model name you want to use and all needed files are downloaded and cached locally.

In [15]:
from easynmt import EasyNMT
model = EasyNMT('opus-mt')

# Sentence Translation
When you have individual sentences to translate, you can call the method `translate_sentences`.

In [16]:
sentences = ['In dieser Liste definieren wir mehrere Sätze.',
             'Jeder dieser Sätze wird dann in die Zielsprache übersetzt.', 
             'Puede especificar en esta lista la oración en varios idiomas.',
             'El sistema detectará automáticamente el idioma y utilizará el modelo correcto.']
translations = model.translate(sentences, target_lang='fr')

print("\n\nTranslations:")
for sent, trans in zip(sentences, translations):
  print(sent)
  print("=>", trans)
  print("")





Translations:
In dieser Liste definieren wir mehrere Sätze.
=> Dans cette liste, nous définissons plusieurs phrases.

Jeder dieser Sätze wird dann in die Zielsprache übersetzt.
=> Chacune de ces phrases sera ensuite traduite dans la langue cible.

Puede especificar en esta lista la oración en varios idiomas.
=> Vous pouvez spécifier la phrase dans cette liste en plusieurs langues.

El sistema detectará automáticamente el idioma y utilizará el modelo correcto.
=> Le système détectera automatiquement la langue et utilisera le bon modèle.



In [17]:
sentences = ['Wir können bei den Sätzen ebenfalls die Ausgangssprache festlegen.',
             'In dem Fall wird direkt das passende Modell geladen und verwendet.']
translations = model.translate(sentences, source_lang='de', target_lang='en')

print("\n\nTranslations:")
for sent, trans in zip(sentences, translations):
  print(sent)
  print("=>", trans, "\n")



Translations:
Wir können bei den Sätzen ebenfalls die Ausgangssprache festlegen.
=> We can also define the original language for the sentences. 

In dem Fall wird direkt das passende Modell geladen und verwendet.
=> In this case, the right model is loaded and used directly. 



In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Document Translation
You can also pass longer documents (or list of documents) to the `translate()` method.

As Transformer models can only translate inputs up to 512 (or 1024) word pieces, we first perform sentence splitting. Then, each sentence is translated individually. 

In [19]:
import tqdm
document = """Berlin is the capital and largest city of Germany by both area and population.
Its 3,769,495 inhabitants as of 31 December 2019 make it the most-populous city of the European Union, according to population within city limits.
The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital. 
The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2, Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions. 
Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau. 
Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee). 
Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate. 
About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.
The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

First documented in the 13th century and at the crossing of two important historic trade routes, Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).
Berlin in the 1920s was the third-largest municipality in the world.
After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory. 
East Berlin was declared capital of East Germany, while Bonn became the West German capital. 
Following German reunification in 1990, Berlin once again became the capital of all of Germany.

Berlin is a world city of culture, politics, media and science.
Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues. 
Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network. 
The metropolis is a popular tourist destination.
Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics."""


print("Output:")
print(model.translate(document, target_lang='de'))

Output:
Berlin ist die Hauptstadt und größte Stadt Deutschlands sowohl in der Region als auch in der Bevölkerung.
Die 3.769,495 Einwohner machen sie zum 31. Dezember 2019 zur bevölkerungsreichsten Stadt der Europäischen Union, nach der Bevölkerung innerhalb der Stadtgrenzen.
Die Stadt gehört auch zu den 16 Bundesländern Deutschlands. Sie ist von Brandenburg umgeben und mit Potsdam, der Hauptstadt Brandenburgs, verbunden. 
Die beiden Städte befinden sich im Zentrum der Hauptstadtregion Berlin-Brandenburg, mit rund sechs Millionen Einwohnern und einer Fläche von mehr als 30.000 km2, Deutschlands drittgrößter Metropolregion nach den Regionen Rhein-Ruhr und Rhein-Main. 
Berlin erstreckt sich über das Ufer der Spree, die in den Havel (ein Nebenfluss der Elbe) im westlichen Bezirk Spandau mündet. 
Zu den wichtigsten topographischen Merkmalen der Stadt gehören die zahlreichen Seen in den westlichen und südöstlichen Stadtteilen, die von den Flüssen Spree, Havel und Dahme gebildet wurden (der g

# Language Detection
EasyNMT allows easy detection of the language of text. For this, we call the method `model.language_detection(text)`.

For language detection, we use [fastText](https://fasttext.cc/blog/2017/10/02/blog-post.html), which is able to recognize more than 170 languages.


In [20]:
sentences = ["This is an English sentence." ,"Dies ist ein deutscher Satz.", "это русское предложение.", "这是一个中文句子。"]

for sent in sentences:
  print(sent)
  print("=> detected language:", model.language_detection(sent), "\n")

This is an English sentence.
=> detected language: en 

Dies ist ein deutscher Satz.
=> detected language: de 

это русское предложение.
=> detected language: ru 

这是一个中文句子。
=> detected language: zh 



# Beam-Search
You can pass the beam-size as parameter to the `translate()` method. A larger beam size produces higher quality translations, but requires longer for the translation. By default, beam-size is set to 5.

In [21]:
import time
model = EasyNMT('opus-mt')

sentence = "Berlin ist die Hauptstadt von Deutschland und sowohl von den Einwohner als auch von der Fläche die größte Stadt in Deutschland, während Hamburg die zweit größte Stadt ist."

#Loading and warm-up of the model
model.translate(sentence, target_lang='en', beam_size=1)

print("\nBeam-Size 1")
start_time = time.time()
print(model.translate(sentence, target_lang='en', beam_size=1))
print("Translated in {:.2f} sec".format(time.time()-start_time))

print("\nBeam-Size 10")
start_time = time.time()
print(model.translate(sentence, target_lang='en', beam_size=10))
print("Translated in {:.2f} sec".format(time.time()-start_time))



Beam-Size 1
Berlin is the capital of Germany and the largest city in Germany, both of its inhabitants and of its area, while Hamburg is the second largest city.
Translated in 0.19 sec

Beam-Size 10
Berlin is the capital of Germany and of both the inhabitants and the area the largest city in Germany, while Hamburg is the second largest city.
Translated in 0.33 sec


# Available Models


In [22]:
available_models = ['opus-mt', 'mbart50_m2m', 'm2m_100_418M']   
#Note: EasyNMT also provides the m2m_100_1.2B. But sadly it requires too much RAM to be loaded with the Colab free version here
#If you start an empty instance in colab and load the 'm2m_100_1.2B' model, it should work.

for model_name in available_models:
  print("\n\nLoad model:", model_name)
  model = EasyNMT(model_name)

  sentences = ['In dieser Liste definieren wir mehrere Sätze.',
              'Jeder dieser Sätze wird dann in die Zielsprache übersetzt.', 
              'Puede especificar en esta lista la oración en varios idiomas.',
              'El sistema detectará automáticamente el idioma y utilizará el modelo correcto.']
  translations = model.translate(sentences, target_lang='en')

  print("Translations:")
  for sent, trans in zip(sentences, translations):
    print(sent)
    print("=>", trans, "\n")
  del model




Load model: opus-mt


Downloading:   0%|          | 0.00/826k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/802k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.59M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/312M [00:00<?, ?B/s]

Translations:
In dieser Liste definieren wir mehrere Sätze.
=> In this list we define several sentences. 

Jeder dieser Sätze wird dann in die Zielsprache übersetzt.
=> Each of these sentences is then translated into the target language. 

Puede especificar en esta lista la oración en varios idiomas.
=> You can specify the sentence in several languages in this list. 

El sistema detectará automáticamente el idioma y utilizará el modelo correcto.
=> The system will automatically detect the language and use the correct model. 



Load model: mbart50_m2m


24.9kB [00:00, 28.4MB/s]                   


Downloading:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.44G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/529 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/649 [00:00<?, ?B/s]



Translations:
In dieser Liste definieren wir mehrere Sätze.
=> In this list we define several sentences. 

Jeder dieser Sätze wird dann in die Zielsprache übersetzt.
=> Each of these sentences is then translated into the target language. 

Puede especificar en esta lista la oración en varios idiomas.
=> You can specify in this list the speech in several languages. 

El sistema detectará automáticamente el idioma y utilizará el modelo correcto.
=> The system will automatically detect the language and use the correct model. 



Load model: m2m_100_418M


89.9kB [00:00, 57.0MB/s]                   


Downloading:   0%|          | 0.00/908 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/272 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

Translations:
In dieser Liste definieren wir mehrere Sätze.
=> In this list we define several sentences. 

Jeder dieser Sätze wird dann in die Zielsprache übersetzt.
=> Each of these sentences is then translated into the target language. 

Puede especificar en esta lista la oración en varios idiomas.
=> You can specify in this list the prayer in several languages. 

El sistema detectará automáticamente el idioma y utilizará el modelo correcto.
=> The system will automatically detect the language and use the correct model. 



# Translation Directions & Languages
To get all available translation directions for a model, you can simply call the following property. An entry like 'af-en' means that you can translate from *af* (Afrikaans) to *en* (English).

In [23]:
model = EasyNMT('opus-mt')
print("Language directions:")
print(sorted(list(model.lang_pairs)))

Language directions:
['aav-en', 'aed-es', 'af-de', 'af-en', 'af-eo', 'af-es', 'af-fi', 'af-fr', 'af-nl', 'af-ru', 'af-sv', 'alv-en', 'am-sv', 'ar-de', 'ar-el', 'ar-en', 'ar-eo', 'ar-es', 'ar-fr', 'ar-he', 'ar-it', 'ar-pl', 'ar-ru', 'ar-tr', 'art-en', 'ase-de', 'ase-en', 'ase-es', 'ase-fr', 'ase-sv', 'az-en', 'az-es', 'az-tr', 'bat-en', 'bcl-de', 'bcl-en', 'bcl-es', 'bcl-fi', 'bcl-fr', 'bcl-sv', 'be-es', 'bem-en', 'bem-es', 'bem-fi', 'bem-fr', 'bem-sv', 'ber-en', 'ber-es', 'ber-fr', 'bg-de', 'bg-en', 'bg-eo', 'bg-es', 'bg-fi', 'bg-fr', 'bg-it', 'bg-ru', 'bg-sv', 'bg-tr', 'bg-uk', 'bi-en', 'bi-es', 'bi-fr', 'bi-sv', 'bn-en', 'bnt-en', 'bzs-en', 'bzs-es', 'bzs-fi', 'bzs-fr', 'bzs-sv', 'ca-de', 'ca-en', 'ca-es', 'ca-fr', 'ca-it', 'ca-nl', 'ca-pt', 'ca-uk', 'cau-en', 'ccs-en', 'ceb-en', 'ceb-es', 'ceb-fi', 'ceb-fr', 'ceb-sv', 'cel-en', 'chk-en', 'chk-es', 'chk-fr', 'chk-sv', 'cpf-en', 'crs-de', 'crs-en', 'crs-es', 'crs-fi', 'crs-fr', 'crs-sv', 'cs-de', 'cs-en', 'cs-eo', 'cs-fi', 'cs-fr', 'c

To check which languages are supported, you can use the following method:

In [24]:
print("All Languages:")
print(model.get_languages())

print("\n\nAll languages with source_lang=en. I.e., we can translate English (en) to these languages.")
print(model.get_languages(source_lang='en'))

print("\n\nAll languages with target_lang=de. I.e., we can translate from these languages to German (de).")
print(model.get_languages(target_lang='de'))

All Languages:
['aav', 'aed', 'af', 'alv', 'am', 'ar', 'art', 'ase', 'az', 'bat', 'bcl', 'be', 'bem', 'ber', 'bg', 'bi', 'bn', 'bnt', 'bzs', 'ca', 'cau', 'ccs', 'ceb', 'cel', 'chk', 'cpf', 'crs', 'cs', 'csg', 'csn', 'cus', 'cy', 'da', 'de', 'dra', 'ee', 'efi', 'el', 'en', 'eo', 'es', 'et', 'eu', 'euq', 'fi', 'fj', 'fr', 'fse', 'ga', 'gaa', 'gil', 'gl', 'grk', 'guw', 'gv', 'ha', 'he', 'hi', 'hil', 'ho', 'hr', 'ht', 'hu', 'hy', 'id', 'ig', 'ilo', 'is', 'iso', 'it', 'ja', 'jap', 'ka', 'kab', 'kg', 'kj', 'kl', 'ko', 'kqn', 'kwn', 'kwy', 'lg', 'ln', 'loz', 'lt', 'lu', 'lua', 'lue', 'lun', 'luo', 'lus', 'lv', 'map', 'mfe', 'mfs', 'mg', 'mh', 'mk', 'mkh', 'ml', 'mos', 'mr', 'ms', 'mt', 'mul', 'ng', 'nic', 'niu', 'nl', 'no', 'nso', 'ny', 'nyk', 'om', 'pa', 'pag', 'pap', 'phi', 'pis', 'pl', 'pon', 'poz', 'pqe', 'pqw', 'prl', 'pt', 'rn', 'rnd', 'ro', 'roa', 'ru', 'run', 'rw', 'sal', 'sg', 'sh', 'sit', 'sk', 'sl', 'sm', 'sn', 'sq', 'srn', 'ss', 'ssp', 'st', 'sv', 'sw', 'swc', 'taw', 'tdt', 'th', 

In [25]:
sentences = ['we start a hackton today in week 4 and fellower are very exciting.']
translations = model.translate(sentences, source_lang='en', target_lang='fr')

print("\n\nTranslations:")
for sent, trans in zip(sentences, translations):
  print(sent)
  print("=>", trans, "\n")



Translations:
we start a hackton today in week 4 and fellower are very exciting.
=> Nous commençons un hackton aujourd'hui dans la semaine 4 et les autres sont très excitants. 



Importing all languages as json file

In [64]:
import json

In [65]:
with open ('/content/lang.json') as file_lang:
  content = file_lang.read()

In [66]:
languages_json = json.loads(content)
languages_json

[{'code': 'ab', 'name': 'Abkhaz'},
 {'code': 'aa', 'name': 'Afar'},
 {'code': 'af', 'name': 'Afrikaans'},
 {'code': 'ak', 'name': 'Akan'},
 {'code': 'sq', 'name': 'Albanian'},
 {'code': 'am', 'name': 'Amharic'},
 {'code': 'ar', 'name': 'Arabic'},
 {'code': 'an', 'name': 'Aragonese'},
 {'code': 'hy', 'name': 'Armenian'},
 {'code': 'as', 'name': 'Assamese'},
 {'code': 'av', 'name': 'Avaric'},
 {'code': 'ae', 'name': 'Avestan'},
 {'code': 'ay', 'name': 'Aymara'},
 {'code': 'az', 'name': 'Azerbaijani'},
 {'code': 'bm', 'name': 'Bambara'},
 {'code': 'ba', 'name': 'Bashkir'},
 {'code': 'eu', 'name': 'Basque'},
 {'code': 'be', 'name': 'Belarusian'},
 {'code': 'bn', 'name': 'Bengali; Bangla'},
 {'code': 'bh', 'name': 'Bihari'},
 {'code': 'bi', 'name': 'Bislama'},
 {'code': 'bs', 'name': 'Bosnian'},
 {'code': 'br', 'name': 'Breton'},
 {'code': 'bg', 'name': 'Bulgarian'},
 {'code': 'my', 'name': 'Burmese'},
 {'code': 'ca', 'name': 'Catalan; Valencian'},
 {'code': 'ch', 'name': 'Chamorro'},
 {'co

In [67]:
# sentence_rw= ['Uyumunsi ni kuwa kabiri',
              # 'kand nibwo DV Lottery program iriburangire',
              # 'turi ku itariki ya munani yukwezi kwa cyumi nakumwe']

sentences = [
    'we start a hackton today in week 4 and fellower are very exciting',
    'today is tuesday',
    'I like to play basketball'
]           

In [68]:
import pandas as pd

In [69]:
df = pd.DataFrame(sentences)
df['Kiny'] =sentences
df['en'] = sentences
df

Unnamed: 0,0,Kiny,en
0,we start a hackton today in week 4 and fellowe...,we start a hackton today in week 4 and fellowe...,we start a hackton today in week 4 and fellowe...
1,today is tuesday,today is tuesday,today is tuesday
2,I like to play basketball,I like to play basketball,I like to play basketball


In [70]:

for language in languages_json[80:100]:
  Code = language['code']
  try:
    df[Code] = model.translate(sentences, source_lang='en',target_lang=Code)
  except:
    pass



In [33]:
df

Unnamed: 0,0,Kiny,en,rw,kg,kj,lg,ln,lu,gv,mk,mg
0,we start a hackton today in week 4 and fellowe...,we start a hackton today in week 4 and fellowe...,we start a hackton today in week 4 and fellowe...,"Muri iki gihe, dutangiza inzu yimukanwa mu cyu...","Bubu yai na mposo ya 4 mpi ya mpangi na beto, ...",Ohatu hovele okukala twa hafa neenghono oshivi...,Tutandika olugendo oluyitibwackton mu wiiki 4 ...,Tobandi kosala ackton lelo oyo na pɔsɔ 4 mpe n...,Twashilwile mwingilo wa Addk dyalelo mu yenga ...,Bleeaney as laa ruggyreeyn,"Ќе почнеме со хактон денес во 4 недела, а коле...",Manomboka hosoka isika androany amin'ny herina...
1,today is tuesday,today is tuesday,today is tuesday,UYUMUNSI ni,"Bubu yai, bantu mekumaka kusala bansaka ya kul...","Kunena, efiku keshe oli li popepi",Leero waliwo eby'okukola bingi,"Lelo oyo, mikolo etikali ya kotánga","Dyalelo nadyo, bantu ba mu ano etu mafuku badi...",jiu er 00:00:00 PM,Денес е вторник.,Androany ny andron'ny tuesday
2,I like to play basketball,I like to play basketball,I like to play basketball,Nkunda gukina ishati,Mono ke zolaka kubula nsaka ya nkweso,Onda hala okukala handi dana etanga lokeemhadi,Njagala nnyo okukuba paper size,Nasepelaka kobɛta lisano ya karate,Nsakanga kukaya makayo a kwitamuna bidyoma,Ny rolley dy lhiggey dy jean jean jean jean je...,Сакам да играм кошарка.,Tia milalao baskety aho


In [71]:
sentences_en = [
    'we start a hackton today in week 4 and fellower are very exciting',
    'today is tuesday',
    'I like to play basketball'
]   

In [72]:
df0 = pd.DataFrame()

In [74]:
df0['english'] = sentences_en

In [77]:
  try:
    df0['Spanish'] = model.translate(sentences_en, source_lang='en',target_lang='es')
  except:
    pass



In [78]:
df0

Unnamed: 0,english,Spanish
0,we start a hackton today in week 4 and fellowe...,Empezamos un hackton hoy en la semana 4 y el c...
1,today is tuesday,Hoy es martes
2,I like to play basketball,Me gusta jugar al baloncesto.
