#EasyNMT - Example (Opus-MT Model)
This notebook shows the usage of [EasyNMT](https://github.com/UKPLab/EasyNMT) for machine translation.

Here, we use the [Opus-MT model](https://github.com/Helsinki-NLP/Opus-MT). The Helsiniki-NLP group provides 1200+ pre-trained models for various language directions (e.g. en-de, es-fr, ru-fr). Each model has a size of about 300 MB.

We make the usage of the models easy: The suitable model needed for your translation is loaded automatically and kept in memory for future use.

In [1]:
!nvidia-smi

Fri Apr 26 23:09:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### Clean badboy

In [2]:
# # Run this to clean GPU memory
import torch
from numba import cuda
device = cuda.get_current_device()
device.reset()
torch.cuda.empty_cache()

In [None]:
!pip install -U easynmt
!pip install -U datasets

In [None]:
!pip install sacremoses

In [1]:
from easynmt import EasyNMT
model = EasyNMT('opus-mt')

In [2]:
from huggingface_hub import login
login(token="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") # TODO: zahodit do pice lebo public repo xd, HF token

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
from datasets import load_dataset
dataset = load_dataset("Open-Orca/SlimOrca-Dedup", split='train')

# TODO process dataset?

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [5]:
# dataset.to_json("dataset.json")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

174976

In [None]:
# import time
# from transformers import MarianMTModel, MarianTokenizer
# import torch
# from typing import List


# class OpusMT:
#     def __init__(self, easynmt_path: str = None, max_loaded_models: int = 10):
#         self.models = {}
#         self.max_loaded_models = max_loaded_models
#         self.max_length = None

#     def load_model(self, model_name):
#         if model_name in self.models:
#             self.models[model_name]['last_loaded'] = time.time()
#             return self.models[model_name]['tokenizer'], self.models[model_name]['model']
#         else:
#             tokenizer = MarianTokenizer.from_pretrained(model_name)
#             model = MarianMTModel.from_pretrained(model_name)
#             model.eval()

#             if len(self.models) >= self.max_loaded_models:
#                 oldest_time = time.time()
#                 oldest_model = None
#                 for loaded_model_name in self.models:
#                     if self.models[loaded_model_name]['last_loaded'] <= oldest_time:
#                         oldest_model = loaded_model_name
#                         oldest_time = self.models[loaded_model_name]['last_loaded']
#                 del self.models[oldest_model]

#             self.models[model_name] = {'tokenizer': tokenizer, 'model': model, 'last_loaded': time.time()}
#             return tokenizer, model

#     def translate_sentences(self, sentences: List[str], source_lang: str, target_lang: str, device: str, beam_size: int = 5, **kwargs):
#         model_name = 'Helsinki-NLP/opus-mt-{}-{}'.format(source_lang, target_lang)
#         tokenizer, model = self.load_model(model_name)
#         model.to(device)

#         inputs = tokenizer(sentences, truncation=True, padding=True, max_length=self.max_length, return_tensors="pt")

#         for key in inputs:
#             inputs[key] = inputs[key].to(device)

#         with torch.no_grad():
#             translated = model.generate(**inputs, num_beams=beam_size, **kwargs)
#             # print((translated.shape))
#             output = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

#         return output

#     def save(self, output_path):
#         return {"max_loaded_models": self.max_loaded_models}

# opes_cz = OpusMT()
# tokenizer_test, _ = opes_cz.load_model('Helsinki-NLP/opus-mt-en-cs')

# Document Translation
You can also pass longer documents (or list of documents) to the `translate()` method.

As Transformer models can only translate inputs up to 512 (or 1024) word pieces, we first perform sentence splitting. Then, each sentence is translated individually.

In [12]:
import tqdm
document_1 = """
Berlin is the capital and largest city of Germany by both area and population.
Its 3,769,495 inhabitants as of 31 December 2019 make it the most-populous city of the European Union, according to population within city limits.
The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital.
The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2, Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions.
Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau.
Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee).
"""
# Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate.
# About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.
# The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

# First documented in the 13th century and at the crossing of two important historic trade routes, Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).
# Berlin in the 1920s was the third-largest municipality in the world.
# After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory.
# East Berlin was declared capital of East Germany, while Bonn became the West German capital.
# Following German reunification in 1990, Berlin once again became the capital of all of Germany.

# Berlin is a world city of culture, politics, media and science.
# Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues.
# Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network.
# The metropolis is a popular tourist destination.
# Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics.
# """

document_2 = """
Please add spaces between words: ThereweretheheelsofforeigninvasiontrampinguponFrance;therewasthedownfallofauEmpire,andthecaptivityofaBonaparte;andtheretheywerethemselves.
"""
print(tokenizer_test(document_2, truncation=True, padding=True, max_length=None, return_tensors="pt")["input_ids"].shape)


torch.Size([1, 52])


In [None]:
# nebere to poslednu vetu ? HALO
print(opes_cz.translate_sentences([document_2], source_lang="en", target_lang="cs", beam_size=20, device="cuda"))

In [None]:
print("Output:")
print(model.translate(document_2, target_lang='cs', perform_sentence_splitting=True, beam_size=15, batch_size=8))

In [6]:
dataset = dataset.select(range(150000 + 130 + 250 + 320, len(dataset)))

## Translate dataset
- save to `translated_dataset.json`

In [None]:
# translate all samples from dataset
import os
import json
import time
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

output_path = "/content/drive/MyDrive/translated_dataset.json"
if os.path.exists(output_path):
    os.remove(output_path)

max_input_length = int(4096 / 4)
truncated_count = 0

print(f"Translating dataset of size: {len(dataset)}")
with open(output_path, "a", encoding="utf-8") as output_file:
    start_time = time.time()
    for i in range(len(dataset)):
        sample = dataset[i]
        sys_val = sample["conversations"][0]["value"]
        input_text = sample["conversations"][1]["value"]
        response = sample["conversations"][2]["value"]

        combined_input = f"{sys_val} {input_text} {response}"
        if len(combined_input) <= max_input_length:
            translated = model.translate([sys_val, input_text, response], source_lang="en", target_lang='cs', perform_sentence_splitting=True, beam_size=15, batch_size=512)

            # Create a new dictionary with the same format
            translated_sample = {
                "conversations": [
                    {"from": "system", "value": translated[0]},
                    {"from": "human", "value": translated[1]},
                    {"from": "gpt", "value": translated[2]}
                ]
            }
            # Write the lines to the file after each step
            output_file.write(json.dumps(translated_sample, ensure_ascii=False))
            output_file.write("\n")
            output_file.flush()
        else:
            truncated_count += 1

        # Print logs every 50 translated lines
        if (i + 1) % 50 == 0:
            end_time = time.time()
            elapsed_time = end_time - start_time
            print(f"Translated {i + 1 - truncated_count} lines in {elapsed_time:.2f} seconds. Truncated {truncated_count} lines.")
            start_time = end_time

print(f"Finished translation. Truncated {truncated_count} lines due to length limits.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Translating dataset of size: 62661
Translated 5 lines in 13.72 seconds. Truncated 45 lines.
Translated 11 lines in 10.05 seconds. Truncated 89 lines.
Translated 15 lines in 21.55 seconds. Truncated 135 lines.
Translated 22 lines in 19.78 seconds. Truncated 178 lines.
Translated 28 lines in 5.45 seconds. Truncated 222 lines.
Translated 35 lines in 35.31 seconds. Truncated 265 lines.


# Available Models


In [None]:
available_models = ['opus-mt', 'mbart50_m2m', 'm2m_100_418M', "m2m_100_1.2B"]
#Note: EasyNMT also provides the m2m_100_1.2B. But sadly it requires too much RAM to be loaded with the Colab free version here
#If you start an empty instance in colab and load the 'm2m_100_1.2B' model, it should work.

for model_name in available_models:
  print("\n\nLoad model:", model_name)
  model = EasyNMT(model_name)

  sentences = ['In dieser Liste definieren wir mehrere Sätze.',
              'Jeder dieser Sätze wird dann in die Zielsprache übersetzt.',
              'Puede especificar en esta lista la oración en varios idiomas.',
              'El sistema detectará automáticamente el idioma y utilizará el modelo correcto.']
  translations = model.translate(sentences, target_lang='en')

  print("Translations:")
  for sent, trans in zip(sentences, translations):
    print(sent)
    print("=>", trans, "\n")
  del model
