<a href="https://colab.research.google.com/github/MST47/Open-Source-NLP-Toolkit/blob/main/1_Machine_Translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Machine Translation using Transformrs

## Introduction
- Open-source machine translation (MT) models enable you to translate between different languages without Google Translate.
  Many models have more than 1000 language pairs to the hugging face hub such as:
1. https://huggingface.co/Helsinki-NLP
2. https://huggingface.co/models?search=facebook+m2m
3. https://github.com/UKPLab/EasyNMT

- Most machine translation models translate between two languages in one direction (e.g. German to English, but not English to German), some can translate in multiple directions.
- Automation of tasks with ML has three main ingrediants:<br>
    i. A model trained on a specific task <br>
    ii. Input data (e.g. texts or images) <br>
    iii. Output produced by the model.    <br>

## Getting Started with Transformers

### Install the Transformers library & dependencies

In [1]:
!pip install transformers~=4.31.0  # The Transformers library from Hugging Face
!pip install sentencepiece==0.1.96  # optional tokeniser, required for some models. e.g. machine translation
!pip install wikipedia==1.4.0  # to download any text from wikipedia
# running large models with accelerate https://huggingface.co/blog/accelerate-large-models
# NOTE: we need to restart the runtime after installing accelerate
!pip install accelerate~=0.21.0

Collecting transformers~=4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers~=4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.2
    Uninstalling transformers-4.41.2:
      Successfully uninstalled transformers-4.41.2
Successfully installed tokenizers-0.13.3 transformers-4.31.0
Collecting sentencepiece==0.1.

#### The Hugging Face Pipeline

In [2]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint

## 1.1 Using Facebook Model

In [3]:
# translation pipeline docs: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TranslationPipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

In [4]:
text = "Ich bin ein Fisch"
pipeline_translate(text, src_lang="de", tgt_lang="en")

[{'translation_text': 'I am a fish'}]

In [5]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("de")  # Wikipedia Language is set to German

text = wikipedia.summary("Donald Trump").replace('\n', ' ')[:318]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="de", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")

Original text:
Donald John Trump [ˈdɑn.əld dʒɑn tɹɐmp] (* 14. Juni 1946 in Queens, New York City, New York) ist ein US-amerikanischer Unternehmer, Entertainer und Politiker der Republikanischen Partei, der von 2017 bis 2021 der 45. Präsident der Vereinigten Staaten war. Er gilt als einer der umstrittensten Politiker der US-Geschich

Translated text:
Donald John Trump [ˈdɑn.əld dʒɑn tɔmp] (born 14 June 1946 in Queens, New York City, New York) is an American entrepreneur, entertainer and politician of the Republican Party, who from 2017 to 2021 was the 45th President of the United States.


## 1.2 Using Helsinki Model

In [14]:
pipeline_translate = pipeline("translation", model="Helsinki-NLP/opus-mt-zh-en") # Source src = zh and Target trg = en , "Helsinki-NLP/opus-mt-{src}-{trg}"

In [13]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("zh")  # Wikipedia Language is set to German

text = wikipedia.summary("Donald Trump").replace('\n', ' ')[:318]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="zh", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")

Original text:
唐納·約翰·川普（英語：Donald John Trump；1946年6月14日—），美國政治人物，第45任美国总统。從政前為企业家、媒體名人。 川普出生并成长于纽约州紐約市皇后區，为特朗普集團前任董事長兼总裁及特朗普娱乐公司的創辦人，在全世界经营房地产、賭場和酒店。1996年至2015年间，特朗普旗下拥有美國小姐和环球小姐选美比赛，还在2004年至2015年间主持了NBC的一档电视真人秀系列节目《飛黃騰達》。2017年时，《福布斯》将他列为世界上第544名最富有的人（美国第201名），截至2024年有着75亿美元的净資产。 川普在1987年时第一次公开表达对竞选公职的兴趣。他在2000年赢得加利福尼亚州和密歇根州的改革黨總統初

Translated text:
Donald John Trump (in English: Donald John Trump; 14 June 1946), American politician 45th President of the United States. He was an entrepreneur and media celebrity, born and grew up in Queens, New York, New York.


## 1.3 EasyNMT (Neural Machine Translation)

In [19]:
!pip install -U easynmt



In [20]:
from easynmt import EasyNMT

In [22]:
model = EasyNMT('opus-mt')

#Translate several sentences to German
sentences = ['You can define a list with sentences.',
             'All sentences are translated to your target language.',
             'Note, you could also mix the languages of the sentences.']
print(model.translate(sentences, target_lang='de'))

['Sie können eine Liste mit Sätzen definieren.\n', 'Alle Sätze werden in Ihre Zielsprache übersetzt.\n', 'Beachten Sie, Sie können auch die Sprachen der Sätze mischen.\n']
