<a href="https://colab.research.google.com/gist/Melvinchen0404/61728595eb14b847605ac19af12aaf6e/machine_translation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP Technique 6: Machine Translation 

Sources: https://www.section.io/engineering-education/building-a-simple-translation-app-using-python-for-beginners/

**OPTION 1 (en, fr, ro, de):** The **T5 base model** (Google) on **Hugging Face** \
The available languages are `en, fr, ro, de` (**English, French, Romanian, German**): https://huggingface.co/t5-base \
The **T5 base model** (Google) is a **transformer-based architecture** that uses a **text-to-text** approach \
Each **NLP** task (e.g., translation, question answering, sentence completion, word sense disambiguation, sentiment analysis, classification) is cast as feeding the model **source text** as input and training it to generate some **target text** \

For more on the **text-to-text** approach of Google's **T5**, see Raffel *et al* (2020): https://arxiv.org/pdf/1910.10683.pdf

**Hugging Face** is an open-source and platform provider of **machine learning** technologies. It has many **transformer pipelines**. **Pipelines** are an easy way to use models for **inference** \
Here is the relevant documentation for these **pipelines**: https://huggingface.co/docs/transformers/main_classes/pipelines \
The `TranslationPipeline` is a task-specific **pipeline** that allows **translation from one language to another** \

**STEP 1 of OPTION 1:** Install Pytorch

In [1]:
!pip3 install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://download.pytorch.org/whl/lts/1.8/torch_lts.html
Collecting torch==1.8.1+cpu
  Downloading https://download.pytorch.org/whl/lts/1.8/cpu/torch-1.8.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (169.1 MB)
[K     |████████████████████████████████| 169.1 MB 77 kB/s 
[?25hCollecting torchvision==0.9.1+cpu
  Downloading https://download.pytorch.org/whl/lts/1.8/cpu/torchvision-0.9.1%2Bcpu-cp37-cp37m-linux_x86_64.whl (13.3 MB)
[K     |████████████████████████████████| 13.3 MB 584 kB/s 
[?25hCollecting torchaudio==0.8.1
  Downloading torchaudio-0.8.1-cp37-cp37m-manylinux1_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 4.3 MB/s 
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installation: torch 1.11.0+cu113
    Uninstalling torch-1.11.0+cu113:
      Successfully uninstalled torch-1.11.0+cu113
 

**STEP 2 of OPTION 1:** Install the Hugging Face Transformers and Gradio \
a) `transformers` is going to give you our **translation pipeline**; \
b) `ipywidgets` provides you with the **progress bar** as the model is being downloaded; \
c) `gradio` gives you a decent way to demonstrate and interact with your **machine learning** model

In [2]:
pip install transformers ipywidgets gradio --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.4 MB/s 
Collecting gradio
  Downloading gradio-3.0.19-py3-none-any.whl (5.1 MB)
[K     |████████████████████████████████| 5.1 MB 30.1 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 5.8 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 30.4 MB/s 
Collecting uvicorn
  Downloading uvicor

**STEP 3 of OPTION 1:** Import the dependencies (e.g., `gradio`) into our **machine learning** model after having installed them

In [3]:
import gradio as gr
from transformers import pipeline

**STEP 4 of OPTION 1:** Use the `pipeline()` method \
The task identifier `translation_xx_to_yy` may be used for the translation: `xx` is the language you want to translate from (or **source language**) and `yy` the language you want to translate to (or **target language**) \

NOTE: The **ISO 639-1** nomenclature provides a 2-letter code for representing most of the major languages of the world and it is used for representing the **source and target languages** here: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes More generally, the **ISO 639** is a standardized nomenclature for classifying languages

In [4]:
translation_pipeline = pipeline ('translation_en_to_fr')

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


**STEP 5 of OPTION 1:** Use the `translation_pipeline()` method and enter the English (`en`) text that you wish to get translated into French (`fr`). Run the code to derive the output text in the **target language**

In [5]:
translation_pipeline ('Stately, plump Buck Mulligan came from the stairhead, bearing a bowl of lather on which a mirror and a razor lay crossed.')

[{'translation_text': "De l'escalier, Buck Mulligan s'est rendu à l'état et a porté un bol de laveur sur lequel un miroir et un rasoir étaient croisés."}]

**OPTION 2:** The **Google Translate API** \
We can use the `googletrans` library, a free and unlimited Python library that make unofficial Ajax calls to the **Google Translate API** in order to detect languages and translate text \
An **API** (or **Application Programming Interface**) is a software intermediary that allows two applications to talk to each other \
**STEP 1 of OPTION 2:** Install the `googletrans` library and import the requisite libraries

In [13]:
!pip3 install googletrans==4.0.0-rc1
!pip3 install google-trans-new
from googletrans import constants
from google_trans_new import google_translator 
from pprint import pprint

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting google-trans-new
  Downloading google_trans_new-1.1.9-py3-none-any.whl (9.2 kB)
Installing collected packages: google-trans-new
Successfully installed google-trans-new-1.1.9


**STEP 2 of OPTION 2:** Initialize the **Google API** translator

In [14]:
translator = google_translator()  

**STEP 3 of OPTION 2:** Use the `translate()` method to generate the **target text** from the **source text** \

In [15]:
translator = Translator()
translation = translator.translate(text="Ulysses is a novel by James Joyce.", dest="zh-cn")
print(f"{translation.origin} ({translation.src}) --> {translation.text} ({translation.dest})")

Ulysses is a novel by James Joyce. (en) --> 尤利西斯（Ulysses）是詹姆斯·乔伊斯（James Joyce）的小说。 (zh-cn)


**STEP 4 of OPTION 2:** Use the `translate()` method \
The object returned by the `translate()` method has the following attributes:
*   `src` - the **source language**;
*   `dest` - the **target (or destination) language** (the default option is `en` or English);
*   `origin` - the **original text**;
*   `dest` - the **target (or destination) text**;
*   `pronunciation` - the pronunciation of the **translated text**

In [16]:
print(color.BOLD + 'Source language: \n' + color.END, translation.src)
print(color.BOLD + 'Target language: \n' + color.END, translation.dest)
print(color.BOLD + 'Source text: \n' + color.END, translation.origin)
print(color.BOLD + 'Target text: \n' + color.END, translation.text)
print(color.BOLD + 'Pronunciation of target text: \n' + color.END, translation.pronunciation)

[1mSource language: 
[0m en
[1mTarget language: 
[0m zh-cn
[1mSource text: 
[0m Ulysses is a novel by James Joyce.
[1mTarget text: 
[0m 尤利西斯（Ulysses）是詹姆斯·乔伊斯（James Joyce）的小说。
[1mPronunciation of target text: 
[0m Yóu lì xī sī (Ulysses) shì zhānmǔsī·qiáo yī sī (James Joyce) de xiǎoshuō.


**STEP 5 of OPTION 2:** Additional data from the `translate()` method may be printed

In [17]:
pprint(translation.extra_data)

{'confidence': None,
 'origin_pronunciation': None,
 'parsed': [[None,
             None,
             'en',
             [[[0, [[[None, 34]], [True]]]], 34],
             [['Ulysses is a novel by James Joyce.', None, None, 34]]],
            [[[None,
               'Yóu lì xī sī (Ulysses) shì zhānmǔsī·qiáo yī sī (James Joyce) '
               'de xiǎoshuō.',
               None,
               None,
               None,
               [['尤利西斯（Ulysses）是詹姆斯·乔伊斯（James Joyce）的小说。',
                 None,
                 None,
                 None,
                 [['尤利西斯（Ulysses）是詹姆斯·乔伊斯（James Joyce）的小说。', [5]],
                  ['尤利西斯是詹姆斯·乔伊斯（James Joyce）的小说。', [11]]]]]]],
             'zh-cn',
             1,
             'en',
             ['Ulysses is a novel by James Joyce.', 'auto', 'zh-cn', True]],
            'en'],
 'parts': [<googletrans.models.TranslatedPart object at 0x7fb34cf7f790>]}
