### Machine Translation
Neural machine translation emerged in recent years, outperforming all previous approaches. More specifically, neural networks based on attention called transformers did an outstanding job on this task.

In this notebook I will  perform machine translation without any training. In other words, I'll be using pre-trained models from Huggingface transformer models. [HuggingFace Transformer model](https://huggingface.co/models?pipeline_tag=translation&sort=downloads)

## I will be using the Colab GPU

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In [3]:
# installing Transfromer
!pip install transformers

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 5.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.1 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 34.7 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.8 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.4 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml


In [4]:
# import transformer library
import transformers

print(transformers.__version__)

4.17.0


The Helsinki-NLP models we will use are primarily trained on the OPUS dataset, a collection of translated texts from the web; it is free online data.


We will see how to easily load the dataset for this task using 🤗 Datasets and how to fine-tune a model on it using the `Trainer` API.

In [48]:
from transformers import *


In [121]:
# the english to russian model from hugging face library
model_checkpoint = "Helsinki-NLP/opus-mt-en-ru"

# Using Pipeline API
Let's first get started with the library's pipeline API; we'll be using the models trained by `Helsinki-NLP`. You can check their page to see the available models they have:

In [49]:
# source & destination languages
src = "en"
dst = "ru"

task_name = f"translation_{src}_to_{dst}"
model_name = f"Helsinki-NLP/opus-mt-{src}-{dst}"

translator  = pipeline(task_name, model=model_name, tokenizer=model_name)

loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/77cec7e2e8c651e0de5a486162b120017319e844a1e8dc2edb308b1063822f1e.d485279d1a134dbaa57f731e1d68a2103c35113a1e8e6f7d9186807db74b54a9
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-ru",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "MarianMTModel"
  ],
  "attention_dropout": 0.0,
  "bad_words_ids": [
    [
      62517
    ]
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 512,
  "decoder_attention_heads": 8,
  "decoder_ffn_dim": 2048,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 62517,
  "dropout": 0.1,
  "encoder_attention_heads": 8,
  "encoder_ffn_dim": 2048,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "e

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

storing https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/tokenizer_config.json in cache at /root/.cache/huggingface/transformers/8789baef22e163407b9859cfc335b5aec2bc5e842d6e87b53787350d908b8984.4ea24009acf2f7010cdca879d1f08babd6fdf9dc414ed3335f5d105a4f945193
creating metadata file for /root/.cache/huggingface/transformers/8789baef22e163407b9859cfc335b5aec2bc5e842d6e87b53787350d908b8984.4ea24009acf2f7010cdca879d1f08babd6fdf9dc414ed3335f5d105a4f945193
loading configuration file https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/77cec7e2e8c651e0de5a486162b120017319e844a1e8dc2edb308b1063822f1e.d485279d1a134dbaa57f731e1d68a2103c35113a1e8e6f7d9186807db74b54a9
Model config MarianConfig {
  "_name_or_path": "Helsinki-NLP/opus-mt-en-ru",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "swish",
  "add_bias_logits": false,
  "add_final_layer_norm": false,
  "architectures": [
    "M

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

storing https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/source.spm in cache at /root/.cache/huggingface/transformers/1abc21cfb4c61b7d591ff17a3deda0fcd412708330d7498955d3bd744f251245.08cf8cb11c4bccda9984da12e0f0ff6d4ae4f731f8e8150ada7f80182f61cf29
creating metadata file for /root/.cache/huggingface/transformers/1abc21cfb4c61b7d591ff17a3deda0fcd412708330d7498955d3bd744f251245.08cf8cb11c4bccda9984da12e0f0ff6d4ae4f731f8e8150ada7f80182f61cf29
https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/target.spm not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpco7w4edh


Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

storing https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/target.spm in cache at /root/.cache/huggingface/transformers/85a3835aa6f93e766c2d97b6e4e3c4fb236c233d5c2fdf5a229b972bcb6cbbf1.56209d2ca3707ce9263f4035ac7a3a3903fdda4180df9f4174972e23e045b436
creating metadata file for /root/.cache/huggingface/transformers/85a3835aa6f93e766c2d97b6e4e3c4fb236c233d5c2fdf5a229b972bcb6cbbf1.56209d2ca3707ce9263f4035ac7a3a3903fdda4180df9f4174972e23e045b436
https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /root/.cache/huggingface/transformers/tmpxdcsckbg


Downloading:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

storing https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/vocab.json in cache at /root/.cache/huggingface/transformers/b9fe1f87eaef128f448778194957f14a9f5476e22b75995daab8490dbd3b69ad.1d41830b1d9ac03606e0af0d302866e05458276f26bcba52f913026be8f291f8
creating metadata file for /root/.cache/huggingface/transformers/b9fe1f87eaef128f448778194957f14a9f5476e22b75995daab8490dbd3b69ad.1d41830b1d9ac03606e0af0d302866e05458276f26bcba52f913026be8f291f8
loading file https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/source.spm from cache at /root/.cache/huggingface/transformers/1abc21cfb4c61b7d591ff17a3deda0fcd412708330d7498955d3bd744f251245.08cf8cb11c4bccda9984da12e0f0ff6d4ae4f731f8e8150ada7f80182f61cf29
loading file https://huggingface.co/Helsinki-NLP/opus-mt-en-ru/resolve/main/target.spm from cache at /root/.cache/huggingface/transformers/85a3835aa6f93e766c2d97b6e4e3c4fb236c233d5c2fdf5a229b972bcb6cbbf1.56209d2ca3707ce9263f4035ac7a3a3903fdda4180df9f4174972e23e045b436
loadi

Let's test it out:

In [51]:
translator("You're a genius Anthony.")[0]["translation_text"]


'Ты гений Энтони.'

The pipeline API is pretty straightforward; we get the output by simply passing the text to the translator pipeline object.

# Now Loading our newsgroups dataset

In [122]:
from sklearn.datasets import fetch_20newsgroups

In [81]:
# Helper function for Cleaning the data set

def clean(post: str, remove_it: tuple):
  new_lines = []
  for line in post.splitlines():
        if not line.startswith(remove_it):
            new_lines.append(line)
  return '\n'.join(new_lines)

remove_it = (
      'From:',
      'Subject:',
      'Reply-To:',
      'In-Reply-To:',
      'Nntp-Posting-Host:',
      'Organization:',
      'X-Mailer:',
      'In article <',
      'Lines:',
      'NNTP-Posting-Host:',
      'Summary:',
      'Article-I.D.:'
  )


In [82]:
categories = ['alt.atheism', 'talk.religion.misc',
               'comp.graphics', 'sci.space']
# fetch the test dataset
newsgroups_test = fetch_20newsgroups(subset='test',
                                      categories=categories)
x_test = data_test.data
x_test = [clean(p, remove_it) for p in x_test]



In [104]:
#print sample of our data
#print("\n".join(newsgroups_test.data[0].split("\n")[:5]))
x_test[2].split("\n")

['X-Newsreader: rusnews v1.02',
 '',
 'acooper@mac.cc.macalstr.edu (Turin Turambar, ME Department of Utter Misery) writes:',
 '> Did that FAQ ever got modified to re-define strong atheists as not those who',
 '> assert the nonexistence of God, but as those who assert that they BELIEVE in ',
 '> the nonexistence of God?',
 '',
 'In a word, yes.',
 '',
 '',
 'mathew']

# Translation Examples

# First 10 sentences

In [117]:
for i in range(0, 10):
  print('English Sentence: ', x_test[0].split("\n")[i])
  print('Russian Translate: ',translator(x_test[0].split("\n")[i])[0]["translation_text"])
  print()

 

English Sentence:  News-Software: VAX/VMS VNEWS 1.41
Russian Translate:  Новостное программное обеспечение: VAX/VMS VNEWS 1,41

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:   I am a little confused on all of the models of the 88-89 bonnevilles.
Russian Translate:  Я немного запутался со всеми моделями Бонневиль 88-89.

English Sentence:  I have heard of the LE SE LSE SSE SSEI. Could someone tell me the
Russian Translate:  Я слышала о LE SESSE SSEI.

English Sentence:  differences are far as features or performance. I am also curious to
Russian Translate:  Мне также любопытно узнать, что такое различия в характеристиках или работе.

English Sentence:  know what the book value is for prefereably the 89 model. And how much
Russian Translate:  Знаете, какое значение имеет бухгалтерская ценность для 89-й модели.

English Sentence:  less than book value can you usually get them for. In other 

# Next 10 sentences

In [119]:
for i in range(0, 10):
  print('English Sentence: ', x_test[1].split("\n")[i])
  print('Russian Translate: ',translator(x_test[1].split("\n")[i])[0]["translation_text"])
  print()

 

English Sentence:  Distribution: world
Russian Translate:  Распределение: мир

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:  I'm not familiar at all with the format of these "X-Face:" thingies, but
Russian Translate:  Я совсем не знакома с форматом этих "Икс-Факс:" штуковины, но

English Sentence:  after seeing them in some folks' headers, I've *got* to *see* them (and
Russian Translate:  Увидев их в заголовках некоторых людей, я забыла их увидеть.

English Sentence:  maybe make one of my own)!
Russian Translate:  Сделай одну из моих)!

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:  I've got "dpg-view" on my Linux box (which displays "uncompressed X-Faces")
Russian Translate:  У меня есть "Dpg-view" на моем ящике Linux (который показывает "некорректированные X-Faces")

English Sentence:  and I've managed to compile [un]compface too... but now that I'm *looking*
Russian Translate:  и мне тоже удалось составить комп

In [120]:
for i in range(0, 10):
  print('English Sentence: ', x_test[2].split("\n")[i])
  print('Russian Translate: ',translator(x_test[2].split("\n")[i])[0]["translation_text"])
  print()

 

English Sentence:  X-Newsreader: rusnews v1.02
Russian Translate:  X-Newsreader: rusnews v1.02

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:  acooper@mac.cc.macalstr.edu (Turin Turambar, ME Department of Utter Misery) writes:
Russian Translate:  acooper@mac.cc.macalstr.edu (Turin Turambar, ME Department of Utter Misery) пишет:

English Sentence:  > Did that FAQ ever got modified to re-define strong atheists as not those who
Russian Translate:  > Удалось ли когда-либо модифицировать FAQ, чтобы переосмыслить сильных атеистов как не тех, кто

English Sentence:  > assert the nonexistence of God, but as those who assert that they BELIEVE in 
Russian Translate:  ▪ утверждают, что Бог не существует, но как те, кто утверждает, что они верят

English Sentence:  > the nonexistence of God?
Russian Translate:  > отсутствие Бога?

English Sentence:  
Russian Translate:  Я не знаю, что делать.

English Sentence:  In a word, yes.
Russian Translate:  Одним словом, 