![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

This handson is to demonstrate how to implement the tools, transformers and models from the HuggingFace library to create our own Machine Translation Model.

Link to the HuggingFace Library: https://huggingface.co/models?sort=downloads

You may check out slides Day 12 - Pretrained Model for NLP & Generalized Language Model, to get a detailed explanation and walkthrough on this handson.

The walkthrough includes:

1. Explaination of code
2. How to find the model we want from HuggingFace and implement them into the code

# What we will accomplish?

1. Use transformers in HuggingFace
2. Import the correct model from HuggingFace
3. Create your own Machine Translation model

# Instructions
Read the code and execute them according to the instructions provided. If you are having trouble understanding the code, you may take a look at slides, Day 12 - Pretrained Model for NLP & Generalized Language Model, Machine Translation Handson to get a better understanding.

# Part 1: Code and its explanation

First, we will install the required libraries.
torch refers to PyTorch library and transformers refers to the HuggingFace transformers library.
We need to install PyTorch in order to utilize the HuggingFace models and transformers.
If you already have installed, you can skip this step.

AutoModelForSeq2SeqLM is where machine translation models fall under in the HuggingFace library.
AutoTokenizer is where we can define tokenizers from the HuggingFace library.
Pipeline is a method where we can automate the workflow to produce machine learning model.

In [1]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

model is where we will be defining the machine translation model
we are importing a pretrained model (indicated by from_pretraied) that is the Helsinki model

tokenizer is where we define tokenizer, also this is a tokenizer from pretrained Helsinki model

translation is where we call the pipeline method to automate the machine translation workflow
here we defined what process it is going to in the parameters
first parameter, “translation_mul_to_en” means translation of multi language to english
second parameter, model=model is just us inserting the model we already initialized above
third parameter, tokenizer=tokenizer is also just us inserting the tokenizer already initialized above 

In [2]:
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-mul-en")
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-mul-en")

translation = pipeline("translation", model=model, tokenizer=tokenizer)

text is obviously the text that we want to translate

translated_text is where we define translated text (Malay to English translated text)
the translation method will translate the text to English
this method will return a dict, so we want to print only the content of the first element[0] in the dict tagged with ‘translation_text’

In [3]:
text = "Nama saya Micheal, siapakah nama awak?"
translated_text = translation(text)[0]['translation_text']

print(translated_text)

My name is Michael. What's your name?


# Part 2: Choosing the proper model from HuggingFace

In this task, you are required to go to the hugging face website to look for the model that can translate english to chinese.
HuggingFace website: https://huggingface.co/models?sort=downloads

If you are having trouble with this task, take a look at slides Day 12 - Pretrained Model for NLP & Generalized Language Model, to get a detailed explanation and walkthrough on this handson.

You will have to find the appropriate model, copy the name of the model provided and paste it into the model and tokenizer parameters.

In [4]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

# model = AutoModelForSeq2SeqLM.from_pretrained("COPY AND PASTE THE MODEL NAME HERE")
# tokenizer = AutoTokenizer.from_pretrained("COPY AND PASTE THE MODEL NAME HERE")

# translation = pipeline("translation", model=model, tokenizer=tokenizer)

# text = "Hi, how are you?"
# translated_text = translation(text)[0]['translation_text']

# print(translated_text)

# Summary
Now you know how to create your own machine translation using HuggingFace Library.

# Contributors
Author
Pahvindran Raj