## Language Translation (JA-EN) 

Name: Sumit Kumar Sangroula

### Context

There are around 7159 spoken languages around the the world as per Ethnologue "https://www.ethnologue.com/insights/how-many-languages". 
Roughly half of these languages are on the verge of extinction. One primary reason for extinction could be globalization. With the advancement 
of technical era, people from around the world prefer using global languages such as English, French, Spanish and Arabic. These langauages are 
used by millions of people and international organizations such as UN, EU and AU. These languages have global reach among millions of people, 
widely used in their native homelands and has also become secondary language in some countries due to cultural differences and for easying administrative works. The attraction of learning these languages have been increasing day by day as it provide limitless opportunities to the 
people. 
Some of the benefits of global languages are:
* job opportunites around the world
* global communication
* cultural exchange
* travel

However, learning them isn't easy. The following are some of the challenges:
* Since the grammar and vocabulary are totally different, it takes a lot of time to be fluent
* Finding resources and tutors is hard
* Could be costly in terms of fees
* Learners may/may not find the opportunities to show their language skills if they don't find the right environment/people for conversation

Due to these reasons, people in some countries tend to use their own native language.

### An Introduction of Japanese Language

Japanese people have high pride in their history, culture and identity. They prefer Japanese language over any other global languages since they value their history and culture and they have been excellent in preserving them. The language came into existence some 1500 years ago although there is no clear proof of its existence. Unlike English language, Japanese language has 3 writing systems.
* Hiragana (ひらがな) - Alphabets that are used for native/ local words. The strokes are somewhat curve in nature.
* Katakana (カタカナ) - Alphabets that are mostly used for derived/ foreign words. The strokes are generally straight lines.
* Kanji (漢字) - Originally Chinese, these characters are logographic symbols with some modifications, and replaces the Hiragana characters as needed. Each Kanji has specific meaning derived from the objects in nature.

Japanese language is one of the most toughest languages in the world. As per FSI Language Difficulty Ranking "https://www.fsi-language-courses.org/blog/fsi-language-difficulty/", it falls in Category V along with Chinese (Mandarin), Cantonese, Korean and Arabic.

#### Why Japanese Language Translation

* Global presence in financial industries
* Global automobiles and technology reach
* Strong cultural values
* Complexity (Human translation is expensive, and AI translation often misses the context and the meaning)

### Importing Necessary Libraries

In [1]:
import pandas as pd #used for data manipulation and analysis
import nltk #natural language tool kit
from nltk.tokenize import word_tokenize #to tokenize english text
from nltk.tokenize import sent_tokenize
import re #to tokenize japanese text
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM #used for pretrained models
from transformers import MarianMTModel, MarianTokenizer #pretrained model to convert Japanese text into English
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer, Seq2SeqTrainingArguments  #pretrained model to convert Japanese text into English
from nltk.translate.bleu_score import corpus_bleu as cb #corpus_bleu for BLEU score evaluation of the text file
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction  #sentence bleu for sentence and SmoothingFunction to prevent BLEU collapse 
import tensorflow as tsf #for deploying deep learning network model
from pathlib import Path #to process files
import torch #to perfom mathematical operations in multi-dimensional arrays
import os #to acess and manipulate the operating system dependent functionalities
import sacrebleu #compute BLEU score
from datasets import Dataset #to process the data
from transformers import M2M100Tokenizer #used for pretrained models
from sacrebleu import corpus_bleu #compute BLEU score
import warnings
warnings.filterwarnings("ignore")

### Language Translation

To begin our project, we will first perform language translation using some random Japanese text

In [2]:
#Japanese text
jap_text = "私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。"
#Eng_trans
eng_text ="My name is Sumit Kumar Sangroula. I live in Kathmandu. There are five people in the family. My hobbies are watching movies and reading novels."

In [3]:
#checking the length of both texts
print("The length of the Japanese text is :", len(jap_text))
print("The length of the English text is :", len(eng_text))

The length of the Japanese text is : 63
The length of the English text is : 142


##### Word Tokenization

In [4]:
#Tokenizing english words
eng_tokens = word_tokenize(eng_text)

#since word_tokenize doesn't work for japanese text, we will use findall method
jap_tokens = re.findall(r'\w+|[。、・]', jap_text)

#Print
print("Japanese tokens:", jap_tokens)
print("English tokens:", eng_tokens)

Japanese tokens: ['私の名前はスミット', '・', 'クマール', '・', 'サングルーラです', '。', 'カトマンズに住んでいます', '。', '家族は5人です', '。', '趣味は映画鑑賞と小説を読むことです', '。']
English tokens: ['My', 'name', 'is', 'Sumit', 'Kumar', 'Sangroula', '.', 'I', 'live', 'in', 'Kathmandu', '.', 'There', 'are', 'five', 'people', 'in', 'the', 'family', '.', 'My', 'hobbies', 'are', 'watching', 'movies', 'and', 'reading', 'novels', '.']


##### Printing the no. of tokens and the counts for both texts

In [5]:
print("Japanese tokens:", jap_tokens)
print("Number of Japanese tokens:", len(jap_tokens))

Japanese tokens: ['私の名前はスミット', '・', 'クマール', '・', 'サングルーラです', '。', 'カトマンズに住んでいます', '。', '家族は5人です', '。', '趣味は映画鑑賞と小説を読むことです', '。']
Number of Japanese tokens: 12


In [6]:
print("English tokens:", eng_tokens)
print("Number of English tokens:", len(eng_tokens))

English tokens: ['My', 'name', 'is', 'Sumit', 'Kumar', 'Sangroula', '.', 'I', 'live', 'in', 'Kathmandu', '.', 'There', 'are', 'five', 'people', 'in', 'the', 'family', '.', 'My', 'hobbies', 'are', 'watching', 'movies', 'and', 'reading', 'novels', '.']
Number of English tokens: 29


### Translation using pretrained models

##### Translation using pretrained model 'Helsinki-NLP/opus-mt-ja-en'

In [7]:
#Load pretrained model and tokenizer
m1_name = "Helsinki-NLP/opus-mt-ja-en" #defining pretrained model
tor1 = MarianTokenizer.from_pretrained(m1_name) #tokenizer for the pretrained model
m1 = MarianMTModel.from_pretrained(m1_name) #choosing model from the pretrained model

#Japanese input text
jap_text = "私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。"

#Tokenize and translate
inputs1 = tor1(jap_text, return_tensors="pt", padding=True, truncation=True) #tokenize the input text
trans1 = m1.generate(**inputs1) #translates the genarated tokens of the input text 
trns_eng1 = tor1.decode(trans1[0], skip_special_tokens=True) #decodes the translated input

#Print
print("Original Japanese text:")
print(jap_text)
print("\nReference English text:")
print(eng_text)
print("\nTranslated English text:")
print(trns_eng1)


Original Japanese text:
私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。

Reference English text:
My name is Sumit Kumar Sangroula. I live in Kathmandu. There are five people in the family. My hobbies are watching movies and reading novels.

Translated English text:
My name is Smit Kmer Sanguura. I live in the Catmans. I have five family members. My hobby is watching movies and reading novels.


We can see that our that the pretrained model could'nt properly translate the Katakana characters since Katakana characters are mostly used for foreign words. The other translation seem fine as the overall meaning of the both the reference text and translated text give same meaning.

BLEU or Bilingual Evaluation is an algorithm developed to measure the correctness of text which is translated from one language to another. It measures the closeness of human translated text with the machine translated text

##### BLEU Score Evaluation for 'Helsinki-NLP/opus-mt-ja-en'

In [8]:
#Referencing original and translated text
eng_ref = [[eng_text]]  #List of English text
hypotheses = [trns_eng1] #Hypotheses list

# Compute BLEU score
bleu_score1 = corpus_bleu(hypotheses, eng_ref).score

#Print
print("Original Japanese text:", jap_text)
print("Reference English text:", eng_text)
print("Model Translated text:", trns_eng1)
print(f"\nBLEU Score: {bleu_score1:.2f}")


Original Japanese text: 私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。
Reference English text: My name is Sumit Kumar Sangroula. I live in Kathmandu. There are five people in the family. My hobbies are watching movies and reading novels.
Model Translated text: My name is Smit Kmer Sanguura. I live in the Catmans. I have five family members. My hobby is watching movies and reading novels.

BLEU Score: 32.58


The score shows that the pretained model performed fairly. But a BLEU score of 32.58 may not perform well when we have large text. We will try another pretrained model to check if it performs better than this.

##### Translation using pretrained model 'facebook/m2m100_1.2B'

In [9]:
#Load pretrained model and tokenizer
m2_name = "facebook/m2m100_1.2B" #defining pretrained model
tor2 = M2M100Tokenizer.from_pretrained(m2_name) #tokenizer for the pretrained model
m2 = M2M100ForConditionalGeneration.from_pretrained(m2_name) #choosing model from the pretrained model
 
#Set source and target language
tor2.src_lang = "ja"   #Source language
tgt_lang = "en"        #Target language

#Japanese input text
jap_text = "私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。"

#Tokenize and translate with forced BOS token for target language. Forced BOS forces the model to generate the output with BOS token.
encoded1 = tor2(jap_text, return_tensors="pt")
generated1 = m2.generate(
    **encoded1,
    forced_bos_token_id=tor2.get_lang_id(tgt_lang)
)
trns_eng2 = tor2.decode(generated1[0], skip_special_tokens=True)

#Print
print("Original Japanese text:")
print(jap_text)
print("\nReference English text:")
print(eng_text)
print("\nTranslated English text:")
print(trns_eng2)


Original Japanese text:
私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。

Reference English text:
My name is Sumit Kumar Sangroula. I live in Kathmandu. There are five people in the family. My hobbies are watching movies and reading novels.

Translated English text:
My name is Smith Kumar Singhula. I live in Kathmandu. My family is five. My hobbies are watching movies and reading novels.


Our second pretrained model somewhat performed better in translating the Katakana characters although there is room for improvement. Considering the word to word translation, The model somehow performed better than the previous one.

##### BLEU Score Evaluation for 'facebook/m2m100_1.2B'

In [10]:
#Referencing original and translated text
eng_ref = [[eng_text]]  #List of English text
hypotheses1 = [trns_eng2] #Hypotheses list

# Compute BLEU score
bleu_score2 = corpus_bleu(hypotheses1, eng_ref).score

#Print
print("Original Japanese:", jap_text)
print("Reference English:", eng_text)
print("Model Translation:", trns_eng2)
print(f"\nBLEU Score: {bleu_score2:.2f}")


Original Japanese: 私の名前はスミット・クマール・サングルーラです。カトマンズに住んでいます。家族は5人です。趣味は映画鑑賞と小説を読むことです。
Reference English: My name is Sumit Kumar Sangroula. I live in Kathmandu. There are five people in the family. My hobbies are watching movies and reading novels.
Model Translation: My name is Smith Kumar Singhula. I live in Kathmandu. My family is five. My hobbies are watching movies and reading novels.

BLEU Score: 53.25


In comparison to 'Helsinki-NLP/opus-mt-ja-en', the 'facebook/m2m100_1.2B' model performed better with BLEU score of 53.25 which is good for human understanding. So, we will use this particular model to translate the Japanese text which has already been translated into English. Then, we will compare the BLEU score and perform fine tuning to see if we get better results. 

### Huge Text Translation

#### Importing the text file

For the huge text translation, we will import the file from https://www.phontron.com/kftt/download/kftt-data-1.0.tar.gz. The files are downloaded on local disk in-case the files get corrupted or deleted from the original source.

In [11]:
file_source = 'C:/Users/Acer/Desktop/dataset for method of prediction/japanese language data/kftt-data-1.0'  #file source
base_path = Path(file_source).parent / 'kftt-data-1.0' / 'data' / 'orig' #base path of the files

In [12]:
#Naming paths for english and japanese text
path_en = os.path.join(base_path, 'kyoto-train.en') #path for English text
path_ja = os.path.join(base_path, 'kyoto-train.ja') #path for Japanese text

#Reading the data using open and readlines method
with open(path_en, encoding='utf-8') as f_en, open(path_ja, encoding='utf-8') as f_ja:
    ja_sents = f_ja.readlines() #read Japanese text
    en_sents = f_en.readlines() #read English text
    
#Stripping the sentences in the text
ja_sents = [line.strip() for line in ja_sents] #strips the Japanese sentences
en_sents = [line.strip() for line in en_sents] #strips the English sentences


#Printing the sentences
for i in range(3): #to print first 3 sentences 
    print(f"JA-ver: {ja_sents[i]}") #prints Japanese sentences
    print(f"EN-ver: {en_sents[i]}\n") #prints English sentences

JA-ver: 雪舟（せっしゅう、1420年（応永27年）-1506年（永正3年））は号で、15世紀後半室町時代に活躍した水墨画家・禅僧で、画聖とも称えられる。
EN-ver: Known as Sesshu (1420 - 1506), he was an ink painter and Zen monk active in the Muromachi period in the latter half of the 15th century, and was called a master painter.

JA-ver: 日本の水墨画を一変させた。
EN-ver: He revolutionized the Japanese ink painting.

JA-ver: 諱は「等楊（とうよう）」、もしくは「拙宗（せっしゅう）」と号した。
EN-ver: He was given the posthumous name "Toyo" or "Sesshu (拙宗)."



The output above is based upon the source Japanese text and the already translated English text from the source file.

##### Counting the no. of sentences in both the files

In [13]:
#Use punctuation splitting (。！？) to count Japanese sentences since sentence tokenizer doesn't work for Japanese text translation
ja_sents_count = sum(len(re.findall(r'[^。！？]+[。！？]', line)) for line in ja_sents)
print(f"Total number of Japanese sentences: {ja_sents_count}")

#Use sentence tokenizer to count the English sentences
en_sents_count = sum(len(sent_tokenize(line)) for line in en_sents)
print(f"Total number of English sentences: {en_sents_count}")

Total number of Japanese sentences: 362720
Total number of English sentences: 451747


### Translation of the source text using pretrained model 'facebook/m2m100_1.2B'

In [14]:
#Trying to use GPU for faster processing
pr_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#Loading Model and Tokenizer
m3_name = "facebook/m2m100_1.2B"
tor3 = M2M100Tokenizer.from_pretrained(m3_name)
m3 = M2M100ForConditionalGeneration.from_pretrained(m3_name).to(pr_device) #choosing model from the pretrained model

#Source language is Japanese and the target language is English
tor3.src_lang = "ja"
target_lang = "en"

#File source
file_source = 'C:/Users/Acer/Desktop/dataset for method of prediction/japanese language data/kftt-data-1.0'
base_path = Path(file_source).parent / 'kftt-data-1.0' / 'data' / 'orig'
path_en = os.path.join(base_path, 'kyoto-train.en')
path_ja = os.path.join(base_path, 'kyoto-train.ja')

#Reading the file
with open(path_en, encoding='utf-8') as f_en, open(path_ja, encoding='utf-8') as f_ja:
    en_sentences = [line.strip() for line in f_en.readlines()]
    ja_sentences = [line.strip() for line in f_ja.readlines()]

#Translate the first 10 sentences from Japanese to English to see if more sentences give high BLEU score
for i in range(10):
    ja_text = ja_sentences[i]
    inputs = tor3(ja_text, return_tensors="pt").to(pr_device)
    translated_tokens = m3.generate(
        **inputs,
        forced_bos_token_id=tor3.get_lang_id(target_lang)
    )
    translated_text = tor3.decode(translated_tokens[0], skip_special_tokens=True) #Predicted Englis text

    #Print
    print(f"\nJA-ver: {ja_text}") #Prints Japanse version
    print(f"Ref EN-ver: {en_sentences[i]}") #Prints Reference English version
    print(f"Predicted EN-ver : {translated_text}") #Prints Predicted English version



JA-ver: 雪舟（せっしゅう、1420年（応永27年）-1506年（永正3年））は号で、15世紀後半室町時代に活躍した水墨画家・禅僧で、画聖とも称えられる。
Ref EN-ver: Known as Sesshu (1420 - 1506), he was an ink painter and Zen monk active in the Muromachi period in the latter half of the 15th century, and was called a master painter.
Predicted EN-ver : Snowboat (1420 (応永27年)-1506 (永正3年) is a painting monk and painting artist who worked in the late 15th century during the late 15th century.

JA-ver: 日本の水墨画を一変させた。
Ref EN-ver: He revolutionized the Japanese ink painting.
Predicted EN-ver : It changed the Japanese painting.

JA-ver: 諱は「等楊（とうよう）」、もしくは「拙宗（せっしゅう）」と号した。
Ref EN-ver: He was given the posthumous name "Toyo" or "Sesshu (拙宗)."
Predicted EN-ver : He said, “See, it’s a shame, it’s a shame, it’s a shame.”

JA-ver: 備中国に生まれ、京都・相国寺に入ってから周防国に移る。
Ref EN-ver: Born in Bicchu Province, he moved to Suo Province after entering SShokoku-ji Temple in Kyoto.
Predicted EN-ver : Born in China, he moved to Kyoto after entering the temple.

JA-ver: その後遣明使に随行して中国（明）に渡って中国の

##### BLEU Score Evaluation

In [15]:
# Prepare reference and hypothesis lists
sent_references = [[nltk.word_tokenize(en_sentences[i])] for i in range(10)] #sentence reference list which inhabits the already translated 10 English sentences.
hypotheses3 = [] #creating an empty list for hypothesis

for i in range(10): #to compare the first 10 sentences
    ja_text = ja_sentences[i]
    inputs = tor3(ja_text, return_tensors="pt").to(pr_device)
    translated_tokens = m3.generate(
        **inputs,
        forced_bos_token_id=tor3.get_lang_id(target_lang)
    )
    translated_text = tor3.decode(translated_tokens[0], skip_special_tokens=True)
    hypotheses3.append(nltk.word_tokenize(translated_text))

#Calculate BLEU score
smooth = SmoothingFunction().method4
bleu_score3 = cb(sent_references, hypotheses3, smoothing_function=smooth)
print(f"\nThe BLEU score for 10 sentences: {bleu_score3 * 100:.2f}")



The BLEU score for 10 sentences: 10.47


A BLEU score of 10.47 was achieved while translating only 10 sentences. The score was 8.03 for 30 sentences which seems our model didn't do well enough. Since displaying 30 translated sentences occupy more space and translation time is longer; the translated version of the 30 sentences has been uploaded at https://github.com/9sumit9/Big-Data-Analytics/blob/main/JA2EN for reference.

### Fine Tuning the Model

##### Preprocessing the data

In [16]:
from datasets import Dataset
from transformers import M2M100Tokenizer

#Loading Japanese and English sentences
data_pairs = [{'text_translation': {'ja': ja, 'en': en}} for ja, en in zip(ja_sentences, en_sentences)]  #Create a list with dictionary for Japanese and Englsih text
 
dataset = Dataset.from_list(data_pairs)
tor4 = M2M100Tokenizer.from_pretrained("facebook/m2m100_1.2B")

##### Tokenizing the whole textfile

In [17]:
#Source and target language
src_lang = "ja"
tgt_lang = "en"

tor4.src_lang = src_lang
tor4.tgt_lang = tgt_lang

def tokenize(batch): #Creating tokenize function
    src_texts = [item[src_lang] for item in batch["text_translation"]]
    tgt_texts = [item[tgt_lang] for item in batch["text_translation"]]

    inputs = tor4(src_texts, truncation=True, padding="max_length", max_length=128)
    
    with tor4.as_target_tokenizer():
        labels = tor4(tgt_texts, truncation=True, padding="max_length", max_length=128)

    inputs["labels"] = labels["input_ids"]
    return inputs
toked_dataset = dataset.map(tokenize, batched=True) #mapping the data for both languages

Map:   0%|          | 0/440288 [00:00<?, ? examples/s]

 Mapping takes a long time since there are 362720 Japanese sentences and 451747 English sentences.

##### Model Loading and BLEU score evaluation

In [18]:
#Using SmoothingFunction
smooth1 = SmoothingFunction().method4

preds = []
refs = []

#Translating the first 100 sentences for evaluation
for i in range(100):
    inputs4 = tor4(ja_sentences[i], return_tensors="pt").to(pr_device)
    translated_tokens2 = m3.generate(
        **inputs4,
        forced_bos_token_id=tor4.get_lang_id(target_lang)
    )
    pred = tor4.decode(translated_tokens2[0], skip_special_tokens=True)

    preds.append(pred.split())
    refs.append([en_sentences[i].split()])  #Referring to the list of English sentences

#BLEU score evaluation
b_score = [sentence_bleu(refs[i], preds[i], smoothing_function=smooth1) for i in range(len(preds))]
avg_bleu = sum(b_score) / len(b_score)
print(f"\nThe BLEU score for {len(preds)} sentences is : {avg_bleu * 100:.2f}")



The BLEU score for 100 sentences is : 8.50


##### Re_training the model with Seq2SeqTrainer

In [20]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

#Assigning Seq2Seq trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset.shuffle(seed=42).select(range(10000)),  #we have set 10000 tokens for train dataset.
    eval_dataset=tokenized_dataset.select(range(500)),  #Evaluating 500 tokens
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()


ValueError: Your currently installed version of Keras is Keras 3, but this is not yet supported in Transformers. Please install the backwards-compatible tf-keras package with `pip install tf-keras`.

##### BLEU score evaluation

In [None]:
from sacrebleu import corpus_bleu

# Translate test sentences
def translate_sentences(m3, tor4, input_texts, src_lang="ja", tgt_lang="en"):
    tor4.src_lang = src_lang
    translated_texts = []
    for text in input_texts:
        encoded = tor4(text, return_tensors="pt", padding=True, truncation=True).to(m3.device)
        generated = m3.generate(**encoded, forced_bos_token_id=tor4.get_lang_id(tgt_lang))
        translated = tor4.decode(generated[0], skip_special_tokens=True)
        translated_texts.append(translated)
    return translated_texts

predictions = translate_sentences(m3, tor4, ja_sentences[:100])
bleu = corpus_bleu(predictions, [en_sentences[:100]])
print(f"BLEU Score: {bleu.score:.2f}")


The output above showed little improvement than the previous output. But still we can't deploy the model for translation.

### Conclusion
Language translation seems a sublime project, it comes with unlimited challenges. Performing it at an individual level is even harder and challenging given the amount of text you are fed and time you have to spent while training the models and evaluating the output. A model that works for one particular language may not work for other language. 

Athough there are other highly trained pretrained models available such as 'seamless-m4t-large', 'facebook/nllb-200-distilled-600M', 'facebook/nllb-200-1.3B' for translating the text, there are few drawbacks in translation.

Some of the drawbacks for language translation are:
* time consumption
* have to download huge packages which may not work in every evaluation models
* availability of GPU and CPU for processing
* patience
* no guarantee of perfect BLEU scoreIt requires large amount of CPU and GPU. Also, training the model takes a lot of time. 

Building our own seqtoseqmodel for translation for a small text would yield a perfect 100% BLEU score since the source text will make comparisions with the reference translated text provided to the model. But in real case scenario, getting a perfect 100% BLEU score is not attainable. For a huge text, manually providing the translated text for reference is tiresome and time consuming. To do so, we need to hire language experts and possess highly powered computing devices since training the data takes a lot of time and requires extra CPU and GPU. So for language translation for huge text, it is advisable to use pretrained model for faster training and evaluation. Normally, a BLEU score of above 50 is considered good translation for human understanding.

Conclusively, to get better translation results, using pretrained models developed by huge companies like GOOGLE, Meta and Microsoft at an organizational level would do justice to the translation projects only if the above drawbacks are met.