# Enhancing Machine Translation of News: Japanese to English Translation

## Experimental part - For single sentence examples (Can be run)

As explained in the report, we couldn't apply postprocessing techniques globally, because errors were dependent on the sentence: some errors could occur for one sentence and not for another one with the same settings. Here are some examples.

In [1]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import evaluate
from datasets import load_dataset, Dataset, load_from_disk
from tqdm import tqdm
import pandas as pd

cache_dir = 'D:\\.cache'                               # CHANGE THIS TO YOUR OWN CACHE DIRECTORY, I didn't have enough space in my main disk
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def translate_text_print(model, tokenizer, input_text):

    # Tokenize the input text, as Word level segmentation
    tokens = tokenizer.tokenize(input_text)
    print("Tokens:", tokens)


    # Tokenize the input text and convert it to tensors (IDs)
        # Will both SEGMENT and ENCODE if is_split_into_words is False
        # Will only ENCODE if is_split_into_words is True
    inputs = tokenizer(input_text, return_tensors="pt", padding="max_length", truncation=True, max_length=50, is_split_into_words=False).to(device)
    print("Input ids:", inputs['input_ids'])
    print("Input attention mask:", inputs['attention_mask'])
    print("Input with tokens:", tokenizer.decode(inputs['input_ids'][0], skip_special_tokens=False))


    # Generate translation using model
    translated = model.generate(**inputs)
    print("Generated ids:",  translated)


    # Decode the output tokens to text
    decoded_output = tokenizer.decode(translated[0], skip_special_tokens=True)
    print("Decoded sentence:", decoded_output)
    return decoded_output

def calculate_bleu_score(predictions, references, max_order=4):
    bleu = evaluate.load("bleu", cache_dir=cache_dir)
    return bleu.compute(predictions=predictions, references=references, max_order=max_order)

def calculate_chrf_score(predictions, references):
    chrf = evaluate.load("chrf", cache_dir=cache_dir)
    return chrf.compute(predictions=predictions, references=references)

def calculate_sacrebleu_score(predictions, references):
    sacrebleu = evaluate.load("sacrebleu", cache_dir=cache_dir)
    return sacrebleu.compute(predictions=predictions, references=references)

tokenizer = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ja-en', cache_dir=cache_dir, use_fast=True)
model = AutoModelForSeq2SeqLM.from_pretrained('Helsinki-NLP/opus-mt-ja-en', cache_dir=cache_dir).to(device)

  from .autonotebook import tqdm as notebook_tqdm


You can test here: Change the Japanese sentence if you want a translation, change both if you want the scores.

In [2]:
z = {'en': 'I am a student', 
     'jp': '私は学生です'}
translated = translate_text_print(model, tokenizer, z['jp'])

Tokens: ['▁私', 'は', '学生', 'です']
Input ids: tensor([[  115,    18,  7323,    74,     0, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: 私は学生です</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Generated ids: ten

### Example 1: Dates

Here, we can't find rules to convert "2018: 7/14-8/19" to "2018年7月14日から8月19日", even if the meaning still remains. However, because the structure of the date is not the same, BLEU score is 0 (because of ngrams > 3).

Decoded sentence: It's August, by the time I see it! The duration of the event is from July 14th, 2018, to August 19.

In [3]:
z = {'en': 'The best season is August!Sunflower Festival in 2018: 7/14-8/19.',
 'jp': '見頃は8月がオススメ!開催期間は2018年7月14日から8月19日まで。'} # 206
translated = translate_text_print(model, tokenizer, z['jp'])  

Tokens: ['▁見', '頃', 'は', '8', '月', 'が', 'オス', 'ス', 'メ', '!', '開', '催', '期間', 'は', '20', '18', '年', '7', '月', '14', '日', 'から', '8', '月', '19', '日', 'まで', '。']
Input ids: tensor([[  390,  3072,    18,  1773,  1215,    34, 14722,   563,  4480,    40,
          5613,  7601,  9668,    18,  2785,  7131,   494,  1784,  1215,  6994,
           583,   135,  1773,  1215,  7240,   583,   417,  5832,     0, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: 見頃は8月がオススメ!開催期間は2018年7月14日から8月19日まで。</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Genera

### Example 2: Repetitive n-grams

Here, the translated sentence has a very long chain of "etc."


Decoded sentence: I'm going to talk to you today about tourism, travel, vacations, entertainment, visits to friends and relatives, rest, therapy, reunions, social and service activities, etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc., etc.

In [37]:
z = {'en': 'Pleasure/Tourism- The purpose of your planned travel is recreational in nature, including tourism, vacation (holiday), amusement, visits with friends or relatives, rest, medical treatment, activities of a fraternal, social, or service nature, and participation by amateurs, who will receive no remuneration, in musical, sports and similar events or contests.',
 'jp': '観光/旅行-旅行、休暇、娯楽、友人や親族の訪問、休養、治療、同窓会や社交、奉仕活動など、及び報酬を伴わない音楽やスポーツなどイベント或いはコンテストのアマチュア参加ウ.'} # 232
translated = translate_text_print(model, tokenizer, z['jp'])  

Tokens: ['▁観光', '/', '旅行', '-', '旅行', '、', '休暇', '、', '娯', '楽', '、', '友人', 'や', '親族', 'の', '訪問', '、', '休', '養', '、', '治療', '、', '同', '窓', '会', 'や', '社', '交', '、', '奉仕', '活動', 'など', '、', '及び', '報酬', 'を', '伴', 'わ', 'ない', '音楽', 'や', 'スポーツ', 'など', 'イベント', '或', 'い', 'は', 'コンテスト', 'の', 'アマ', 'チュ', 'ア', '参加', 'ウ', '.']
Input ids: tensor([[10926,   783,  8029,   146,  8029, 15168, 18751, 15168, 47596,  6129,
         15168,  3767,   261, 41629,    13, 15487, 15168, 10958, 11009, 15168,
          7077, 15168,  3626,  9452,  1486,   261,  4558,  5968, 15168, 19783,
          5253,  1699, 15168, 22963, 10998,    22, 21294,   251,    72,  5658,
           261, 16912,  1699, 13442, 16914,   125,    18, 41745,    13,     0]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]], device='cuda:0')
Input with tokens: 観光/旅行-旅行、休暇、娯楽、友人

### Example 3: "、" punctuation

Here, the model just can't translate the sentence, and provides a non-related sentence.

Decoded sentence: It's one of the most remarkable spots in the history of the world.

In [43]:
z = {'en': "Celebrities quietly frequent the museum as well. It's also gaining attention from being part of one of the tours on the Seven Stars Cruise Train.",
 'jp': '著名人がお忍びで来場することも多く、また、豪華寝台列車「ななつ星」の一部のツアーにも組み込まれるなど、いま改めて注目されているスポットです。'} # 236
translated = translate_text_print(model, tokenizer, z['jp'])

Tokens: ['▁著名', '人が', 'お', '忍び', 'で', '来', '場', 'すること', 'も', '多く', '、', 'また', '、', '豪華', '寝', '台', '列車', '「', 'な', 'な', 'つ', '星', '」', 'の一部', 'の', 'ツアー', 'にも', '組み込', 'まれる', 'など', '、', 'いま', '改め', 'て', '注目', 'されている', 'スポット', 'です', '。']
Input ids: tensor([[12251,  1443,   816, 40990,    53,   808,  2312,  1987,   109,  1446,
         15168,   798, 15168, 52499,  4683,  5067, 12926, 18155,    93,    93,
           556,  3352, 20056,  5456,    13, 30778,   698, 29972, 12939,  1699,
         15168,   693, 37550,    81, 11677,  2601, 36224,    74,  5832,     0,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: 著名人がお忍びで来場することも多く、また、豪華寝台列車「ななつ星」の一部のツアーにも組み込まれるなど、いま改めて注目されているスポットです。</s> <pad> <pad> <pad> <p

But if we delete the last "、":

Decoded sentence: It's a new feature to be noticed that well-known people often sneak in, as well as to be included on a tour of some of the "The Star of the Night."

The previous part of the sentence about the Named Entity "Star" something is now here. Though, BLEU score of 0.

In [44]:
z = {'en': "Celebrities quietly frequent the museum as well. It's also gaining attention from being part of one of the tours on the Seven Stars Cruise Train.",
 'jp': '著名人がお忍びで来場することも多く、また、豪華寝台列車「ななつ星」の一部のツアーにも組み込まれるなど いま改めて注目されているスポットです。'} # 236
translated = translate_text_print(model, tokenizer, z['jp'])

Tokens: ['▁著名', '人が', 'お', '忍び', 'で', '来', '場', 'すること', 'も', '多く', '、', 'また', '、', '豪華', '寝', '台', '列車', '「', 'な', 'な', 'つ', '星', '」', 'の一部', 'の', 'ツアー', 'にも', '組み込', 'まれる', 'など', '▁いま', '改め', 'て', '注目', 'されている', 'スポット', 'です', '。']
Input ids: tensor([[12251,  1443,   816, 40990,    53,   808,  2312,  1987,   109,  1446,
         15168,   798, 15168, 52499,  4683,  5067, 12926, 18155,    93,    93,
           556,  3352, 20056,  5456,    13, 30778,   698, 29972, 12939,  1699,
          4726, 37550,    81, 11677,  2601, 36224,    74,  5832,     0, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: 著名人がお忍びで来場することも多く、また、豪華寝台列車「ななつ星」の一部のツアーにも組み込まれるなど いま改めて注目されているスポットです。</s> <pad> <pad> <pad> <pad> 

### Example 4: Special characters

Here, the model can't translate the "€" symbol.

Decoded sentence: You have it in manual transmissions and trim styles... and it's up to 85000 points automatically.

In [49]:
z = {'en': 'You have it in Style trim with manual transmission ... and Also automatically by 19,850 €.',
 'jp': 'あなたはマニュアルトランスミッションとトリムスタイルでそれを持っている...とまた、自動的に19850€によります。'} # 385
translated = translate_text_print(model, tokenizer, z['jp'])

Tokens: ['▁あなた', 'は', 'マニュアル', 'トランス', 'ミッション', 'と', 'トリ', 'ム', 'スタイル', 'で', 'それ', 'を', '持', 'っている', '...', 'と', 'また', '、', '自動', '的に', '1985', '0', '€', 'に', 'より', 'ます', '。']
Input ids: tensor([[  166,    18, 39322, 44013, 19252,    42,  9640,  1172, 13056,    53,
           441,    22,   762,   479,    70,    42,   798, 15168,  7295,  1212,
         55261,  1852, 56070,    29,   582,    95,  5832,     0, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: あなたはマニュアルトランスミッションとトリムスタイルでそれを持っている...とまた、自動的に19850€によります。</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <p

But if we translate "euros" into "ユーロ", we get:

Decoded sentence: You have it in manual transmission and trim style... and it is automatically made up of 19850 euros.

In [51]:
z = {'en': 'You have it in Style trim with manual transmission ... and Also automatically by 19,850 €.',
 'jp': 'あなたはマニュアルトランスミッションとトリムスタイルでそれを持っている...とまた、自動的に19850ユーロによります。'} # 385
translated = translate_text_print(model, tokenizer, z['jp'])

Tokens: ['▁あなた', 'は', 'マニュアル', 'トランス', 'ミッション', 'と', 'トリ', 'ム', 'スタイル', 'で', 'それ', 'を', '持', 'っている', '...', 'と', 'また', '、', '自動', '的に', '▁1985', '0', 'ユーロ', '▁に', 'より', 'ます', '。']
Input ids: tensor([[  166,    18, 39322, 44013, 19252,    42,  9640,  1172, 13056,    53,
           441,    22,   762,   479,    70,    42,   798, 15168,  7295,  1212,
          9081,  1852, 37306,     7,   582,    95,  5832,     0, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715,
         60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715, 60715]],
       device='cuda:0')
Input attention mask: tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:0')
Input with tokens: あなたはマニュアルトランスミッションとトリムスタイルでそれを持っている...とまた、自動的に 19850ユーロ によります。</s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 