# Description

In this notebook, I will evaluate the performance of the trained NMT, including:
- BLEU: widely-used metric that measures n-gram overlap between predicted sentence and true sentence.
- TER (Translation Error Rate): measuring number of editing required to transform a predicted sentence to true sentence.

In [40]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try: tf.config.experimental.set_memory_growth(gpus[0], True)
    except RuntimeError as e:   print(e)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import regex as re
from sklearn.model_selection import train_test_split
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

from tensorflow_text.tools.wordpiece_vocab import bert_vocab_from_dataset
import tensorflow_text as tf_text

from utils.read_file_utils import *
from utils.model_utils import *
from utils.evaluation_utils import *

In [6]:
PATH_FILE_TEST_EN = r"data/processed_data/en_sent_test.txt"
PATH_FILE_TEST_VI = r"data/processed_data/vi_sent_test.txt"

PATH_MODEL_TRANSLATOR = "translator"

translator = tf.saved_model.load(PATH_MODEL_TRANSLATOR)

# 1. BLEU

In [34]:
N_samples = 10_000

list_en_sentence = read_text_file(PATH_FILE_TEST_EN)
list_vi_sentence = read_text_file(PATH_FILE_TEST_VI)

if N_samples != None:
    list_en_sentence = list_en_sentence[:N_samples]
    list_vi_sentence = list_vi_sentence[:N_samples]

assert len(list_en_sentence) == len(list_vi_sentence)
print(f"Number of sentence: {len(list_en_sentence)}")

Number of sentence: 10000


In [35]:
idx = np.random.randint(0, len(list_en_sentence))

en_sentence = list_en_sentence[idx]
vi_sentence = list_vi_sentence[idx]

print(f"English: {en_sentence}")
print(f"Vietname: {vi_sentence}")

English: - hush up , lottie .
Vietname: - yên nào , lottie .


## 1.1. Test BLEU on single sample

In [37]:
idx = np.random.randint(0, len(list_en_sentence))

en_sentence = list_en_sentence[idx]
en_sentence = en_sentence.lower()
vi_sentence = list_vi_sentence[idx]
vi_sentence = vi_sentence.lower()

print(f"English: {en_sentence}")
print(f"Vietname: {vi_sentence}")

print("-"*100)

translated_text, translated_tokens, attention_weights, list_predicted_prob = translator(tf.constant(en_sentence))
translated_text = translated_text.numpy().decode('utf-8')
print(f"Translated text: {translated_text}")

score = calculate_bleu_score([vi_sentence], translated_text)
print("BLEU score:", score)

English: and i understand that if she slipped up that she would have a completely reasonable explanation for it .
Vietname: và nhỡ có mắc sai lầm thì cô ta sẽ có một lời giải thích hợp lý .
----------------------------------------------------------------------------------------------------
Translated text: và tôi hiểu rằng nếu cô ấy bị trượt chân , cô ấy sẽ có một lời giải thích hoàn toàn hợp lý cho nó .
BLEU score: 0.21675453206953177


## 1.2. BLEU on all Testing set

In [38]:
list_bleu_scores = []

for idx, (en_sentence, vi_sentence) in enumerate(zip(list_en_sentence, list_vi_sentence)):
    if idx % 1_000 == 0:
        print(f"idx = {idx}")

    translated_text, translated_tokens, attention_weights, list_predicted_prob = translator(tf.constant(en_sentence))
    translated_text = translated_text.numpy().decode('utf-8')
    score = calculate_bleu_score([vi_sentence], translated_text)
    list_bleu_scores.append(score)

print(f"Average BLEU score: {sum(list_bleu_scores)/len(list_bleu_scores)}")

idx = 0
idx = 1000
idx = 2000
idx = 3000
idx = 4000
idx = 5000
idx = 6000
idx = 7000
idx = 8000
idx = 9000
Average BLEU score: 0.27079152411186236


# 2. Translation Error Rate 

Type of edits in TER:
- Insertions.
- Deletion.
- Substitution. 

TER calculates the number of edits needed to turn the translated output into a true translation, normalized by the length of the reference. It’s often expressed as a percentage, with lower TER values indicating better translation quality (i.e., less editing required):
- 0 TER is the best. 
- 1 TER is the worse.

## 2.1. Test TER on single sample

In [43]:
idx = np.random.randint(0, len(list_en_sentence))

en_sentence = list_en_sentence[idx]
en_sentence = en_sentence.lower()
vi_sentence = list_vi_sentence[idx]
vi_sentence = vi_sentence.lower()

print(f"English: {en_sentence}")
print(f"Vietname: {vi_sentence}")

print("-"*100)

translated_text, translated_tokens, attention_weights, list_predicted_prob = translator(tf.constant(en_sentence))
translated_text = translated_text.numpy().decode('utf-8')
print(f"Translated text: {translated_text}")

ter_score = calculate_ter(vi_sentence, translated_text)
print(f"TER Score: {ter_score:.4f}")

English: i design silicon lithography for personal gain .
Vietname: tôi thiết kế nên phương pháp khắc quang phổ lên nhựa silicon .
----------------------------------------------------------------------------------------------------
Translated text: tôi thiết kế in in silicon để có được lợi ích cá nhân .
TER Score: 0.7692


## 2.2. TER on all Testing set

In [45]:
list_ter_scores = []

for idx, (en_sentence, vi_sentence) in enumerate(zip(list_en_sentence, list_vi_sentence)):
    if idx % 1_000 == 0:
        print(f"idx = {idx}")

    translated_text, translated_tokens, attention_weights, list_predicted_prob = translator(tf.constant(en_sentence))
    translated_text = translated_text.numpy().decode('utf-8')
    ter_score = calculate_ter(vi_sentence, translated_text)
    list_ter_scores.append(ter_score)

print(f"Average BLEU score: {sum(list_ter_scores)/len(list_ter_scores)}")

idx = 0
idx = 1000
idx = 2000
idx = 3000
idx = 4000
idx = 5000
idx = 6000
idx = 7000
idx = 8000
idx = 9000
Average BLEU score: 0.572375288929681
