<a href="https://colab.research.google.com/github/ITMK/DataLitMT/blob/main/MT_Quality_Score_Calculator_traditional_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MT Quality Score Calculator for Metrics Based on String Matching
This is a Jupyter Notebook for calculating traditional machine translation quality scores, such as BLEU, METEOR, chrF and TER. The code used for these calculations is adopted largely from the Natural Language Toolkit ([NLTK](https://www.nltk.org/)). First, we'll cover some preliminaries required to understand the basic concepts employed in the various MT quality scores. Then, we'll take an in-depth look at these scores and their inner workings. Finally, you'll be able to calculate these scores using your own example sentences.

All metrics presented in this notebook are based on a string comparison between a machine-translated sentence (which is often called 'hypothesis' in MT research) and a corresponding human reference translation ('reference'). The reference can already exist (e.g., official translations of specific texts) or you can create it yourself by post-editing the machine-translated text so that it satisfies your own or your client's quality standards. In the latter case, the score you want to calculate is often complemented with a lower-case 'h' (e.g. 'BLEU' would become 'hBLEU', 'TER' would become 'hTER', etc.) to indicate that the reference sentence is the result of a human post-editing process.  

How well these MT quality scores actually capture the quality of a machine-translated hypothesis is measured by their correlation with human quality judgements (which remain the ultimate gold standard in machine translation quality evaluation). Good scores will achieve a high correlation with human quality judgements while poor scores will achieve a low correlation with these judgements. 

## 0 Housekeeping
First, we need to ensure that we have the Natural Language Toolkit in its current version installed. Run the following code to test this. 

In [None]:
# Upgrade to the current version of pip (if necessary) and check if NLTK is installed
!pip install --upgrade pip
!pip show nltk

If you get the message that the NLTK package was not found or an NLTK version older than 3.5 is displayed, run the code below. Otherwise, you can skip this step.

In [None]:
# Install NLTK or upgrade to its current version and run the nltk downloader
!pip install --upgrade nltk
import nltk
nltk.download()

If the NLTK Downloader prompts you to enter a command, enter 'd' and confirm with 'Shift + Enter'. In the 'Identifier' field, enter 'all-nltk' and confirm with 'Shift + Enter'. This downloads all available NLTK packages in their current versions. Once the download is finished, enter 'q' and confirm with 'Shift + Enter'.

NOTE: If a graphical downloader opens, just select 'all-nltk' from the 'Collections' list and click 'Download'. 

We also need to install the Pyter package, which is not part of the Natural Language Toolkit but which we need for calculating the Translation Edit Rate. Just run the code below to install this package.

In [None]:
# Install the current version of the pyter package
!pip install --upgrade pyter3

Now you're set up and ready to go! 

## 1 Preliminaries
This section covers some basic concepts that are required to understand the workings of the various MT quality scores covered in this notebook. Specifically, we will take a look at the concepts of *precision*, *recall* and *n-grams*. If you are already familiar with these concepts, feel free to skip this section.

### 1.1 Precision and Recall
Precision and recall are two fundamental concepts employed by many MT quality scores. Let's first look at the formula for precision:  

$$Precision = \frac{\mbox{#}\;words\;in\;hypothesis\;that\;match\;reference}{\mbox{#}\;words\;in\;hypothesis}$$

As we see here, precision is calculated as the number of words in the hypothesis that are also present in the reference, divided by the number of words in the hypothesis. We will look at an example below, but first, let's have a look at the formula for recall:

$$Recall = \frac{\mbox{#}\;words\;in\;hypothesis\;that\;match\;reference}{\mbox{#}\;words\;in\;reference}$$

As you see, the formulas for recall and precision are almost identical and differ only slightly in their denominator. Recall is calculated as the number of words in the hypothesis that are also present in the reference (same as in precision) divided by the number of words in the reference. Run the code below to see an example of how precision and recall are calculated for actual hypotheses and references.

In [None]:
# Import precision() and recall() functions
from nltk.metrics.scores import precision, recall

# Define reference and hypothesis
reference_p_r = {'This', 'is', 'a', 'simple', 'test', 'sentence'}
hypothesis_p_r = {'This', 'is', 'an', 'example', 'sentence'}

# Calculate and print precision and recall scores
precision = precision(reference_p_r, hypothesis_p_r)
recall = recall(reference_p_r, hypothesis_p_r)

print(f"Precision: {precision}\n")
print(f"Recall: {recall}")

Let's look at the precision score first. There are three words (*This*, *is* and *sentence*) which appear both in the hypothesis and in the reference, so our numerator is 3. The hypothesis (*This is an example sentence*) has 5 words, so our denominator is 5, and $\frac{3}{5} = 0.6$.  
For recall, we keep 3 in the numerator (three words which appear both in the hypothesis and in the reference). The number of words in the reference (*This is a simple test sentence*) is 6, so our denominator is 6, and $\frac{3}{6} = 0.5$.  
The basic premise of most automatic MT quality scores employing precision and recall is that the more similar the hypothesis is to the reference, the higher the MT quality is (the ideal case being identity between hypothesis and reference). From this perspective, precision tells us how many 'wrong' words the MT engine produced (i.e. words, that are not in the human reference translation). A precision score of 0.6 would thus tell us that the MT engine produced 60% of 'right' words and 40% of 'wrong' words. Recall, on the other hand, tells us how many words the MT engine 'missed' or failed to produce (i.e., words that are in the reference translation but do not appear in the hypothesis). A recall score of 0.5 thus tells us that the MT engine failed to produce 50% of the words which are present in the reference translation. Therefore, we strive for a high precision score and a high recall score (precision = 1 and recall = 1 would mean that hypothesis and reference are identical). Precision and recall are employed in one way or another in all of the similarity measures covered in section 2.1 below.
The source code of the precision and recall functions used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/metrics/scores.html).

### 1.2 n-grams
The concept of *n-grams* is actually quite easy to understand. The *n* in *n-grams* is a placeholder for any integer, so, if we substitute *n* for *1* we get *1-grams*, if we substitute *n* for *2* we get *2-grams*, if we substitute *n* for *3* we get *3-grams*, and so on. These integers stand for the number of words that constitute the corresponding n-grams. For example, 1-grams are made up of single words, 2-grams are made up of sequences of two words, 3-grams are made up of sequences of three words, etc. n-grams allow us to divide sentences into chunks of varying sizes (single words, two-word sequences, three-word sequences, ...) for various computational purposes. Run the code below to see how an example sentence can be chunked into various n-grams. 

In [None]:
# Import the Punkt tokenizer required for the n-gram function
import nltk
nltk.download('punkt')

# Import ngrams() and word_tokenize() functions
from nltk.util import ngrams
from nltk import word_tokenize

# Define an example sentence to segment into ngrams. Feel free to change this example sentence as you see fit
sentence = 'This is an interesting example sentence'

# Display the 1-grams (individual words) contained in the sentence
one_gram = list(ngrams(word_tokenize(sentence), 1))
len_one_gram = len(list(ngrams(word_tokenize(sentence), 1)))
print(f"The sentence contains the following {len_one_gram} 1-grams: {one_gram}\n")

# Move up one order to display the 2-grams (sequences of two words) contained in the sencence
two_gram = list(ngrams(word_tokenize(sentence), 2))
len_two_gram = len(list(ngrams(word_tokenize(sentence), 2)))
print(f"The sentence contains the following {len_two_gram} 2-grams: {two_gram}\n")

# Move up another order to display the 3-grams (sequences of three words) contained in the sencence
three_gram = list(ngrams(word_tokenize(sentence), 3))
len_three_gram = len(list(ngrams(word_tokenize(sentence), 3)))
print(f"The sentence contains the following {len_three_gram} 3-grams: {three_gram}\n")

# Do the same for 4-grams
four_gram = list(ngrams(word_tokenize(sentence), 4))
len_four_gram = len(list(ngrams(word_tokenize(sentence), 4)))
print(f"The sentence contains the following {len_four_gram} 4-grams: {four_gram}\n")

So, when applying the 1-gram function to our example sentence, the function splits the sentence into its individual words. Applying the 2-gram function splits it into any possible sequence of two consecutive words, applying the 3-gram function splits it into any possible sequence of three consecutive words, etc. There is a rather obvious pattern here. If our example sentence has, e.g., six words, it is chunked into six individual 1-grams (single words). It we broaden the n-gram window from 1 to 2, it will be split into five 2-grams. If we broaden the window further from 2 to 3, it will be split into four 3-grams, and so on. So, each time we increase our n-gram window by 1, the number of n-grams the sentence can and will be split into decreases by 1.  
The source code of the n-gram function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/util.html).

## 2 MT Quality Metrics

### 2.1 Similarity Measures
Similarity measures derive their name from the fact that they measure the degree of similarity between two strings. Accordingly, the *higher* the score calculated by these metrics, the *higher* the quality of the MT output is assumed to be and vice versa.

### 2.1.1 F-Measure
Let's have a look at our first MT quality score. We have seen that precision and recall provide two slightly different perspectives on the quality of a machine translation engine. Ideally, our engine should not generate wrong words (precision) and it should not miss any words present in the reference translation (recall). **F-Measure** combines precision and recall into a single score. Here is the formula:  

$$F\mbox{-}Measure = \frac{precision \times recall}{\frac{precision\;+\;recall}{2}}$$

So, the formula for F-Measure is a fraction where the numerator is the product of precision and recall and the denominator is itself a fraction, here the sum of precision and recall divided by two. If you know your mathematics well, you may see that this formula calculates the harmonic mean (one of the three Pythagorean means, together with the arithmetic mean and the geometric mean) of precision and recall.  
Let's have a look at an example.  

In [None]:
# Import f_measure(), precision() and recall() functions
from nltk.metrics.scores import f_measure
from nltk.metrics.scores import precision, recall

# Define reference and hypothesis
reference_f = {'This', 'is', 'a', 'simple', 'test', 'sentence'}
hypothesis_f = {'This', 'is', 'an', 'example', 'sentence'}

# Calculate and print precision, recall and f-measure scores
precision = precision(reference_p_r, hypothesis_p_r)
recall = recall(reference_p_r, hypothesis_p_r)
f_measure = f_measure(reference_f, hypothesis_f)

print(f"Precision: {precision}\n")
print(f"Recall: {recall}\n")
print(f"F-Measure: {f_measure}")

For your convenience, the code also prints out the precision and recall values for the sentence pair. If you plug these into the F-Measure formula above, you should arrive at the same F-Measure as calculated by the NLTK function.  
The major disadvantage of F-Measure is that this metric does not take word order into account. For example, if you change the hypothesis to *Example this sentence an is*, you'll get the exact same score as for the grammatically correct hypothesis *This is an example sentence*. And while it is certainly true that word order errors are significantly less frequent in neural MT than they were in phrase-based statistical MT, you should still keep this limitation of F-Measure in mind.  
The source code of the F-Measure function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/metrics/scores.html).

### 2.1.2 BLEU
**BLEU** is an acronym for **B**i**L**ingual **E**valuation **U**nderstudy. It was originally proposed in [Papineni et al. (2002): BLEU: A Method for Automatic Evaluation of Machine Translation](https://www.aclweb.org/anthology/P02-1040.pdf). Today, it is still the most commonly used automatic MT quality score and it achieves reasonably good correlation with human quality judgements. BLEU is calculated using the following formula:

$$BLEU = min(1, \frac{hypothesis\;length}{reference\;length})(\displaystyle\prod_{i=1}^{n} precision_i)^{\frac{1}{n}}$$

Let's break this down:

$(\displaystyle\prod_{i=1}^{n} precision_i)^{\frac{1}{n}}$: This part of the formula looks somewhat complicated, but it is actually quite simple. It says:  

**1)** $precision_i$: Find the individual precision values (n-grams that appear both in the hypothesis and in the reference) for 1-grams, 2-grams, 3-grams and 4-grams (the upper limit of 4 is not given in the formula, but the common BLEU score implementations used in MT research compute up to the n-gram value of 4). This will give you values such as $\frac{4}{7}$ (4 out of 7 possible 1-gram matches), $\frac{3}{6}$ (3 out of 6 possible 2-gram matches), $\frac{2}{5}$ (2 out of 5 possible 3-gram matches) and $\frac{1}{4}$ (1 out of 4 possible 4-gram matches).  

**2)** $\displaystyle\prod_{i=1}^{n}$ and $\frac{1}{n}$: Calculate the geometric mean of the n-gram precision values (F-Measure calculates the *harmonic* mean, BLEU calculates the *geometric* mean). To calculate the geometric mean, we multiply our $n$ precision values and then take the $n$-th root of the product of these values. In our example, we have 1- to 4-grams (four precision values, therefore $n = 4$), so we multiply $\frac{4}{7}\times\frac{3}{6}\times\frac{2}{5}\times\frac{1}{4}$ and then take the fourth root of this product. Calculating the $n$-th root of a number is equivalent to raising this number to the exponent $\frac{1}{n}$ ($\frac{1}{4}$ in our example). So, our final calculation in this case would be $(\frac{4}{7}\times\frac{3}{6}\times\frac{2}{5}\times\frac{1}{4})^{\frac{1}{4}}$.    


$min(1, \frac{hypothesis\;length}{reference\;length})$: This part of the formula is called *brevity penalty* and it penalizes MT hypotheses that are shorter than their references (if the hypothesis is shorter than the reference, we fear that the MT system forgot to translate certain words). You may have noticed that BLEU works with precision but not recall values. The brevity penalty is BLEU's 'substitute' for recall, if you will. Suppose our hypothesis contains 6 words and the reference contains 7. If we plug these values into the formula, we get $min(1, \frac{6}{7})$. This simply means: take the smaller of the two values (here, $\frac{6}{7}$) and multiply it with the previous part of the BLEU formula. Since we multiply with a value smaller than 1, the overall BLEU score will decrease as a result of this multiplication. In other words, a brevity penalty was applied which reduces the overall BLEU score. If the hypothesis contained 7 words and the reference 6 words, we'd get $min(1, \frac{7}{6})$. In this case, the smaller of the two values is 1, and if we multiply the previous part of the BLEU formula with 1, nothing happens. In other words, in this case (where the hypothesis is longer than the reference) no brevity penalty is applied. Let's look at an example.

In [None]:
# Import bleu_score(), word_tokenize() and SmoothingFunction() functions
from nltk.translate import bleu_score
from nltk import word_tokenize
from nltk.translate.bleu_score import SmoothingFunction

# Store SmoothingFunction() in variable 'chencherry'
chencherry = SmoothingFunction()

# Define reference and hypothesis
reference_BLEU = word_tokenize('This is a very interesting calculation of BLEU score')
hypothesis_BLEU = word_tokenize('This is a very interesting BLEU score calculation')

# Calculate and print BLEU score
BLEU = bleu_score.sentence_bleu([reference_BLEU], hypothesis_BLEU, smoothing_function=chencherry.method3)
print(f"BLEU: {BLEU}")

In the code above, you see that a smoothing function ('smoothing_function=chencherry.method3') is used. This function has the following purpose: If there are no n-gram overlaps of a specific order between hypothesis and reference, the resulting n-gram precision factor would be 0 (e.g. if there are no 4-gram overlaps between hypothesis and reference, we get 0/4 = 0). Of course, if one factor of a product is 0, the entire product (here, our BLEU score) will be 0. Since it seems overly harsh to score such hypotheses with a BLEU score of 0, a smoothing function is applied, which assigns a small value to the n-gram precision value that would otherwise be 0, thus avoiding an overall BLEU score of 0 (you can see the effect of this smoothing function if you stipulate a hypothesis and a reference with no 4-gram or 3-gram overlaps, for example). We won't cover in detail how this smoothing function is calculated. For an overview of different smoothing methods used in BLEU score calculation, see [Chen/Cherry 2014): A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU](https://www.aclweb.org/anthology/W14-3346/). In this notebook, we use the smoothing technique which the paper calls 'Smoothing 3' (which is sort of the 'official' smoothing technique employed in BLEU score calculation). In this context, it should be pointed out that BLEU was originally conceived as a corpus-level metric, i.e., it is intended to score longer stretches of text instead of single sentences. In such longer stretches of text, it is rather likely that at least some 4-gram overlaps between hypothesis and reference will be present and hence, smoothing is normally not required. However, BLEU is also often used to calculate scores at the sentence level, where the absence of 4- or even 3-gram overlaps is quite common. Hence, when calculating sentence-level BLEU scores, a smoothing function will have to be applied on a regular basis.   
The source code of the BLEU score function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/translate/bleu_score.html). BLEU scores published in official machine translation contests such as the [Conference on Machine Translation (WMT)](http://www.statmt.org/wmt20/) are calculated with the [SacreBLEU](https://github.com/mjpost/sacrebleu) package. SacreBLEU was originally proposed in [Post (2018): A Call for Clarity in Reporting BLEU Scores](https://arxiv.org/abs/1804.08771). 

### 2.1.3 METEOR
**METEOR** is an acronym for **M**etric for **E**valuation of **T**ranslation with **E**xplicit **OR**dering. It was originally proposed in [Banerjee/Lavie (2005): METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments](https://www.aclweb.org/anthology/W05-0909.pdf). The source code of the METEOR score function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/translate/meteor_score.html).  
METEOR was developed explicitly in order to overcome some of the shortcomings of its predecessor BLEU, and indeed, it generally achieves a higher correlation with human quality judgements than BLEU. As you may have noticed, calculating BLEU is language-agnostic, i.e. the BLEU score function described in section 2.1 above can be applied to any language. The input strings will just be tokenized, chunked into 1-grams to 4-grams and then processed by the BLEU formula. METEOR, on the other hand, employs specific linguistic knowledge such as stemming and synonym detection, making it a more precise score than BLEU but at the same time more difficult to implement. The advantages of METEOR over BLEU will become clear in the following sections.

METEOR is calculated using the following formula:
$$METEOR = F_{mean}\times (1-Penalty)$$

Again, let's break this down.

$F_{mean} = \frac{10PR}{R+9P}$: This part of the METEOR formula is somewhat reminiscent of the F-Measure formula. Here also, we calculate the harmonic mean (a *weighted* harmonic mean in this case), where *P* stands for *Precision* and *R* stands for *Recall*. So, the numerator multiplies precision * recall * 10 and the denominator adds recall to 9 * precision. The harmonic mean calculated here weighs recall 9 times more than precision (reflecting a finding of the authors that recall is more important than precision when a high level of correlation with human judgements is to be achieved). In METEOR, precision and recall are only based on unigram or 1-gram matching so that, unlike BLEU, there will be no matches calculated for 2-grams, 3-grams, etc. Also, while BLEU requires exact string identity between n-gram matches, METEOR is more forgiving here by stemming the words in the reference and hypothesis sentences and checking for synonyms in the [WordNet](https://wordnet.princeton.edu/) database. So, for example, strings like *responsibility* and *responsible* would be reduced to their common stem *respons* and treated as unigram matches by METEOR. Also, WordNet would tell METEOR that strings like *car* and *automobile* are synonyms, so METEOR would also consider them as unigram matches. BLEU, on the other hand, would find no matches in the two examples.  

$Penalty = 0.5\times(\frac{\mbox{#}\;chunks}{\mbox{#}\;unigrams\;matched})^3$: The notion of a penalty should already be familiar to you from the BLEU formula. Recall that BLEU only works with precision and not with recall and it compensates for this lack of recall by applying a brevity penalty. METEOR, on the other hand, works with both precision and recall, but it only calculates unigrams/1-grams and no higher-order n-grams such as 2-grams or 3-grams (which BLEU uses to check for the fluency of the hypothesis). METEOR compensates for this lack of higher-order n-grams with its own penalty function. In the formula, *chunks* are sequences of unigrams of any length which occur both in the hypothesis and in the reference. For example, for the hypothesis *the president spoke to the audience* and the reference *the president then spoke to the audience*, there are two chunks, *the president* and *spoke to the audience*. The longer the chunks of unigram matches between hypothesis and reference, the less chunks there are and the higher the fluency of the hypothesis and the lower the penalty will be (a penalty of 1 means that no penalty is applied, see the discussion of BLEU's brevity penalty above). On the other hand, the shorter the chunks, the more chunks there are, the lower the fluency of the hypothesis will be and thus the higher the penalty will be (the highest possible penalty is 0.5, which would reduce the final METEOR score by half).  

$F_{mean}\times (1-Penalty)$: The two formulas discussed above are then combined into the final METEOR formula. The formula multiplies the F-Measure calculated in the first formula with $(1-Penalty)$. As a result, the $F_{mean}$ value will be reduced by the maximum of 50 % in case there are no chunks of two or more unigrams in reference and hypothesis. For more detailed information on the individual calculation steps discussed here, read the original METEOR paper linked above.

In summary, the most important advantage of METEOR over BLEU is that the former score applies linguistic knowledge in the form of stemming and synonym detection and is thus more flexible in its matching operations than BLEU. Run the code below to calculate METEOR for the given hypothesis-reference pair.


In [None]:
# Import meteor_score function
from nltk.translate import meteor_score

# Define reference and hypothesis and tokenize them
reference_METEOR = word_tokenize('I am fully responsible')
hypothesis_METEOR = word_tokenize('I have full responsibility')

# Calculate and print METEOR score
METEOR = meteor_score.single_meteor_score(reference_METEOR, hypothesis_METEOR)
print(f"METEOR: {METEOR}")

As you can see, METEOR computes a reasonably high score for hypothesis and reference, although the two strings are quite different at the surface. If you calculate the BLEU score for the two strings, you'd receive a much lower score as BLEU only looks for exact string identity.

NOTE: The NLTK implementation of METEOR uses the English version of WordNet for synonym detection. Hence, using the METEOR function in this notebook will only make sense for English hypotheses and references. Also, NLTK implements the original METEOR version from 2005. In the meantime, the score has been revised and updated several times (see, for example, [Denkowski/Lavie (2011): Meteor 1.3: Automatic Metric for Reliable Optimization and Evaluation of Machine Translation Systems](https://www.aclweb.org/anthology/W11-2107/)) and is now also available for languages other than English. The most current version of METEOR is described in [Denkowski/Lavie (2014): Meteor Universal: Language Specific Translation Evaluation for Any Target Language](https://www.aclweb.org/anthology/W14-3348/). Today, the most widely used implementation for calculating official METEOR scores is [this Java implementation](https://www.cs.cmu.edu/~alavie/METEOR/).  

In this notebook, we analyse the original METEOR version for the following reasons: 1) NLTK provides an easily accessible Python implementation. 2) The original version makes the advantages of METEOR over BLEU sufficiently clear. 3) The original version is of medium complexity. With each subsequent version, the score's complexity rises, making it harder to grasp for newcomers. 4) Recently, a new neural MT quality metric named **COMET** was introduced, which shows a promising performance and may make METEOR obsolete at some point in the future (as you might have guessed from the name, *COMET* was developed by some of the same people who are responsible for METEOR; they certainly have a penchant for heavenly bodies). COMET is covered in a separate notebook on embedding-based MT quality metrics.

### 2.1.4 chrF
**chrF** is an acronym for **ch**a**r**acter n-gram **F**-Score. It was originally proposed in [Popovic (2015): chrF: Character n-gram F-Score for Automatic MT Evaluation](https://www.aclweb.org/anthology/W15-3049.pdf). The source code of the chrF score function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/translate/chrf_score.html). chrF is one of the most current MT quality scores based on string-matching. Unlike the scores covered so far, chrF does not operate at the *word* but rather at the *character* level (hence the *chr*). As such, it employs the notion of *character n-grams*, where an 1-gram represents a single character, a 2-gram represents a sequence of two characters, etc. (basically the same idea as with regular n-grams, but here, the atomic units are characters instead of words). 

chrF is calculated using the following formula:
$$chrF\beta = (1+\beta^2)\frac{chrP\times chrR}{\beta^2 \times chrP+chrR}$$

Looking at the formula, you may already have guessed that the *F* in *chrF* stands for F-Measure again (which we know already as an autonomous MT quality score and as part of the METEOR score). So, chrF is "the F-score based on character *n*-grams" (Popovic 2015:392). As the previous metrics covered in this notebook (with the exception of METEOR), chrF is completely language-agnostic. In the formula, $chrP$ is the percentage of n-grams in the hypothesis which are also present in the reference (see our general definition of *precision* above) and $chrR$ is the percentage of character n-grams in the reference which are also present in the hypothesis (again, see our general definition of *recall* above). $\beta$, finally, is a parameter which assigns $\beta$ times more importance to recall than to precision (the idea that recall should be assigned more importance than precision should sound familiar from the discussion of METEOR above). When $\beta = 1$, precision and recall are given equal importance.  
The best correlations with human quality judgements (remember, this correlation is the currency in which the quality of an MT quality score is measured) were achieved with the *6-gram chrF3 score* (see Popovic 2015:393). This means that the n-grams used in calculating chrF are character 6-grams, in other words, sequences of six characters. $3$ is the value for $\beta$, which means that recall is given three times more importance than precision when calculating chrF. The *chrf_score()* function implemented in NLTK retains these values in its standard configuration. Run the code below to calculate 6-gram chrF3 for our reference-hypothesis pair.

In [None]:
# Import chrf_score() and word_tokenize() functions
from nltk.translate import chrf_score
from nltk import word_tokenize

# Define reference and hypothesis
reference_chrF = word_tokenize('This is a simple test sentence')
hypothesis_chrF = word_tokenize('This is an example sentence')

#Calculate and print chrF score
chrF = chrf_score.sentence_chrf(reference_chrF, hypothesis_chrF)
print(f"chrF: {chrF}")

As you can see, for the same hypothesis-reference pair as in our F-Measure calculation in section 2.1.1 above, chrF returns a score which is a little bit lower than the F-Measure score computed above (which was 0.6). If you want to calculate chrF for other character n-gram sequences and for other $\beta$ values, you have to pass additional arguments to the function. For example, in order to calculate the 4-gram chrF2 score, you have to call the function like this:   

In [None]:
#Calculate and print chrF score with custom n-gram and beta values
chrF = chrf_score.sentence_chrf(reference_chrF, hypothesis_chrF, min_len=1, max_len=4, beta=2.0, ignore_whitespace=True)
print(f"chrF: {chrF}")

By customizing the values for *max_len* and *beta*, you can set your own n-gram and $\beta$ values. Also, by setting *ignore_whitespace* to *False*, you tell the chrF function to treat whitespace characters as character n-grams. By default, whitespace characters are ignored.  

In 2017, the author of the original chrF paper proposed an extended version of the metric called *chrF++*, which includes word-level 1-grams and 2-grams in its calculation. If you are interested in this extended version, have a look at the paper: [Popovic (2017): chrF++: Words helping character n-grams](https://www.statmt.org/wmt17/pdf/WMT70.pdf).

## 2.2 Distance Measures
In contrast to similarity measures, distance measures derive their name from the fact that they measure the degree of dissimilarity or distance between two strings. Accordingly, the *higher* the score calculated by these metrics, the *lower* the quality of the MT output is assumed to be and vice versa.

### 2.2.1 Edit Distance
**Edit Distance** was originally proposed in [Levenshtein (1965): Binary Codes Capable of Correcting Deletions, Insertions, and Reversals](https://www.semanticscholar.org/paper/Binary-codes-capable-of-correcting-deletions%2C-and-Levenshtein/b2f8876482c97e804bb50a5e2433881ae31d0cdd). With reference to its inventor, Vladimir Levenshtein, is often also called *Levenshtein Distance*, but we'll stick to the term *Edit Distance* in this notebook. It is one of the earliest metrics used for string comparison. Edit Distance is calculated using the following formula:  

$$Edit\;Distance = {substitutions + insertions + deletions}$$  

So, Edit Distance simply sums the number of steps required to transform a hypothesis into a reference. The steps allowed are substituting one element for another, inserting or deleting an element. The source code of the Edit Distance function used in this notebook can be found [here](https://www.nltk.org/_modules/nltk/metrics/distance.html).

#### 2.2.1.1 Character-based Edit Distance
Edit Distance can operate either on character or on word level. Here, we'll first take a look at character based Edit Distance.

In [None]:
# Import edit_distance() function
from nltk.metrics.distance import edit_distance

# Define reference and hypothesis
reference_ed_char = 'This is a simple test sentence'
hypothesis_ed_char = 'This is an example sentence'

# Calculate and print character-based Edit Distance
edit_distance_char = edit_distance(reference_ed_char, hypothesis_ed_char)
print(f"Edit Distance (character-based): {edit_distance_char}")

It may be a bit tiresome to do so, but if you would like to check whether the score calculated by the code above is correct (it is, actually), you can just count the minimum number of character substitutions, insertions and deletions which are required in the hypothesis to make it identical to the reference. Such a manual calculation is easier when Edit Distance is calculated at the word-level. 

#### 2.2.1.2 Word-based Edit Distance

In [None]:
# Import edit_distance() and word_tokenize() functions
from nltk.metrics.distance import edit_distance
from nltk import word_tokenize

# Define reference and hypothesis
reference_ed_wrd = 'This is a simple test sentence'
hypothesis_ed_wrd = 'This is an example sentence'

# Calculate and print word-based Edit Distance
edit_distance_wrd = edit_distance(word_tokenize(reference_ed_wrd), word_tokenize(hypothesis_ed_wrd))
print(f"Edit Distance (word-based): {edit_distance_wrd}")

This process can be retraced more easily. In the hypothesis, we substitute *an* for *a* (step 1), we substitute *example* for *test* (step 2) and we insert the word *simple* (step 3). After performing these three steps, hypothesis and reference are identical; hence the Edit Distance is 3. 

### 2.2.2 WER


**WER** is an acronym for **W**ord **E**rror **R**ate. MT research borrowed this metric from automatic speech recognition, where it is used to compute the difference between the text that was spoken and the text that the computer actually understood. Word Error Rate is calculated using the following formula:  

$$WER = \frac{substitutions + insertions + deletions}{reference\;length}$$

As you can see, the formula for Word Error Rate is almost identical to the Edit Distance formula, except that WER divides the number of editing operations through the length of the reference in order to normalize the score (normalizing the score makes it comparable to other WER scores which were calculated for reference sentences with different lengths). Also, unlike Edit Distance, Word Error Rate is only calculated on the basis of words (as the name already implies). Hence, Word Error Rate is identical to **Normalized Word-based Edit Distance**. WER is not implemented as a stand-alone function in NLTK, so we'll use the formula for word-based Edit Distance and tweak it a little to include the normalizing step.

In [None]:
# Import edit_distance() and word_tokenize() functions
from nltk.metrics.distance import edit_distance
from nltk import word_tokenize

# Define reference and hypothesis
reference_wer = 'This is a simple test sentence'
hypothesis_wer = 'This is an example sentence'

# Calculate and print Word Error Rate
edit_distance = edit_distance(word_tokenize(reference_wer), word_tokenize(hypothesis_wer)) 
word_error_rate = edit_distance / len(word_tokenize(reference_wer))
print(f"Word Error Rate: {word_error_rate}")

Here, we require the same three steps as performed in section 2.2.1.2 above in order to transform the hypothesis into the reference. Since the reference contains 6 words, we divide 3 by 6 and get 0.5.

### 2.2.3 TER

**TER** is an acronym for **T**ranslation **E**dit **R**ate.
It was originally proposed in [Snover et al. (2006): A Study of Translation Edit Rate with Targeted Human Annotation](http://www.cs.umd.edu/~snover/pub/amta06/ter_amta.pdf). TER is calculated using the following formula:  

$$TER = \frac{substitutions + insertions + deletions + shifts}{reference\;length}$$  

Again, you see that the formula for Translation Edit Rate is almost identical to the Word Error Rate formula, except that it introduces a new editing operation called *shift*. A shift means that we can take a word that is already present in a string and move it to another position in this string. TER is usually calculated on the basis of words and not on the basis of characters (although there is a character-based version called [characTER](https://www.aclweb.org/anthology/W16-2342/), but we'll not consider it in this notebook). You can think of Translation Edit Rate as **Extended Normalized Word-based Edit Distance** (if that makes any sense to you). The source code of the TER function used in this notebook can be found [here](https://github.com/BramVanroy/pyter).

In [None]:
# Import ter() and word_tokenize() functions
from pyter import ter
from nltk import word_tokenize

# Define reference and hypothesis
reference_ter = 'This is a simple test sentence'
hypothesis_ter = 'This is an example sentence'

# Calculate and print Translation Edit Rate
TER = ter(word_tokenize(hypothesis_ter), word_tokenize(reference_ter))
print(f"TER: {TER}")

### 2.2.4 Post-Edit Modification Percentage

The **Post-Edit Modification Percentage (PEM%)** is the last MT quality score covered in this notebook. It provides a bridge between MT quality scores and the fuzzy match scores that you will be familiar with from translation memory systems. The Post-Edit Modification Percentage score is implemented in the [Qualitivity](https://community.sdl.com/product-groups/translationproductivity/w/customer-experience/2251/qualitivity?_ga=2.127563211.793596354.1607361594-1168507431.1607361592) plugin for Trados Studio. PEM% is calculated using the following formula:  

$$Post\mbox{-}Edit\; Modification\; Percentage = \frac{Max.\; characters - Edit\; Distance}{Max.\; characters}\times 100$$  

Basically, this formula calculates the character-based Edit Distance between hypothesis and reference, normalizes it by the length of the longer of the two strings (Max. characters) and transforms this score into a percentage value. Again, PEM% is not implemented as a function in NLTK, so we'll create our own Python formula to calculate it (based on NLTK's Edit Distance formula).  

In [None]:
# Import edit_distance() function
from nltk.metrics.distance import edit_distance

# Define reference and hypothesis
reference_pem = 'This is a simple test sentence'
hypothesis_pem = 'This is an example sentence'

edit_distance = edit_distance(reference_pem, hypothesis_pem)
if len(reference_pem) > len(hypothesis_pem):
    max_char = len(reference_pem)
else:
    max_char = len(hypothesis_pem)
PEM = (max_char - edit_distance)/max_char * 100
print(f'Edit Distance: {edit_distance}\n')
print(f'Max. characters: {max_char}\n')
print(f'PEM%: {PEM}%')

For your convenience, the code also prints out the Edit Distance and the number of characters in the longer of the two strings. So, you'd calculate $(30-29)/30x100$ and get a PEM% value of 70%. This value is to be interpreted as follows: The amount of editing required to transform the MT hypothesis into the reference equals the amount of editing that would have been required if the  MT hypothesis had been a 70% fuzzy match provided by a translation memory. As stated earlier, PEM% provides a bridge between MT quality scores and the fuzzy match scores translators will be used to from working with translation memory systems. If a translator post-edits a machine-translated text in a TM system such as Trados Studio, PEM% will provide him or her with a familiar measure of effort quantification. However, the PEM% measure can only be provided *after* the translator post-edited the text, whereas TM fuzzy match scores are provided *before* the translator starts to work on a text. 

## 3 Calculate Your Own Scores

Here, you can specify your own hypothesis and reference sentences, which will then be used to calculate the different scores. This will allow you to compare the different scores directly with each other and to see which correspond best to your own quality judgements of the MT output. If you run the following code, you'll be prompted to enter a hypothesis and a reference sentence. Remember that, in this notebook, METEOR only works with English sentences. So if you work with non-English examples, keep in mind that the METEOR score for these examples will not make any sense.  

In [None]:
# Enter machine-translated sentence (hypothesis) and the human reference translation (reference)
own_hypothesis = input("\nPlease enter a machine-translated sentence and confirm with Enter: ")
own_reference = input("\nPlease enter the corresponding human reference translation and confirm with Enter: ")

Now that you have provided your own hypothesis and reference, you can run the code below to calculate the various scores for these sentences. If you want to change your hypothesis and reference, just run the code above again and enter your new sentences. Then, if you run the code below again, it will calculate the scores for your new sentences.

In [None]:
import nltk
from nltk import word_tokenize

print('SIMILARITY MEASURES:\n')

# Calculate and print precision, recall and f-measure scores
from nltk.metrics.scores import precision, recall
precision = precision(set(list(word_tokenize(own_hypothesis))), set(list(word_tokenize(own_reference))))
recall = recall(set(list(word_tokenize(own_hypothesis))), set(list(word_tokenize(own_reference))))
f_measure = (precision * recall) / ((precision + recall)/2)
print(f"Precision: {precision}\n")
print(f"Recall: {recall}\n")
print(f"F-Measure: {f_measure}\n")

# Calculate and print BLEU score
from nltk.translate import bleu_score
from nltk.translate.bleu_score import SmoothingFunction
chencherry = SmoothingFunction()
BLEU = bleu_score.sentence_bleu([nltk.word_tokenize(own_reference)], nltk.word_tokenize(own_hypothesis), smoothing_function=chencherry.method3)
print(f"BLEU: {BLEU}\n")

# Calculate and print METEOR score
from nltk.translate import meteor_score
own_reference_tok = word_tokenize(own_reference)
own_hypothesis_tok = word_tokenize(own_hypothesis)
METEOR = meteor_score.single_meteor_score(own_reference_tok, own_hypothesis_tok)
print(f"METEOR: {METEOR}\n")

# Calculate and print chrF score
from nltk.translate import chrf_score
chrF = chrf_score.sentence_chrf(own_reference, own_hypothesis)
print(f"chrF: {chrF}\n\n")

print('DISTANCE MEASURES:\n')

# Calculate and print character-based Edit Distance
from nltk.metrics.distance import edit_distance
edit_distance_char = edit_distance(own_reference, own_hypothesis)
print(f"Edit Distance (character-based): {edit_distance_char}\n")

# Calculate and print word-based Edit Distance
edit_distance_word = edit_distance(word_tokenize(own_reference), word_tokenize(own_hypothesis))
print(f"Edit Distance (word-based): {edit_distance_word}\n")

# Calculate and print Word Error Rate
edit_distance = edit_distance(word_tokenize(own_reference), word_tokenize(own_hypothesis)) 
word_error_rate = edit_distance / len(word_tokenize(own_reference))
print(f"Word Error Rate: {word_error_rate}\n")

# Calculate and print Translation Edit Rate
import pyter
TER = pyter.ter(word_tokenize(own_hypothesis), word_tokenize(own_reference))
print(f"Translation Edit Rate: {TER}\n")

# Calculate and print 
edit_distance = edit_distance(own_reference, own_hypothesis)
if len(own_reference) > len(own_hypothesis):
    max_char = len(own_reference)
else:
    max_char = len(own_hypothesis)
PEM = (max_char - edit_distance) / max_char * 100
print(f'PEM%: {PEM}%')