A popular metric for evaluating the quality of machine-translated text: the BLEU score proposed by Kishore Papineni, et al. In their 2002 paper ["BLEU: a Method for Automatic Evaluation of Machine Translation"](https://www.aclweb.org/anthology/P02-1040.pdf), the BLEU score works by comparing "candidate" text to one or more "reference" translations. The result is better the closer the score is to 1. Let's see how to get this value in the following sections.

# BLEU Score

In [None]:
pip install sacrebleu

In [7]:
import numpy as np                 
import nltk                         
from nltk.util import ngrams
from collections import Counter     
import sacrebleu                    
import matplotlib.pyplot as plt
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## Defining the BLEU Score

You have seen the formula for calculating the BLEU score in this week's lectures. More formally, we can express the BLEU score as:

$$BLEU = BP\Bigl(\prod_{i=1}^{4}precision_i\Bigr)^{(1/4)}$$

with the Brevity Penalty and precision defined as:

$$BP = min\Bigl(1, e^{(1-({ref}/{cand}))}\Bigr)$$

$$precision_i = \frac {\sum_{snt \in{cand}}\sum_{i\in{snt}}min\Bigl(m^{i}_{cand}, m^{i}_{ref}\Bigr)}{w^{i}_{t}}$$

where:

* $m^{i}_{cand}$, is the count of i-gram in candidate matching the reference translation.
* $m^{i}_{ref}$, is the count of i-gram in the reference translation.
* $w^{i}_{t}$, is the total number of i-grams in candidate translation.

The n-gram precision counts how many unigrams, bigrams, trigrams, and four-grams (i=1,...,4) match their n-gram counterpart in the reference translations. This term acts as a precision metric. Unigrams account for adequacy while longer n-grams account for fluency of the translation. To avoid overcounting, the n-gram counts are clipped to the maximal n-gram count occurring in the reference ($m_{n}^{ref}$). Typically precision shows exponential decay with the with the degree of the n-gram.

## Calculations of the BLEU score

In [8]:
reference = "The NASA Opportunity rover is battling a massive dust storm on planet Mars."
candidate_1 = "The Opportunity rover is combating a big sandstorm on planet Mars."
candidate_2 = "A NASA rover is fighting a massive storm on planet Mars."

tokenized_ref = nltk.word_tokenize(reference.lower())
tokenized_cand_1 = nltk.word_tokenize(candidate_1.lower())
tokenized_cand_2 = nltk.word_tokenize(candidate_2.lower())

In [9]:
def brevity_penalty(reference, candidate):
    ref_length = len(reference)
    can_length = len(candidate)

    # Brevity Penalty by length
    if ref_length > can_length:
        BP = 1
    else:
        penalty = 1 - (ref_length / can_length)
        BP = np.exp(penalty)
    return BP

def clipped_precision(reference, candidate):
    """Bleu score function given a original and a machine translated sentences"""

    clipped_precision_score = []

    for i in range(1, 5):
        candidate_n_gram = Counter(ngrams(candidate, i)) 
        reference_n_gram = Counter(ngrams(reference, i)) 
        # sentence length
        c = sum(reference_n_gram.values()) 
        
        # for every pair
        for j in reference_n_gram:
            if j in candidate_n_gram:
                
                # compare frequency
                if (reference_n_gram[j] > candidate_n_gram[j]): 
                    reference_n_gram[j] = candidate_n_gram[j]   
            else:
                reference_n_gram[j] = 0

        clipped_precision_score.append(sum(reference_n_gram.values()) / c)

    weights = [0.25] * 4

    s = (w_i * np.log(p_i) for w_i, p_i in zip(weights, clipped_precision_score))
    s = np.exp(np.sum(s))
    return s

def bleu_score(reference, candidate):
    BP = brevity_penalty(reference, candidate)
    precision = clipped_precision(reference, candidate)
    return BP * precision

In [10]:
print("Results reference versus candidate 1 our own code BLEU: ",
      round(bleu_score(tokenized_ref, tokenized_cand_1) * 100, 1))

print("Results reference versus candidate 2 our own code BLEU: ",
      round(bleu_score(tokenized_ref, tokenized_cand_2) * 100, 1))

Results reference versus candidate 1 our own code BLEU:  27.4
Results reference versus candidate 2 our own code BLEU:  35.0




**BLEU Score Interpretation on a Corpus**

|Score      | Interpretation                                                |
|:---------:|:-------------------------------------------------------------:|
| < 10      | Almost useless                                                |
| 10 - 19   | Hard to get the gist                                          |
| 20 - 29   | The gist is clear, but has significant grammatical errors     |
| 30 - 40   | Understandable to good translations                           |
| 40 - 50   | High quality translations                                     |
| 50 - 60   | Very high quality, adequate, and fluent translations          |
| > 60      | Quality often better than human                               |