# [BLEU (BiLingual Evaluation Understudy)](https://docs.kolena.io/metrics/bleu/)

- The BLEU (BiLingual Evaluation Understudy) score is a metric to evaluate the quality of candidate texts.
- BLEU can be thought of as an analog to precision for text comparisons.
- between 0 and 1, higher is better

**BLEU Score Interpretation on a Corpus**

|Score      | Interpretation                                                |
|:---------:|:-------------------------------------------------------------:|
| < 10      | Almost useless                                                |
| 10 - 19   | Hard to get the gist                                          |
| 20 - 29   | The gist is clear, but has significant grammatical errors     |
| 30 - 40   | Understandable to good translations                           |
| 40 - 50   | High quality translations                                     |
| 50 - 60   | Very high quality, adequate, and fluent translations          |
| > 60      | Quality often better than human                               |

## ready to use library

In [1]:
import sacrebleu

In [2]:
candidate = "Fall leaves rustled softly beneath our weary feet".lower()
reference = "Crisp autumn leaves rustled softly beneath our weary feet".lower()

In [3]:
print(
    "sacrebleu library BLEU: ",
    round(sacrebleu.sentence_bleu(candidate, [reference]).score, 1),
)

sacrebleu library BLEU:  74.2


## detailed implementation from scratch

In [4]:
import math
from collections import Counter

import nltk
import numpy as np

nltk.download("punkt")
from nltk.util import ngrams

[nltk_data] Downloading package punkt to /Users/yifanwu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
def brevity_penalty(candidate, reference):
    ref_length = len(reference)
    can_length = len(candidate)
    BP = min(1, np.exp(1 - ref_length / can_length))
    return BP

In [6]:
def clipped_precision(candidate, reference):
    """
    Clipped precision function given a original and a machine translated sentences
    """

    clipped_precision_score = []

    for i in range(1, 5):
        ref_n_gram = Counter(ngrams(reference, i))
        cand_n_gram = Counter(ngrams(candidate, i))

        c = sum(cand_n_gram.values())
        for j in cand_n_gram:
            if j in ref_n_gram:
                # if the count of the candidate n-gram is bigger than the corresponding count in the reference n-gram,
                if cand_n_gram[j] > ref_n_gram[j]:
                    # then set the count of the candidate n-gram to be equal to the reference n-gram
                    cand_n_gram[j] = ref_n_gram[j]
            else:
                # else set the candidate n-gram equal to zero
                cand_n_gram[j] = 0

        clipped_precision_score.append(sum(cand_n_gram.values()) / c)

    weights = [0.25] * 4

    s = (w_i * math.log(p_i) for w_i, p_i in zip(weights, clipped_precision_score))

    s = math.exp(math.fsum(s))
    return s

In [7]:
def bleu_score(candidate, reference):
    candidate = nltk.word_tokenize(candidate)
    reference = nltk.word_tokenize(reference)
    BP = brevity_penalty(candidate, reference)
    precision = clipped_precision(candidate, reference)
    print(f"{BP=}, {precision=}")
    return BP * precision

In [8]:
print(
    "DIY BLEU: ",
    round(bleu_score(candidate, reference) * 100, 1),
)

BP=0.8824969025845955, precision=0.8408964152537145
DIY BLEU:  74.2


In [9]:
assert round(bleu_score(candidate, reference) * 100, 1) == round(
    sacrebleu.sentence_bleu(candidate, [reference]).score, 1
), "DIY BLEU is not align with sacrebleu lib"

BP=0.8824969025845955, precision=0.8408964152537145
