## Метрика качества BLEU  

BLEU - Bilingual Evaluation Understudy

Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

The score was developed for evaluating the predictions made by automatic machine translation systems. It is not perfect, but does offer 5 compelling benefits:
   * It is quick and inexpensive to calculate;
   * It is easy to understand;
   * It is language independent;
   * It correlates highly with human evaluation;
   * It has been widely adopted.
   
The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“.

How it works:
The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

In addition to translation, we can use the BLEU score for other language generation problems with deep learning methods such as:
   * Language generation;
   * Image caption generation;
   * Text summarization;
   * Speech recognition.
   


### Sentence BLEU Score 

NLTK provides the sentence_bleu() function for evaluating a candidate sentence against one or more reference sentences.

The reference sentences must be provided as a list of sentences where each reference is a list of tokens. The candidate sentence is provided as a list of tokens. 
For example:

In [27]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'old', 'man', 'walking', 'with', 'girl'], ['man', 'and', 'girl', 'are','walking', 'together']]
candidate = ['the', 'old', 'man', 'walking', 'with', 'girl']
score = sentence_bleu(reference, candidate)
print(score)

1.0


### Corpus BLEU Score

NLTK also provides a function called corpus_bleu() for calculating the BLEU score for multiple sentences such as a paragraph or a document.

The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens. The candidate documents must be specified as a list where each document is a list of tokens, e.g. a list of lists of tokens.

This is a little confusing; here is an example of two references for one document.

In [32]:
# two references for one document
from nltk.translate.bleu_score import corpus_bleu
reference = [[['the', 'old', 'man', 'walking', 'with', 'girl'], ['man', 'and', 'girl', 'are','walking', 'together']]]
candidate = [['the', 'old', 'man', 'walking', 'with', 'girl']]
score = corpus_bleu(references, candidates)
print(score)

1.0


I think that in our situation Sentence BLEU Score is more suitable. So in further examples I will use it.

### Cumulative and Individual BLEU Scores

The BLEU score calculations in NLTK allow you to specify the weighting of different n-grams in the calculation of the BLEU score.

This gives you the flexibility to calculate different types of BLEU score, such as individual and cumulative n-gram scores.

*** Individual N-Gram Scores ***

5 слов совпадает, одно нет => BLUE = 5/6 = 0.833

In [35]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'old', 'man', 'walking', 'with', 'girl'], ['man', 'and', 'girl', 'are','walking', 'together']]
candidate = ['the', 'old', 'man', 'walking', 'with', 'boy']
score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
print(score)

0.8333333333333334


We can repeat this example for individual n-grams from 1 to 4 as follows:

In [36]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'old', 'man', 'walking', 'with', 'girl'], ['man', 'and', 'girl', 'are','walking', 'together']]
candidate = ['the', 'old', 'man', 'walking', 'with', 'boy']

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Individual 1-gram: 0.833333
Individual 2-gram: 0.800000
Individual 3-gram: 0.750000
Individual 4-gram: 0.666667


Although we can calculate the individual BLEU scores, this is not how the method was intended to be used and the scores do not carry a lot of meaning, or seem that interpretable.

*** Cumulative N-Gram Scores ***

Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.

By default, the sentence_bleu() and corpus_bleu() scores calculate the cumulative 4-gram BLEU score, also called BLEU-4.

The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. For example:

In [37]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'old', 'man', 'walking', 'with', 'girl'], ['man', 'and', 'girl', 'are','walking', 'together']]
candidate = ['the', 'old', 'man', 'walking', 'with', 'boy']

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

Cumulative 1-gram: 0.833333
Cumulative 2-gram: 0.816497
Cumulative 3-gram: 0.795536
Cumulative 4-gram: 0.759836


*** One more example *** 

In [44]:
# two word different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'dog', 'jumped', 'over', 'the', 'lazy', 'dog']
#score = sentence_bleu(reference, candidate)
#print(score)

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

Individual 1-gram: 0.777778
Individual 2-gram: 0.500000
Individual 3-gram: 0.428571
Individual 4-gram: 0.333333
Cumulative 1-gram: 0.777778
Cumulative 2-gram: 0.623610
Cumulative 3-gram: 0.553618
Cumulative 4-gram: 0.485492


In [25]:
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'dog', 'jumped',  'over', 'the', 'dog']
#score = sentence_bleu(reference, candidate)
#print(score)

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

Individual 1-gram: 0.661873
Individual 2-gram: 0.252142
Individual 3-gram: 0.147083
Individual 4-gram: 0.000000
Cumulative 1-gram: 0.661873
Cumulative 2-gram: 0.408517
Cumulative 3-gram: 0.293867
Cumulative 4-gram: 0.000000


The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


*** ------------------------------------------- ***

*** In our problem we will apply sentence_bleu(), cumulative 4-gram BLEU score, also called BLEU-4 ***

*** ------------------------------------------- ***