## Performance Evaluation of MT
### BLEU Score: Adequacy and Fluency of Translations

1. Calculate the modified n-gram precision for the full corpus for all n=1...N
2. Calculate the geometric mean (gm-precision) of all the precisions
3. Calculate the brevity penalty (bp) for the full corpus
3. Calculate the BLEU by bp * gm-precision

Example Calculation:

* Candidate1: the the the the the the the
* Candidate2: the cat is on the mat
* Ref1: the cat sat on the mat
* Ref2: there is a cat on the mat

#### Modified 1-gram Precision (Measures adequacy)
* Candidate1: $\frac{2}{7}$
* Candidate2: $\frac{5}{5}$

#### Modified 2-gram Precision (Measures fluency)
* Candidate1: $\frac{0}{1}$
* Candidate2: $\frac{3}{5}$


In [1]:
import numpy as np
from collections import Counter

In [11]:
def unique_n_gram_string(n_gram):
    string = ''
    for g in n_gram[:-1]:
        string += str(g)+'-'
        
    string += str(n_gram[-1])
    return string

def calculate_mod_n_gram_precision(n_gram, refs, cands):
    
    denominator = 0.0
    
    tot_bleu = 0.0
    
    tot_ref_length, tot_cand_length = 0, 0
    for ref, cand in zip(refs, cands):
        
        print('\tReference sentence: ',' '.join([reverse_test_dict[r] for r in ref]))
        print('\t  Candidate sentence: ',' '.join([reverse_test_dict[c] for c in cand]))

        denominator += max(cand.size + 1 - n_gram,1)
        tot_ref_length += ref.size
        tot_cand_length += cand.size
        
        # find unique n-grams in predicted
        cand_n_grams = [unique_n_gram_string(cand[w_i:w_i+n_gram]) for w_i in range(cand.size + 1 - n_gram)]
        cand_n_grams = list(set(cand_n_grams))

        occurences_for_unique_grams = dict(zip(cand_n_grams,[0 for _ in cand_n_grams]))

        ref_n_grams = [unique_n_gram_string(ref[w_i:w_i+n_gram]) for w_i in range(ref.size + 1 - n_gram)]
        ref_counts = Counter(ref_n_grams)

        # iterates through every n_gram in the predicted
        for w_i in range(cand.size + 1 - n_gram):        
            c_gram = cand[w_i:w_i+n_gram]
            gram_string = unique_n_gram_string(c_gram)
            
            for ref_i in range(ref.size + 1 - n_gram):

                r_gram = ref[ref_i:ref_i+n_gram]

                found_gram_in_actual = int(np.prod(c_gram == r_gram))

                occurences_for_unique_grams[gram_string] += found_gram_in_actual

        
        for g, occ in occurences_for_unique_grams.items():
            g_bleu = float(occ)
            if g in ref_counts:
                g_bleu = min(g_bleu,ref_counts[g])

            tot_bleu += g_bleu

    mod_n_prec = tot_bleu/denominator
    
    
    return mod_n_prec, tot_ref_length, tot_cand_length


def calculate_bleu(refs, cands, high_n):
    weight = 1.0/high_n # using the same weight for all mod n_gram precisions
    
    tot_precision = []
    for n in range(1,high_n+1): 
        print('Calculating modified %d-gram precision'%n)
        prec, tot_ref_length, tot_cand_length = calculate_mod_n_gram_precision(n,refs,cands)
        tot_precision.append(weight*np.log(prec + 1e-100))
        
    brevity_penalty = 1.0
    
    if tot_cand_length <= tot_ref_length:
        brevity_penalty = np.exp(1.0-(tot_ref_length*1.0/max(tot_cand_length,1)))
        
    bleu = brevity_penalty * np.exp(np.sum(tot_precision))

    return bleu

test_dict = {'the':10,'cat':11,'sat':12,'on':13,'mat':14,'is':15,'there':16,'a':17,'dog':18}
reverse_test_dict = dict(zip(test_dict.values(),test_dict.keys()))

sample_text_refs = [['the','cat','sat','on','the','mat'],['there','is','a','cat','on','the','mat']]
sample_refs = []
for r in sample_text_refs:
    sample_refs.append(np.asarray([test_dict[w] for w in r],dtype=np.int32))

sample_text_cands_1 = [['the','the','the'],['the','a','cat']]
sample_cands_1 = []
for c1 in sample_text_cands_1:
    sample_cands_1.append(np.asarray([test_dict[w] for w in c1],dtype=np.int32))


sample_text_cands_2 = [['the','dog','on','the','mat'],['there','is','cat','on','the','mat']]
sample_cands_2 = []
for c2 in sample_text_cands_2:
    sample_cands_2.append(np.asarray([test_dict[w] for w in c2],dtype=np.int32))


b1 = calculate_bleu(sample_refs,sample_cands_1,3)
print('\nBLEU-3: ',b1)
print()

b2 = calculate_bleu(sample_refs,sample_cands_2,3)
print('\nBLEU-3: ',b2)

Calculating modified 1-gram precision
	Reference sentence:  the cat sat on the mat
	  Candidate sentence:  the the the
	Reference sentence:  there is a cat on the mat
	  Candidate sentence:  the a cat
Calculating modified 2-gram precision
	Reference sentence:  the cat sat on the mat
	  Candidate sentence:  the the the
	Reference sentence:  there is a cat on the mat
	  Candidate sentence:  the a cat
Calculating modified 3-gram precision
	Reference sentence:  the cat sat on the mat
	  Candidate sentence:  the the the
	Reference sentence:  there is a cat on the mat
	  Candidate sentence:  the a cat

BLEU-3:  8.568589920310384e-35

Calculating modified 1-gram precision
	Reference sentence:  the cat sat on the mat
	  Candidate sentence:  the dog on the mat
	Reference sentence:  there is a cat on the mat
	  Candidate sentence:  there is cat on the mat
Calculating modified 2-gram precision
	Reference sentence:  the cat sat on the mat
	  Candidate sentence:  the dog on the mat
	Reference sente