<hr style='height:3pt'>

# Evaluation of the Caption Generation

<hr style='height:3pt'>

There exists a large compendium of techniques to evaluate the similarity between a machine generated caption and a human generated caption. Typically the similarity is computed using a **candidate sentence** generated by an ML algorithm and a **reference sentence** (or multiple) generated by a human. A few examples include:
- **BLEU (2002)**
    - At its core, BLEU is the precision of the candidate sentence, a.k.a, the proportion of words in the candidate sentence that also appear in the reference sentence. It extends to doing multiple n-gram comparisons and taking a weighted average. A more thorough description and example implementation in python can be found [here](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/). Extensions to this method penalize candidate sentences that are shorter than the reference sentence.  
    
    
- **ROUGE (2004)**
    - The recall of the candidate sentence. The proportion of words in the reference sentence that also appear in the candidate sentence. It's essentially the complement to BLEU, and they are often combined in a reported F1 score. Read more [here](https://stackoverflow.com/questions/38045290/text-summarization-evaluation-bleu-vs-rouge)
    
    
- **METEOR (2005)**
    - An extension to the precision/recall combo that algorithmically finds a mapping between the candidate text and the reference text, then uses that to compute the score. Wikipedia says "Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set." This method also factors in synonyms. [source](https://en.wikipedia.org/wiki/METEOR)
    
    
- **CIDEr (2015)**
    - This method was developed specifically for image captioning, and extends the previous methods by doing a TF-IDF weighting before comparing the co-occurrence of n-grams between the candidate and reference sentence (actually a set of sentences typically). It is not always effective in situations where it adds disporportionate weight to unimportant words in a sentence that occur infrequently. [source](https://en.wikipedia.org/wiki/METEOR)
    

- **WMD (2015)**
    - Uses word embeddings and something similar to Wasserstein distance to compute the discrepancy between a candidate sentence and a reference sentence. This snares the semantic similarities between two sentences that may not share commong words or even synonyms. [Here](https://vene.ro/blog/word-movers-distance-in-python.html) is a python blog post about it.
    
    
- **SPICE (2016)**
    - SPICE breaks down sentences into semantically meaningful components such as objects, attributes, and relation types. This graph structure is then used to create pairs of words that are semantically related, and computes and F1 score for the tuples between the candidate and the reference sentence(s). [This](https://aclweb.org/anthology/E17-1019) paper does a good job of summarizing this and all the above metrics.
    
    
The paper linked [here](https://aclweb.org/anthology/E17-1019) does a phenomenal job of providing visual and tabular comparisons of each of the aforementioned metrics. The paper also examines their correlation with each other, concluding that the n-gram metrics (BLEU, ROUGE, METEOR, CIDEr) can complement the embedding (WMD) and graph-based (SPICE) ones. Here is a table and figure from the paper:

![](nlp_metrics.png)


Now, I want to try  Meteor and WMD in combination to capture both lexical and semantic similarities, but will go through each to see how difficult it is to get them working in Python.

# BLEU Example

In [3]:
# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
smoother = SmoothingFunction()  
reference = [['this', 'is', 'small', 'test']]  # Corpus of documents
candidate = ['this', 'is', 'a', 'test']  # Candidate document
score = sentence_bleu(reference, candidate, 
                      smoothing_function=smoother.method4, 
                      weights=(0.25, 0.25, 0.25, 0.25))  # ngram weights
print(score)

0.2866227639866161


# ROUGE Example

In [11]:
# pip install rouge
from rouge import Rouge 

reference = 'this is small test'  # Corpus of documents
candidate = 'this is a test'  # Candidate document
rouge = Rouge()
scores = rouge.get_scores(hypothesis, reference)
scores

[{'rouge-1': {'f': 0.1499999982, 'p': 0.08333333333333333, 'r': 0.75},
  'rouge-2': {'f': 0.0, 'p': 0.0, 'r': 0.0},
  'rouge-l': {'f': 0.05616438356152923, 'p': 0.05555555555555555, 'r': 0.5}}]

# METEOR Example

Follow installation instructions [here](https://github.com/Maluuba/nlg-eval), then run ```python setup.py install```

In [7]:
from nlgeval import compute_individual_metrics
reference = ['this is small test']  # Corpus of documents
candidate = 'this is a test'  # Candidate document
metrics_dict = compute_individual_metrics(reference, candidate)

In [8]:
metrics_dict

{'Bleu_1': 0.7499999996250004,
 'Bleu_2': 0.4999999997291671,
 'Bleu_3': 4.999999996944452e-06,
 'Bleu_4': 1.8803015450937985e-08,
 'METEOR': 0.2900878266954308,
 'ROUGE_L': 0.75,
 'CIDEr': 0.0,
 'SkipThoughtCS': 0.84144896,
 'EmbeddingAverageCosineSimilairty': 0.950262,
 'VectorExtremaCosineSimilarity': 0.833967,
 'GreedyMatchingScore': 0.891991}

# WMD Example

Try [this](https://github.com/RaRe-Technologies/gensim/blob/c971411c09773488dbdd899754537c0d1a9fce50/docs/notebooks/WMD_tutorial.ipynb)

Determined the computation overhead was too much