# Text Metrics:

### Bleu:
Simply measures n-gram overlap between hypothesis & reference. \
Problem is that t is only sensitive to perfect word matches. \
Output between 0 and 1. \
Code: https://www.nltk.org/_modules/nltk/translate/bleu_score.html

In [41]:
from nltk.translate import bleu
bleu(
     ['The candidate has no alignment to any of the references'.split()],
     'The candidate has no zin to do this assignment or any of the references'.split(),
     (1,),
 )

0.6428571428571429

### ChrF:
Measures character n-gram overlap instead. \
Output between 0 and 1. \
Code: https://www.nltk.org/_modules/nltk/translate/chrf_score.html

In [5]:
from nltk.translate import chrf_score
chrf_score.sentence_chrf("What is going on?", "What are you doing?")

0.27753518128142923

### Rouge:
Rouge1 refers to the overlap of unigram (each word) between the hypothesis and reference. \
Rouge2 refers to the overlap of bigrams between the hypothesis and reference. \
RougeL measures the longest common subsequence between hypothesis and reference. \
Code: https://pypi.org/project/rouge-score/

In [42]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

# Neural Metrics:

### BERTscore
Leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.
https://github.com/Tiiiger/bert_score

In [79]:
from datasets import load_metric
bertscore = load_metric("bertscore")
predictions = ["Hello world.", "What do you do?"]
references = ["Goodbye world.", "What are you doing?"]
bertscore.compute(predictions=predictions, references=references, model_type="distilbert-base-uncased", lang="en")

{'precision': [0.9177080988883972, 0.9117429256439209],
 'recall': [0.9177080988883972, 0.9117429256439209],
 'f1': [0.9177080988883972, 0.9117429256439209],
 'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.11(hug_trans=4.16.2)'}

Bleurt gives errors when trying to predict their score, which seemingly can only be fixed by editing the source code. \
For BARTscore, the model needs to be downloaded, which I've not tested yet.

### Nist
It is based on the BLEU metric, but with some alterations. Where BLEU simply calculates n-gram precision adding equal weight to each one, NIST also calculates how informative a particular n-gram is. That is to say when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given. 
Wasn't completely sure if it was neural or text metric actually.
https://dl.acm.org/doi/pdf/10.5555/1289189.1289273

In [72]:
from nltk.translate import nist_score
# n is highest n-gram order, default=5
nist_score.sentence_nist(references=["Hello world, what are you doing?.".split()], hypothesis="Hello world, what am I doing?.".split(), n=2)

1.723308333814104

Some other metrics we tried out, but ran into errors or other problems.

In [78]:
from bleurt import score

#checkpoint = "BERT-Tiny"
references = ["This is a test."]
candidates = ["This is the test."]

scorer = score.BleurtScorer()
scorer.score(references=references, candidates=candidates)

INFO:tensorflow:No checkpoint specified, defaulting to BLEURT-tiny.


No checkpoint specified, defaulting to BLEURT-tiny.


INFO:tensorflow:Reading checkpoint c:\users\lenna\appdata\local\programs\python\python38\lib\site-packages\bleurt\test_checkpoint.


Reading checkpoint c:\users\lenna\appdata\local\programs\python\python38\lib\site-packages\bleurt\test_checkpoint.


INFO:tensorflow:Config file found, reading.


Config file found, reading.


INFO:tensorflow:Will load checkpoint dbleurt_tiny


Will load checkpoint dbleurt_tiny


INFO:tensorflow:Loads full paths and checks that files exists.


Loads full paths and checks that files exists.


INFO:tensorflow:... name:dbleurt_tiny


... name:dbleurt_tiny


INFO:tensorflow:... vocab_file:vocab.txt


... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


... do_lower_case:True


INFO:tensorflow:... max_seq_length:512


... max_seq_length:512


INFO:tensorflow:Creating BLEURT scorer.


Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


Loading model.


INFO:tensorflow:BLEURT initialized.


BLEURT initialized.


InvalidArgumentError: cannot compute __inference_pruned_83523 as input #0(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:__inference_pruned_83523]

In [77]:
from datasets import load_metric
bleurt = load_metric("bleurt")
predictions = "Hello world."
references = "Hello world."
bleurt.compute(predictions=predictions, references=references)

Using default BLEURT-Base checkpoint for sequence maximum length 128. You can use a bigger model for better results with e.g.: datasets.load_metric('bleurt', 'bleurt-large-512').


INFO:tensorflow:Reading checkpoint C:\Users\lenna\.cache\huggingface\metrics\bleurt\default\downloads\extracted\6d786599a9a38fb3873d1dd62aa809ba8f3dee72e87d5f86f8a28957562f3fe3\bleurt-base-128.


Reading checkpoint C:\Users\lenna\.cache\huggingface\metrics\bleurt\default\downloads\extracted\6d786599a9a38fb3873d1dd62aa809ba8f3dee72e87d5f86f8a28957562f3fe3\bleurt-base-128.


INFO:tensorflow:Config file found, reading.


Config file found, reading.


INFO:tensorflow:Will load checkpoint bert_custom


Will load checkpoint bert_custom


INFO:tensorflow:Loads full paths and checks that files exists.


Loads full paths and checks that files exists.


INFO:tensorflow:... name:bert_custom


... name:bert_custom


INFO:tensorflow:... vocab_file:vocab.txt


... vocab_file:vocab.txt


INFO:tensorflow:... bert_config_file:bert_config.json


... bert_config_file:bert_config.json


INFO:tensorflow:... do_lower_case:True


... do_lower_case:True


INFO:tensorflow:... max_seq_length:128


... max_seq_length:128


INFO:tensorflow:Creating BLEURT scorer.


Creating BLEURT scorer.


INFO:tensorflow:Creating WordPiece tokenizer.


Creating WordPiece tokenizer.


INFO:tensorflow:WordPiece tokenizer instantiated.


WordPiece tokenizer instantiated.


INFO:tensorflow:Creating Eager Mode predictor.


Creating Eager Mode predictor.


INFO:tensorflow:Loading model.


Loading model.


INFO:tensorflow:BLEURT initialized.


BLEURT initialized.


InvalidArgumentError: cannot compute __inference_pruned_82118 as input #0(zero-based) was expected to be a int64 tensor but is a int32 tensor [Op:__inference_pruned_82118]

In [None]:
# https://github.com/neulab/BARTScore
# have to download model and .py code for it to work, as well as fix the following error
from bart_score import BARTScorer
bart_scorer = BARTScorer(checkpoint='facebook/bart-large-cnn')
bart_scorer.score(['This is interesting.'], ['This is fun.'], batch_size=4)