# Text Generation Metrics

We use huggingface `evaluate` library for most of the metrics shown. See documentation here: https://huggingface.co/evaluate-metric


In [None]:
!pip install evaluate sacrebleu rouge_score bert_score unbabel-comet
import evaluate

Collecting unbabel-comet
  Downloading unbabel_comet-2.2.4-py3-none-any.whl.metadata (19 kB)
Collecting entmax<2.0,>=1.1 (from unbabel-comet)
  Downloading entmax-1.3-py3-none-any.whl.metadata (348 bytes)
Collecting jsonargparse==3.13.1 (from unbabel-comet)
  Downloading jsonargparse-3.13.1-py3-none-any.whl.metadata (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting pytorch-lightning<3.0.0,>=2.0.0 (from unbabel-comet)
  Downloading pytorch_lightning-2.5.0.post0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics<0.11.0,>=0.10.2 (from unbabel-comet)
  Downloading torchmetrics-0.10.3-py3-none-any.whl.metadata (15 kB)
Collecting lightning-utilities>=0.10.0 (from pytorch-lightning<3.0.0,>=2.0.0->unbabel-comet)
  Downloading lightning_utilities-0.12.0-py3-none-any.whl.metadata (5.6 kB)
Downloading unbabel_comet-2.2.4-py3-none-any.whl (96 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

See all huggingface metrics here: https://huggingface.co/evaluate-metric

## BLEU

In [None]:
bleu = evaluate.load("bleu")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [None]:
pred = "เขา หาม มเหสี"
target = "เขา หาม หมา มเหสี"
results = bleu.compute(predictions=[pred], references=[[target]], tokenizer=lambda s: s.split(" "))
results

{'bleu': 0.0,
 'precisions': [1.0, 0.5, 0.0, 0.0],
 'brevity_penalty': 0.7165313105737893,
 'length_ratio': 0.75,
 'translation_length': 3,
 'reference_length': 4}

## ChrF

In [None]:
chrf  = evaluate.load("chrf")

Downloading builder script:   0%|          | 0.00/9.01k [00:00<?, ?B/s]

In [None]:
results = chrf.compute(predictions=[pred], references=[[target]]) # if word_order = 2, it will be chrF++! but need to input tokenizer
results

{'score': 51.34138057521868, 'char_order': 6, 'word_order': 0, 'beta': 2}

## ROUGE

In [None]:
rouge  = evaluate.load("rouge")

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
candidates = ["Summarization is cool"]
references = [["Summarization is beneficial and cool","Summarization saves time"]]

results = rouge.compute(predictions=candidates, references=references)
print(results)

{'rouge1': 0.7499999999999999, 'rouge2': 0.3333333333333333, 'rougeL': 0.7499999999999999, 'rougeLsum': 0.7499999999999999}


In [None]:
candidates = ["A fast brown fox leaps over a sleeping dog"]
references = [["The quick brown fox jumps over the lazy dog"]]

results = rouge.compute(predictions=candidates, references=references)
print(results)

{'rouge1': 0.4444444444444444, 'rouge2': 0.125, 'rougeL': 0.4444444444444444, 'rougeLsum': 0.4444444444444444}


Using huggingface evaluate with Thai will not work natively.  
--> See https://stackoverflow.com/questions/73963171/rouge-score-metric-for-non-english-arabic-language-is-not-working    

--> https://stackoverflow.com/questions/76633871/why-rouge-score-results-are-confusing-for-non-english-languages

https://github.com/huggingface/evaluate/issues/108

It seems like the rouge_score library that this metric uses filters all non-alphanueric latin characters
in `rouge_scorer/tokenize.py` with `text = re.sub(r"[^a-z0-9]+", " ", six.ensure_str(text))`.

The RougeScorer accepts a tokenizer keyword argument.

In [None]:
from rouge_score import rouge_scorer
pred = "เขา หาม มเหสี"
target = "เขา หาม หมา มเหสี"

class MyTokenizer:
  def tokenize(s):
    return s.split(" ")
r_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], tokenizer=MyTokenizer)
results = r_scorer.score(target, pred)
results

{'rouge1': Score(precision=1.0, recall=0.75, fmeasure=0.8571428571428571),
 'rouge2': Score(precision=0.5, recall=0.3333333333333333, fmeasure=0.4),
 'rougeL': Score(precision=1.0, recall=0.75, fmeasure=0.8571428571428571)}

## METEOR

In [None]:
meteor  = evaluate.load("meteor")

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
pred = "the cat sat on the mat"
target = "the cat sat on the mat"
results = meteor.compute(predictions=[pred], references=[[target]])
results

{'meteor': 0.9976851851851852}

In [None]:
pred = "the cat sat on the mat"
target = "the cat sat on the big mat"
results = meteor.compute(predictions=[pred], references=[[target]])
results

{'meteor': 0.8534621578099838}

## TER

In [None]:
ter  = evaluate.load("ter")

Downloading builder script:   0%|          | 0.00/9.99k [00:00<?, ?B/s]

In [None]:
pred = "the cat sat on the mat"
target = "the cats sat on the mat"
results = ter.compute(predictions=[pred], references=[[target]])
results

{'score': 16.666666666666664, 'num_edits': 1, 'ref_length': 6.0}

Shift word "sat"

In [None]:
pred = "the cat sat on the mat"
target = "the cats on the mat sat"
results = ter.compute(predictions=[pred], references=[[target]])
results

{'score': 33.33333333333333, 'num_edits': 2, 'ref_length': 6.0}

Shift "on the mat"

In [None]:
pred = "the cat sat on the mat"
target = "on the mat the cat sat"
results = ter.compute(predictions=[pred], references=[[target]])
results

{'score': 16.666666666666664, 'num_edits': 1, 'ref_length': 6.0}

## BertScore

In [None]:
bertscore = evaluate.load("bertscore")

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

The original BERTScore paper showed that BERTScore correlates well with human judgment on sentence-level and system-level evaluation, but this depends on the model and language pair selected.

Multilingual Bert supported languages: https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

>The multilingual model supports the following languages. These languages were chosen because they are the top 100 languages with the largest Wikipedias [...]
>
> The **Multilingual Cased (New)** release contains additionally **Thai** and **Mongolian**, which were not included in the original release.

Finally, calculating the BERTScore metric involves downloading the BERT model that is used to compute the score-- the default model for `en`, `roberta-large`, takes over 1.4GB of storage space and downloading it can take a significant amount of time depending on the speed of your internet connection. If this is an issue, choose a smaller model; for instance `distilbert-base-uncased` is 268MB.

Using `lang=th` downloads `bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.47.1)`, which should support Thai.

In [None]:
pred = "เขาหามมเหสี"
target = "เขาหามหมามเหสี"
results = bertscore.compute(predictions=[pred], references=[target], lang="th")
results

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

{'precision': [0.9578304290771484],
 'recall': [0.9327464699745178],
 'f1': [0.9451220631599426],
 'hashcode': 'bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.48.2)'}

In [None]:
results = bertscore.compute(predictions=["ชีวิตทุกข์ทรมานจริง"], references=["ชีวิตมันแย่มาก"], lang="th")
results

{'precision': [0.7607712745666504],
 'recall': [0.8199488520622253],
 'f1': [0.7892523407936096],
 'hashcode': 'bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.48.2)'}

In [None]:
results = bertscore.compute(predictions=["รู้สึกสนุกสุดยอด"], references=["ชีวิตมันแย่มาก"], lang="th")
results

{'precision': [0.654716432094574],
 'recall': [0.7046784162521362],
 'f1': [0.6787793040275574],
 'hashcode': 'bert-base-multilingual-cased_L9_no-idf_version=0.3.12(hug_trans=4.48.2)'}

## COMET

In [None]:
comet = evaluate.load('comet')

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.38k [00:00<?, ?B/s]

hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/f49d328952c3470eff6bb6f545d62bfdb6e66304/checkpoints/model.ckpt`


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.11/dist-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


COMET takes 3 lists of strings as input: sources (a list of source sentences), predictions (a list of candidate translations) and references (a list of reference translations).

In [None]:
source = ["Dem Feuer konnte Einhalt geboten werden", "Schulen und Kindergärten wurden eröffnet."]
hypothesis = ["The fire could be stopped", "Schools and kindergartens were open"]
reference = ["They were able to control the fire.", "Schools and kindergartens opened"]
results = comet.compute(predictions=hypothesis, references=reference, sources=source)
results

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


{'mean_score': 0.9051420092582703,
 'scores': [0.8385582566261292, 0.9717257618904114]}