# LLM Evaluation

* **ROGUE (Recall-Oriented Understudy for Gisting Evaluation):**
    * **ROUGE-1:** Counts how many individual words are shared between the machine-generated summary and the original summary.
    * **ROUGE-2:** Counts how many pairs of words (like "important information") are shared between the two summaries.
    * **ROUGE-L:** Looks for the longest sequence of words that appears in both summaries, like a sentence or phrase.
    * **ROUGE-Lsum:** Adds up the longest sequences from all the original summaries to get a total score.
---
* **BLEU (Bilingual Evaluation Understudy):**
    * **BLEU score:** A score (between 0 and 1) that measures how well the GenAI text matches the original text.
    * **Precisions:** Measures how many words (or phrases) are shared between the two texts:
        * 1-gram: Counts individual words
        * 2-gram: Counts pairs of words
        * 3-gram: Counts trios of words
        * 4-gram: Counts quads of words
    * **Brevity penalty:** A penalty applied if the GenAI text is too short compared to the original text (1.0 means no penalty).
    * **Length ratio:** Compares the length of the GenAI text to the original text (e.g., 1.16 means the GenAI text is 16% longer).
    * **Translation length and Reference length:** The number of words in the GenAI text and the original text, respectively.
---
* **METEOR (Metric for Evaluation for Translation with Explicit Ordering):**
    * METEOR is a metric used for GenAI evaluation and is said to have a better correlation with human judgment.
    * It modifies the precision and recall computations, replacing them with a weighted F-score based on mapping unigrams and a penalty function for incorrect word order.
    * It aims to solve the problem of other metrics not accounting for word order in the candidate text.
---
* **BERTScore**:
    * BERTScore is a metric used to evaluate the quality of text summarization by measuring how similar the text summary is to the original text.
    * It uses contextualized token embeddings, which are shown to be effective for entailment detection.
    * It leverages the power of BERT, a state-of-the-art transformer-based model developed by Google, to understand the semantic meaning of words in a sentence.
    * It provides **precision, recall and F1-scores**, allowing users to have a comprehensive understanding of the performance of their models.
---
* **Perplexity**:
    * Perplexity is a metric used to evaluate the performance of language models (LLMs). In simple terms, perplexity measures how well the model predicts a text sample.
        * **Low perplexity (close to 0):** The LLM is very confident and accurate in its predictions, like it's "not perplexed" at all!
        * **High perplexity (close to 1 or higher):** The LLM is uncertain or "perplexed" about the text, indicating poor predictions.
    * **Perplexities:** This is a list of perplexity scores for each input text. A lower perplexity score indicates better prediction performance.
    * **Mean Perplexity:** This is the average perplexity score across all input texts. It provides a single value representing the overall perplexity.
---
* **MoverScore**:
    * This scorer first uses embedding models, specifically pre-trained language models like BERT to obtain deeply contextualized word embeddings for both the reference text and the generated text 
    * Afterwards it uses something called the Earth Mover’s Distance (EMD) to compute the minimal cost that must be paid to transform the distribution of words in an LLM output to the distribution of words in the reference text.
---
* **BLEURT**:
    * It does the same thing as BERTScore but is better correlated with human judgment compared to similar metrics such as BERT and BERTscore
    * The scores are generally between 0 and 1, with higher values indicating greater similarity between the texts.


In [None]:
!pip install --upgrade -r "/home/ec2-user/SageMaker/15. Essential Code/requirements.txt" -q

In [None]:
!pip install evaluate -q
!pip install rouge-score -q
!pip install bert-score -q

In [1]:
!pip install torch -q

In [15]:
import evaluate

In [3]:
# List of Evaluation Modules

evaluate.list_evaluation_modules()

['lvwerra/test',
 'jordyvl/ece',
 'angelina-wang/directional_bias_amplification',
 'cpllab/syntaxgym',
 'lvwerra/bary_score',
 'hack/test_metric',
 'yzha/ctc_eval',
 'codeparrot/apps_metric',
 'mfumanelli/geometric_mean',
 'daiyizheng/valid',
 'erntkn/dice_coefficient',
 'mgfrantz/roc_auc_macro',
 'Vlasta/pr_auc',
 'gorkaartola/metric_for_tp_fp_samples',
 'idsedykh/metric',
 'idsedykh/codebleu2',
 'idsedykh/codebleu',
 'idsedykh/megaglue',
 'cakiki/ndcg',
 'Vertaix/vendiscore',
 'GMFTBY/dailydialogevaluate',
 'GMFTBY/dailydialog_evaluate',
 'jzm-mailchimp/joshs_second_test_metric',
 'ola13/precision_at_k',
 'yulong-me/yl_metric',
 'abidlabs/mean_iou',
 'abidlabs/mean_iou2',
 'KevinSpaghetti/accuracyk',
 'NimaBoscarino/weat',
 'ronaldahmed/nwentfaithfulness',
 'Viona/infolm',
 'kyokote/my_metric2',
 'kashif/mape',
 'Ochiroo/rouge_mn',
 'giulio98/code_eval_outputs',
 'leslyarun/fbeta_score',
 'giulio98/codebleu',
 'anz2/iliauniiccocrevaluation',
 'zbeloki/m2',
 'xu1998hz/sescore',
 'dvit

## Summary to be Tested

In [17]:
actual_summary = """
Meta Platforms, Inc., formerly Facebook, Inc., is a major American technology conglomerate based in Menlo Park, California.
It owns and operates Facebook, Instagram, Threads, and WhatsApp. Meta ranks among the largest tech companies globally, 
alongside Alphabet (Google), Amazon, Apple, and Microsoft. Despite ventures into hardware, Meta relies heavily on 
advertising for revenue, comprising 97.8% in 2023. It rebranded to Meta Platforms, Inc. in 2021 to emphasize its 
focus on building the metaverse, integrating its products and services. Through acquisitions and partnerships, 
Meta continues to expand its reach and influence in the tech industry.
"""

llm_summary = """
Meta Platforms, Inc., formerly known as Facebook, Inc., is a prominent American technology conglomerate headquartered in 
Menlo Park, California. The company's portfolio includes popular platforms such as Facebook, Instagram, 
Threads, and WhatsApp. It is considered one of the leading technology companies globally, competing with giants 
like Alphabet (Google), Amazon, Apple, and Microsoft. Despite dabbling in hardware ventures, Meta predominantly 
relies on advertising, which accounted for 97.8% of its revenue in 2023. In 2021, the company rebranded as 
Meta Platforms, Inc. to underscore its commitment to building the metaverse, an integrated environment connecting 
its products and services. Through acquisitions and partnerships, Meta continues to expand its influence and presence 
in the tech sector.
"""

## ROGUE

In [5]:
rogue_scorer = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [6]:
rogue_scorer.compute(references=[actual_summary], predictions=[llm_summary])

{'rouge1': 0.7019230769230769,
 'rouge2': 0.4271844660194175,
 'rougeL': 0.6346153846153846,
 'rougeLsum': 0.6538461538461539}

## BLEU

In [7]:
bleu_scorer = evaluate.load('bleu')

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

In [8]:
bleu_scorer.compute(references=[actual_summary], predictions=[llm_summary])

{'bleu': 0.45040483379141744,
 'precisions': [0.7112676056338029,
  0.49645390070921985,
  0.38571428571428573,
  0.302158273381295],
 'brevity_penalty': 1.0,
 'length_ratio': 1.1639344262295082,
 'translation_length': 142,
 'reference_length': 122}

## METEOR

In [9]:
meteor_scorer = evaluate.load('meteor')

Downloading builder script:   0%|          | 0.00/6.93k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /home/ec2-user/nltk_data...
[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /home/ec2-user/nltk_data...


In [10]:
meteor_scorer.compute(references=[actual_summary], predictions=[llm_summary])

{'meteor': 0.7225239471511148}

## BertScore

In [11]:
# Method 1
from bert_score import BERTScorer
bert_scorer = BERTScorer(model_type='bert-base-uncased')
bert_scorer.score([actual_summary], [llm_summary])

(tensor([0.8955]), tensor([0.8448]), tensor([0.8694]))

In [12]:
# Method 2
bert_scorer = evaluate.load("bertscore")
bert_scorer.compute(references=[actual_summary], predictions=[llm_summary], lang="en")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


{'precision': [0.9383306503295898],
 'recall': [0.9495557546615601],
 'f1': [0.9439098834991455],
 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.40.0)'}

## Perplexity

In [13]:
perplexity_scorer = evaluate.load("perplexity", module_type="metric")

In [14]:
perplexity_scorer.compute(references=[actual_summary], predictions=[llm_summary], model_id = "gpt2")

  0%|          | 0/1 [00:00<?, ?it/s]

{'perplexities': [44.8903923034668], 'mean_perplexity': 44.8903923034668}

## BLEURT

In [None]:
bleurt_scorer = evaluate.load("bleurt", module_type="metric")

In [None]:
bleurt_scorer.compute(references=[actual_summary], predictions=[llm_summary])