# **Evaluation Metrics - BERT, ROUGE, BLEU and EXACT MATCH SCORES**

### Bert_Score, Rouge_Score, Bleu_Score, Exact_Match

In [1]:
!pip install bert-score
!pip install rouge-score
!pip install nltk
!pip install meteor-score


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting matplotlib (from bert-score)
  Downloading matplotlib-3.10.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting contourpy>=1.0.1 (from matplotlib->bert-score)
  Downloading contourpy-1.3.1-cp311-cp311-win_amd64.whl.metadata (5.4 kB)
Collecting cycler>=0.10 (from matplotlib->bert-score)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib->bert-score)
  Downloading fonttools-4.55.3-cp311-cp311-win_amd64.whl.metadata (168 kB)
     ---------------------------------------- 0.0/168.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/168.5 kB ? eta -:--:--
     -- ------------------------------------- 10.2/168.5 kB ? eta -:--:--
     -- ------------------------------------- 10.2/168.5 kB ? eta -:--:--
     ------ ------------------------------ 30.7/168.5 kB 187.9 kB/s eta 0:00:01
     ------ ----------------------------


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Could not find a version that satisfies the requirement meteor-score (from versions: none)
ERROR: No matching distribution found for meteor-score

[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from bert_score import score
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score


def calculate_bertscore(ground_truths, rag_outputs):
    # Calculate BERTScore for each output
    P, R, F1 = score(rag_outputs, ground_truths, lang="en", model_type="roberta-large")
    return F1.tolist()  # Return list of F1 scores

def calculate_exact_match(ground_truths, rag_outputs):
    # Calculate Exact Match score for each output
    exact_matches = [1 if gt == output else 0 for gt, output in zip(ground_truths, rag_outputs)]
    return exact_matches  # Return list of exact match scores

def calculate_bleu(ground_truths, rag_outputs):
    # Calculate BLEU score for each output
    bleu_scores = [sentence_bleu([gt.split()], output.split()) for gt, output in zip(ground_truths, rag_outputs)]
    return bleu_scores  # Return list of BLEU scores

def calculate_rouge(ground_truths, rag_outputs):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = [scorer.score(gt, output) for gt, output in zip(ground_truths, rag_outputs)]
    # Calculate ROUGE scores for each output
    rouge_1 = [score['rouge1'].fmeasure for score in scores]
    rouge_2 = [score['rouge2'].fmeasure for score in scores]
    rouge_L = [score['rougeL'].fmeasure for score in scores]
    return rouge_1, rouge_2, rouge_L  # Return lists of ROUGE scores


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import pandas as pd
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer
from bert_score import score as bert_score

# Helper functions for scoring
def calculate_bleu(references, hypotheses):
    """Calculate BLEU scores for a list of references and hypotheses."""
    scores = []
    smooth = SmoothingFunction().method1
    for ref, hyp in zip(references, hypotheses):
        scores.append(sentence_bleu([ref.split()], hyp.split(), smoothing_function=smooth))
    return scores

def calculate_rouge(references, hypotheses):
    """Calculate ROUGE-1, ROUGE-2, and ROUGE-L scores."""
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    rouge1, rouge2, rougel = [], [], []
    for ref, hyp in zip(references, hypotheses):
        scores = scorer.score(ref, hyp)
        rouge1.append(scores['rouge1'].fmeasure)
        rouge2.append(scores['rouge2'].fmeasure)
        rougel.append(scores['rougeL'].fmeasure)
    return rouge1, rouge2, rougel

def calculate_bertscore(references, hypotheses):
    """Calculate BERTScore."""
    P, R, F1 = bert_score(hypotheses, references, lang="en", verbose=True)
    return F1.tolist()

def calculate_exact_match(references, hypotheses):
    """Calculate exact match scores."""
    return [1 if ref.strip() == hyp.strip() else 0 for ref, hyp in zip(references, hypotheses)]

# Load the CSV file
csv_file = 'Final_Answers_Generated.csv'

# Read the specific columns from the CSV file
csv_data = pd.read_csv(csv_file, usecols=['modified_rag_refined_answer', 'traditional_rag_refined_answer', 'answers'])

# Replace NaN or non-string values with an empty string
csv_data = csv_data.fillna("")  # Replace NaN with an empty string
csv_data = csv_data.astype(str)  # Ensure all values are strings

# Extract the columns as lists
modified_rag_refined_answer = csv_data['modified_rag_refined_answer'].tolist()
traditional_rag_refined_answer = csv_data['traditional_rag_refined_answer'].tolist()
ground_truths = csv_data['answers'].tolist()

# Print to verify
print("Modified RAG Refined Answers:", modified_rag_refined_answer[:5])
print("Traditional RAG Refined Answers:", traditional_rag_refined_answer[:5])
print("Ground Truths:", ground_truths[:5])

# Calculate scores for Modified RAG
bertscores_mod = calculate_bertscore(ground_truths, modified_rag_refined_answer)
exact_matches_mod = calculate_exact_match(ground_truths, modified_rag_refined_answer)
bleu_scores_mod = calculate_bleu(ground_truths, modified_rag_refined_answer)
rouge_1_mod, rouge_2_mod, rouge_L_mod = calculate_rouge(ground_truths, modified_rag_refined_answer)

# Calculate scores for Traditional RAG
bertscores_trad = calculate_bertscore(ground_truths, traditional_rag_refined_answer)
exact_matches_trad = calculate_exact_match(ground_truths, traditional_rag_refined_answer)
bleu_scores_trad = calculate_bleu(ground_truths, traditional_rag_refined_answer)
rouge_1_trad, rouge_2_trad, rouge_L_trad = calculate_rouge(ground_truths, traditional_rag_refined_answer)

# Add scores to a DataFrame
new_df = pd.DataFrame({
    "Ground Truth": ground_truths,
    "Modified RAG Answer": modified_rag_refined_answer,
    "Traditional RAG Answer": traditional_rag_refined_answer,
    "BERTScore for Modified": bertscores_mod,
    "Exact Match for Modified": exact_matches_mod,
    "BLEU Score for Modified": bleu_scores_mod,
    "ROUGE-1 for Modified": rouge_1_mod,
    "ROUGE-2 for Modified": rouge_2_mod,
    "ROUGE-L for Modified": rouge_L_mod,
    "BERTScore for Traditional": bertscores_trad,
    "Exact Match for Traditional": exact_matches_trad,
    "BLEU Score for Traditional": bleu_scores_trad,
    "ROUGE-1 for Traditional": rouge_1_trad,
    "ROUGE-2 for Traditional": rouge_2_trad,
    "ROUGE-L for Traditional": rouge_L_trad
})

# Save the new DataFrame to a CSV
output_file = 'Evaluation_Score_Results.csv'
new_df.to_csv(output_file, index=False)
print(f"Scores saved to {output_file}")


Modified RAG Refined Answers: ['The tradition of people carrying the Olympic torch before the Olympic games began in 1936 during the Berlin Summer Olympics.', '7.5', 'The theme for the 2008 Summer Olympics torch relay was "one world, one dream".', 'The organizers of the torch relay called it "The Beijing Olympics."', '"One world, two dreams"']
Traditional RAG Refined Answers: ['The tradition of people carrying the Olympic torch before the Olympic games began in 1928.', '130 days', 'The theme for the torch relay was designed to protect the torch relay from scenes that marred it in the UK, France, and the US.', 'The organizers called it the "Australian leg of the torch relay."', '"One world, two dreams" was the slogan for the 2008 Olympics.']
Ground Truths: ['1936 Summer Olympics.', '129 days', 'one world, one dream', 'Journey of Harmony', 'one world, one dream']


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 32/32 [00:04<00:00,  6.95it/s]


computing greedy matching.


100%|██████████| 19/19 [00:00<00:00, 55.73it/s]


done in 4.96 seconds, 242.88 sentences/sec


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 33/33 [00:04<00:00,  7.57it/s]


computing greedy matching.


100%|██████████| 19/19 [00:00<00:00, 66.33it/s]


done in 4.66 seconds, 258.45 sentences/sec
Scores saved to Evaluation_Score_Results.csv
