### Install Required Libraries

This cell installs all necessary libraries for evaluation metrics including ROUGE, BLEU, BERTScore, and Sentence-BERT similarity.

In [1]:
!pip install rouge-score bert-score nltk transformers torch scikit-learn sentence-transformers evaluate

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.

###  Download NLTK Tokenizer

Downloads the Punkt tokenizer used for sentence and word tokenization (needed for BLEU score calculation).


In [2]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

###  Import Required Libraries

Imports all necessary modules for evaluating model-generated text:
- json, numpy: data handling and numeric operations.
- typing: for type hinting with List and Dict.
- sentence_transformers, bert_score, rouge_score: for similarity and scoring.
- sklearn: for cosine similarity.
- nltk: for BLEU score and tokenization.
- transformers.logging: suppresses unnecessary logging from HuggingFace.


In [3]:
import json
import numpy as np
from typing import List, Dict
from sentence_transformers import SentenceTransformer, util
from rouge_score import rouge_scorer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
from bert_score import score
from transformers import logging
logging.set_verbosity_error()


###  Evaluator Class

Defines a class for evaluating text generation quality using multiple metrics:
- Faithfulness: Cosine similarity via Sentence-BERT.
- ROUGE: Overlap of n-grams.
- BLEU: Word-level precision.
- BERTScore: Semantic similarity with BERT.
- Sentence-BERT Similarity: Batch cosine similarity.
- Evaluates predictions from a JSON file and summarizes results.


In [4]:
class Evaluator:
    def __init__(self):
        self.bert_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    def calculate_faithfulness(self, model_answer: str, ground_truth: str) -> float:
        answer_embedding = self.bert_model.encode(model_answer)
        truth_embedding = self.bert_model.encode(ground_truth)

        answer_embedding = answer_embedding / np.linalg.norm(answer_embedding)
        truth_embedding = truth_embedding / np.linalg.norm(truth_embedding)

        similarity = cosine_similarity(
            answer_embedding.reshape(1, -1),
            truth_embedding.reshape(1, -1)
        )[0][0]

        return float(similarity)

    def calculate_rouge_scores(self, model_answer: str, ground_truth: str) -> Dict:
        scores = self.rouge_scorer.score(ground_truth, model_answer)
        return {
            'rouge1': scores['rouge1'].fmeasure,
            'rouge2': scores['rouge2'].fmeasure,
            'rougeL': scores['rougeL'].fmeasure
        }

    def calculate_bleu_score(self, model_answer: str, ground_truth: str) -> float:
        reference = [word_tokenize(ground_truth.lower())]
        candidate = word_tokenize(model_answer.lower())
        try:
            return sentence_bleu(reference, candidate)
        except:
            return 0.0

    def calculate_bert_score(self, model_answer: str, ground_truth: str) -> float:
        P, R, F1 = score([model_answer], [ground_truth], lang="en")
        return F1.mean().item()

    def calculate_sentencebert_similarity(self, model_answers: List[str], ground_truths: List[str]) -> float:
        embeddings1 = self.bert_model.encode(model_answers, convert_to_tensor=True)
        embeddings2 = self.bert_model.encode(ground_truths, convert_to_tensor=True)
        cos_sim = util.pytorch_cos_sim(embeddings1, embeddings2)
        avg_sim = cos_sim.diag().mean().item()
        return avg_sim

    def evaluate_from_json(self, json_path: str) -> Dict:

        with open(json_path, 'r', encoding='utf-8') as f:
            data = json.load(f)

        metrics = {
            'faithfulness': [],
            'rouge1': [],
            'rouge2': [],
            'rougeL': [],
            'bleu': [],
            'bert_score': [],
        }

        for item in data:
            ground_truth = item["refs"]
            model_answer = item["preds"]

            metrics['faithfulness'].append(
                self.calculate_faithfulness(model_answer, ground_truth)
            )

            rouge_scores = self.calculate_rouge_scores(model_answer, ground_truth)
            metrics['rouge1'].append(rouge_scores['rouge1'])
            metrics['rouge2'].append(rouge_scores['rouge2'])
            metrics['rougeL'].append(rouge_scores['rougeL'])

            metrics['bleu'].append(
                self.calculate_bleu_score(model_answer, ground_truth)
            )

            metrics['bert_score'].append(
                self.calculate_bert_score(model_answer, ground_truth)
            )

        all_preds = [item['preds'] for item in data]
        all_refs = [item['refs'] for item in data]
        sentencebert_sim = self.calculate_sentencebert_similarity(all_preds, all_refs)
        final_metrics = {
            metric: {
                'mean': np.mean(scores),
                'std': np.std(scores),
                'min': np.min(scores),
                'max': np.max(scores),
            }
            for metric, scores in metrics.items()
        }

        final_metrics['sentencebert_cosine'] = {
            'mean': sentencebert_sim,
            'std': 0.0,
            'min': sentencebert_sim,
            'max': sentencebert_sim
        }

        return final_metrics

    def generate_evaluation_report(self, metrics: Dict) -> str:
        report = "Evaluation Report\n==================\n"
        for metric, values in metrics.items():
            report += f"{metric.upper()} : {values['mean']:.4f}\n\n"
        return report

### Evaluating the model

---



In [7]:
evaluator = Evaluator()
metrics = evaluator.evaluate_from_json("/content/finetuned-predictions.json")
report = evaluator.generate_evaluation_report(metrics)
print(report)

The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Evaluation Report
FAITHFULNESS : 0.7064

ROUGE1 : 0.4797

ROUGE2 : 0.2797

ROUGEL : 0.3613

BLEU : 0.2139

BERT_SCORE : 0.8862

SENTENCEBERT_COSINE : 0.7064


