# Advanced Summarization Evaluation Suite

This notebook evaluates two flat models (**PEGASUS**, **PRIMERA**) on the **Multi-News** dataset using a comprehensive suite of state-of-the-art metrics requested for top-tier publication analysis.

### Models Evaluated:
1. `google/pegasus-multi_news`
2. `allenai/PRIMERA`

### Metrics Evaluated:
1. **Traditional:** ROUGE-1, ROUGE-2, ROUGE-L, BERTScore
2. **Faithfulness & Factuality:** FactCC, SummaC, QAGS, QAFactEval, AlignScore
3. **Holistic/NLG:** BARTScore, UniEval

**Note:** This notebook clones official repositories for metrics that do not have standard PyPI packages to ensure faithful evaluation.

In [None]:
# 1. Clean up and force install stable versions
# We use NumPy < 2.0 to ensure compatibility with TensorFlow and Accelerate.
!pip uninstall -y numpy huggingface-hub transformers datasets
!pip install -q "numpy<2.0,>=1.24" "huggingface-hub==0.24.0" "transformers>=4.41.0" "datasets>=2.19.0"

# 2. Install the metrics
# We install summac with --no-deps so it doesn't break the environment we just built.
!pip install -q --no-deps summac
!pip install -q evaluate rouge_score bert_score sentencepiece protobuf accelerate scipy scikit-learn

Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Found existing installation: huggingface-hub 0.36.0
Uninstalling huggingface-hub-0.36.0:
  Successfully uninstalled huggingface-hub-0.36.0
Found existing installation: transformers 4.57.3
Uninstalling transformers-4.57.3:
  Successfully uninstalled transformers-4.57.3
Found existing installation: datasets 4.0.0
Uninstalling datasets-4.0.0:
  Successfully uninstalled datasets-4.0.0


In [None]:
import torch
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from tqdm.auto import tqdm
import evaluate
import os
import sys

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

Running on: cuda


## 2. Setup Advanced Metrics (Cloning Official Repos)
Many SOTA metrics require specific codebases. We clone them here.

In [None]:
# --- Setup BARTScore ---
if not os.path.exists('BARTScore'):
    !git clone https://github.com/neulab/BARTScore.git
sys.path.append('BARTScore') # Add to path

# --- Setup UniEval ---
if not os.path.exists('UniEval'):
    !git clone https://github.com/maszhongming/UniEval.git
    # Download UniEval checkpoint (approx 1GB)
    !wget https://huggingface.co/zhmh/UniEval/resolve/main/unieval_sum_v1.pth -O UniEval/unieval_sum_v1.pth

# --- Setup AlignScore ---
if not os.path.exists('AlignScore'):
    !git clone https://github.com/yuh-zha/AlignScore.git
    # Download AlignScore Checkpoint (RoBERTa-base version for speed, use large for paper if needed)
    !wget https://huggingface.co/yzha/AlignScore/resolve/main/AlignScore-base.ckpt -O AlignScore/AlignScore-base.ckpt
    !pip install -r AlignScore/requirements.txt # Ensure dependencies

# --- Setup QAFactEval ---
# Note: QAFactEval is heavy. If this fails due to environment conflicts, consider running it in a separate environment.
if not os.path.exists('QAFactEval'):
    !git clone https://github.com/salesforce/QAFactEval.git
    # QAFactEval often requires specific setup; we will attempt to import from the cloned repo directly.

Cloning into 'BARTScore'...
remote: Enumerating objects: 220, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 220 (delta 18), reused 14 (delta 14), pack-reused 194 (from 1)[K
Receiving objects: 100% (220/220), 101.98 MiB | 20.95 MiB/s, done.
Resolving deltas: 100% (47/47), done.
Updating files: 100% (192/192), done.
Cloning into 'UniEval'...
remote: Enumerating objects: 91, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 91 (delta 13), reused 5 (delta 5), pack-reused 65 (from 1)[K
Receiving objects: 100% (91/91), 1.97 MiB | 5.56 MiB/s, done.
Resolving deltas: 100% (22/22), done.
--2026-01-18 12:30:30--  https://huggingface.co/zhmh/UniEval/resolve/main/unieval_sum_v1.pth
Resolving huggingface.co (huggingface.co)... 3.161.20.125, 3.161.20.15, 3.161.20.112, ...
Connecting to huggingface.co (huggingface.co)|3.161.20.125|:443... connected.

## 3. Data Loading
Loading 100 samples from the test split of `Awesome075/multi_news_parquet`.

In [None]:
import os
import shutil
import datasets
from datasets import load_dataset, DownloadConfig

# 1. Force the system to wait up to 1 hour (3600s) for Hugging Face responses
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "3600"
os.environ["HF_HUB_ETAG_TIMEOUT"] = "3600"

# 2. Clear previous corrupted attempts (optional but recommended)
cache_path = "/root/.cache/huggingface/datasets"
if os.path.exists(cache_path):
    print("üßπ Clearing old cache to prevent corruption...")
    shutil.rmtree(cache_path)

# 3. Download with high-timeout configuration
print("‚è≥ Downloading dataset (this can take 2-5 minutes in Colab)...")
try:
    # We use num_proc=1 to avoid multiple connections fighting for bandwidth
    dataset = load_dataset(
        "Awesome075/multi_news_parquet",
        download_mode="force_redownload",
        num_proc=1
    )

    # Select the test split
    test_data = dataset['test'].select(range(100))
    src_docs = test_data['document']
    gold_sums = test_data['summary']
    print("‚úÖ Dataset loaded successfully!")

except Exception as e:
    print(f"‚ùå Error during load: {e}")
    print("\nPRO TIP: If it still fails, click the 'Hugging Face' link in the error ")
    print("to see if the site is temporarily down.")

‚è≥ Downloading dataset (this can take 2-5 minutes in Colab)...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.parquet:   0%|          | 0.00/323M [00:00<?, ?B/s]

validation.parquet:   0%|          | 0.00/39.5M [00:00<?, ?B/s]

test.parquet:   0%|          | 0.00/40.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

‚úÖ Dataset loaded successfully!


## 4. Model Inference
Generating summaries using PEGASUS and PRIMERA. We use standard generation parameters (beam search).

In [None]:
def generate_summaries(model_name, docs, device, batch_size=1):
    print(f"Loading {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    model.eval()

    generated_summaries = []

    for i in tqdm(range(0, len(docs), batch_size), desc=f"Generating with {model_name}"):
        batch_docs = docs[i : i + batch_size]

        # PRIMERA handles long documents better, PEGASUS truncates.
        # Max input length for Pegasus is usually 1024, PRIMERA is 4096.
        max_input = 4096 if 'PRIMERA' in model_name else 1024

        inputs = tokenizer(batch_docs, return_tensors="pt", max_length=max_input, truncation=True, padding=True).to(device)

        with torch.no_grad():
            # Standard generation parameters
            summary_ids = model.generate(
                inputs["input_ids"],
                num_beams=4,
                max_length=256,
                length_penalty=2.0,
                early_stopping=True
            )

        decoded = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)
        generated_summaries.extend(decoded)

    # Clear VRAM
    del model
    del tokenizer
    torch.cuda.empty_cache()

    return generated_summaries

# Generate
pegasus_preds = generate_summaries('google/pegasus-multi_news', src_docs, device)
primera_preds = generate_summaries('allenai/PRIMERA', src_docs, device)

Loading google/pegasus-multi_news...


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-multi_news and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Generating with google/pegasus-multi_news:   0%|          | 0/100 [00:00<?, ?it/s]

Loading allenai/PRIMERA...


tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/20.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/283 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.79G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/197 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.79G [00:00<?, ?B/s]

Generating with allenai/PRIMERA:   0%|          | 0/100 [00:00<?, ?it/s]

Input ids are automatically padded from 2263 to 2560 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 943 to 1024 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2201 to 2560 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3965 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 3938 to 4096 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 1929 to 2048 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 1262 to 1536 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 1056 to 1536 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 296 to 512 to be a multiple of `config.attention_window`: 512
Input ids are automatically padded from 2240 to 2560 to be a multip

## 5. Evaluation
We define wrapper functions for each metric group.

In [None]:
# Initialize Results Dictionary
results_data = {
    "Metric": [],
    "PEGASUS": [],
    "PRIMERA": []
}

def add_result(metric_name, score_pegasus, score_primera):
    results_data["Metric"].append(metric_name)
    results_data["PEGASUS"].append(score_pegasus)
    results_data["PRIMERA"].append(score_primera)
    print(f"{metric_name}: PEGASUS={score_pegasus:.4f}, PRIMERA={score_primera:.4f}")

In [None]:
# 1. Fix ROUGE error by upgrading rouge-score
!pip install -U rouge-score evaluate bert-score

# 2. Install SummaC & BARTScore requirements
!pip install summac
!pip install -q git+https://github.com/neulab/BARTScore.git

# 3. Clone Advanced Metrics for Publication (UniEval & AlignScore)
!git clone https://github.com/salesforce/UniEval.git
!git clone https://github.com/yzha001/AlignScore.git

# 4. Install Metric Dependencies
!pip install -q pyrootutils

Collecting rouge-score
  Using cached rouge_score-0.1.2.tar.gz (17 kB)
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (pyproject.toml) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24987 sha256=03906f48e250c631a3a2471a6377e5edf48cd756b56a03dae409ab422110b405
  Stored in directory: /root/.cache/pip/wheels/85/9d/af/01feefbe7d55ef5468796f0c68225b6788e85d9d0a281e7a70
Successfully built rouge-score
Installing collected packages: rouge-score
  Attempting uninstall: rouge-score
    Found existing installation: rouge-score 0.0.4
    Uninstalling rouge-score-0.0.4:
      Successfully uninstalled rouge-score-0.0.4
Successfully installed rouge-score-0.1.2
Collecting huggingface-hub<=0.17.0 (from summac)
  Using cached huggingface_hub-0

[31mERROR: git+https://github.com/neulab/BARTScore.git does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.[0m[31m
[0mfatal: destination path 'UniEval' already exists and is not an empty directory.
fatal: destination path 'AlignScore' already exists and is not an empty directory.


In [None]:
# --- 1. ROUGE & BERTScore ---
rouge = evaluate.load('rouge')
bertscore = evaluate.load('bertscore')

def eval_hf_metrics(preds, refs, sources):
    # ROUGE
    r_scores = rouge.compute(predictions=preds, references=refs)

    # BERTScore (using roberta-large as standard)
    bs_scores = bertscore.compute(predictions=preds, references=refs, lang="en", model_type="roberta-large")
    bs_f1 = np.mean(bs_scores['f1'])

    return r_scores, bs_f1

print("Evaluating Standard Metrics...")
peg_rouge, peg_bs = eval_hf_metrics(pegasus_preds, gold_sums, src_docs)
prim_rouge, prim_bs = eval_hf_metrics(primera_preds, gold_sums, src_docs)

add_result("ROUGE-1", peg_rouge['rouge1'], prim_rouge['rouge1'])
add_result("ROUGE-2", peg_rouge['rouge2'], prim_rouge['rouge2'])
add_result("ROUGE-L", peg_rouge['rougeL'], prim_rouge['rougeL'])
add_result("BERTScore-F1", peg_bs, prim_bs)

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading builder script: 0.00B [00:00, ?B/s]

Evaluating Standard Metrics...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ROUGE-1: PEGASUS=0.4656, PRIMERA=0.3488
ROUGE-2: PEGASUS=0.1881, PRIMERA=0.1057
ROUGE-L: PEGASUS=0.2392, PRIMERA=0.1740
BERTScore-F1: PEGASUS=0.8707, PRIMERA=0.8268


In [None]:
# Clone the repository if it doesn't exist
import os
if not os.path.exists('BARTScore'):
    !git clone https://github.com/neulab/BARTScore.git

# Install the specific requirements for BARTScore
!pip install -q transformers

In [None]:
import sys
import numpy as np
import torch

# 1. ADD THE FOLDER TO PATH
# This tells Python to look inside the cloned folder for 'bart_score'
sys.path.append('/content/BARTScore')

# 2. IMPORT AND INITIALIZE
from bart_score import BARTScorer

# Use the device you defined earlier (cuda)
bart_scorer = BARTScorer(device=device, checkpoint='facebook/bart-large-cnn')

def eval_bartscore(preds, sources):
    # BARTScore is essentially the log-likelihood of the summary given the source
    # The higher (closer to 0), the better.
    scores = bart_scorer.score(sources, preds, batch_size=4)
    return np.mean(scores)

print("üöÄ Evaluating BARTScore...")
peg_bart = eval_bartscore(pegasus_preds, src_docs)
prim_bart = eval_bartscore(primera_preds, src_docs)

# Add to your results table
add_result("BARTScore (Faithfulness)", peg_bart, prim_bart)

# 3. CLEAN UP (Crucial for T4 GPU memory)
del bart_scorer
torch.cuda.empty_cache()
print("‚úÖ BARTScore completed and memory cleared.")

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

üöÄ Evaluating BARTScore...
BARTScore (Faithfulness): PEGASUS=-1.3601, PRIMERA=-1.4148
‚úÖ BARTScore completed and memory cleared.


In [None]:
!pip install -q summac

[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m44.0/44.0 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m


In [None]:
# --- 3. SummaC ---
import nltk
import numpy as np
import torch
from summac.model_summac import SummaCZS

# 1. Fix the LookupError by downloading required NLTK data
nltk.download('punkt')
nltk.download('punkt_tab')

# 2. Initialize the model
print("Loading SummaC model (vitc)...")
# Note: if you get OOM (Memory Error), change granularity to "document"
model_zs = SummaCZS(granularity="sentence", model_name="vitc", device=device)

def eval_summac(preds, sources):
    # SummaC expects lists of strings
    # sources: original documents
    # preds: generated summaries
    scores = model_zs.score(sources, preds)
    return np.mean(scores['scores'])

print("üöÄ Evaluating SummaC Scores...")
peg_summac = eval_summac(pegasus_preds, src_docs)
prim_summac = eval_summac(primera_preds, src_docs)

# Add results to your table
add_result("SummaC-ZS", peg_summac, prim_summac)

# 3. Memory Cleanup for T4 GPU
# del model_zs
# torch.cuda.empty_cache()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Loading SummaC model (vitc)...
üöÄ Evaluating SummaC Scores...


tokenizer_config.json:   0%|          | 0.00/217 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/760k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/235M [00:00<?, ?B/s]

SummaC-ZS: PEGASUS=-0.1326, PRIMERA=0.2233


In [None]:
# 1. Install the correct tokenizer packages
!pip install -q sacremoses mosestokenizer jsonlines

import sys
import os
import importlib
import numpy as np

# 2. Add UniEval to the path correctly
unieval_path = os.path.abspath('UniEval')
if unieval_path not in sys.path:
    sys.path.insert(0, unieval_path)

# 3. FORCE Python to forget the 'utils' from BARTScore
if 'utils' in sys.modules:
    del sys.modules['utils']

# 4. Now import - this will now pull from /content/UniEval/utils.py
from utils import convert_to_json
from metric.evaluator import get_evaluator

print("‚úÖ UniEval modules loaded successfully without BARTScore interference.")

def eval_unieval(preds, sources, refs):
    # UniEval evaluates four dimensions: Coherence, Consistency, Fluency, Relevance
    data = convert_to_json(output_list=preds, src_list=sources, ref_list=refs)

    # Initialize evaluator (this will download the UniEval-summarization checkpoint)
    evaluator = get_evaluator('summarization', device=device)

    eval_scores = evaluator.evaluate(data, print_result=False)

    # Extract means (UniEval returns a list of dictionaries)
    coherence = np.mean([s['coherence'] for s in eval_scores])
    consistency = np.mean([s['consistency'] for s in eval_scores])
    fluency = np.mean([s['fluency'] for s in eval_scores])
    relevance = np.mean([s['relevance'] for s in eval_scores])

    return coherence, consistency, fluency, relevance

print("üöÄ Evaluating UniEval (this may download model weights)...")
peg_uni = eval_unieval(pegasus_preds, src_docs, gold_sums)
prim_uni = eval_unieval(primera_preds, src_docs, gold_sums)

# Store results
add_result("UniEval-Coherence", peg_uni[0], prim_uni[0])
add_result("UniEval-Consistency", peg_uni[1], prim_uni[1])
add_result("UniEval-Fluency", peg_uni[2], prim_uni[2])
add_result("UniEval-Relevance", peg_uni[3], prim_uni[3])

print("‚úÖ UniEval Evaluation Complete.")

‚úÖ UniEval modules loaded successfully without BARTScore interference.
üöÄ Evaluating UniEval (this may download model weights)...


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Evaluating coherence of 100 samples !!!


  0%|          | 0/13 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:38<00:00,  2.99s/it]


Evaluating consistency of 100 samples !!!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 108/108 [05:30<00:00,  3.06s/it]


Evaluating fluency of 100 samples !!!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 108/108 [00:17<00:00,  6.19it/s]


Evaluating relevance of 100 samples !!!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:24<00:00,  1.89s/it]


model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Evaluating coherence of 100 samples !!!



  0%|          | 0/13 [00:00<?, ?it/s][A
  8%|‚ñä         | 1/13 [00:03<00:37,  3.15s/it][A
 15%|‚ñà‚ñå        | 2/13 [00:06<00:33,  3.09s/it][A
 23%|‚ñà‚ñà‚ñé       | 3/13 [00:09<00:30,  3.10s/it][A
 31%|‚ñà‚ñà‚ñà       | 4/13 [00:12<00:27,  3.11s/it][A
 38%|‚ñà‚ñà‚ñà‚ñä      | 5/13 [00:15<00:25,  3.13s/it][A
 46%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 6/13 [00:18<00:22,  3.15s/it][A
 54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 7/13 [00:21<00:18,  3.16s/it][A
 62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 8/13 [00:25<00:15,  3.19s/it][A
 69%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 9/13 [00:28<00:12,  3.19s/it][A
 77%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã  | 10/13 [00:31<00:09,  3.22s/it][A
 85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 11/13 [00:34<00:06,  3.19s/it][A
 92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 12/13 [00:37<00:03,  3.18s/it][A
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:39<00:00,  3.04s/it]


Evaluating consistency of 100 samples !!!



  0%|          | 0/58 [00:00<?, ?it/s][A
  2%|‚ñè         | 1/58 [00:03<03:01,  3.18s/it][A
  3%|‚ñé         | 2/58 [00:06<02:57,  3.17s/it][A
  5%|‚ñå         | 3/58 [00:09<02:54,  3.17s/it][A
  7%|‚ñã         | 4/58 [00:12<02:50,  3.16s/it][A
  9%|‚ñä         | 5/58 [00:15<02:45,  3.12s/it][A
 10%|‚ñà         | 6/58 [00:18<02:40,  3.09s/it][A
 12%|‚ñà‚ñè        | 7/58 [00:20<02:18,  2.71s/it][A
 14%|‚ñà‚ñç        | 8/58 [00:23<02:21,  2.82s/it][A
 16%|‚ñà‚ñå        | 9/58 [00:26<02:12,  2.70s/it][A
 17%|‚ñà‚ñã        | 10/58 [00:29<02:15,  2.83s/it][A
 19%|‚ñà‚ñâ        | 11/58 [00:32<02:17,  2.92s/it][A
 21%|‚ñà‚ñà        | 12/58 [00:35<02:17,  2.98s/it][A
 22%|‚ñà‚ñà‚ñè       | 13/58 [00:38<02:15,  3.02s/it][A
 24%|‚ñà‚ñà‚ñç       | 14/58 [00:41<02:14,  3.06s/it][A
 26%|‚ñà‚ñà‚ñå       | 15/58 [00:44<02:12,  3.08s/it][A
 28%|‚ñà‚ñà‚ñä       | 16/58 [00:48<02:09,  3.09s/it][A
 29%|‚ñà‚ñà‚ñâ       | 17/58 [00:51<02:06,  3.10s/it][A
 31%|‚ñà‚ñà‚ñà       | 18/58 [00:

Evaluating fluency of 100 samples !!!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 58/58 [00:17<00:00,  3.24it/s]


Evaluating relevance of 100 samples !!!


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 13/13 [00:23<00:00,  1.79s/it]

UniEval-Coherence: PEGASUS=0.7898, PRIMERA=0.5662
UniEval-Consistency: PEGASUS=0.7471, PRIMERA=0.8330
UniEval-Fluency: PEGASUS=0.9171, PRIMERA=0.7574
UniEval-Relevance: PEGASUS=0.7328, PRIMERA=0.3557
‚úÖ UniEval Evaluation Complete.





In [None]:
import os

# 1. Manually edit the AlignScore setup file to remove the "torch<2" restriction
setup_path = 'AlignScore/setup.py'
if os.path.exists(setup_path):
    with open(setup_path, 'r') as f:
        content = f.read()
    # Remove the version cap on torch and pytorch-lightning
    content = content.replace('torch<2,>=1.12.1', 'torch')
    content = content.replace('pytorch-lightning<2,>=1.7.7', 'pytorch-lightning')
    with open(setup_path, 'w') as f:
        f.write(content)
    print("‚úÖ Patched AlignScore/setup.py to allow modern PyTorch.")

# 2. Force install WITHOUT checking dependency versions
# This bypasses the 'torch<2' error entirely
!pip install --no-deps -e AlignScore

# 3. Install the few missing pieces manually (that don't conflict)
!pip install -q jsonlines pytorch-lightning sentence-transformers

Obtaining file:///content/AlignScore
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: alignscore
  Building editable for alignscore (pyproject.toml) ... [?25l[?25hdone
  Created wheel for alignscore: filename=alignscore-0.1.3-py3-none-any.whl size=8479 sha256=442d3f4ad3fb6b40e2b05245afeda2108e36178f09fd16937a45522cb8ae76bf
  Stored in directory: /tmp/pip-ephem-wheel-cache-js86p20m/wheels/08/59/20/57f5b9343f7921a44f6a27d3d0fa9f77f3c619ff21f30780ed
Successfully built alignscore
Installing collected packages: alignscore
  Attempting uninstall: alignscore
    Found existing installation: alignscore 0.1.3
    Uninstalling alignscore-0.1.3:
      Successfully uninstalled alignscore-0.1.3
Succe

In [None]:
from transformers import RobertaModel, RobertaTokenizer
import time

# 1. Pre-download the base model with a retry loop
model_name = "roberta-base"
max_retries = 3

for i in range(max_retries):
    try:
        print(f"‚è≥ Attempt {i+1}: Pre-downloading {model_name}...")
        RobertaTokenizer.from_pretrained(model_name)
        RobertaModel.from_pretrained(model_name)
        print(f"‚úÖ {model_name} is now cached successfully!")
        break
    except Exception as e:
        print(f"‚ö†Ô∏è Attempt {i+1} failed: {e}")
        if i < max_retries - 1:
            time.sleep(5) # Wait 5 seconds before retrying
        else:
            print("‚ùå Failed to download model after multiple attempts. Check your internet connection.")

‚è≥ Attempt 1: Pre-downloading roberta-base...


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


‚úÖ roberta-base is now cached successfully!


In [None]:
import nltk
import torch
import numpy as np

# 1. Essential NLTK downloads for sentence splitting
nltk.download('punkt')
nltk.download('punkt_tab')

def run_alignscore_eval(preds, sources, model_name):
    print(f"üöÄ Starting AlignScore for {model_name}...")

    # Initialize the scorer
    # This loads the 800MB checkpoint and RoBERTa-base
    scorer = AlignScore(
        model='roberta-base',
        batch_size=16,
        device='cuda' if torch.cuda.is_available() else 'cpu',
        ckpt_path='AlignScore/AlignScore-base.ckpt',
        evaluation_mode='nli_sp'
    )

    print(f"Computing alignment (sentence-by-sentence) for {len(preds)} samples...")
    # AlignScore splits your text into sentences and compares them via NLI
    scores = scorer.score(contexts=sources, claims=preds)
    avg_score = np.mean(scores)

    print(f"‚úÖ {model_name} AlignScore: {avg_score:.4f}")

    # Crucial: Free up memory for the next model
    del scorer
    torch.cuda.empty_cache()

    return avg_score

# 2. Execute with error handling
try:
    peg_align = run_alignscore_eval(pegasus_preds, src_docs, "PEGASUS")
    prim_align = run_alignscore_eval(primera_preds, src_docs, "PRIMERA")

    # Add to your global results dictionary
    add_result("AlignScore", peg_align, prim_align)
    print("\nüéâ ALL METRICS CAPTURED!")
except Exception as e:
    print(f"‚ùå AlignScore Evaluation Failed: {e}")

üöÄ Starting AlignScore for PEGASUS...
‚ùå AlignScore Evaluation Failed: name 'AlignScore' is not defined


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
import pandas as pd

# 1. Compile all results into a dictionary
# (Make sure these variable names match what you used in previous cells)
data = {
    "Metric": [
        "ROUGE-1 (‚Üë)", "ROUGE-2 (‚Üë)", "ROUGE-L (‚Üë)",
        "BERTScore (‚Üë)",
        "BARTScore (Faithfulness ‚Üë)",
        "SummaC-ZS (Factuality ‚Üë)",
        "AlignScore (Unified ‚Üë)",
        "UniEval-Coherence (‚Üë)",
        "UniEval-Consistency (‚Üë)",
        "UniEval-Fluency (‚Üë)",
        "UniEval-Relevance (‚Üë)"
    ],
    "PEGASUS (Baseline)": [
        peg_rouge['rouge1'], peg_rouge['rouge2'], peg_rouge['rougeL'],
        peg_bertscore, peg_bart, peg_summac, peg_align,
        peg_uni[0], peg_uni[1], peg_uni[2], peg_uni[3]
    ],
    "PRIMERA (Multi-News)": [
        prim_rouge['rouge1'], prim_rouge['rouge2'], prim_rouge['rougeL'],
        prim_bertscore, prim_bart, prim_summac, prim_align,
        prim_uni[0], prim_uni[1], prim_uni[2], prim_uni[3]
    ]
}

# 2. Create and format the table
df_final = pd.DataFrame(data)
df_final["Delta (%)"] = ((df_final.iloc[:, 2] - df_final.iloc[:, 1]) / df_final.iloc[:, 1] * 100).round(2)

print("üìä FINAL RESEARCH RESULTS")
display(df_final.style.highlight_max(axis=1, subset=["PEGASUS (Baseline)", "PRIMERA (Multi-News)"], color='lightgreen'))

# 3. Export to LaTeX for your paper
print("\nüìù Copy this LaTeX code into your Overleaf/Paper:")
print(df_final.to_latex(index=False, float_format="%.4f"))