In [14]:
paper = """On its 50th anniversary, this work celebrates Hawking's revolutionary finding that cosmic black holes release heat energy - a phenomenon now bearing his name. Through mathematical size and unit analysis, researchers establish the Hawking heat level formula, which has become a cornerstone in modern space science. Their analysis reveals how space holes possess a measurable warmth level (TH) that changes in opposite proportion to their size (MBH). Beyond the math framework, the study explores the broader implications of this discovery, including how these cosmic voids gradually fade away and their internal disorder measurement.

The investigation starts with an overview of unit analysis - a method for studying measurable quantities in different scales. Starting from the basic idea that a stationary space hole's properties are determined by its mass, the researchers use mathematical relationships to calculate the heat output. Central to their calculations is the space-mass attraction constant Gm.

Examining the implications of Hawking's insight, the study explains the importance of space holes having a measurable temperature. The researchers explore the role of internal disorder in this context, noting how it relates to the size of the hole's outer boundary. In its final sections, the work addresses both the challenges in detecting this heat output and the promising field of simulated gravity - a discipline focused on recreating space-like conditions through laboratory setups.

The text covers various specialized ideas, ranging from unit analysis and heat radiation to disorder measurement and simulated gravity systems. Mathematical formulas support a detailed calculation of the heat output levels. Written with precision and clarity, the content remains approachable for those with a background in space science and matter studies.
"""

In [15]:
summary = """The paper commemorates the 50th anniversary of Stephen Hawking's groundbreaking discovery of black holes emitting thermal radiation, known as Hawking radiation. The authors use dimensional analysis to derive the Hawking temperature, a fundamental concept in modern astrophysics. They demonstrate that black holes have an absolute temperature, TH, which depends inversely on their mass, MBH. The authors also explore the physical implications of Hawking's discovery, including the evaporation of black holes and their entropy.

The paper begins by introducing the concept of dimensional analysis, a tool used to study physical quantities with different dimensions. The authors then derive the Hawking temperature using dimensional analysis, starting from the assumption that a static black hole is characterized by its mass. They introduce the standard gravitational parameter, Gm, which is essential to the derivation.

The authors then discuss the physical meaning of Hawking's discovery, highlighting the implications of black holes having an absolute temperature. They also explore the concept of entropy, which is proportional to the area of the event horizon. The paper concludes by discussing the challenges of detecting Hawking radiation and the potential applications of analogue gravity, a field that aims to create physical systems that mimic the behavior of gravitational phenomena.

Throughout the paper, the authors use a range of technical terms, including dimensional analysis, Hawking radiation, entropy, and analogue gravity. They also provide a detailed derivation of the Hawking temperature, using mathematical equations and formulas. The paper is written in a clear and concise style, making it accessible to readers with a background in astrophysics and physics.

"""

In [None]:
from typing import Dict, List
import numpy as np
from transformers import AutoTokenizer, AutoModel
from rouge_score import rouge_scorer
from bert_score import BERTScorer
import torch
import re

class ScientificMetricsEvaluator:
    def __init__(self):
        """Initialize scientific metrics"""
        self.rouge_scorer = rouge_scorer.RougeScorer(
            ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
            use_stemmer=True
        )
        
        self.bert_scorer = BERTScorer(
            model_type="adsabs/astroBERT",
            num_layers=9,
            batch_size=32,
            nthreads=4,
            all_layers=False,
            idf=False,
            lang='en',
            rescale_with_baseline=False
        )
        
        self.tokenizer = AutoTokenizer.from_pretrained("adsabs/astroBERT")
        self.max_length = 512
        self.overlap = 50  # Number of tokens to overlap between chunks

    def _preprocess_text_rouge(self, text: str) -> str:
        """Preprocessing for ROUGE - keeps whitespace"""
        text = re.sub(r'[^\w\s,.!?-]', '', text)
        return ' '.join(text.split()).strip()

    def _preprocess_text_bert(self, text: str) -> str:
        """Minimal preprocessing for BERT - only removes potentially problematic characters"""
        return re.sub(r'[^\w,.!?-]', '', text)

    def _chunk_text(self, text: str) -> List[str]:
        """
        Chunk text into overlapping sequences that fit within BERT's maximum sequence length
        
        Args:
            text (str): Input text to be chunked
        
        Returns:
            List[str]: List of text chunks
        """
        # Tokenize the entire text
        tokens = self.tokenizer.encode(text, add_special_tokens=False)
        
        # Initialize chunks
        chunks = []
        
        # Create overlapping chunks
        for i in range(0, len(tokens), self.max_length - self.overlap):
            # Take a chunk of max_length, starting from i
            chunk_tokens = tokens[i:i + self.max_length]
            
            # Add special tokens
            chunk_tokens = [self.tokenizer.cls_token_id] + chunk_tokens + [self.tokenizer.sep_token_id]
            
            # Decode the chunk
            chunk = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk)
        
        return chunks

    def calculate_rouge(self, reference: str, candidate: str) -> Dict[str, float]:
        """Calculate ROUGE scores"""
        reference = self._preprocess_text_rouge(reference)
        candidate = self._preprocess_text_rouge(candidate)
        
        scores = self.rouge_scorer.score(reference, candidate)
        
        return {
            'rouge1': scores['rouge1'].fmeasure,
            'rouge2': scores['rouge2'].fmeasure,
            'rougeL': scores['rougeL'].fmeasure,
            'rougeLsum': scores['rougeLsum'].fmeasure
        }

    def calculate_bertscore(self, reference: str, candidate: str) -> Dict[str, float]:
        """Calculate BERTScore using chunked texts"""
        reference = self._preprocess_text_bert(reference)
        candidate = self._preprocess_text_bert(candidate)
        
        # Chunk both reference and candidate
        reference_chunks = self._chunk_text(reference)
        candidate_chunks = self._chunk_text(candidate)
        
        # Calculate BERTScore for each chunk pair and average
        chunk_scores = []
        for ref_chunk in reference_chunks:
            for cand_chunk in candidate_chunks:
                P, R, F1 = self.bert_scorer.score([cand_chunk], [ref_chunk])
                chunk_scores.append((float(P[0]), float(R[0]), float(F1[0])))
        
        # Average the chunk scores
        if chunk_scores:
            avg_precision = np.mean([score[0] for score in chunk_scores])
            avg_recall = np.mean([score[1] for score in chunk_scores])
            avg_f1 = np.mean([score[2] for score in chunk_scores])
        else:
            avg_precision = avg_recall = avg_f1 = 0.0
        
        return {
            'precision': avg_precision,
            'recall': avg_recall,
            'f1': avg_f1
        }

    def calculate_ngram_novelty(self, reference: str, candidate: str) -> float:
        """Calculate the proportion of novel n-grams in the candidate"""
        reference = self._preprocess_text_rouge(reference)
        candidate = self._preprocess_text_rouge(candidate)
        
        def get_ngrams(text, n):
            words = text.split()
            return set(' '.join(words[i:i+n]) for i in range(len(words)-n+1))
        
        # Calculate for different n-gram sizes
        novelty_scores = []
        for n in [1, 2, 3]:
            ref_ngrams = get_ngrams(reference, n)
            cand_ngrams = get_ngrams(candidate, n)
            novel_ngrams = cand_ngrams - ref_ngrams
            if cand_ngrams:
                novelty_scores.append(len(novel_ngrams) / len(cand_ngrams))
        
        return np.mean(novelty_scores) if novelty_scores else 0.0

    def evaluate_summary(self, reference: str, candidate: str) -> Dict[str, float]:
        """Calculate all metrics and combine them with weights optimized for abstractive summaries"""
        rouge_scores = self.calculate_rouge(reference, candidate)
        bert_scores = self.calculate_bertscore(reference, candidate)
        novelty_score = self.calculate_ngram_novelty(reference, candidate)
        
        final_scores = {
            'rouge1': rouge_scores['rouge1'],
            'rouge2': rouge_scores['rouge2'],
            'rougeL': rouge_scores['rougeL'],
            'rougeLsum': rouge_scores['rougeLsum'],
            'bertscore_precision': bert_scores['precision'],
            'bertscore_recall': bert_scores['recall'],
            'bertscore_f1': bert_scores['f1'],
            'ngram_novelty': novelty_score
        }
        
        # New weighting scheme that rewards:
        # 1. High semantic similarity (BERTScore) 
        # 2. High n-gram novelty (high term novelty and similiarity means that it understands the text sementically and is able to use terms interchangebly)
        weights = {
            'bertscore_f1': 0.5,      # Semantic preservation
            'ngram_novelty': 0.5,     # Vocabulary diversity        
        }
        
        final_scores['abstractive_score'] = sum(
            final_scores[metric] * weight
            for metric, weight in weights.items()
        )
        
        return final_scores

def evaluate_scientific_summary(reference_text: str, summary_text: str) -> Dict[str, float]:
    """Convenience function for evaluating scientific summaries"""
    evaluator = ScientificMetricsEvaluator()
    return evaluator.evaluate_summary(reference_text, summary_text)

In [17]:
scores = evaluate_scientific_summary(paper, summary)
print(scores)

{'rouge1': 0.43478260869565216, 'rouge2': 0.11764705882352941, 'rougeL': 0.32136105860113423, 'rougeLsum': 0.32136105860113423, 'bertscore_precision': 0.7116235196590424, 'bertscore_recall': 0.7540913820266724, 'bertscore_f1': 0.7305146753787994, 'ngram_novelty': 0.8522407782276202, 'abstractive_score': 0.6852017828779127}
