## 🔍 Notebook Summary: Paper Summarization

This notebook presents both extractive and abstractive approaches to generate paper summarization related to AI research. It includes:

1. **Text preprocessing**: Text cleaning including text normalization, citations/references removal, and abstract exclusion.
2. **SBERT-Based Semantic Embedding and TextRank**: Dense sentence embeddings using `all-MiniLM-L6-v2` and Text Rank method to extract key sentences.
3. **Pegasus pre-trained model**: Abstractive summarization using Pegasus pre-trained model.
4. **Evaluation**: Measures performance of generated summary as compared to the Abstract using ROUGE, BLEU, and BERT scoring

In [None]:
from helper_functions import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import sent_tokenize

seed=7 #for random state / reproducibility #curr best 8
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 200) 

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Install Dependencies

In [20]:
%pip install transformers sentence-transformers torch nltk
%pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [21]:
df = pd.read_csv("ai_ml_papers.csv")

  df = pd.read_csv("ai_ml_papers.csv")


In [22]:
from helper_functions import extract_sections

#Extract paper content by excluding abstract and references
def remove_abstract_and_references(text, sections):
    lines = text.split('\n')
    
    intro_lines = [line for line, section in sections if section.upper() == 'INTRODUCTION']
    intro_line = min(intro_lines) if intro_lines else None #position if intro
    
    ref_lines = [line for line, section in sections if section.upper() == 'REFERENCES' or section == "BIBLIOGRAPHY" or section == "ACKNOWLEDGEMENTS"]
    ref_line = min(ref_lines) if ref_lines else None #position of references

    if intro_line and ref_line: #extract paper only from introduction and exclude reference
        trimmed_lines = lines[intro_line : (ref_line - 1)] 
    elif intro_line:
        trimmed_lines = lines[intro_line:]
    elif ref_line:
        trimmed_lines = lines[:ref_line]
    else:
        trimmed_lines = lines
    
    return '\n'.join(trimmed_lines)

n_sample = 3

df_sample = df.sample(n=n_sample, random_state=seed)
#extract and append full-text
df_sample['full_text'] = df_sample['id'].apply(extract_pdf_text)

df_sample['sections'] = df_sample['full_text'].apply(extract_sections)
#print(df_sample['full_text'].iloc[1])
df_sample['removed'] = df_sample.apply(
    lambda row: remove_abstract_and_references(row['full_text'], row['sections']), axis=1
)
print(df_sample['removed'])

#extract_sections(df_sample['full_text'])


103953    Bayesian Attention Belief Networks\nShujian Zhang * 1 Xinjie Fan * 1 Bo Chen 2 Mingyuan Zhou 1\nAbstract\nAttention-based neural networks have achieved\nstate-of-the-art results on a wide range of...
143223    1 \n \n \nResearch on Stable Obstacle Avoidance Control \nStrategy for Tracked Intelligent Transportation \nVehicles in Non-structural Environment Based on \nDeep Learning \nYitian Wang, Jun Lin, ...
130405    Introduction\nReinforcement learning (RL) agents have reached superhuman performance in many tasks, such as\ngames, with clearly deﬁned objectives [23, 3, 26]. However, real-world deployment of RL...
Name: removed, dtype: object


**Text preprocessing**

In [23]:
#tokenization and text cleaning
from nltk.tokenize import word_tokenize
nltk.download("punkt")

def preprocess(text):

    # Lowercase and remove URLs/special characters
    text = text.lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    #Removing Extra Spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    #Remove single characters or digits, tend to be part of formula
    text = re.sub(r'\b[a-z0-9]\b', '', text)
    
    #Remove -\n (dash and newline) 
    # For example: au-\ntonomous -> autonomous
    text = re.sub(r'(\w+)-\s*(\w+)', r'\1\2', text)
    
    #Remove Citations
    text = re.sub(r'\[\d+\]', '', text)  # Removes [12] , [2-5]
    text = re.sub(r'\([\w\s,.]+,\s\d{4}\s?\)', '', text) # Removes (Author, Year) or (et al., 2023)
    text = re.sub(r'\[[\w\s,.]+,\s\d{4}\s?\]', '', text) # Removes [Author, Year] or [et al., 2023]
    text = re.sub(r'et al.,', '', text) # Removes et al. in general
    
    # Tokenize
    tokens = word_tokenize(text)
    tokens = [word for word in tokens]
    
    return ' '.join(tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ekabu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Method 1. BERT Embeddings with Text Rank Summary

In [24]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx


# Load a pre-trained BERT model for sentence embeddings
model_BERT = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def summarize_BERT_TextRank(text):
    text = preprocess(text) #preprocess
    sentences = sent_tokenize(text)

    # Convert each sentence into an embedding
    sentence_embeddings = model_BERT.encode(sentences)

    # Compute similarity matrix
    sim_matrix = cosine_similarity(sentence_embeddings)
    nx_graph = nx.from_numpy_array(sim_matrix)
    scores = nx.pagerank(nx_graph)

    # Rank sentences using PageRank scores
    top_n = 2
    ranked = sorted(((scores[i], i, s) for i, s in enumerate(sentences)), reverse=True)  # Include index to track original position
    top_indices = [i for (_, i, _) in ranked[:top_n]]  # Extract top 5 indices

    # Sort indices to restore original order (e.g., [2, 0, 4] → [0, 2, 4])
    top_indices_sorted = sorted(top_indices)

    # Extract the summary
    summary = " ".join([sentences[i] for i in top_indices_sorted])

    return summary
  

Method 2. Use pretrained google/pegasus-xsum model

In [25]:
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
model_name = "sshleifer/distill-pegasus-xsum-16-4"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

In [26]:
# Join sentences while keeping within the 1024-token limit
def chunk_text(sentences, tokenizer, max_tokens=1024, overlap=100):
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        # Tokenize once and cache
        tokens = tokenizer.tokenize(sentence)
        token_len = len(tokens)
        
        # Flush chunk if adding this sentence exceeds max_tokens
        if current_length + token_len > max_tokens:
            chunks.append(" ".join(current_chunk))
            # Retain last `overlap` tokens for context (if possible)
            if overlap > 0:
                overlap_start = max(0, len(current_chunk) - overlap)
                current_chunk = current_chunk[overlap_start:]
                current_length = sum(len(tokenizer.tokenize(s)) for s in current_chunk)
            else:
                current_chunk = []
                current_length = 0
        
        current_chunk.append(sentence)
        current_length += token_len

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

def summarize_pegasus(text):
    text = preprocess(text) #preprocess
    sentences = sent_tokenize(text)
    
    chunks = chunk_text(sentences, tokenizer)
    
    summary_list = []

    for chunk in chunks:
        inputs = tokenizer(chunk, return_tensors="pt", max_length=1024, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}  # move inputs to same device
        
        summary_ids = model.generate(inputs["input_ids"], max_length=50, min_length=30, num_beams=5, early_stopping=True)
        summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        summary_list.append(summary)

    # Combine all summarized chunks
    summary = " ".join(summary_list)
    
    return summary


**Extract full-text pdf and Generate sections**  
Note: each section shows (line number, section name)

In [27]:
n_sample = 3

df_sample = df.sample(n=n_sample, random_state=seed)
#extract and append full-text
df_sample['full_text'] = df_sample['id'].apply(extract_pdf_text)
df_sample['sections'] = df_sample['full_text'].apply(extract_sections)
df_sample[['title','sections']].head()


Unnamed: 0,title,sections
103953,Bayesian Attention Belief Networks,"[(2, Abstract), (1142, Acknowledgements), (1151, References)]"
143223,Bayesian Variable Selection in a Million Dimensions,"[(1690, References)]"
130405,Preprocessing Reward Functions for Interpretability,"[(7, Abstract), (21, Introduction), (174, Results), (424, Conclusion), (433, References)]"


**Text preprocessing**  
Remove Abstract & References and text normalization

In [28]:
df_sample['preprocessed_text'] = df_sample.apply(
    lambda row: remove_abstract_and_references(row['full_text'], row['sections']), axis=1
)
df_sample['preprocessed_text'] = df_sample['preprocessed_text'].apply(preprocess)
df_sample[['title','preprocessed_text']].head()

Unnamed: 0,title,preprocessed_text
103953,Bayesian Attention Belief Networks,bayesian attention belief networks shujian zhang * xinjie fan * bo chen mingyuan zhou abstract attentionbased neural networks have achieved stateof-theart results on wide range of tasks . most suc...
143223,Bayesian Variable Selection in a Million Dimensions,"research on stable obstacle avoidance control strategy for tracked intelligent transportation vehicles in nonstructural environment based on deep learning yitian wang , jun lin , liu zhang , tianh..."
130405,Preprocessing Reward Functions for Interpretability,"introduction reinforcement learning ( rl ) agents have reached superhuman performance in many tasks , such as games , with clearly deﬁned objectives [ 23 , , 26 ] . however , realworld deployment ..."


**Generate Summary**

In [29]:
df_sample['summary_bert_textRank'] = df_sample['full_text'].apply(summarize_BERT_TextRank)
print(df_sample[['id', 'title', 'summary_bert_textRank']])


                id                                                title  \
103953  2106.05251                   Bayesian Attention Belief Networks   
143223   2208.0118  Bayesian Variable Selection in a Million Dimensions   
130405  2203.13553  Preprocessing Reward Functions for Interpretability   

                                                                                                                                                                                          summary_bert_textRank  
103953                                                                                                         bayesian attention modules . , and le , . . sequence to sequence learning with neural networks .  
143223  15 ( ) , in which the unmanned vehicle successfully plans safe obstacle avoidance path and steers to avoid obstacles at the . based on this figure , the unmanned vehicle is descending and performi...  
130405  from , we produce simpler but equivalent reward function ′ , 

In [30]:
df_sample['summary_pegasus_pretrained'] = df_sample['full_text'].apply(summarize_pegasus)
print(df_sample[['id', 'title', 'summary_pegasus_pretrained']])

                id                                                title  \
103953  2106.05251                   Bayesian Attention Belief Networks   
143223   2208.0118  Bayesian Variable Selection in a Million Dimensions   
130405  2203.13553  Preprocessing Reward Functions for Interpretability   

                                                                                                                                                                                     summary_pegasus_pretrained  
103953  We have developed a new system of attention-based neural networks that outperforms deterministic and stateof-theart models , which have become the foundation for many computational tasks. We have ...  
143223  A study has been carried out at the University of China to improve the ability of autonomous vehicles to avoid obstacles in non-structural environments and in emergency situations. Obstacle avoida...  
130405  In our series of letters from the University of Berkeley, we 

**Evaluation**

In [31]:
%pip install rouge_score
%pip install evaluate
%pip install bert_score

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [32]:
from rouge import Rouge

from evaluate import load
import bert_score

In [33]:
# Initialize metrics
rouge = load("rouge")
bleu = load("bleu")
# Sample input
abstract = """Queenstown, a vibrant tourist destination, thrives on its stunning scenery and diverse activities. However, this reliance on tourism presents a complex dynamic, with both opportunities and challenges for the local community. This paper explores the multifaceted impact of tourism on the Queenstown community, examining its economic benefits alongside the social and environmental consequences.\n\nTourism undeniably fuels the local economy, generating significant revenue through various sectors like accommodation, hospitality, and retail. Businesses directly and indirectly benefit from the influx of tourists, creating employment opportunities and stimulating economic growth.\n\nThe rapid growth of tourism also brings challenges. Overcrowding can lead to a decline in the quality of life for residents, with increased traffic congestion, strain on infrastructure, and a rise in housing costs pushing locals out. Furthermore, the environmental impact of tourism, such as pollution and habitat destruction, poses a long-term threat to the natural beauty that attracts tourists in the first place.\n\nTourism is a double-edged sword for Queenstown. While it provides economic opportunities, it also presents social and environmental challenges that require careful management. A sustainable approach to tourism development is crucial to ensure that the benefits are shared equitably and that the community's quality of life and the environment are protected for future generations."""

generated_summary = """This research paper examines the dual impact of tourism on Queenstown, highlighting its economic benefits (job creation, revenue) alongside the social and environmental challenges (overcrowding, infrastructure strain, environmental degradation). It concludes that a sustainable tourism approach is vital for the community's well-being and the long-term preservation of its natural beauty."""

# Run evaluation
def evaluate_automatic_metrics(abstract, generated_summary):
    results = {}

    # ROUGE
    rouge_scores = rouge.compute(predictions=[generated_summary], references=[abstract])
    results["ROUGE-1"] = round(rouge_scores["rouge1"], 3)
    results["ROUGE-2"] = round(rouge_scores["rouge2"], 3)
    results["ROUGE-L"] = round(rouge_scores["rougeL"], 3)

    # BLEU
    bleu_score = bleu.compute(predictions=[generated_summary], references=[[abstract]])
    results["BLEU"] = round(bleu_score["bleu"], 3)

    # BERTScore
    P, R, F1 = bert_score.score([generated_summary], [abstract], lang="en", verbose=False)
    results["BERTScore-F1"] = round(F1[0].item(), 3)

    return results

# Example usage:
auto_scores = evaluate_automatic_metrics(abstract, generated_summary)
for metric, score in auto_scores.items():
    print(f"{metric}: {score}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ROUGE-1: 0.314
ROUGE-2: 0.139
ROUGE-L: 0.238
BLEU: 0.011
BERTScore-F1: 0.892


In [34]:
df_eval = df_sample[['title', 'abstract', 'summary_bert_textRank', 'summary_pegasus_pretrained']].copy()

multi_scores_textrank = []
multi_scores_bart = []

COL_ROUGE1 = 'ROUGE-1'
COL_ROUGE2 = 'ROUGE-2'
COL_ROUGEL = 'ROUGE-L'
COL_BLEU = 'BLEU'
COL_BERT = 'BERTScore-F1'
COL_TEXTRANK = ' TextRank'
COL_PEGASUS = ' Pegasus'

df_eval[COL_ROUGE1 + COL_TEXTRANK] = None
df_eval[COL_ROUGE1 + COL_PEGASUS] = None

df_eval[COL_ROUGE2 + COL_TEXTRANK] = None
df_eval[COL_ROUGE2 + COL_PEGASUS] = None

df_eval[COL_ROUGEL + COL_TEXTRANK] = None
df_eval[COL_ROUGEL + COL_PEGASUS] = None

df_eval[COL_BLEU + COL_TEXTRANK] = None
df_eval[COL_BLEU + COL_PEGASUS] = None

df_eval[COL_BERT + COL_TEXTRANK] = None
df_eval[COL_BERT + COL_PEGASUS] = None


for index, row in df_eval.iterrows():
    auto_scores_textrank = evaluate_automatic_metrics(row['abstract'], row['summary_bert_textRank'])
    auto_scores_pegasus = evaluate_automatic_metrics(row['abstract'], row['summary_pegasus_pretrained'])
    
    df_eval.loc[index, COL_ROUGE1 + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGE1]
    df_eval.loc[index, COL_ROUGE1 + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGE1]
    
    df_eval.loc[index, COL_ROUGE2 + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGE2]
    df_eval.loc[index, COL_ROUGE2 + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGE2]
    
    df_eval.loc[index, COL_ROUGEL + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGEL]
    df_eval.loc[index, COL_ROUGEL + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGEL]
    
    df_eval.loc[index, COL_BLEU + COL_TEXTRANK] = auto_scores_textrank[COL_BLEU]
    df_eval.loc[index, COL_BLEU + COL_PEGASUS] = auto_scores_pegasus[COL_BLEU]
    
    df_eval.loc[index, COL_BERT + COL_TEXTRANK] = auto_scores_textrank[COL_BERT]
    df_eval.loc[index, COL_BERT + COL_PEGASUS] = auto_scores_pegasus[COL_BERT]



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

In [35]:
pd.set_option('display.max_columns', None)
df_eval.head()

Unnamed: 0,title,abstract,summary_bert_textRank,summary_pegasus_pretrained,ROUGE-1 TextRank,ROUGE-1 Pegasus,ROUGE-2 TextRank,ROUGE-2 Pegasus,ROUGE-L TextRank,ROUGE-L Pegasus,BLEU TextRank,BLEU Pegasus,BERTScore-F1 TextRank,BERTScore-F1 Pegasus
103953,Bayesian Attention Belief Networks,Attention-based neural networks have achieved state-of-the-art results on a\nwide range of tasks. Most such models use deterministic attention while\nstochastic attention is less explored due to...,"bayesian attention modules . , and le , . . sequence to sequence learning with neural networks .","We have developed a new system of attention-based neural networks that outperforms deterministic and stateof-theart models , which have become the foundation for many computational tasks. We have ...",0.073,0.018,0.021,0.009,0.062,0.013,0.0,0.003,0.793,0.789
143223,Bayesian Variable Selection in a Million Dimensions,"Bayesian variable selection is a powerful tool for data analysis, as it\noffers a principled method for variable selection that accounts for prior\ninformation and uncertainty. However, wider ad...","15 ( ) , in which the unmanned vehicle successfully plans safe obstacle avoidance path and steers to avoid obstacles at the . based on this figure , the unmanned vehicle is descending and performi...",A study has been carried out at the University of China to improve the ability of autonomous vehicles to avoid obstacles in non-structural environments and in emergency situations. Obstacle avoida...,0.126,0.012,0.0,0.001,0.069,0.009,0.0,0.0,0.795,0.741
130405,Preprocessing Reward Functions for Interpretability,"In many real-world applications, the reward function is too complex to be\nmanually specified. In such cases, reward functions must instead be learned\nfrom human feedback. Since the learned rew...","from , we produce simpler but equivalent reward function ′ , which we then visualize . -. . . . figure 11 : reward models trained on synthetic data from the path reward using preference comparison...","In our series of letters from the University of Berkeley, we have proposed a new approach to learning reward functions, which can be more difficult to understand than the original. In our series o...",0.192,0.024,0.057,0.01,0.124,0.017,0.013,0.003,0.816,0.786
