## 🔍 Notebook Summary: Paper Summarization

This notebook presents both extractive and abstractive approaches to generate paper summarization related to AI research. It includes:

1. **Text preprocessing**: Text cleaning including text normalization, citations/references removal, and abstract exclusion.
2. **SBERT-Based Semantic Embedding and TextRank**: Dense sentence embeddings using `all-MiniLM-L6-v2` and Text Rank method to extract key sentences.
3. **Pegasus pre-trained model**: Abstractive summarization using Pegasus pre-trained model.
4. **Evaluation**: Measures performance of generated summary as compared to the Abstract using ROUGE, BLEU, and BERT scoring

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
from nltk.tokenize import sent_tokenize

seed=8 #for random state / reproducibility #curr best 8
pd.set_option('display.width', 200)
pd.set_option('display.max_colwidth', 200) 

%load_ext autoreload
%autoreload 2

  from pandas.core import (


Install Dependencies

In [2]:
%pip install transformers sentence-transformers torch nltk
%pip install sentencepiece

Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




In [3]:
df = pd.read_csv("ai_ml_papers.csv")

  df = pd.read_csv("ai_ml_papers.csv")


In [4]:
from helper_functions import extract_sections, extract_pdf_text

#Extract paper content by excluding abstract and references
def remove_abstract_and_references(text, sections):
    lines = text.split('\n')
    
    intro_lines = [line for line, section in sections if section.upper() == 'INTRODUCTION']
    intro_line = min(intro_lines) if intro_lines else None #position if intro
    
    ref_lines = [line for line, section in sections if section.upper() == 'REFERENCES' or section == "BIBLIOGRAPHY" or section == "ACKNOWLEDGEMENTS"]
    ref_line = min(ref_lines) if ref_lines else None #position of references

    if intro_line and ref_line: #extract paper only from introduction and exclude reference
        trimmed_lines = lines[intro_line : (ref_line - 1)] 
    elif intro_line:
        trimmed_lines = lines[intro_line:]
    elif ref_line:
        trimmed_lines = lines[:ref_line]
    else:
        trimmed_lines = lines
    
    return '\n'.join(trimmed_lines)

n_sample = 3

df_sample = df.sample(n=n_sample, random_state=seed)
#extract and append full-text
df_sample['full_text'] = df_sample['id'].apply(extract_pdf_text)

df_sample['sections'] = df_sample['full_text'].apply(extract_sections)
#print(df_sample['full_text'].iloc[1])
df_sample['removed'] = df_sample.apply(
    lambda row: remove_abstract_and_references(row['full_text'], row['sections']), axis=1
)
print(df_sample['removed'])

#extract_sections(df_sample['full_text'])


179995    This paper has been accepted for publication at the IEEE/RSJ International Conference on Intelligent Robots and\nSystems (IROS), Detroit, Michigan, USA, 2023.\nGP-guided MPPI for Efficient Navigat...
147614    Introduction\nClassiﬁcation of attributed graphs has received much attention in recent years because graphs are\nwell suited to represent a broad class of data in ﬁelds such as chemistry, biology,...
227753    Introduction\nLarge language models (LLMs) (Brown et al., 2020; Anil et al., 2023; Thoppilan et al., 2022) have\nrecently gained widespread attention, serving various functions including question ...
Name: removed, dtype: object


**Text preprocessing**

In [5]:
#tokenization and text cleaning
import re
from nltk.tokenize import word_tokenize
nltk.download("punkt")

def preprocess(text):

    # Lowercase and remove URLs/special characters
    text = text.lower()
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    
    #Removing Extra Spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    #Remove single characters or digits, tend to be part of formula
    text = re.sub(r'\b[a-z0-9]\b', '', text)
    
    #Remove -\n (dash and newline) 
    # For example: au-\ntonomous -> autonomous
    text = re.sub(r'(\w+)-\s*(\w+)', r'\1\2', text)
    
    #Remove Citations
    text = re.sub(r'\[\d+\]', '', text)  # Removes [12] , [2-5]
    text = re.sub(r'\([\w\s,.]+,\s\d{4}\s?\)', '', text) # Removes (Author, Year) or (et al., 2023)
    text = re.sub(r'\[[\w\s,.]+,\s\d{4}\s?\]', '', text) # Removes [Author, Year] or [et al., 2023]
    text = re.sub(r'et al.,', '', text) # Removes et al. in general
    
    # Tokenize
    tokens = word_tokenize(text)
    tokens = [word for word in tokens]
    
    return ' '.join(tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ekabu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Method 1. BERT Embeddings with Text Rank Summary

In [6]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx



# Load a pre-trained BERT model for sentence embeddings
model_BERT = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

def summarize_BERT_TextRank(text):
    text = preprocess(text) #preprocess
    sentences = sent_tokenize(text)

    # Convert each sentence into an embedding
    sentence_embeddings = model_BERT.encode(sentences)

    # Compute similarity matrix
    sim_matrix = cosine_similarity(sentence_embeddings)
    nx_graph = nx.from_numpy_array(sim_matrix)
    scores = nx.pagerank(nx_graph)

    # Rank sentences using PageRank scores
    top_n = 2
    ranked = sorted(((scores[i], i, s) for i, s in enumerate(sentences)), reverse=True)  # Include index to track original position
    top_indices = [i for (_, i, _) in ranked[:top_n]]  # Extract top 5 indices

    # Sort indices to restore original order (e.g., [2, 0, 4] → [0, 2, 4])
    top_indices_sorted = sorted(top_indices)

    # Extract the summary
    summary = " ".join([sentences[i] for i in top_indices_sorted])

    return summary
  




Method 2. Use pretrained google/pegasus-xsum model

In [None]:
import torch
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load the model and tokenizer
model_name = "google/pegasus-xsum"
#model_name = ""
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# Join sentences while keeping within the 1024-token limit
def chunk_text(sentences, tokenizer, max_tokens=1024):
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        token_len = len(tokenizer.tokenize(sentence))
        
        if current_length + token_len > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        
        current_chunk.append(sentence)
        current_length += token_len

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

def summarize_pegasus(text):
    text = preprocess(text) #preprocess
    sentences = sent_tokenize(text)
    
    chunks = chunk_text(sentences, tokenizer)
    print(f"size of chunks: {len(chunks)}")
    
    summary_list = []

    #First stage: summarize chunks of 1024 tokens
    for chunk in chunks:
        inputs = tokenizer(chunk, return_tensors="pt", max_length=1024, truncation=True)
        inputs = {k: v.to(device) for k, v in inputs.items()}  # move inputs to same device
        
        summary_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=2, early_stopping=True, repetition_penalty=2.0)
        final_input = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        summary_list.append(final_input)

    # Combine all summarized chunks
    final_input = " ".join(summary_list)
    
    #Second stage: summarize the chunks into a single summary
    
    final_inputs = tokenizer(final_input, return_tensors="pt", max_length=1024, truncation=True).to(device)
    final_summary_ids = model.generate(
        final_inputs["input_ids"],
        max_length=100,              # about 80–100 words
        min_length=50,
        num_beams=4,
        repetition_penalty=2.0,
        early_stopping=True
    )
    final_summary = tokenizer.decode(final_summary_ids[0], skip_special_tokens=True)
    
    return final_summary


**Extract full-text pdf and Generate sections**  
Note: each section shows (line number, section name)

In [None]:
n_sample = 3

df_sample = df.sample(n=n_sample, random_state=seed)
#extract and append full-text
df_sample['full_text'] = df_sample['id'].apply(extract_pdf_text)
df_sample['sections'] = df_sample['full_text'].apply(extract_sections)
df_sample[['title','sections']].head()


Unnamed: 0,title,sections
179995,GP-guided MPPI for Efficient Navigation in Complex Unknown Cluttered\n Environments,"[(1064, REFERENCES)]"
147614,A Simple Way to Learn Metrics Between Attributed Graphs,"[(17, Abstract), (33, Introduction), (498, Method), (588, Method), (672, Method), (729, Conclusion), (737, Acknowledgements), (742, References), (1516, Method)]"
227753,Ad Auctions for LLMs via Retrieval Augmented Generation,"[(7, Abstract), (21, Introduction), (507, Results), (694, References)]"


**Text preprocessing**  
Remove Abstract & References and text normalization

In [None]:
df_sample['preprocessed_text'] = df_sample.apply(
    lambda row: remove_abstract_and_references(row['full_text'], row['sections']), axis=1
)
df_sample['preprocessed_text'] = df_sample['preprocessed_text'].apply(preprocess)
df_sample[['title','preprocessed_text']].head()

Unnamed: 0,title,preprocessed_text
179995,GP-guided MPPI for Efficient Navigation in Complex Unknown Cluttered\n Environments,"this paper has been accepted for publication at the ieee/rsj international conference on intelligent robots and systems ( iros ) , detroit , michigan , usa , 2023. gpguided mppi for efficient navi..."
147614,A Simple Way to Learn Metrics Between Attributed Graphs,"introduction classiﬁcation of attributed graphs has received much attention in recent years because graphs are well suited to represent broad class of data in ﬁelds such as chemistry , biology , c..."
227753,Ad Auctions for LLMs via Retrieval Augmented Generation,"introduction large language models ( llms ) ( brown 2020 ; anil 2023 ; thoppilan 2022 ) have recently gained widespread attention , serving various functions including question answering , content..."


**Generate Summary**

In [None]:
df_sample['summary_bert_textRank'] = df_sample['full_text'].apply(summarize_BERT_TextRank)
print(df_sample[['id', 'title', 'summary_bert_textRank']])


  attn_output = torch.nn.functional.scaled_dot_product_attention(


                id                                                                                 title  \
179995  2307.04019  GP-guided MPPI for Efficient Navigation in Complex Unknown Cluttered\n  Environments   
147614  2209.12727                               A Simple Way to Learn Metrics Between Attributed Graphs   
227753  2406.09459                               Ad Auctions for LLMs via Retrieval Augmented Generation   

                                                                                                                                                                                          summary_bert_textRank  
179995  afterward , mppi computes the optimal control sequence that satisfies the robot and collision avoidance constraints . such policy takes advantage of the sgp occupancy model to learn about the navi...  
147614  the complexity of rpw2 is given by ( p2n log ( ) ) which saves factor as compared to pw2 and this term is often greater than 10. from rpw2 

In [None]:
df_sample['summary_pegasus_pretrained'] = df_sample['full_text'].apply(summarize_pegasus)
print(df_sample[['id', 'title', 'summary_pegasus_pretrained']])

size of chunks: 14


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


**Evaluation**

In [None]:
from rouge import Rouge

from evaluate import load
import bert_score

In [None]:
# Initialize metrics
rouge = load("rouge")
bleu = load("bleu")
# Sample input
abstract = """Queenstown, a vibrant tourist destination, thrives on its stunning scenery and diverse activities. However, this reliance on tourism presents a complex dynamic, with both opportunities and challenges for the local community. This paper explores the multifaceted impact of tourism on the Queenstown community, examining its economic benefits alongside the social and environmental consequences.\n\nTourism undeniably fuels the local economy, generating significant revenue through various sectors like accommodation, hospitality, and retail. Businesses directly and indirectly benefit from the influx of tourists, creating employment opportunities and stimulating economic growth.\n\nThe rapid growth of tourism also brings challenges. Overcrowding can lead to a decline in the quality of life for residents, with increased traffic congestion, strain on infrastructure, and a rise in housing costs pushing locals out. Furthermore, the environmental impact of tourism, such as pollution and habitat destruction, poses a long-term threat to the natural beauty that attracts tourists in the first place.\n\nTourism is a double-edged sword for Queenstown. While it provides economic opportunities, it also presents social and environmental challenges that require careful management. A sustainable approach to tourism development is crucial to ensure that the benefits are shared equitably and that the community's quality of life and the environment are protected for future generations."""

generated_summary = """This research paper examines the dual impact of tourism on Queenstown, highlighting its economic benefits (job creation, revenue) alongside the social and environmental challenges (overcrowding, infrastructure strain, environmental degradation). It concludes that a sustainable tourism approach is vital for the community's well-being and the long-term preservation of its natural beauty."""

# Run evaluation
def evaluate_automatic_metrics(abstract, generated_summary):
    results = {}

    # ROUGE
    rouge_scores = rouge.compute(predictions=[generated_summary], references=[abstract])
    results["ROUGE-1"] = round(rouge_scores["rouge1"], 3)
    results["ROUGE-2"] = round(rouge_scores["rouge2"], 3)
    results["ROUGE-L"] = round(rouge_scores["rougeL"], 3)

    # BLEU
    bleu_score = bleu.compute(predictions=[generated_summary], references=[[abstract]])
    results["BLEU"] = round(bleu_score["bleu"], 3)

    # BERTScore
    P, R, F1 = bert_score.score([generated_summary], [abstract], lang="en", verbose=False)
    results["BERTScore-F1"] = round(F1[0].item(), 3)

    return results

# Example usage:
auto_scores = evaluate_automatic_metrics(abstract, generated_summary)
for metric, score in auto_scores.items():
    print(f"{metric}: {score}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


ROUGE-1: 0.314
ROUGE-2: 0.139
ROUGE-L: 0.238
BLEU: 0.011
BERTScore-F1: 0.892


Calculate and tabulate scores for generated summaries

In [None]:
df_eval = df_sample[['title', 'abstract', 'summary_bert_textRank', 'summary_pegasus_pretrained']].copy()
df_eval['abstract'] = df_eval['abstract'].apply(preprocess) #preprocess abstract

multi_scores_textrank = []
multi_scores_bart = []

COL_ROUGE1 = 'ROUGE-1'
COL_ROUGE2 = 'ROUGE-2'
COL_ROUGEL = 'ROUGE-L'
COL_BLEU = 'BLEU'
COL_BERT = 'BERTScore-F1'
COL_TEXTRANK = ' TextRank'
COL_PEGASUS = ' Pegasus'

df_eval[COL_ROUGE1 + COL_TEXTRANK] = None
df_eval[COL_ROUGE1 + COL_PEGASUS] = None

df_eval[COL_ROUGE2 + COL_TEXTRANK] = None
df_eval[COL_ROUGE2 + COL_PEGASUS] = None

df_eval[COL_ROUGEL + COL_TEXTRANK] = None
df_eval[COL_ROUGEL + COL_PEGASUS] = None

df_eval[COL_BLEU + COL_TEXTRANK] = None
df_eval[COL_BLEU + COL_PEGASUS] = None

df_eval[COL_BERT + COL_TEXTRANK] = None
df_eval[COL_BERT + COL_PEGASUS] = None


for index, row in df_eval.iterrows():
    auto_scores_textrank = evaluate_automatic_metrics(row['abstract'], row['summary_bert_textRank'])
    auto_scores_pegasus = evaluate_automatic_metrics(row['abstract'], row['summary_pegasus_pretrained'])
    
    df_eval.loc[index, COL_ROUGE1 + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGE1]
    df_eval.loc[index, COL_ROUGE1 + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGE1]
    
    df_eval.loc[index, COL_ROUGE2 + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGE2]
    df_eval.loc[index, COL_ROUGE2 + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGE2]
    
    df_eval.loc[index, COL_ROUGEL + COL_TEXTRANK] = auto_scores_textrank[COL_ROUGEL]
    df_eval.loc[index, COL_ROUGEL + COL_PEGASUS] = auto_scores_pegasus[COL_ROUGEL]
    
    df_eval.loc[index, COL_BLEU + COL_TEXTRANK] = auto_scores_textrank[COL_BLEU]
    df_eval.loc[index, COL_BLEU + COL_PEGASUS] = auto_scores_pegasus[COL_BLEU]
    
    df_eval.loc[index, COL_BERT + COL_TEXTRANK] = auto_scores_textrank[COL_BERT]
    df_eval.loc[index, COL_BERT + COL_PEGASUS] = auto_scores_pegasus[COL_BERT]



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['ro

In [None]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 0)   
df_eval.head()

Unnamed: 0,title,abstract,summary_bert_textRank,summary_pegasus_pretrained,ROUGE-1 TextRank,ROUGE-1 Pegasus,ROUGE-2 TextRank,ROUGE-2 Pegasus,ROUGE-L TextRank,ROUGE-L Pegasus,BLEU TextRank,BLEU Pegasus,BERTScore-F1 TextRank,BERTScore-F1 Pegasus
179995,GP-guided MPPI for Efficient Navigation in Complex Unknown Cluttered\n Environments,"robotic navigation in unknown , cluttered environments with limited sensing capabilities poses significant challenges in robotics . local trajectory optimization methods , such as model predictive...","afterward , mppi computes the optimal control sequence that satisfies the robot and collision avoidance constraints . such policy takes advantage of the sgp occupancy model to learn about the navi...","Researchers at the University of South Korea have developed a robot that can achieve collision-free navigation in two different operating modes, according to research by the University of Freiburg...",0.418,0.102,0.25,0.008,0.333,0.079,0.102,0.0,0.865,0.798
147614,A Simple Way to Learn Metrics Between Attributed Graphs,"the choice of good distances and similarity measures between objects is important for many machine learning methods . therefore , many metric learning algorithms have been developed in recent year...","the complexity of rpw2 is given by ( p2n log ( ) ) which saves factor as compared to pw2 and this term is often greater than 10. from rpw2 , we deﬁne parametric distance drpw2 θ between two attrib...","The results of two experiments on the training of an artificial intelligence algorithm have been published, as well as the results of two experiments on the experimental method of learning and tra...",0.343,0.219,0.041,0.02,0.163,0.119,0.0,0.0,0.814,0.813
227753,Ad Auctions for LLMs via Retrieval Augmented Generation,"in the field of computational advertising , the integration of ads into the outputs of large language models ( llms ) presents an opportunity to support these services without compromising content...","this verifies our conjecture that the multiallocation segment auction constructs more coherent output , as the llm can optimize over the entire document to incorporate the selected ads . overall ,...","This is a series of papers at the University of South Africa, which look at the results of the study of ad-auctions and how to respond to books. 800-338- 800-338- 800-338- 800-338- 800-338- 800-33...",0.293,0.128,0.02,0.0,0.137,0.101,0.0,0.0,0.83,0.77


Take average of the scores for the n-sampled papers

In [None]:
# Assuming df is your DataFrame
df_eval_avg = df_eval[[COL_ROUGE1 + COL_TEXTRANK, COL_ROUGE1 + COL_PEGASUS, COL_BERT+COL_TEXTRANK, COL_BERT+COL_PEGASUS]]
df_eval_avg.mean()

ROUGE-1 TextRank         0.351333
ROUGE-1 Pegasus          0.149667
BERTScore-F1 TextRank    0.836333
BERTScore-F1 Pegasus     0.793667
dtype: object

Print an example of a row

In [None]:
print(f"title: {df_eval.iloc[0]['title']}")
print(f"abstract: {df_eval.iloc[0]['abstract']}")
print(f"textrank: {df_eval.iloc[0]['summary_bert_textRank']}")
print(f"pegasus: {df_eval.iloc[0]['summary_pegasus_pretrained']}")


title: GP-guided MPPI for Efficient Navigation in Complex Unknown Cluttered
  Environments
abstract: robotic navigation in unknown , cluttered environments with limited sensing capabilities poses significant challenges in robotics . local trajectory optimization methods , such as model predictive path intergal ( mppi ) , are promising solution to this challenge . however , global guidance is required to ensure effective navigation , especially when encountering challenging environmental conditions or navigating beyond the planning horizon . this study presents the gpmppi , an online learningbased control strategy that integrates mppi with local perception model based on sparse gaussian process ( sgp ) . the key idea is to leverage the learning capability of sgp to construct variance ( uncertainty ) surface , which enables the robot to learn about the navigable space surrounding it , identify set of suggested subgoals , and ultimately recommend the optimal subgoal that minimizes predefi