# COMPSCI 723 - GRAD TERM PROJECT

#### Team Members: 
##### 1. Shivam Jayeshkumar Mehta (sjmehta@uwm.edu)
##### 2. Atharva Pradeep Vaishnav (vaishna2@uwm.edu)
##### 3. Venkata Kailash Tanniru (vtanniru@uwm.edu)


#### Contributions:
###### Shivam Jayeshkumar Mehta :
> - Dataset creation and Cleaning.	
> - Extractive Summarization Implementation

###### Atharva Pradeep Vaishnav:
> - Dataset creation and Cleaning.
> - Abstractive Summarization Implementation.

###### Venkata Kailash Tanniru:
> - Dataset Creation and Cleaning.
> - Evaluation Metrics Implementation and Comprehensive Evaluation


## Importing Libraries

In [12]:
import pandas as pd
import numpy as np
import re
import json
import nltk
import bert_score
import textstat
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('punkt')
import sentencepiece
from transformers import T5Tokenizer, T5ForConditionalGeneration
from transformers import pipeline
from rouge_score import rouge_scorer
from bert_score import score
import textstat
import torch


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SHIVAM\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SHIVAM\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading the data

In [13]:
json_file_path = 'test_data.json'

with open(json_file_path, 'r') as file:
    data = json.load(file)

df = pd.DataFrame(data)
print("Data loaded successfully!")

# displaying the input data
df.head(100)


Data loaded successfully!


Unnamed: 0,article,abstract
0,for about 20 years the problem of properties o...,The study investigates short-term periodicitie...
1,it is believed that the direct detection of gr...,This study examines the potential to detect ci...
2,"as a common quantum phenomenon , the tunneling...","A new barrier penetration formula, derived fro..."
3,the short - lived radioisotope ( slri ) @xmath...,The study explores marginally gravitationally ...
4,as in our previous analysis of run 1a data @x...,CDF searched for new particles decaying into j...
5,this review focuses specifically on what we ha...,Observations of pre-supernova images reveal in...
6,single - transverse spin asymmetries ( ssas ) ...,This study explores single-transverse spin asy...
7,kingman s coalescent is a random tree introduc...,Kingmanâ€™s coalescent models genetic ancestri...
8,circumstellar material holds clues about the m...,The study of circumstellar environments around...


## Text Cleaning

In [14]:
def clean_text(text):

    # removing names prefixed with '@'
    text = re.sub(r'@\w+', '', text)
    
    # removing placeholders like 'xmath' followed by numbers
    text = re.sub(r'\bxmath\d+\b', '', text)
    
    # removing figure, table, formula, and equation mentions
    text = re.sub(r'\b(fig(?:ure)?\.?|table|formula|equation)\s*\[?\w*?\]?\b', '', text, flags=re.IGNORECASE)
    
    # removing references and citations.
    text = re.sub(r'\[\d+\]', '', text)
    text = re.sub(r'\[@.*?\]', '', text)
    
    # removing all math expressions and LaTeX commands
    text = re.sub(r'\$.*?\$', '', text)                  
    text = re.sub(r'\\[a-zA-Z]+(\{.*?\})?', '', text)    
    text = re.sub(r'\{.*?\}', '', text)
    text = re.sub(r'\(.*?\)', '', text)
    
    # removing URLs and email addresses
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    
    # removing multiple spaces, newlines, and special characters with a single space
    text = re.sub(r'[^a-zA-Z0-9\s.,;?!]', '', text)
    text = re.sub(r'\s+', ' ', text)
    
    # Standardizing text to lowercase
    text = text.lower()
    
    # removing duplicate sentences
    sentences = list(dict.fromkeys(sent_tokenize(text)))
    text = ' '.join(sentences)
    
    # removing the trailing spaces
    text = text.strip()
    
    return text

In [15]:
df['cleaned_article'] = df['article'].apply(clean_text)
print("\nText cleaning completed!")

# Display cleaned articles for verification
df[['article', 'cleaned_article']].head(100)


Text cleaning completed!


Unnamed: 0,article,cleaned_article
0,for about 20 years the problem of properties o...,for about 20 years the problem of properties o...
1,it is believed that the direct detection of gr...,it is believed that the direct detection of gr...
2,"as a common quantum phenomenon , the tunneling...","as a common quantum phenomenon , the tunneling..."
3,the short - lived radioisotope ( slri ) @xmath...,the short lived radioisotope was alive during ...
4,as in our previous analysis of run 1a data @x...,"as in our previous analysis of run 1a data , w..."
5,this review focuses specifically on what we ha...,this review focuses specifically on what we ha...
6,single - transverse spin asymmetries ( ssas ) ...,single transverse spin asymmetries play a fund...
7,kingman s coalescent is a random tree introduc...,kingman s coalescent is a random tree introduc...
8,circumstellar material holds clues about the m...,circumstellar material holds clues about the m...


## Text Summarization

#### Extractive Summarization - TextRank Algorithm

In [16]:
def text_rank_summary(text, max_words=100, min_words=60):
    # Cleaning the input text
    cleaned_text = clean_text(text)
    
    # tokenizing the text into sentences
    sentences = sent_tokenize(cleaned_text)
    
    # ensuring there's content to process
    if len(sentences) < 2:
        return "Not enough content to summarize."
    
    # preprocessing sentences by removing stopwords and filtering out very short/long sentences
    preprocessed_sentences = []
    valid_sentences = []
    
    for i, sentence in enumerate(sentences):
        preprocessed = ' '.join([word for word in word_tokenize(sentence.lower()) if word.isalnum() and word not in stop_words])
        if len(preprocessed.split()) >= 5:  # filtering out very short sentences
            preprocessed_sentences.append(preprocessed)
            valid_sentences.append(sentence)
    
    # send the error to display If no valid sentences remain after preprocessing
    if len(preprocessed_sentences) < 2:
        return "Not enough content to summarize."
    
    # doing TF-IDF vectorization
    vectorizer = TfidfVectorizer(ngram_range=(1, 2))
    tfidf_matrix = vectorizer.fit_transform(preprocessed_sentences)
    
    # calculating the cosine similarity between sentences
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    
    # building the graph and apply PageRank
    nx_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(nx_graph)
    
    # ranking sentences by their scores
    ranked_sentences = sorted(((scores[i], valid_sentences[i]) for i in range(len(valid_sentences))), reverse=True)
    
    # selecting the top sentences until word limit is reached
    summary_sentences = []
    total_words = 0
    
    for score, sentence in ranked_sentences:
        word_count = len(word_tokenize(sentence))
        if total_words + word_count <= max_words:
            summary_sentences.append(sentence)
            total_words += word_count
        if total_words >= min_words:
            break
    
    # sorting sentences by their original position to improve coherence and readability
    summary_sentences.sort(key=lambda s: sentences.index(s))
    
    # Join the sentences to form the summary
    summary = ' '.join(summary_sentences)
    
    # Final cleanup
    summary = re.sub(r'\s+', ' ', summary).strip()
    
    return summary

In [17]:
# applying the extractive summarization function
df['extractive_summary'] = df['cleaned_article'].apply(text_rank_summary)

In [18]:
# creating a dataFrame to store the extracted summaries
ExtractiveResults_df = df[['abstract', 'extractive_summary']]

# displaying the generated summaries
print("\nOriginal Abstract vs Extractive Summary:")
ExtractiveResults_df.head(100)


Original Abstract vs Extractive Summary:


Unnamed: 0,abstract,extractive_summary
0,The study investigates short-term periodicitie...,the existence of the periodicity of about days...
1,This study examines the potential to detect ci...,we characterize sgwb by the so called stokes p...
2,"A new barrier penetration formula, derived fro...","in the present work , we derived a new barrier..."
3,The study explores marginally gravitationally ...,3 shows that the initial disk surface injectio...
4,CDF searched for new particles decaying into j...,"for the mass region gev c , there are 2947 eve..."
5,Observations of pre-supernova images reveal in...,following a brief summary of sn classification...
6,This study explores single-transverse spin asy...,"at the same time , taking the moment of the fa..."
7,Kingmanâ€™s coalescent models genetic ancestri...,the sequence t1 satisfies a large deviation pr...
8,The study of circumstellar environments around...,", width529 currently , a small survey with her..."


#### Abstractive Summarization - T5 Transformer Based Model

In [19]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

tokenizer = T5Tokenizer.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base").to(device)

#  function to handle long texts with batch processing
def generate_abstractive_summary(text, max_length=100, min_length=60, chunk_size=256):
    # tokenizing the input text and split it into manageable chunks for faster processing
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=chunk_size, padding="longest")
    input_ids = inputs['input_ids'].to(device)

    summaries = []

    # Processing article chunks in batches
    batch_size = 4
    for i in range(0, input_ids.size(0), batch_size):
        batch_input_ids = input_ids[i:i + batch_size]
        try:
            outputs = model.generate(batch_input_ids, max_length=max_length, min_length=min_length, do_sample=False)
            batch_summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
            summaries.extend(batch_summaries)
        except Exception as e:
            print(f"Error during summarization: {e}")
            summaries.append("Error generating summary.")

    # Combining the batch summaries results to create a single summary
    final_summary = ' '.join(summaries)

    # Truncating the summary to ensure it is in the specified length range
    final_summary_tokens = tokenizer.tokenize(final_summary)
    if len(final_summary_tokens) > max_length:
        final_summary = tokenizer.decode(tokenizer.encode(final_summary_tokens[:max_length]), skip_special_tokens=True)

    return final_summary

df['abstractive_summary'] = df['cleaned_article'].apply(lambda x: generate_abstractive_summary(x))

# create a dataFrame to store the abstractive summaries
AbstractiveResults_df = df[['abstract', 'abstractive_summary']]

# displaying the generated summaries
print("\n Original Abstract vs Abstractive Summary:")
AbstractiveResults_df.head(100)


Using device: cpu

 Original Abstract vs Abstractive Summary:


Unnamed: 0,abstract,abstractive_summary
0,The study investigates short-term periodicitie...,155day periodicity was found in data records f...
1,This study examines the potential to detect ci...,ptas are a method of searching for gravitation...
2,"A new barrier penetration formula, derived fro...",quantum tunneling is a common quantum phenomen...
3,The study explores marginally gravitationally ...,". , a short lived radioisotope , was alive dur..."
4,CDF searched for new particles decaying into j...,a general search for new particles with a narr...
5,Observations of pre-supernova images reveal in...,cc sn progenitors have been directly detected ...
6,This study explores single-transverse spin asy...,qcd in high energy hadronic scattering has bee...
7,Kingmanâ€™s coalescent models genetic ancestri...,a random tree arising in large population gene...
8,The study of circumstellar environments around...,ir provides several key diagnostics for the ev...


## Evaluation

In [20]:
#functions to implement evaluation metrics

# calculate avg rouge-1 score for all the summaries in the dataframe
def compute_rouge1_avg(df, ref_col, hyp_col):

    scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
    rouge1_scores = []
    for ref, hyp in zip(df[ref_col], df[hyp_col]):
        s = scorer.score(ref, hyp)
        rouge1_scores.append(s['rouge1'].fmeasure)
    return np.mean(rouge1_scores)


# calculate avg rouge-2 score for all the summaries in the dataframe
def compute_rouge2_avg(df, ref_col, hyp_col):

    scorer = rouge_scorer.RougeScorer(['rouge2'], use_stemmer=True)
    rouge2_scores = []
    for ref, hyp in zip(df[ref_col], df[hyp_col]):
        s = scorer.score(ref, hyp)
        rouge2_scores.append(s['rouge2'].fmeasure)
    return np.mean(rouge2_scores)

# calculate avg BERTScore F1 for all the summaries in the dataframe
def compute_bert_score_avg(df, ref_col, hyp_col, lang="en"):

    references = df[ref_col].tolist()
    candidates = df[hyp_col].tolist()

    P, R, F1 = score(candidates, references, lang=lang, verbose=False)
    return F1.mean().item() if isinstance(F1, torch.Tensor) else np.mean(F1)

# calculate avg Flesch-Kincaid Grade Level for all the summaries in the dataframe
def compute_flesch_kincaid_avg(df, hyp_col):

    fk_scores = df[hyp_col].apply(textstat.flesch_kincaid_grade)
    return fk_scores.mean()

In [21]:
# calculating the evaluation metrics for abstractive summaries
abs_rouge1 = compute_rouge1_avg(AbstractiveResults_df, 'abstract', 'abstractive_summary')
abs_rouge2 = compute_rouge2_avg(AbstractiveResults_df, 'abstract', 'abstractive_summary')
abs_bert = compute_bert_score_avg(AbstractiveResults_df, 'abstract', 'abstractive_summary')
abs_fk = compute_flesch_kincaid_avg(AbstractiveResults_df, 'abstractive_summary')

# calculating the evaluation metrics for extractive summaries
ext_rouge1 = compute_rouge1_avg(ExtractiveResults_df, 'abstract', 'extractive_summary')
ext_rouge2 = compute_rouge2_avg(ExtractiveResults_df, 'abstract', 'extractive_summary')
ext_bert = compute_bert_score_avg(ExtractiveResults_df, 'abstract', 'extractive_summary')
ext_fk = compute_flesch_kincaid_avg(ExtractiveResults_df, 'extractive_summary')


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# creating Evaluation_result_df to display the evaluation metrics
Evaluation_result_df = pd.DataFrame({
    'Method': ['Extractive', 'Abstractive'],
    'Avg_ROUGE-1': [ext_rouge1, abs_rouge1],
    'Avg_ROUGE-2': [ext_rouge2, abs_rouge2],
    'Avg_BERTScore_F1': [ext_bert, abs_bert],
    'Avg_Flesch_Kincaid': [ext_fk, abs_fk]
})

# displaying the evaluation metrics
Evaluation_result_df

Unnamed: 0,Method,Avg_ROUGE-1,Avg_ROUGE-2,Avg_BERTScore_F1,Avg_Flesch_Kincaid
0,Extractive,0.294452,0.079846,0.825715,17.255556
1,Abstractive,0.25793,0.049757,0.830853,9.344444
