# Assignment 8: Mini Project

Adhvait Ananthan Srinath 

Poh Shi Qian 

## Loading in the data

### Dataset Citation


Hugging Face. (2024). CCDV PubMed Summarization Dataset. Retrieved from https://huggingface.co/datasets/ccdv/pubmed-summarization



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
file_path = '/content/drive/My Drive/Colab Notebooks/project_data/train.txt'
file_path1 = '/content/drive/My Drive/Colab Notebooks/project_data/test.txt'
file_path2 = '/content/drive/My Drive/Colab Notebooks/project_data/val.txt'

In [None]:
!pip install jsonlines

Collecting jsonlines
  Downloading jsonlines-4.0.0-py3-none-any.whl (8.7 kB)
Installing collected packages: jsonlines
Successfully installed jsonlines-4.0.0


In [None]:
import jsonlines
import pandas as pd

# Create an empty list to store the data
data = []

# Open the JSONL file and read its contents
with jsonlines.open(file_path) as reader:
    for obj in reader:
        data.append(obj)

df = pd.DataFrame(data)

df.head()

Unnamed: 0,article_id,article_text,abstract_text,labels,section_names,sections
0,PMC3872579,[a recent systematic analysis showed that in 2...,[<S> background : the present study was carrie...,,"[INTRODUCTION, MATERIALS AND METHODS, Particip...",[[a recent systematic analysis showed that in ...
1,PMC3770628,[it occurs in more than 50% of patients and ma...,[<S> backgroundanemia in patients with cancer ...,,"[Introduction, Patients and methods, Study des...",[[it occurs in more than 50% of patients and m...
2,PMC5330001,"[tardive dystonia ( td ) , a rarer side effect...",[<S> tardive dystonia ( td ) is a serious side...,,"[INTRODUCTION, CASE REPORT, DISCUSSION, Declar...","[[tardive dystonia ( td ) , a rarer side effec..."
3,PMC4386667,"[lepidoptera include agricultural pests that ,...",[<S> many lepidopteran insects are agricultura...,,"[1. Introduction, 2. Insect Immunity, 3. Signa...",[[lepidoptera include agricultural pests that ...
4,PMC4307954,[syncope is caused by transient diffuse cerebr...,[<S> we present an unusual case of recurrent c...,,"[Introduction, Case report, Discussion, Confli...",[[syncope is caused by transient diffuse cereb...


In [None]:
new_df = df[['article_text', 'abstract_text']].copy()

articleText = new_df.sample(n=50, random_state=42)

print(articleText)

articleText.info()

                                             article_text  \
32536   [long - term synaptic plasticity is thought to...   
543     [californium-252 is an artificial element with...   
46953   [ewing 's sarcoma is a malignant nonosteogenic...   
3580    [conventional endodontic treatment has experie...   
95214   [choroidal osteoma ( choroidal osseous chorist...   
36084   [quantitative nuclear magnetic resonance ( qnm...   
29915   [laparoscopic cholecystectomy ( lc ) , as comp...   
95647   [compared to the adult population , blunt lary...   
15315                                                  []   
96640   [infection with herpes simplex virus ( a dna v...   
91297   [its clinical manifestations are related to th...   
94277   [universal vaccination against acute communica...   
101396  [image fusion software can derive a fusion ima...   
32677   [ziehl  neelsen ( zn ) method for acid - fast ...   
94053   [decades of funding and research focused on co...   
53336   [in 2006 , the n

## Preprocessing the Text

We are going to preprocess the code to clean and prepare the text for the summarization task. The code performs several steps:

1. Lowercasing
2. Removing punctuation
3. Tokenization
4. Part-of-Speech Tagging (POS) - Assigns a grammitical tag to each token
5. Named Entity Recognition (NER) - To identify named entities such as person names, locations, and organizations
6. Removing Stopwords and Named Entities - Removes common stopwords and named entities to focus on meaninful words.

This will help us to improve the quality of the model and remove the "noise" from the text data.

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
import string

# Download the NER model (if not already downloaded)
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text_with_ner(text_list):
    text = ' '.join(text_list)
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])

    # Tokenize the text
    tokens = word_tokenize(text)

    # Perform part-of-speech tagging
    pos_tags = pos_tag(tokens)

    # Perform named entity recognition
    named_entities = nltk.ne_chunk(pos_tags)

    # Remove stopwords and named entities
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word, tag in pos_tags if word.lower() not in stop_words and tag != 'NE']

    # Join tokens back into a string
    preprocessed_text = ' '.join(filtered_tokens)

    return preprocessed_text

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# Applying the preprocessing function to both 'article_text' and 'abstract_text' columns
articleText['preprocessed_article_text'] = articleText['article_text'].apply(preprocess_text_with_ner)
articleText['preprocessed_abstract_text'] = articleText['abstract_text'].apply(preprocess_text_with_ner)

print(articleText[['preprocessed_article_text', 'preprocessed_abstract_text']])
articleText.info()


                                preprocessed_article_text  \
32536   long term synaptic plasticity thought represen...   
543     californium252 artificial element half life 26...   
46953   ewing sarcoma malignant nonosteogenic primary ...   
3580    conventional endodontic treatment experienced ...   
95214   choroidal osteoma choroidal osseous choristoma...   
36084   quantitative nuclear magnetic resonance qnmr w...   
29915   laparoscopic cholecystectomy lc compared open ...   
95647   compared adult population blunt laryngotrachea...   
15315                                                       
96640   infection herpes simplex virus dna virus preva...   
91297   clinical manifestations related reduction abse...   
94277   universal vaccination acute communicable disea...   
101396  image fusion software derive fusion image sing...   
32677   ziehl neelsen zn method acid fast bacilli afb ...   
94053   decades funding research focused combatting sp...   
53336   2006 nhs institu

Now, we are going to identify and remove certain articles with a word count less that 400.

Rows with less text are filtered out to improve the summarization quality

In [None]:
# Calculate word count for each row in the 'preprocessed_article_text' column
articleText['word_count'] = articleText['preprocessed_article_text'].apply(lambda x: len(x.split()))

# Identify rows with word count less than 400 for 'preprocessed_article_text'
filtered_out_rows = articleText[articleText['word_count'] < 400]

# Filter out rows with word count less than 400 for 'preprocessed_article_text'
articleText = articleText[articleText['word_count'] >= 400]

# Drop the 'word_count' column as it's no longer needed
articleText = articleText.drop(columns=['word_count'])

# Display the first few rows of the filtered out DataFrame for examination
articleText.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 32536 to 51251
Data columns (total 4 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   article_text                44 non-null     object
 1   abstract_text               44 non-null     object
 2   preprocessed_article_text   44 non-null     object
 3   preprocessed_abstract_text  44 non-null     object
dtypes: object(4)
memory usage: 1.7+ KB


In [None]:
# Create a DataFrame with only 'preprocessed_article_text'
articleText_article = articleText[['preprocessed_article_text']].copy()

# Create a DataFrame with only 'preprocessed_abstract_text'
articleText_abstract = articleText[['preprocessed_abstract_text']].copy()

# Display information about the new DataFrames
print("DataFrame with preprocessed article text:")
print(articleText_article.info())

print("\nDataFrame with preprocessed abstract text:")
print(articleText_abstract.info())


DataFrame with preprocessed article text:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 32536 to 51251
Data columns (total 1 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   preprocessed_article_text  44 non-null     object
dtypes: object(1)
memory usage: 704.0+ bytes
None

DataFrame with preprocessed abstract text:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 32536 to 51251
Data columns (total 1 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   preprocessed_abstract_text  44 non-null     object
dtypes: object(1)
memory usage: 704.0+ bytes
None


## Summarizing the Articles

For the summarization process, we are utilizing the Hugging Face **'transformer'** library to generate summaries for text input. The function takes a piece of text as input, truncates it to fit within the maximum sequence length, and then uses the default summarization pipeline to generate a summary.

In [None]:
from transformers import pipeline

# Load the default summarization pipeline
summarization_pipeline = pipeline("summarization")

# Define a function to generate summaries
def generate_summary(text):
    # Truncate the text to fit within the maximum sequence length
    truncated_text = text[:1024]  # Assuming maximum sequence length of 1024

    # Generate summary using the pipeline
    summary = summarization_pipeline(truncated_text, max_length=120, min_length=30, do_sample=False)

    # Extract and return the summary text
    return summary[0]['summary_text']

# Apply the function to each row in the DataFrame
articleText_article['summary'] = articleText_article['preprocessed_article_text'].apply(generate_summary)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
summaryText = articleText_article[['summary']].copy()

summaryText.head()
summaryText.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 32536 to 51251
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   summary  44 non-null     object
dtypes: object(1)
memory usage: 704.0+ bytes


## Calculating BERT Score

We chose BERTScore specifically for evaluating summaries in this context for its effectiveness in capturing semantic similarity, robustness across various text data, and ease of use. These qualities make it a suitable choice for assessing the quality of generated summaries relative to the reference abstracts.

In [None]:
!pip install bert_score

Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13


In [None]:
from bert_score import score

# Define system-generated summaries and reference summaries
system_summaries = summaryText['summary'].tolist()
reference_summaries = articleText_abstract['preprocessed_abstract_text'].tolist()

# Calculate BERTScore
P, R, F1 = score(system_summaries, reference_summaries, lang="en", verbose=True)

# Print BERTScore
print("Precision:", P.mean())
print("Recall:", R.mean())
print("F1 Score:", F1.mean())


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/2 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 195.28 seconds, 0.23 sentences/sec
Precision: tensor(0.8295)
Recall: tensor(0.7937)
F1 Score: tensor(0.8110)


**Precision**: A precision of 0.8295 suggests that the generated summaries are concise and focused and contains mostly relevant content from the reference abstracts.

**Recall**: The recall score of 0.7937 suggests that the summaries are comprehensive and include a significant portion of the key information present in the reference.

**F1 Score**: The F1 Score of 0.8110 implies that our summaries strike a good balance between precision and recall, affirming their accuracy and informativeness.

## Extracting and Comparing Keywords

We also intend on extracting and comparing keywords. We intend to do this as this serves as a crucial tool for assessing their quality and effectiveness.

In [None]:
from collections import Counter

# Function to extract keywords from summary text
def extract_keywords_from_summary(text):
    # Ensure text is converted to string
    text = str(text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word.lower() for word in tokens if word.isalnum() and word.lower() not in stop_words]

    # Calculate word frequency
    word_freq = Counter(filtered_tokens)

    # Get top keywords based on frequency
    top_keywords = word_freq.most_common(10)  # Extract top 5 keywords

    return [keyword for keyword, freq in top_keywords]

# Extract keywords from the first row of summaryText
summary_keywords = extract_keywords_from_summary(summaryText['summary'].iloc[0])

# Print out the keywords
print("Top 10 keywords from the summary:", summary_keywords)

Top 10 keywords from the summary: ['long', 'term', 'synaptic', 'distribution', 'plasticity', 'thought', 'represent', 'cellular', 'basis', 'learning']


In [None]:
# Function to extract keywords from abstract text
def extract_keywords_from_abstract(text):
    # Ensure text is converted to string
    text = str(text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove punctuation and stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word.lower() for word in tokens if word.isalnum() and word.lower() not in stop_words]

    # Calculate word frequency
    word_freq = Counter(filtered_tokens)

    # Get top keywords based on frequency
    top_keywords = word_freq.most_common(10)  # Extract top 5 keywords

    return [keyword for keyword, freq in top_keywords]

# Extract keywords from the first row of summaryText
abstract_keywords = extract_keywords_from_abstract(articleText_abstract['preprocessed_abstract_text'].iloc[0])

# Print out the keywords
print("Top 10 keywords from the abstract:", abstract_keywords)

Top 10 keywords from the abstract: ['imaging', 'vsd', 'regions', 'long', 'term', 'synaptic', 'ltp', 'ltd', 'excited', 'plasticity']


Now we intend to compare the keywords from the article text in our generated summaries and the abstract provided in the dataset for each article

In [None]:
# Function to calculate similarity based on percentage
def calculate_similarity_percentage(keyword_list1, keyword_list2, reference_text):
    set1 = set(keyword_list1)
    set2 = set(keyword_list2)
    reference_set = set(word_tokenize(reference_text.lower()))

    # Calculate the intersection of keywords in summary and abstract with the reference text
    intersection_summary = set1.intersection(reference_set)
    intersection_abstract = set2.intersection(reference_set)

    # Calculate the percentage of keywords from reference text in summary and abstract
    percentage_summary = len(intersection_summary) / len(reference_set) * 100
    percentage_abstract = len(intersection_abstract) / len(reference_set) * 100

    return percentage_summary, percentage_abstract

# Calculate similarity percentage between summary keywords and abstract keywords
article_text = articleText_article['preprocessed_article_text'].iloc[0]
percentage_summary, percentage_abstract = calculate_similarity_percentage(summary_keywords, abstract_keywords, article_text)

print("Percentage of reference article text keywords in summary:", percentage_summary)
print("Percentage of reference article text keywords in abstract:", percentage_abstract)

Percentage of reference article text keywords in summary: 0.8554319931565441
Percentage of reference article text keywords in abstract: 0.7698887938408896


These results suggest that the generated summaries contain a slightly higher percentage of keywords from the reference article text compared to that of the abstract. The fact that summary is able to capture around 85.5% of the keywords indicates that it contains a significant portion of key information as well.