NLTK Summarizer
TF: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document,as a way of normalization.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

NLTK summarization involves 5 steps

Create the word frequency table

Tokenize the sentences

Score the sentences: Term frequency

Find the threshold

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import pandas as pd # dataframe processing

In [31]:
article =  df=pd.read_csv("/content/drive/MyDrive/medicine_articles.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,pubmed_id,all_text
0,0,9815484,b'Vertebral artery dissection (VAD) is a commo...
1,1,9839201,b'Long COVID is an often debilitating illness ...
2,2,9930320,b'Wrist-ankle acupuncture (WAA) has been repor...
3,3,10018610,b'Molnupiravir is an oral antiviral drug that ...
4,4,9922164,b'Human prion protein and prion-like protein m...


In [32]:
#Generating a single article from the text body of the 2nd article
text=df['all_text'][1]

In [33]:
text

"b'Long COVID is an often debilitating illness that occurs in at least 10% of severe acute respiratory syndrome coronavirus\\xc2\\xa02 (SARS-CoV-2) infections. More than 200 symptoms have been identified with impacts on multiple organ systems. At least 65 million individuals worldwide are estimated to have long COVID, with cases increasing daily. Biomedical research has made substantial progress in identifying various pathophysiological changes and risk factors and in characterizing the illness; further, similarities with other viral-onset illnesses such as myalgic encephalomyelitis/chronic fatigue syndrome and postural orthostatic tachycardia syndrome have laid the groundwork for research in the field. In this Review, we explore the current literature and highlight key findings, the overlap with other conditions, the variable onset of symptoms, long COVID in children and the impact of vaccinations. Although these key findings are critical to understanding long COVID, current diagnosti

In [34]:
#split the words in each sentence into tokens -tokenization
import nltk
nltk.download('punkt')
tokens = word_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [35]:
#we create a dictionary for the word frequency table from the text.
#For this, we should only use the words that are not part of the stopwords
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure','journals','april', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.','al.', 'Elsevier',
    'PMC', 'CZI', 'www','image','figures','tables','introduction','materials and methods','results']
new_stopwords_list = stop_words.union(new_stopwords)

#remove words that are in NLTK stopwords list
not_stopwords = {'not', 'may'} 
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])

print(final_stop_words)

{"weren't", 'on', "it's", 'they', 'peer', 'is', 'there', 'y', 'did', 'more', 'how', 's', 'my', 'to', 'during', "you're", 'yourself', 'reserved', 'isn', 'so', 'using', 'down', 'theirs', 'other', "mightn't", 'won', 'than', 'if', 'al', 'https', "couldn't", 'weren', 'being', 'fig.', 'wouldn', 'this', 'only', 'shan', 'which', 'figures', 'your', 'biorxiv', 'et', 'al.', 'nor', "you've", 'we', 'below', 'just', 'author', 'o', 'under', 'what', 'these', 'again', 'himself', 'copyright', 'both', "won't", 'for', 'it', 'into', 'aren', 'once', 've', 'such', "aren't", 'an', "didn't", 'PMC', 'few', "she's", 'been', 'CZI', 'are', 'after', 'preprint', 'above', 'as', "you'd", 'from', 'those', 'same', 'reviewed', 'when', 'materials and methods', 'any', 'of', 'ours', 'some', 'yourselves', 'hers', 'has', 'whom', 'rights', 'tables', 'mightn', 'she', 'where', 'its', 'why', 'herself', 'too', 'm', 'having', 'results', 'introduction', 'their', 'don', 'figure', 'or', 't', 'd', 'because', 'him', 'that', 'license', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [36]:
#Remove the punctuations
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [37]:
#Generate the word frequencies  to score each sentence
word_frequencies = {}
for word in tokens:    
    if word.lower() not in stop_words:
        if word.lower() not in punctuation:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

In [38]:
word_frequencies

{"b'Long": 1,
 'COVID': 8,
 'often': 1,
 'debilitating': 1,
 'illness': 3,
 'occurs': 1,
 'least': 2,
 '10': 1,
 'severe': 1,
 'acute': 1,
 'respiratory': 1,
 'syndrome': 3,
 'coronavirus\\xc2\\xa02': 1,
 'SARS-CoV-2': 2,
 'infections': 1,
 '200': 1,
 'symptoms': 2,
 'identified': 1,
 'impacts': 2,
 'multiple': 2,
 'organ': 2,
 'systems': 2,
 '65': 1,
 'million': 1,
 'individuals': 3,
 'worldwide': 2,
 'estimated': 1,
 'long': 6,
 'cases': 1,
 'increasing': 1,
 'daily': 1,
 'Biomedical': 1,
 'research': 7,
 'made': 1,
 'substantial': 1,
 'progress': 1,
 'identifying': 1,
 'various': 1,
 'pathophysiological': 1,
 'changes': 1,
 'risk': 1,
 'factors': 1,
 'characterizing': 1,
 'similarities': 1,
 'viral-onset': 2,
 'illnesses': 1,
 'myalgic': 1,
 'encephalomyelitis/chronic': 1,
 'fatigue': 1,
 'postural': 1,
 'orthostatic': 1,
 'tachycardia': 1,
 'laid': 1,
 'groundwork': 1,
 'field': 1,
 'Review': 1,
 'explore': 1,
 'current': 2,
 'literature': 1,
 'highlight': 1,
 'key': 2,
 'findings'

In [39]:
max_frequency = max(word_frequencies.values())
print(max_frequency)

8


In [40]:

for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [41]:
print(word_frequencies)

{"b'Long": 0.125, 'COVID': 1.0, 'often': 0.125, 'debilitating': 0.125, 'illness': 0.375, 'occurs': 0.125, 'least': 0.25, '10': 0.125, 'severe': 0.125, 'acute': 0.125, 'respiratory': 0.125, 'syndrome': 0.375, 'coronavirus\\xc2\\xa02': 0.125, 'SARS-CoV-2': 0.25, 'infections': 0.125, '200': 0.125, 'symptoms': 0.25, 'identified': 0.125, 'impacts': 0.25, 'multiple': 0.25, 'organ': 0.25, 'systems': 0.25, '65': 0.125, 'million': 0.125, 'individuals': 0.375, 'worldwide': 0.25, 'estimated': 0.125, 'long': 0.75, 'cases': 0.125, 'increasing': 0.125, 'daily': 0.125, 'Biomedical': 0.125, 'research': 0.875, 'made': 0.125, 'substantial': 0.125, 'progress': 0.125, 'identifying': 0.125, 'various': 0.125, 'pathophysiological': 0.125, 'changes': 0.125, 'risk': 0.125, 'factors': 0.125, 'characterizing': 0.125, 'similarities': 0.125, 'viral-onset': 0.25, 'illnesses': 0.125, 'myalgic': 0.125, 'encephalomyelitis/chronic': 0.125, 'fatigue': 0.125, 'postural': 0.125, 'orthostatic': 0.125, 'tachycardia': 0.125,

In [42]:
# Score the sentences: Term frequency
# We’re using the Term Frequency method to score each sentence.
#Basic Algorithm: score a sentence by its words, adding the frequency of every non-stop word in a sentence.
#each term is divided b the number of times in occurs in the corpus
sent_token = sent_tokenize(text)
sent_token

["b'Long COVID is an often debilitating illness that occurs in at least 10% of severe acute respiratory syndrome coronavirus\\xc2\\xa02 (SARS-CoV-2) infections.",
 'More than 200 symptoms have been identified with impacts on multiple organ systems.',
 'At least 65 million individuals worldwide are estimated to have long COVID, with cases increasing daily.',
 'Biomedical research has made substantial progress in identifying various pathophysiological changes and risk factors and in characterizing the illness; further, similarities with other viral-onset illnesses such as myalgic encephalomyelitis/chronic fatigue syndrome and postural orthostatic tachycardia syndrome have laid the groundwork for research in the field.',
 'In this Review, we explore the current literature and highlight key findings, the overlap with other conditions, the variable onset of symptoms, long COVID in children and the impact of vaccinations.',
 'Although these key findings are critical to understanding long COV

In [43]:
sentence_scores = {}
for sent in sent_token:
    sentence = sent.split(" ")
    for word in sentence:        
        if word.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.lower()]

In [44]:
sentence_scores

{"b'Long COVID is an often debilitating illness that occurs in at least 10% of severe acute respiratory syndrome coronavirus\\xc2\\xa02 (SARS-CoV-2) infections.": 1.875,
 'More than 200 symptoms have been identified with impacts on multiple organ systems.': 1.25,
 'At least 65 million individuals worldwide are estimated to have long COVID, with cases increasing daily.': 2.25,
 'Biomedical research has made substantial progress in identifying various pathophysiological changes and risk factors and in characterizing the illness; further, similarities with other viral-onset illnesses such as myalgic encephalomyelitis/chronic fatigue syndrome and postural orthostatic tachycardia syndrome have laid the groundwork for research in the field.': 5.25,
 'In this Review, we explore the current literature and highlight key findings, the overlap with other conditions, the variable onset of symptoms, long COVID in children and the impact of vaccinations.': 2.25,
 'Although these key findings are cri

In [45]:
from heapq import nlargest

In [46]:
#Select or specify a threshold value
#Using the n largest library to get 2% of the weighted sentences
select_length = int(len(sent_token)*0.05)
select_length

0

In [47]:

sum = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [48]:
sum

[]

In [49]:
#Generate the final summary by joining the n number of sentences
final_summary = [word for word in sum]
summary = ' '.join(final_summary)

In [50]:
#The summary
summary

''

In [51]:
#Evaluation
#Read the reference summary
ref_summary = '''Our infectious disease colleagues are adamant that restricting the movement of people into and around the hospital setting are effective clinical and epidemiological strategies that will help protect both the vulnerable patient population and health care providers themselves, 
who need to stay healthy so that they may care for their patients. In a health care institution, visitation restrictions not only affect inpatients but also have an impact on ambulatory patients who must come for diagnostic tests or interventions and who, if deprived access, might develop urgent or emergent conditions.
Feedback should be sought from those individuals who would be affected by visitation restrictions, such as staff, patients and family members.Health care workers, being in direct communication with patients and families, bear the brunt of their anger and frustration regarding any restriction in visitation.
If a family is allowed to visit a patient whose death is presumed to be imminent, then the patient's identity should be protected by using privacy strategies.
'''
!pip install -r rouge/requirements.txt
!pip install rouge-score
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2'], use_stemmer=True)
scores = scorer.score(ref_summary,summary)
                      
scores

[31mERROR: Could not open requirements file: [Errno 2] No such file or directory: 'rouge/requirements.txt'[0m[31m
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


{'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0)}

In [52]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of reference summarized text is:",len(ref_summary.split()))
print("Word count of summarized text is:",len(summary.split()))

Word count of full text is: 293
Word count of reference summarized text is: 160
Word count of summarized text is: 0


In [53]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of summarized text is:",len(summary.split()))
#Reading time of text
print('Reading time of full text (mins):',(len(text.split())/265))
#Reading time of text
print('Reading time of summary(mins):',(len(summary.split())/265))
print('The Rouge Metrics are:\n')
scores

Word count of full text is: 293
Word count of summarized text is: 0
Reading time of full text (mins): 1.1056603773584905
Reading time of summary(mins): 0.0
The Rouge Metrics are:



{'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0)}