## NLTK Summarizer

**TF: Term Frequency** — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document,as a way of normalization.<br>

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

**NLTK summarization involves 5 steps**

1. Create the word frequency table

2. Tokenize the sentences

3. Score the sentences: Term frequency

4. Find the threshold


In [25]:
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import pandas as pd # dataframe processing

In [26]:
article =  df=pd.read_csv("capstone_data.csv")
df.head()

Unnamed: 0,cord_uid,title,pmcid,abstract,authors,journal,text_body
0,zowp10ts,Recombination Every Day: Abundant Recombinatio...,PMC1054884,Viral recombination can dramatically impact ev...,"Froissart, Remy; Roze, Denis; Uzest, Marilyne;...",PLoS Biol,Introduction\n\nAs increasing numbers of full-...
1,i4pmux28,Why can't I visit? The ethics of visitation re...,PMC1065028,"Patients want, need and expect that their rela...","Rogers, Sharon",Crit Care,Introduction\n\nThe sudden emergence of severe...
2,jw1lxwyd,Prospective evaluation of an internet-linked h...,PMC1065064,INTRODUCTION: Critical care physicians may ben...,"Lapinsky, Stephen E; Wax, Randy; Showalter, Ra...",Crit Care,Introduction\n\nThe rate of expansion of medic...
3,xiv9vxdp,Scanning the horizon: emerging hospital-wide t...,PMC1065120,This commentary represents a selective survey ...,"Suntharalingam, Ganesh; Cousins, Jonathan; Gat...",Crit Care,Introduction\n\nThis series of articles provid...
4,mcfmxqp2,Characterization of the frameshift signal of E...,PMC1065257,The ribosomal frameshifting signal of the mous...,"Manktelow, Emily; Shigemoto, Kazuhiro; Brierle...",Nucleic Acids Res,INTRODUCTION\n\nProgrammed −1 ribosomal frames...


In [27]:

#Generating a single article from the text body of the 2nd article
text=df['text_body'][1]

## Wireframe of NLP data used for the capstone
- **cord_uid**: The unique CORD 19 User Identification number assigned to the article 
- **title**: The title of the article
- **pmcid**: The PMC article unique identification number
- **abstract**: The abstract of the article
- **authors**: The name of the article's authors 	
- **journal**: The name of the journal, the article was published in
- **text_body**: The article without the abstract, reference and acknowlegement section


In [28]:
text

"Introduction\n\nThe sudden emergence of severe acute respiratory syndrome (SARS) in April 2003 caused much concern and reaction. Refereed medical journals ever since have been rife with articles about SARS. The eventual containment and treatment of SARS has seen a diminution of the massive media publicity and overt public concern. However, fears have recently surfaced about the potential for re-emergence of SARS in the near future. As we confront the potential need to return to more stringent infection control measures once again, this is an appropriate time to reflect on the ethical values that underlay the strict visitation restrictions imposed in hospitals in Ontario during the SARS outbreak and the moderate restrictions in place since SARS. This reflection will facilitate future decision making with respect to visitation restrictions.\n\nWhen public health trumps civil liberties: the collateral damage associated with victims of SARS\n\nOur infectious disease colleagues are adamant

In [29]:
#split the words in each sentence into tokens -tokenization
tokens = word_tokenize(text)

In [30]:
#we create a dictionary for the word frequency table from the text.
#For this, we should only use the words that are not part of the stopwords
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure','journals','april', 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.','al.', 'Elsevier',
    'PMC', 'CZI', 'www','image','figures','tables','introduction','materials and methods','results']
new_stopwords_list = stop_words.union(new_stopwords)

#remove words that are in NLTK stopwords list
not_stopwords = {'not', 'may'} 
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])

print(final_stop_words)


{'during', 'wouldn', 'CZI', 'you', 'wasn', 'hasn', "aren't", 'ourselves', 'https', 'into', 'figure', 'what', "wasn't", 'over', 'in', 'didn', "needn't", 'won', "hadn't", 'couldn', 'herself', 'we', 'about', 'y', 'mustn', 'doesn', 'do', 'very', "haven't", "won't", 'has', 'ours', 'own', 'yourselves', 'ma', 'biorxiv', 'an', 'than', 'having', 'which', 'once', "don't", 'been', 'why', 'all', 'from', 'both', 'more', 'should', 'figures', 'don', 'ain', 'being', 'can', "you're", 'by', 'hers', 'to', "mightn't", 'most', 'whom', "that'll", 'so', 'when', 'fig', 'off', 'because', 'are', "weren't", 'results', 'shouldn', 'such', 'he', 'here', 'that', 'out', "should've", 'they', 'did', 'my', 'then', 'now', 'same', 'copyright', 'et', 'image', 'myself', 'of', 'above', 'few', 'www', 'i', 'aren', 'was', 'shan', 'doi', 'had', 'some', 'fig.', 'she', 'themselves', "you'd", 'the', 'our', 'am', 'too', 't', "you've", 'Elsevier', 'd', 'between', 'with', 'al.', 'after', "isn't", 'its', 'until', 'as', 'hadn', "shan't"

In [31]:
#Remove the punctuations
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

In [32]:
#Generate the word frequencies  to score each sentence
word_frequencies = {}
for word in tokens:    
    if word.lower() not in stop_words:
        if word.lower() not in punctuation:
            if word not in word_frequencies.keys():
                word_frequencies[word] = 1
            else:
                word_frequencies[word] += 1

In [33]:

word_frequencies

{'Introduction': 1,
 'sudden': 2,
 'emergence': 1,
 'severe': 2,
 'acute': 2,
 'respiratory': 2,
 'syndrome': 2,
 'SARS': 15,
 'April': 1,
 '2003': 3,
 'caused': 2,
 'much': 1,
 'concern': 2,
 'reaction': 1,
 'Refereed': 1,
 'medical': 1,
 'journals': 1,
 'ever': 1,
 'since': 2,
 'rife': 1,
 'articles': 1,
 'eventual': 1,
 'containment': 1,
 'treatment': 1,
 'seen': 1,
 'diminution': 1,
 'massive': 1,
 'media': 1,
 'publicity': 2,
 'overt': 2,
 'public': 9,
 'However': 4,
 'fears': 1,
 'recently': 1,
 'surfaced': 1,
 'potential': 3,
 're-emergence': 1,
 'near': 1,
 'future': 2,
 'confront': 1,
 'need': 8,
 'return': 1,
 'stringent': 3,
 'infection': 1,
 'control': 1,
 'measures': 2,
 'appropriate': 3,
 'time': 4,
 'reflect': 1,
 'ethical': 10,
 'values': 2,
 'underlay': 1,
 'strict': 1,
 'visitation': 15,
 'restrictions': 12,
 'imposed': 1,
 'hospitals': 2,
 'Ontario': 1,
 'outbreak': 3,
 'moderate': 1,
 'place': 1,
 'reflection': 1,
 'facilitate': 1,
 'decision': 2,
 'making': 2,
 're

In [34]:

max_frequency = max(word_frequencies.values())
print(max_frequency)

15


In [35]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency

In [36]:
print(word_frequencies)

{'Introduction': 0.06666666666666667, 'sudden': 0.13333333333333333, 'emergence': 0.06666666666666667, 'severe': 0.13333333333333333, 'acute': 0.13333333333333333, 'respiratory': 0.13333333333333333, 'syndrome': 0.13333333333333333, 'SARS': 1.0, 'April': 0.06666666666666667, '2003': 0.2, 'caused': 0.13333333333333333, 'much': 0.06666666666666667, 'concern': 0.13333333333333333, 'reaction': 0.06666666666666667, 'Refereed': 0.06666666666666667, 'medical': 0.06666666666666667, 'journals': 0.06666666666666667, 'ever': 0.06666666666666667, 'since': 0.13333333333333333, 'rife': 0.06666666666666667, 'articles': 0.06666666666666667, 'eventual': 0.06666666666666667, 'containment': 0.06666666666666667, 'treatment': 0.06666666666666667, 'seen': 0.06666666666666667, 'diminution': 0.06666666666666667, 'massive': 0.06666666666666667, 'media': 0.06666666666666667, 'publicity': 0.13333333333333333, 'overt': 0.13333333333333333, 'public': 0.6, 'However': 0.26666666666666666, 'fears': 0.0666666666666666

In [37]:
# Score the sentences: Term frequency
# We’re using the Term Frequency method to score each sentence.
#Basic Algorithm: score a sentence by its words, adding the frequency of every non-stop word in a sentence.
#each term is divided b the number of times in occurs in the corpus
sent_token = sent_tokenize(text)
sent_token

['Introduction\n\nThe sudden emergence of severe acute respiratory syndrome (SARS) in April 2003 caused much concern and reaction.',
 'Refereed medical journals ever since have been rife with articles about SARS.',
 'The eventual containment and treatment of SARS has seen a diminution of the massive media publicity and overt public concern.',
 'However, fears have recently surfaced about the potential for re-emergence of SARS in the near future.',
 'As we confront the potential need to return to more stringent infection control measures once again, this is an appropriate time to reflect on the ethical values that underlay the strict visitation restrictions imposed in hospitals in Ontario during the SARS outbreak and the moderate restrictions in place since SARS.',
 'This reflection will facilitate future decision making with respect to visitation restrictions.',
 'When public health trumps civil liberties: the collateral damage associated with victims of SARS\n\nOur infectious disease 

In [38]:
sentence_scores = {}
for sent in sent_token:
    sentence = sent.split(" ")
    for word in sentence:        
        if word.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.lower()]

In [39]:

sentence_scores

{'Introduction\n\nThe sudden emergence of severe acute respiratory syndrome (SARS) in April 2003 caused much concern and reaction.': 1.2666666666666666,
 'Refereed medical journals ever since have been rife with articles about SARS.': 0.4666666666666667,
 'The eventual containment and treatment of SARS has seen a diminution of the massive media publicity and overt public concern.': 1.3333333333333333,
 'However, fears have recently surfaced about the potential for re-emergence of SARS in the near future.': 0.5333333333333333,
 'As we confront the potential need to return to more stringent infection control measures once again, this is an appropriate time to reflect on the ethical values that underlay the strict visitation restrictions imposed in hospitals in Ontario during the SARS outbreak and the moderate restrictions in place since SARS.': 6.066666666666667,
 'This reflection will facilitate future decision making with respect to visitation restrictions.': 1.6666666666666665,
 'When

In [40]:

from heapq import nlargest

In [41]:
#Select or specify a threshold value
#Using the n largest library to get 2% of the weighted sentences
select_length = int(len(sent_token)*0.05)
select_length

2

In [42]:

sum = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [43]:
sum

["It could be argued that visitation restrictions, in light of a potential outbreak of a contagious disease, are ethically sound because of the compelling need to protect public health.However, even when public health concerns trump individual liberties, the ethical operationalization of this value would demand that 'those whose rights are being infringed' need to be managed in 'an ethical and even-handed manner so that they are not unfairly or disproportionately harmed by such measures' [1].This is an important and far-reaching consideration because SARS caused collateral damage and we know that the implementation of visitation restrictions will have an impact on a broad range of individuals.",
 "Every reasonable effort should be made to protect the individual patient's identity and their specific health status should exceptionality be considered.It is ethically the responsibility of the organization to enforce compliance with restricted visitation and a corporate department should be

In [44]:
#Generate the final summary by joining the n number of sentences
final_summary = [word for word in sum]
summary = ' '.join(final_summary)

In [45]:
#The summary
summary

"It could be argued that visitation restrictions, in light of a potential outbreak of a contagious disease, are ethically sound because of the compelling need to protect public health.However, even when public health concerns trump individual liberties, the ethical operationalization of this value would demand that 'those whose rights are being infringed' need to be managed in 'an ethical and even-handed manner so that they are not unfairly or disproportionately harmed by such measures' [1].This is an important and far-reaching consideration because SARS caused collateral damage and we know that the implementation of visitation restrictions will have an impact on a broad range of individuals. Every reasonable effort should be made to protect the individual patient's identity and their specific health status should exceptionality be considered.It is ethically the responsibility of the organization to enforce compliance with restricted visitation and a corporate department should be assi

In [46]:
#Evaluation
#Read the reference summary
ref_summary = '''Our infectious disease colleagues are adamant that restricting the movement of people into and around the hospital setting are effective clinical and epidemiological strategies that will help protect both the vulnerable patient population and health care providers themselves, 
who need to stay healthy so that they may care for their patients. In a health care institution, visitation restrictions not only affect inpatients but also have an impact on ambulatory patients who must come for diagnostic tests or interventions and who, if deprived access, might develop urgent or emergent conditions.
Feedback should be sought from those individuals who would be affected by visitation restrictions, such as staff, patients and family members.Health care workers, being in direct communication with patients and families, bear the brunt of their anger and frustration regarding any restriction in visitation.
If a family is allowed to visit a patient whose death is presumed to be imminent, then the patient's identity should be protected by using privacy strategies.
'''
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2'], use_stemmer=True)
scores = scorer.score(ref_summary,summary)
                      
scores

{'rouge1': Score(precision=0.391304347826087, recall=0.4444444444444444, fmeasure=0.4161849710982659),
 'rouge2': Score(precision=0.07650273224043716, recall=0.08695652173913043, fmeasure=0.0813953488372093)}

In [47]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of reference summarized text is:",len(ref_summary.split()))
print("Word count of summarized text is:",len(summary.split()))

Word count of full text is: 1801
Word count of reference summarized text is: 160
Word count of summarized text is: 177


In [53]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of summarized text is:",len(summary.split()))
#Reading time of text
print('Reading time of full text (mins):',(len(text.split())/265))
#Reading time of text
print('Reading time of summary(mins):',(len(summary.split())/265))
print('The Rouge Metrics are:\n')
scores


Word count of full text is: 1801
Word count of summarized text is: 177
Reading time of full text (mins): 6.796226415094339
Reading time of summary(mins): 0.6679245283018868
The Rouge Metrics are:



{'rouge1': Score(precision=0.391304347826087, recall=0.4444444444444444, fmeasure=0.4161849710982659),
 'rouge2': Score(precision=0.07650273224043716, recall=0.08695652173913043, fmeasure=0.0813953488372093)}

These metrics are quite encouraging
- I can customize the summarization process according to my requirements and use word embeddings
- I intend on trying the Spacy model, which is faster, according its doccumentation