# Extractive Text Summarization- SpaCy

### Install the foll packages in conda terminal
- conda install -c conda-forge spacy
- conda install -c conda-forge scispacy
- -m spacy download en_core_sci_md (SciSpacy)

!pip install -U spacy
!pip install scispacy
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_sm-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_craft_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_jnlpba_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bc5cdr_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_ner_bionlp13cg_md-0.2.4.tar.gz
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz

https://pypi.org/project/scispacy/

source code for this notebook :https://spacy.io/usage/spacy-101
https://jcharistech.wordpress.com/2018/12/31/text-summarization-using-spacy-and-python/


In [1]:
#Source Code
# Load Pkgs
import numpy as np # linear algebra
import pandas as pd # dataframe processing
import spacy
#scispaCy is a Python package containing spaCy models for processing biomedical, scientific or clinical text.
#https://allenai.github.io/scispacy/
import scispacy

# Text Preprocessing Pkg
from spacy.lang.en.stop_words import STOP_WORDS # stop words list
from string import punctuation
from collections import Counter
from string import punctuation
from spacy.tokenizer import Tokenizer
#calling a pretrained biomedical model 
#Assigns context-specific token vectors, POS tags, dependency parse and named entities.
#A full spaCy pipeline for biomedical data with a larger vocabulary and 50k word vectors.

nlp = spacy.load("en_core_sci_md") #the built-in load function.

#NER specific models
#https://towardsdatascience.com/using-scispacy-for-named-entity-recognition-785389e7918d
import en_ner_craft_md
import en_ner_bc5cdr_md
import en_ner_jnlpba_md
import en_ner_bionlp13cg_md

#Tools for extracting & displaying data

#text post processing
from spacy import displacy   #for ner    




### SpaCy
SpaCy is a relatively new package for “Industrial strength NLP in Python” developed by Matt Honnibal at explosion.ai. It does not weigh the user down with decisions over what esoteric algorithms to use for common tasks and it’s fast. Incredibly fast (it’s implemented in Cython). It’s reasonably low-level, but very intuitive and performant.<br>
ref:https://hackernoon.com/summarization-with-wine-reviews-using-spacy-b49f18399577


### TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF (Term Frequency-Inverse Document Frequency) is often used in information retrieval and text mining to calculate the importance of a sentence for text summarization.
The TF-IDF weight is composed of two terms:
- **TF**: Term Frequency — Measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear many more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length, such as the total number of terms in the document, as a way of normalization.
- TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)


- **IDF**: Inverse Document Frequency — Measures how important a term is. While computing the term frequency, all terms are considered equally important. However, it is known that certain terms may appear a lot of times but have little importance in the document. We usually term these words stopwords. For example: is, are, they, and so on.
- IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

TF-IDF is  good choice if you are dealing with a single domain. The text in this set is only fron the biomedical domain.
A common term in a domain might be an important term in another domain. 
This notebook focuses on identifying the top sentences in an article as follows:
- Tokenize the article using spaCy’s language model.
- Extract important keywords and calculate normalized weight.
- Calculate the importance of each sentence in the article based on keyword appearance.
- Sort the sentences based on the calculated importance.<br>
https://medium.com/better-programming/extractive-text-summarization-using-spacy-in-python-88ab96d1fd97

In [2]:
 df=pd.read_csv("capstone_data.csv")
#Generating a single article for the initial example
text=df['text_body'][1]
text

"Introduction\n\nThe sudden emergence of severe acute respiratory syndrome (SARS) in April 2003 caused much concern and reaction. Refereed medical journals ever since have been rife with articles about SARS. The eventual containment and treatment of SARS has seen a diminution of the massive media publicity and overt public concern. However, fears have recently surfaced about the potential for re-emergence of SARS in the near future. As we confront the potential need to return to more stringent infection control measures once again, this is an appropriate time to reflect on the ethical values that underlay the strict visitation restrictions imposed in hospitals in Ontario during the SARS outbreak and the moderate restrictions in place since SARS. This reflection will facilitate future decision making with respect to visitation restrictions.\n\nWhen public health trumps civil liberties: the collateral damage associated with victims of SARS\n\nOur infectious disease colleagues are adamant

In [3]:
# Build a List of Stopwords
stopwords = list(STOP_WORDS)

In [4]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('Number of stop words: %d' % len(spacy_stopwords))
print('First ten stop words: %s' % list(spacy_stopwords)[:10])

Number of stop words: 326
First ten stop words: ['become', 'sometimes', 'whatever', 'quite', 'with', 'enough', 'it', 'below', 'all', 'many']


In [5]:
#Customizing the stopwords in the SpaCy library.
#This Spacy model has 326 stop words.I'm adding a few more to that list
#Research papers will often frequently use words that don't actually contribute to the meaning and are not considered 
#everyday stopwords.
from spacy.lang.en.stop_words import STOP_WORDS # stop words list
custom_stop_words = [
    'doi', 'preprint', 'copyright', 'peer', 'reviewed', 'org', 'https', 'et', 'al', 'author', 'figure','journals','april' 
    'rights', 'reserved', 'permission', 'used', 'using', 'biorxiv', 'medrxiv', 'license', 'fig', 'fig.', ''
    'al.', 'Elsevier', 'PMC', 'CZI', 'www','image','figures','tables','introduction','materials and methods','results'
]

for w in custom_stop_words:
    if w not in stopwords:
        stopwords.append(w)


In [6]:
def top_sentence(text, limit):
    keyword = [] #make a list of the key words
    pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']#propernouns, nouns and adjectives are usually keywords
    doc = nlp(text.lower())
    for token in doc:
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        if(token.pos_ in pos_tag):
            keyword.append(token.text)#Append the token to a list if it is the part-of-speech tag that we have defined.
            
            
            
            
  #Normalizing the weights of the key words  
    freq_word = Counter(keyword)#Counter will convert the list into a dictionary with their respective frequency values.
    max_freq = Counter(keyword).most_common(1)[0][1]#Get the frequency of the top most-common keyword.
    for w in freq_word:
        freq_word[w] = (freq_word[w]/max_freq)#Loop over each item in the dictionary and normalize the frequency. 
        
        
        
 # calculate the importance of the sentences by identifying the occurrence of important keywords and sum up the value.       
    sent_strength={}
    for sent in doc.sents:
        for word in sent:
            if word.text in freq_word.keys():
                if sent in sent_strength.keys():
                    sent_strength[sent]+=freq_word[word.text]
                else:
                    sent_strength[sent]=freq_word[word.text]
#Finally,Create a new key-value in the sent_strength dictionary using the sentence 
#as key and the normalized keyword value as value.


    summary = []
    
    sorted_x = sorted(sent_strength.items(), key=lambda kv: kv[1], reverse=True)
    
    counter = 0
    for i in range(len(sorted_x)):
        summary.append(str(sorted_x[i][0]).capitalize())

        counter += 1
        if(counter >= limit):
            break
            
    return ' '.join(summary)

In [7]:
#Im selecting the top 5 sentences to generate my summary
summary = top_sentence(text, 2)
summary

"It could be argued that visitation restrictions, in light of a potential outbreak of a contagious disease, are ethically sound because of the compelling need to protect public health.however, even when public health concerns trump individual liberties, the ethical operationalization of this value would demand that 'those whose rights are being infringed' need to be managed in 'an ethical and even-handed manner so that they are not unfairly or disproportionately harmed by such measures' [1].this is an important and far-reaching consideration because sars caused collateral damage and we know that the implementation of visitation restrictions will have an impact on a broad range of individuals. When public health trumps civil liberties: the collateral damage associated with victims of sars\n\nour infectious disease colleagues are adamant that restricting the movement of people into and around the hospital setting are effective clinical and epidemiological strategies that will help protec

In [8]:
#Evaluation
#Read reference summary
ref_summary = '''Our infectious disease colleagues are adamant that restricting the movement of people into and around the hospital setting are effective clinical and epidemiological strategies that will help protect both the vulnerable patient population and health care providers themselves, 
who need to stay healthy so that they may care for their patients. In a health care institution, visitation restrictions not only affect inpatients but also have an impact on ambulatory patients who must come for diagnostic tests or interventions and who, if deprived access, might develop urgent or emergent conditions.
Feedback should be sought from those individuals who would be affected by visitation restrictions, such as staff, patients and family members.Health care workers, being in direct communication with patients and families, bear the brunt of their anger and frustration regarding any restriction in visitation.
If a family is allowed to visit a patient whose death is presumed to be imminent, then the patient's identity should be protected by using privacy strategies.
'''

#The abstract is the target: Text containing the target (ground truth) text.- The Gold Standard
# the summary generated is the prediction: Text containing the predicted text.


from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1','rouge2'], use_stemmer=True)
scores = scorer.score(ref_summary,summary)
                      
scores

{'rouge1': Score(precision=0.5057471264367817, recall=0.5432098765432098, fmeasure=0.5238095238095238),
 'rouge2': Score(precision=0.32947976878612717, recall=0.35403726708074534, fmeasure=0.3413173652694611)}

In [9]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of reference summarized text is:",len(ref_summary.split()))
print("Word count of summarized text is:",len(summary.split()))

Word count of full text is: 1801
Word count of reference summarized text is: 160
Word count of summarized text is: 170


### Generating Keywords function
I’ll be writing the keyword extraction code inside a function. It’s a lot more convenient and I can easily call it whenever I need to extract keywords from a big chunk of text. It accepts a string as an input parameter.
sOURCE: https://medium.com/better-programming/extract-keywords-using-spacy-in-python-4a8415478fbf

In [10]:
#Modifying the initial part of the get top sentence, to generate only the keywords
def get_keywords(text):
    result = []
    pos_tag = ['PROPN', 'ADJ', 'NOUN']
    doc = nlp(text.lower())
    for token in doc:
        
        if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
            continue
        
        if(token.pos_ in pos_tag):
            result.append(token.text)
                
    return result 

In [11]:
keywords = get_keywords(text)
#Generate the keywords
print(keywords)

['introduction', 'sudden', 'emergence', 'severe', 'acute', 'respiratory', 'syndrome', 'sars', 'april', 'concern', 'reaction', 'refereed', 'medical', 'journals', 'rife', 'articles', 'sars', 'eventual', 'containment', 'treatment', 'sars', 'diminution', 'massive', 'media', 'publicity', 'overt', 'public', 'concern', 'fears', 'potential', 're-emergence', 'sars', 'near', 'future', 'potential', 'need', 'stringent', 'infection', 'control', 'measures', 'appropriate', 'time', 'ethical', 'values', 'strict', 'visitation', 'restrictions', 'hospitals', 'ontario', 'sars', 'outbreak', 'moderate', 'restrictions', 'place', 'sars', 'reflection', 'future', 'decision', 'respect', 'visitation', 'restrictions', 'public', 'health', 'trumps', 'civil', 'liberties', 'collateral', 'damage', 'victims', 'sars', 'infectious', 'disease', 'colleagues', 'adamant', 'movement', 'people', 'hospital', 'effective', 'clinical', 'epidemiological', 'strategies', 'vulnerable', 'patient', 'population', 'health', 'care', 'provide

In [12]:
print(len(keywords))
#I can see alot of duplication

704


In [13]:
#Remove duplicate items
#using the set function deletes the duplicates
keywd = set(get_keywords(text))
print(keywd)

{'brunt', 'messages', 'contaminated', 'liberal', 'application', 'resources', 'health.however', 'transparency', 'expressions', 'quarantine', 'needs', 'level', 'numbers', 'ideal', 'institution', 'direct', 'near', 'code', 'challenge', 'travel', 'appeal', 'pay', 'anger', 'argument', 'times', 'consideration', 'rules', 'stigmatization', 'global', 'ambulatory', 'private', 'freedom', 'healthy', 'sensitive', 'primary', 'workers', 'exceptional', 'reflection', 'status', 'vulnerable', 'deviation', 'trumps', 'contagious', 'bias', 'ethical', 'differences', 'distress', 'symptoms', 'epidemiological', 'identity', 'political', 'respect', 'special', 'facility', 'media', 'precautions', 'accessible', 'decisions', 'notification', 'awareness', 'situation', 'problems', 'total', 'inability', 'standardized', 'strict', 'damage', 'inconvenience', 'values', 'macro', 'people', 'notice', 'moderate', 'even-handed', 'emergence', 'conclusion', 'communication', 'clinical', 'organization', 'university', 'recent', 'manner

In [14]:
#This time ive got 363 keywords instead of 703 duplicated keywords
print(len(keywd))

363


In [15]:
#Generate hashtags from keywords
hashtags = [('#' + x) for x in keywd]
print(' '.join(hashtags))

#brunt #messages #contaminated #liberal #application #resources #health.however #transparency #expressions #quarantine #needs #level #numbers #ideal #institution #direct #near #code #challenge #travel #appeal #pay #anger #argument #times #consideration #rules #stigmatization #global #ambulatory #private #freedom #healthy #sensitive #primary #workers #exceptional #reflection #status #vulnerable #deviation #trumps #contagious #bias #ethical #differences #distress #symptoms #epidemiological #identity #political #respect #special #facility #media #precautions #accessible #decisions #notification #awareness #situation #problems #total #inability #standardized #strict #damage #inconvenience #values #macro #people #notice #moderate #even-handed #emergence #conclusion #communication #clinical #organization #university #recent #manner #member #decision #work #period #set #movement #corporate #policies #areas #overt #families #colleagues #author #sensitivity #far-reaching #impact #latitude #chan

### Name Entity Recognition 

In [16]:
summary

"It could be argued that visitation restrictions, in light of a potential outbreak of a contagious disease, are ethically sound because of the compelling need to protect public health.however, even when public health concerns trump individual liberties, the ethical operationalization of this value would demand that 'those whose rights are being infringed' need to be managed in 'an ethical and even-handed manner so that they are not unfairly or disproportionately harmed by such measures' [1].this is an important and far-reaching consideration because sars caused collateral damage and we know that the implementation of visitation restrictions will have an impact on a broad range of individuals. When public health trumps civil liberties: the collateral damage associated with victims of sars\n\nour infectious disease colleagues are adamant that restricting the movement of people into and around the hospital setting are effective clinical and epidemiological strategies that will help protec

In [17]:
# Process the text
doc = nlp(summary)
# Iterate over the entities
for ent in doc.ents:
    # Print the entity text and label
    print(ent.text, ent.label_)
    

visitation ENTITY
restrictions ENTITY
outbreak ENTITY
contagious disease ENTITY
public ENTITY
public health ENTITY
individual ENTITY
ethical operationalization ENTITY
rights ENTITY
ethical ENTITY
even-handed ENTITY
measures ENTITY
sars ENTITY
collateral damage ENTITY
implementation ENTITY
visitation ENTITY
restrictions ENTITY
impact ENTITY
individuals ENTITY
public health trumps ENTITY
civil liberties ENTITY
collateral damage ENTITY
associated with ENTITY
victims ENTITY
sars ENTITY
infectious disease ENTITY
adamant ENTITY
movement ENTITY
people ENTITY
hospital setting ENTITY
effective ENTITY
clinical ENTITY
epidemiological strategies ENTITY
vulnerable ENTITY
patient ENTITY
health care providers ENTITY
healthy ENTITY
care ENTITY
patients ENTITY


In [18]:
#read a single text =summary


#Load specific model and pass text through
nlp = en_ner_jnlpba_md.load()
doc = nlp(summary)

#Display resulting entity extraction
displacy_image = displacy.render(doc, jupyter=True,style='ent')



In [22]:
#Load all four ner models
nlp_cr = en_ner_craft_md.load()
nlp_bc = en_ner_bc5cdr_md.load()
nlp_bi = en_ner_bionlp13cg_md.load()
nlp_jn = en_ner_jnlpba_md.load()
# Process the text
doc = nlp(summary)

displacy_image = displacy,render(doc, jupyter = True, style = ‘ent’)

SyntaxError: invalid character in identifier (<ipython-input-22-95b03e54d72d>, line 4)

### Reading Time

In [None]:
# Get Total Word Counts with Tokenization
docx1 = nlp(text)

In [None]:
# Tokens
mytokens = [ token.text for token in docx1 ]
# Total Number or Length of Words
len(mytokens)

In [None]:
# Reading Time
def readingTime(docs):
    total_words_tokens =  [ token.text for token in nlp(docs)]
    estimatedtime  = len(total_words_tokens)/265
    return '{} mins'.format(round(estimatedtime))

In [None]:
readingTime(text)

In [None]:
### length of full text and word count
print("Word count of full text is:",len(text.split()))
# length of summarized text and word count
print("Word count of summarized text is:",len(summary.split()))
#Reading time of text
print('Reading time of full text:',readingTime(text))
#Reading time of text
print('Reading time of summary:',readingTime(summary))
print('The Rouge Metrics are:\n')
scores

This model has a lower score than the gensim model, however, its extremely flexible and I can customize it
A pipeline can be used to preprocess the text and post process after summary generation, to display name entity recognition and keywords.Another benefit of using SciSpacy is that ithas acronym and  abbreviation recognition 