# Text summarization using NLTK and web-scraping

This study demonstrates how to optain an article from a web page and apply a word-frequency based text summarization method, through use of Python, NLTK, and web-scraping.

## Optaining Data: Web-scraping

In [0]:
# Importing libraries
from bs4 import BeautifulSoup
import requests
import re

In [0]:
# Accessing the web-site
url = 'https://www.ecdc.europa.eu/en/publications-data/rapid-risk-assessment-novel-coronavirus-disease-2019-covid-19-pandemic-increased'
headers = {'User-Agent' : 'Mozilla/5.0'}
data = requests.get(url, headers)


In [3]:
# Checking the status
data.status_code

200

In [4]:
# Visualization of the content of the data
data.content

b'<!DOCTYPE html>\n<html  lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n  <head>\n    <meta charset="utf-8" />\n<meta name="twitter:card" content="summary" />\n<meta property="og:site_name" content="European Centre for Disease Prevention and Control" />\n<meta name="description" content="Since ECDC\xe2\x80\x99s fifth update on novel coronavirus published on 2 March 2020 and as of 11 March, the number of cases and deaths reported in the EU/EEA and the UK has been rising, mirroring the trends seen in China in January-early February and in northern Italy in late February. If this trend continues, based on the quick pace of growth of the epidemic 

In [5]:
# Parsing the data
soup = BeautifulSoup(data.content, 'html.parser')
print(soup.prettify())

<!DOCTYPE html>
<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
 <head>
  <meta charset="utf-8"/>
  <meta content="summary" name="twitter:card"/>
  <meta content="European Centre for Disease Prevention and Control" property="og:site_name"/>
  <meta content="Since ECDC’s fifth update on novel coronavirus published on 2 March 2020 and as of 11 March, the number of cases and deaths reported in the EU/EEA and the UK has been rising, mirroring the trends seen in China in January-early February and in northern Italy in late February. If this trend continues, based on the quick pace of growth of the epidemic observed in China and northern Italy, i

In [6]:
# View the title of the article
soup.title.string

'Rapid risk assessment: Novel coronavirus disease 2019 (COVID-19) pandemic: increased transmission in the EU/EEA and the UK – sixth update'

In [7]:
# View the whole text on the web page
print(soup.get_text())































  window.cookieConsentKitBannerStyle = "defaultStyle";
  window.cookieConsentKitBannerPosition = "right";
  window.cookieConsentKitDomain = ".ecdc.europa.eu";
  window.cookieConsentKitDisplayPrompt = "true";
  window.cookieConsentKitSupportMultilingual = "false";
  window.cookieConsentKitEnableGoogleAnalytics = "false";
  window.googleAnalyticsTrackingId = "";
  window.cookieConsentKitEnableGoogleTagManager = "true";
  window.googleTagManagerContainerId = "GTM-WM8RHC6";



Rapid risk assessment: Novel coronavirus disease 2019 (COVID-19) pandemic: increased transmission in the EU/EEA and the UK – sixth update






      Skip to main content
    










Global Navigation

Other sites:

ECDC


European Antibiotic Awareness Day


ESCAIDE - Scientific conference


Eurosurveillance journal



























                                    European Centre for Disease Prevention and Control
                                            

An agency o

In [8]:
# Optain the text of the article from the whole page
article = soup.find('div', class_ = 'text-image')
text_ = article.get_text()
print(text_)


As of 11 March 2020, 118 598 cases of COVID-19 were reported worldwide by more than 100 countries. Since late February, the majority of cases reported are from outside China, with an increasing majority of these reported from EU/EEA countries and the UK.
The Director General of the World Health Organization declared COVID-19 a global pandemic on 11 March 2020.
All EU/EEA countries and the UK are affected, reporting a total of 17 413 cases as of 11 March. Seven hundred and eleven cases reported by EU/EEA countries and the UK have died. Italy represents 58% of the cases (n=10 149) and 88% of the fatalities (n=631). The current pace of the increase in cases in the EU/EEA and the UK mirrors trends seen in China in January-early February and trends seen in Italy in mid-February.
Need for immediate targeted action
In the current situation where COVID-19 is rapidly spreading worldwide and the number of cases in Europe is rising with increasing pace in several affected areas, there is a need 

## Text Summarization

In [9]:
# Import libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Calculate the freqeuncy of words
Define a function to create a word frequency table that can be applied to any text. Note that the function is applied to the text that has been already cleaned above.

In [0]:
# Create the frequency table
def frequency_table(text):

    stop_words = set(stopwords.words("english"))
    words = word_tokenize(text)
    ps = PorterStemmer()

    freq_table = dict()
    for word in words:
        word = ps.stem(word)
        if word in stop_words:
            continue
        if word in freq_table:
            freq_table[word] += 1
        else:
            freq_table[word] = 1

    return freq_table

In [11]:
freq_table = frequency_table(text_)
freq_table

{'%': 3,
 '(': 10,
 ')': 10,
 ',': 49,
 '.': 39,
 '1': 1,
 '100': 1,
 '11': 3,
 '118': 1,
 '149': 1,
 '17': 1,
 '2': 1,
 '2020': 2,
 '3': 1,
 '4': 1,
 '413': 1,
 '58': 1,
 '598': 1,
 '80': 1,
 '88': 1,
 ':': 4,
 ';': 8,
 'A': 3,
 'As': 2,
 'If': 1,
 'In': 3,
 'UK': 8,
 'accept': 2,
 'accordingli': 1,
 'account': 1,
 'action': 3,
 'acut': 1,
 'adapt': 1,
 'addit': 2,
 'adult': 1,
 'aerosol-gener': 1,
 'affect': 2,
 'agent': 1,
 'ahead': 1,
 'aim': 2,
 'allow': 2,
 'among': 1,
 'anticip': 1,
 'appli': 1,
 'applic': 3,
 'approach': 5,
 'area': 2,
 'assess': 5,
 'associ': 1,
 'assum': 1,
 'asymptomat': 1,
 'avail': 1,
 'awar': 1,
 'base': 2,
 'bed': 1,
 'burden': 1,
 'cancel': 1,
 'capac': 8,
 'care': 4,
 'case': 14,
 'caus': 1,
 'chain': 1,
 'children': 2,
 'china': 3,
 'chronic': 2,
 'circumst': 1,
 'clinic': 2,
 'closur': 1,
 'cohort': 1,
 'collect': 1,
 'come': 2,
 'common': 1,
 'commun': 7,
 'comprehens': 1,
 'condit': 2,
 'confin': 1,
 'confirm': 1,
 'confirmatori': 1,
 'conserv': 1,

### Tokenize the sentences

In [12]:
# Split the text in a set of sentences
list_of_sentences = sent_tokenize(text_)
print(list_of_sentences)

['\nAs of 11 March 2020, 118 598 cases of COVID-19 were reported worldwide by more than 100 countries.', 'Since late February, the majority of cases reported are from outside China, with an increasing majority of these reported from EU/EEA countries and the UK.', 'The Director General of the World Health Organization declared COVID-19 a global pandemic on 11 March 2020.', 'All EU/EEA countries and the UK are affected, reporting a total of 17 413 cases as of 11 March.', 'Seven hundred and eleven cases reported by EU/EEA countries and the UK have died.', 'Italy represents 58% of the cases (n=10 149) and 88% of the fatalities (n=631).', 'The current pace of the increase in cases in the EU/EEA and the UK mirrors trends seen in China in January-early February and trends seen in Italy in mid-February.', 'Need for immediate targeted action\nIn the current situation where COVID-19 is rapidly spreading worldwide and the number of cases in Europe is rising with increasing pace in several affecte

### Apply Term Frequency
Score each sentence to optain the ones with the hightes scores in the whole text while penalizing long sentences through dividing their scores with number of words they have.

In [0]:
# Create a sentence score table
def score_sentences(list_of_sentences, freq_table):
    sentence_score_table = dict()

    for sentence in list_of_sentences:
        number_of_words_in_sentence = (len(word_tokenize(sentence)))
        for word_score in freq_table:
            if word_score in sentence.lower():
                if sentence in sentence_score_table:
                    sentence_score_table[sentence] += freq_table[word_score]
                else:
                    sentence_score_table[sentence] = freq_table[word_score]

        sentence_score_table[sentence] = sentence_score_table[sentence] // number_of_words_in_sentence

    return sentence_score_table

In [14]:
sentence_score_table = score_sentences(list_of_sentences, freq_table)
sentence_score_table

{'\nAs of 11 March 2020, 118 598 cases of COVID-19 were reported worldwide by more than 100 countries.': 7,
 'A high degree of population understanding, solidarity and discipline is required to apply strict personal hygiene, coughing etiquette, self-monitoring and social distancing measures.': 5,
 'A rapid shift from a containment to a mitigation approach is required, as the rapid increase in cases, that is anticipated in the coming days to few weeks may not provide decision makers and hospitals enough time to realise, accept and adapt their response accordingly if not implemented ahead of time.': 2,
 'A strategic approach based on early and rigorous application of these measures will help reduce the burden and pressure on the healthcare system, and in particular on hospitals, and will allow more time for the testing of therapeutics and vaccine development.': 3,
 'All EU/EEA countries and the UK are affected, reporting a total of 17 413 cases as of 11 March.': 6,
 'As the epidemic prog

### Identify the threshold
Find the average score of the sentences-score table and use it in order to identify the threshold. 

In [0]:
# Create the function to calculate average score of the sentences
def find_average_score(sentence_score_table):
    sum_of_scores = 0
    for score in sentence_score_table:
        sum_of_scores += sentence_score_table[score]

    # Average value of a sentence from original text
    average_score = int(sum_of_scores / len(sentence_score_table))

    return average_score

In [16]:
avarage_score = find_average_score(sentence_score_table)
avarage_score

4

### Generate the summary

In [0]:
def gen_summary(list_of_sentences, sentence_score_table, threshold):
    '''
    threshold is identified by the user, through multiplying values with the 'avarage_score' of the sentences.
    example: threshold = 1.4 * avarage_score
    Or identify a threshold value
    '''
    count_sentences = 0
    summary = ''

    for sentence in list_of_sentences:
        if sentence in sentence_score_table and sentence_score_table[sentence] > (threshold):
            summary += " " + sentence
            count_sentences += 1

    return summary

### Application of the functions in one cell

In [18]:
# Feed the text
text = text_

# Create the word frequency table
freq_table= frequency_table(text)

# Tokenize the sentences: Split the text in a set of sentences
list_of_sentences = sent_tokenize(text)

# Create a sentence score table 
sentence_score_table = score_sentences(list_of_sentences, freq_table)

# Find the average score of the sentences and identify the threshold
threshold = find_average_score(sentence_score_table)
    
# Generate the summary
summary = gen_summary(list_of_sentences, sentence_score_table, 1.4 * threshold)

print(summary)

 
As of 11 March 2020, 118 598 cases of COVID-19 were reported worldwide by more than 100 countries. All EU/EEA countries and the UK are affected, reporting a total of 17 413 cases as of 11 March. The risk of transmission of COVID-19 in health and social institutions with large vulnerable populations is considered high. Countries should identify healthcare units that can be designated to care for COVID-19 cases, to minimise transmission to non-cases and to conserve PPE. The highest priority for use of respirators (FFP2/3) are healthcare workers, in particular those performing aerosol-generating procedures, including swabbing. Testing approaches should prioritise vulnerable populations, protection of social and healthcare institutions, including staff. National surveillance systems should initially aim at rapidly detecting cases and assessing community transmission.
