# Text summarization task

- remove citation (if any)
- text cleaning
- sentence tokenization
- word-frequency table
- clustering
- summarization

In [1]:
text = """
The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern in January 2020 and a pandemic in March 2020. As of 23 January 2021, more than 98.2 million cases have been confirmed, with more than 2.1 million deaths attributed to COVID-19, across 190 countries worldwide.

Symptoms of COVID-19 are highly variable, ranging from none to severe illness. The virus spreads mainly through the air when people are near each other.[b] It leaves an infected person as they breathe, cough, sneeze, or speak and enters another person via their mouth, nose, or eyes. It may also spread via contaminated surfaces. People remain infectious for up to two weeks, and can spread the virus even if they do not show symptoms.[9]

Recommended preventive measures include social distancing, wearing face masks in public, ventilation and air-filtering, hand washing, covering one's mouth when sneezing or coughing, disinfecting surfaces, and monitoring and self-isolation for people exposed or symptomatic. Several vaccines are being developed and distributed. Current treatments focus on addressing symptoms while work is underway to develop therapeutic drugs that inhibit the virus. Authorities worldwide have responded by implementing travel restrictions, lockdowns, workplace hazard controls, and facility closures. Many places have also worked to increase testing capacity and trace contacts of the infected.

The responses to the pandemic have resulted in global social and economic disruption, including the largest global recession since the Great Depression.[10] It has led to the postponement or cancellation of events, widespread supply shortages exacerbated by panic buying, agricultural disruption and food shortages, and decreased emissions of pollutants and greenhouse gases. Many educational institutions have been partially or fully closed. Misinformation has circulated through social media and mass media. There have been incidents of xenophobia and discrimination against Chinese people and against those perceived as being Chinese or as being from areas with high infection rates.[11]

"""

In [2]:
import re
text = re.sub('\[+(.*)+\]','', text)
print(text)


The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, China. The World Health Organization declared the outbreak a Public Health Emergency of International Concern in January 2020 and a pandemic in March 2020. As of 23 January 2021, more than 98.2 million cases have been confirmed, with more than 2.1 million deaths attributed to COVID-19, across 190 countries worldwide.

Symptoms of COVID-19 are highly variable, ranging from none to severe illness. The virus spreads mainly through the air when people are near each other.

Recommended preventive measures include social distancing, wearing face masks in public, ventilation and air-filtering, hand washing, covering one's mouth when sneezing or coughing, disinfecting surfaces, and monitoring and self-isolation for people exposed or symptomatic. Sev

In [3]:
# !python -m spacy download en_core_web_sm

In [4]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

In [5]:
stopwords = list( STOP_WORDS )

In [6]:
nlp = spacy.load('en_core_web_sm')

In [7]:
doc = nlp(text)

In [8]:
tokens = [token.text for token in doc]
# print(tokens)

In [9]:
punctuation = punctuation + '\n'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~\n'

# text cleaning

In [10]:
word_frequencies = {}
for word in doc:
    if word.text.lower() not in stopwords:
        if word.text.lower() not in punctuation:
            if word.text not in word_frequencies.keys():
                word_frequencies[word.text] = 1
            else:
                word_frequencies[word.text] += 1
#print(word_frequencies)  

In [11]:
max_frequency = max(word_frequencies.values())
#max_frequency

In [12]:
for word in word_frequencies.keys():
    word_frequencies[word] = word_frequencies[word]/max_frequency
#print(word_frequencies)

# sentence tokenization

In [13]:
sentence_tokens = [sent for sent in doc.sents]
#print(sentence_tokens)

In [14]:
sentence_scores = {}
for sent in sentence_tokens:
    for word in sent:
        if word.text.lower() in word_frequencies.keys():
            if sent not in sentence_scores.keys():
                sentence_scores[sent] = word_frequencies[word.text.lower()]
            else:
                sentence_scores[sent] += word_frequencies[word.text.lower()]
#sentence_scores

In [15]:
from heapq import nlargest

In [16]:
select_length = int(len(sentence_tokens)*0.3)
select_length

3

In [17]:
summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)

In [18]:
summary

[
 The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).,
 Recommended preventive measures include social distancing, wearing face masks in public, ventilation and air-filtering, hand washing, covering one's mouth when sneezing or coughing, disinfecting surfaces, and monitoring and self-isolation for people exposed or symptomatic.,
 The responses to the pandemic have resulted in global social and economic disruption, including the largest global recession since the Great Depression.
 ]

# combine sentences together

In [19]:
final_summary = [word.text for word in summary]
final_summary

['\nThe COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).',
 "Recommended preventive measures include social distancing, wearing face masks in public, ventilation and air-filtering, hand washing, covering one's mouth when sneezing or coughing, disinfecting surfaces, and monitoring and self-isolation for people exposed or symptomatic.",
 'The responses to the pandemic have resulted in global social and economic disruption, including the largest global recession since the Great Depression.\n\n']

In [20]:
summary = ' '.join(final_summary)

In [21]:
print(summary)


The COVID-19 pandemic, also known as the coronavirus pandemic, is an ongoing pandemic of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Recommended preventive measures include social distancing, wearing face masks in public, ventilation and air-filtering, hand washing, covering one's mouth when sneezing or coughing, disinfecting surfaces, and monitoring and self-isolation for people exposed or symptomatic. The responses to the pandemic have resulted in global social and economic disruption, including the largest global recession since the Great Depression.




# compare length of original text and summarize text

In [22]:
len(text)

1559

In [23]:
len(summary)

626

In [24]:
# source: https://youtu.be/9PoKellNrBc