# Auto-summarizing Text


## Setup

In [19]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from string import punctuation
from collections import defaultdict
from heapq import nlargest

## Get some text sample

Fetch some text and clean so its ready for analysis

In [5]:
url = "https://www.washingtonpost.com/health/2020/05/28/children-with-perplexing-syndrome-linked-covid-19-may-be-experiencing-deadly-cytokine-storm/"
page = urlopen(url).read().decode('utf8', 'ignore')
soup = BeautifulSoup(page, "lxml") #BeautifulSoup create a tree structure.
text = ' '.join(map(lambda p: p.text, soup.find_all('article'))) #Join the text of each articles
parsed_text =  text.encode('ascii', errors='replace').decode("utf-8").replace("?", "") # Removing encoding characters

print(parsed_text)




## Summarise text

**Approach**:
1. Find the most important words: authors tend to repeat the words which are more important. Most repeated words will define its importance. The higher the frequency, the higher the importance. It is important to remove the stop words, as they will be the ones repeated the most and confuse the auto summarizing algorithm.
2. Compute significance score for sentences based on words they contain: give a weight on the most important words. Sum of all important words.
3. Pick the top most significant sentences

### Step 1: find the most important words

List of words in the articles

In [24]:
words = word_tokenize(parsed_text.lower())
print ("Number of words retrieved: " + str(len(words)))

Number of words retrieved: 1445


Get a list of stopwords (nltk.corpus) and punctuations (string) to ignore while analyzing.

In [21]:
english_stopwords = set(stopwords.words('english') + list(punctuation))
print (english_stopwords)

{"wouldn't", 'm', 'myself', 'when', 'through', 'by', 'which', 'll', 't', 'few', 'needn', 'more', 'yourselves', 'these', 've', '.', 'of', 'up', 'theirs', "doesn't", 'were', 'on', 'can', '#', 'from', "you've", 'out', 'in', "aren't", '$', 'didn', "weren't", "mustn't", ']', '!', 'until', 'is', 'for', 'my', 'have', 'o', 'ma', 're', '&', 'too', 'about', 'those', '~', 'while', "you're", 'themselves', 'shouldn', "don't", 'she', 'both', 'y', '"', 'his', 'mustn', 'into', 'once', 'am', 'will', '*', 'aren', 'same', '=', 'i', 'than', 'been', 'very', 'haven', 'ours', "haven't", 'your', "won't", 'does', '%', 'being', '|', 'had', 'a', 'some', ',', 'that', 'how', 'are', 'ain', "didn't", 'ourselves', 'against', 'before', 'me', 'yours', "you'll", 'hasn', 'but', ';', 'down', 'isn', "couldn't", 'him', 'further', 'herself', ':', 'at', "needn't", 'won', 'each', "shouldn't", 'has', '_', 'mightn', 'not', 'you', 'yourself', 'during', ')', 'no', 'below', '<', 'don', '[', "mightn't", 'only', "hasn't", 'now', 'its

Filter the words

In [25]:
filtered_words = [word for word in words if word not in english_stopwords]
print ("Number of words kept after filtering: " + str(len(filtered_words)))

Number of words kept after filtering: 732


### Step 2: Compute significance score

Compute the frequency for each words using the FreqDist method (nltk.probability)


In [32]:
words_frequency = FreqDist(filtered_words)
print(type(words_frequency))

<class 'nltk.probability.FreqDist'>


Split the articles into sentences:

In [15]:
sentences = sent_tokenize(parsed_text)
print(sentences)



Define the rank of each sentence, by incrementing a counter by its own word frequency.

In [48]:
ranking = defaultdict(int)
for i, sentence in enumerate(sentences):
  for w in word_tokenize(sentence.lower()):
    if w in words_frequency:
      ranking[i] += words_frequency[w]
print(ranking)

defaultdict(<class 'int'>, {0: 30, 1: 87, 2: 33, 3: 128, 4: 118, 5: 119, 6: 37, 7: 44, 8: 32, 9: 39, 10: 13, 11: 26, 12: 10, 13: 51, 14: 7, 15: 50, 16: 21, 17: 46, 18: 35, 19: 3, 20: 21, 21: 23, 22: 61, 23: 14, 24: 16, 25: 47, 26: 30, 27: 21, 28: 33, 29: 77, 30: 12, 31: 17, 32: 21, 33: 33, 34: 9, 35: 120, 36: 25, 37: 64, 38: 85, 39: 98, 40: 22})


### Step 3: Pick the most significant sentences

Select the most important sentences (order by ranking) using the nlargest method (heapq).

In [51]:
top_ranking = nlargest(3, ranking, key=ranking.get)
top_sentences = [sentences[i] for i in sorted(top_ranking)]

print("Text summary:")
for s in top_sentences:
  print(s)

Text summary:
Each complained of different symptoms, but blood tests, imaging and heart monitoring showed they all appeared to be having an exaggerated inflammatory reaction in what doctors suspect is post-viral complication of covid-19.Christopher Strother, the director of emergency medicine at Mount Sinai, described it as the pediatric version of the cytokine storm occurring in some adults with severe illness from the novel coronavirus.ADSign up for our Coronavirus Updates newsletter to track the outbreak.
She didnt have a rash, but after running tests, doctors discovered that her heart was beating super fast  a problem because it means the heart can have trouble filling with blood because it is contracting too rapidly  and her temperature climbed to almost 103 degrees.ADADAll four children have recovered, Strother said, and are back home now.Malcolm lost 20 pounds in during his 11-day stay the hospital, but did so well after he was off the ventilator that he walked out of the hospit