## <font color="00cec9">Extraction-Based Summarizer
<font color="#00cec9"> Scraped Wikipedia articles using Beautiful Soup

In [1]:
import bs4 as bs
import urllib.request 
import re
import nltk
nltk.download('punkt') # for tokenizing
import sys

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


<font color="turquoise">Replace the given wikipedia page with the web page you want to summarize.

In [2]:
scraped_wiki = urllib.request.urlopen('https://en.wikipedia.org/wiki/Quantum_mechanics')
wiki = scraped_wiki.read()

In [3]:
# running the beautiful soup's lxml parser on the wiki data
parse_wiki = bs.BeautifulSoup(wiki, 'lxml')
# scrape all the paragraphs (p tags)
article_para = parse_wiki.find_all('p')

In [4]:
text = ""

In [5]:
for p in article_para:
  text += p.text

<font color="#00cec9">Cleaning on the Text data<br>
Note: You can modify/add/remove the given regular expressions according to the web page you're reading. The following regex were created keeping wikipedia pages in mind.

In [6]:
# removing [] and extra spaces
text = re.sub(r'\[[0-9]*\]', ' ', text)
# removing white spaces
text = re.sub(r'\s+', ' ', text)
# removing special characters and digits
new_text = re.sub('[^a-zA-Z]', ' ', text)
# removing white spaces
new_text = re.sub(r'\s+', ' ', new_text)

<font color="#00cec9">Convert paragraphs to sentences

In [7]:
# sent_tokenize function takes a body of text and splits it into sentences
sentences = nltk.sent_tokenize(text)

In [8]:
# downloading stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

<font color="00cec9">Loop to calculate the word frequencies.<br>
Tokenize the sentences<br>
if word is not a stopword and is in the word list, the count is added

In [9]:
stopwords = nltk.corpus.stopwords.words('english')

token_freq = {} #word frequencies
for token in nltk.word_tokenize(new_text):
  if token not in stopwords:
    if token not in token_freq.keys():
      token_freq[token] = 1
    else:
      token_freq[token] += 1

<font color="00cec9">Find weighted frequence of occurence

In [10]:
max_freq = max(token_freq.values())

for token in token_freq.keys():
  token_freq[token] = (token_freq[token]/max_freq)

<font color="00cec9">Replace words with weighted frequence in sentences

In [11]:
weight = {} # sentence score
for sent in sentences:
  for token in nltk.word_tokenize(sent.lower()):
    if token in token_freq.keys():
      if len(sent.split(' ')) < 30:
        if sent not in weight.keys():
          weight[sent] = token_freq[token]
        else:
          weight[sent] += token_freq[token]

<font color="00cec9">Heap Queue<br>
It makes it possible to view the data (words/scores) - our heap, as a regular Python list<br>
heapq.nlargest(n, iterable, key=None)

In [12]:
import heapq

In [13]:
extracted_sentences = heapq.nlargest(5, weight, key=weight.get)
# 5 is number of largest weights we want
# weight is the iterable object 
# weight.get returns the value corresponding to the key

In [14]:
summary = ' '.join(extracted_sentences)
summary

': 1.1 It is the foundation of all quantum physics including quantum chemistry, quantum field theory, quantum technology, and quantum information science. It has since permeated many disciplines, including quantum chemistry, quantum electronics, quantum optics, and quantum information science. Another popular theory is loop quantum gravity (LQG), which describes quantum properties of gravity and is thus a theory of quantum spacetime. The first complete quantum field theory, quantum electrodynamics, provides a fully quantum description of the electromagnetic interaction. Complications arise with chaotic systems, which do not have good quantum numbers, and quantum chaos studies the relationship between classical and quantum descriptions in these systems.'