##<font color="#fd79a8">Extraction-Based Summarizer <br><font color="#48dbfb">Scraped Wikipedia articles using Beautiful Soup </font>

Beautiful Soup : Library used to scrape wikipedia articles

In [None]:
import bs4 as bs
import urllib.request
import re
import nltk
nltk.download('punkt')
import sys

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:
scraped_wiki = urllib.request.urlopen('https://en.wikipedia.org/wiki/State_of_Palestine')
wiki = scraped_wiki.read()

In [None]:
parse_wiki = bs.BeautifulSoup(wiki, 'lxml')
#text body we are interested are paragraphs
article_para = parse_wiki.find_all('p')

In [None]:
text = ""

In [None]:
for p in article_para:
  text += p.text

In [None]:
text

'Coordinates   mw parser output  geo default  mw parser output  geo dms  mw parser output  geo dec display inline  mw parser output  geo nondefault  mw parser output  geo multi punct display none  mw parser output  longitude  mw parser output  latitude white space nowrap       N       E            N        E                   Palestine  Arabic          romanized  Filas  n   officially the State of Palestine a                Dawlat Filas  n   is a state located in Western Asia  Officially governed by the Palestine Liberation Organization  PLO   it claims the West Bank  including East Jerusalem  and the Gaza Strip as its territory  though the entirety of that territory has been occupied by Israel since the      Six Day War    As a result of the Oslo Accords of            the West Bank is currently divided into     Palestinian enclaves that are under partial Palestinian National Authority  PNA  rule  the remainder  including     Israeli settlements  is under full Israeli control  The Gaza

####<font color="#fd79a8"> Cleaning on the Text data

In [None]:
#remove brackets
text = re.sub(r'\[[0-9]*\]',' ',text)
#remove extra whitespaces we have
text = re.sub(r'\s',' ',text)
#substitute anything other alphabets with space
new_text = re.sub('[^a-zA-Z]',' ', text)
#remove extra whitespaces we have
new_text = re.sub(r'\s',' ',new_text)



In [None]:
text

'Coordinates   mw parser output  geo default  mw parser output  geo dms  mw parser output  geo dec display inline  mw parser output  geo nondefault  mw parser output  geo multi punct display none  mw parser output  longitude  mw parser output  latitude white space nowrap       N       E            N        E                   Palestine  Arabic          romanized  Filas  n   officially the State of Palestine a                Dawlat Filas  n   is a state located in Western Asia  Officially governed by the Palestine Liberation Organization  PLO   it claims the West Bank  including East Jerusalem  and the Gaza Strip as its territory  though the entirety of that territory has been occupied by Israel since the      Six Day War    As a result of the Oslo Accords of            the West Bank is currently divided into     Palestinian enclaves that are under partial Palestinian National Authority  PNA  rule  the remainder  including     Israeli settlements  is under full Israeli control  The Gaza

In [None]:
new_text

'Coordinates   mw parser output  geo default  mw parser output  geo dms  mw parser output  geo dec display inline  mw parser output  geo nondefault  mw parser output  geo multi punct display none  mw parser output  longitude  mw parser output  latitude white space nowrap       N       E            N        E                   Palestine  Arabic          romanized  Filas  n   officially the State of Palestine a                Dawlat Filas  n   is a state located in Western Asia  Officially governed by the Palestine Liberation Organization  PLO   it claims the West Bank  including East Jerusalem  and the Gaza Strip as its territory  though the entirety of that territory has been occupied by Israel since the      Six Day War    As a result of the Oslo Accords of            the West Bank is currently divided into     Palestinian enclaves that are under partial Palestinian National Authority  PNA  rule  the remainder  including     Israeli settlements  is under full Israeli control  The Gaza

##<font color="#fd79a8">Convert paragraphs to sentences

In [None]:
sentences = nltk.sent_tokenize(text)

In [None]:
sentences

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

###<font color="#fd79a8"> Loop to calculate the word frequencies. <br>Tokenize the sentences<br>if word is not a stopword and is in the word list, the count is added 

In [None]:
stopwords= nltk.corpus.stopwords.words('english' )

token_freq = {}
for token in nltk.word_tokenize(new_text):
  if token not in stopwords:
    if token not in token_freq.keys():
      token_freq[token] = 1
    else:
      token_freq[token] += 1

###<font color="#48dbfb">Find weighted frequency of occurence 

In [None]:
max_freq = max(token_freq.values())

for token in token_freq.keys():
  token_freq[token] = (token_freq[token]/max_freq)

In [None]:
max_freq

91

###<font color="#48dbfb">Replace words with weighted frequency in sentences

In [None]:
weight = {}
for sent in sentences:
  for token in nltk.word_tokenize(sent.lower()):
    if token in token_freq.keys():
      if len(sent.split(' ')) < 30:
        if sent not in weight.keys():
          weight[sent] = token_freq[token]
        else:
          weight[sent] += token_freq[token]

In [None]:
weight

Now we figure out what Weighted Freq we want to extract. <br>
Those will be the element we take out from wiki article. <br>
And stitch it back together to form summary.

####<font color="#fd79a8">Heap queue <br>It makes it possible to view the data (words/scores) -  our heap, as a regular Python list<br><font color="#0abde3">heapq.nlargest(n, iterable, key=None) 

n : no with highest rating / highest weighted word
iterable : data we'll be going through

In [None]:
import heapq

In [None]:
#5 most weighted elements
extracted_sentences = heapq.nlargest(8, weight, key = weight.get)


In [None]:
summary = " ".join(extracted_sentences)
summary

'Many of the countries that do not recognise the State of Palestine nevertheless recognise the PLO as the "representative of the Palestinian people". Specifically, the term "occupied Palestinian territory" refers as a whole to the geographical area of the Palestinian territory occupied by Israel since 1967. [citation needed] As of 31 July 2019, 138 (71.5%) of the 193 member states of the United Nations have recognised the State of Palestine. There are a wide variety of views regarding the status of the State of Palestine, both among the states of the international community and among legal scholars. Thus, the two enclaves constituting the area claimed by State of Palestine have no geographical border with one another, being separated by Israel. This article uses the terms "Palestine", "State of Palestine", "occupied Palestinian territory" (oPt or OPT) interchangeably depending on context. In all cases, any references to land or territory refer to land claimed by the State of Palestine.