<a href="https://colab.research.google.com/github/SenSudi/DataSummarizer/blob/main/data_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import bs4 as bs        # beautiful soup, useful Python utility for web scraping
import urllib.request   # parse XML and HTML
import re               # for regex
import nltk             # for nlp processing
from nltk.tokenize import word_tokenize
nltk.download('maxent_ne_chunker')
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Main_Page')     #accessing/scraping wiki data
article = scraped_data.read()                                                        #reading the data byte-by-byte
print(article)

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Wikipedia, the free encyclopedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"YEbc92OaaBj1z-ZFUgc8zwAAABI","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1004593520,"wgRevisionId":1004593520,"wgArticleId":15580374,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":

In [2]:
nltk.download('words')
parsed_article = bs.BeautifulSoup(article,'lxml')      #to parse the data
paragraphs = parsed_article.find_all('p')     #text enclosed within <p> tag is retrieved
article_text = ""     #to combine the paragraphs

for p in paragraphs:  
    article_text += p.text

#Pre-processing
#Removing square brackets and extra spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)  
article_text = re.sub(r'\s+', ' ', article_text)
print(article_text)

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
Paper Mario: The Origami King is a cross-genre video game, developed by Intelligent Systems and published by Nintendo; it was released exclusively for the Nintendo Switch console in July 2020. The story follows Mario teaming up with his new ally Olivia to prevent the Mushroom Kingdom from being folded entirely into origami. The game is designed to look entirely like paper, with multiple open-world areas allowing for exploration and puzzle-solving. Turn-based combat is organized into circular rings, which can be rotated to line up enemies to deal more damage. The producer, Kensuke Tanabe (pictured), anticipating that he could not satisfy every fan, opted for entirely new gameplay and concepts compared to previous games in the series. The game received generally positive reviews, being praised for its graphics, writing and characters, and critiqued for the lack of other elements of role

In [3]:
#Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )  
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)
print(formatted_article_text)

Paper Mario The Origami King is a cross genre video game developed by Intelligent Systems and published by Nintendo it was released exclusively for the Nintendo Switch console in July The story follows Mario teaming up with his new ally Olivia to prevent the Mushroom Kingdom from being folded entirely into origami The game is designed to look entirely like paper with multiple open world areas allowing for exploration and puzzle solving Turn based combat is organized into circular rings which can be rotated to line up enemies to deal more damage The producer Kensuke Tanabe pictured anticipating that he could not satisfy every fan opted for entirely new gameplay and concepts compared to previous games in the series The game received generally positive reviews being praised for its graphics writing and characters and critiqued for the lack of other elements of role playing games such as experience points Reception on gameplay mainly the puzzle style combat was mixed Full article March Phi

In [4]:
nltk.download('punkt')
text = word_tokenize(article_text)
nltk.download('averaged_perceptron_tagger')
tagged = nltk.pos_tag(text)
print(tagged)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Paper', 'NNP'), ('Mario', 'NNP'), (':', ':'), ('The', 'DT'), ('Origami', 'NNP'), ('King', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('cross-genre', 'JJ'), ('video', 'NN'), ('game', 'NN'), (',', ','), ('developed', 'VBN'), ('by', 'IN'), ('Intelligent', 'NNP'), ('Systems', 'NNPS'), ('and', 'CC'), ('published', 'VBN'), ('by', 'IN'), ('Nintendo', 'NNP'), (';', ':'), ('it', 'PRP'), ('was', 'VBD'), ('released', 'VBN'), ('exclusively', 'RB'), ('for', 'IN'), ('the', 'DT'), ('Nintendo', 'NNP'), ('Switch', 'NNP'), ('console', 'NN'), ('in', 'IN'), ('July', 'NNP'), ('2020', 'CD'), ('.', '.'), ('The', 'DT'), ('story', 'NN'), ('follows', 'VBZ'), ('Mario', 'NNP'), ('teaming', 'VBG'), ('up', 'RP'), ('with', 'IN'), ('his', 'PRP$'), ('new', 'JJ

In [5]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

cp = nltk.RegexpParser(pattern)
cs = cp.parse(tagged)
print(cs)

(S
  Paper/NNP
  Mario/NNP
  :/:
  The/DT
  Origami/NNP
  King/NNP
  is/VBZ
  (NP a/DT cross-genre/JJ video/NN)
  (NP game/NN)
  ,/,
  developed/VBN
  by/IN
  Intelligent/NNP
  Systems/NNPS
  and/CC
  published/VBN
  by/IN
  Nintendo/NNP
  ;/:
  it/PRP
  was/VBD
  released/VBN
  exclusively/RB
  for/IN
  the/DT
  Nintendo/NNP
  Switch/NNP
  (NP console/NN)
  in/IN
  July/NNP
  2020/CD
  ./.
  (NP The/DT story/NN)
  follows/VBZ
  Mario/NNP
  teaming/VBG
  up/RP
  with/IN
  his/PRP$
  new/JJ
  ally/RB
  Olivia/NNP
  to/TO
  prevent/VB
  the/DT
  Mushroom/NNP
  Kingdom/NNP
  from/IN
  being/VBG
  folded/VBN
  entirely/RB
  into/IN
  (NP origami/NN)
  ./.
  (NP The/DT game/NN)
  is/VBZ
  designed/VBN
  to/TO
  look/VB
  entirely/RB
  like/IN
  (NP paper/NN)
  ,/,
  with/IN
  multiple/JJ
  open-world/JJ
  areas/NNS
  allowing/VBG
  for/IN
  (NP exploration/NN)
  and/CC
  (NP puzzle-solving/NN)
  ./.
  (NP Turn-based/JJ combat/NN)
  is/VBZ
  organized/VBN
  into/IN
  circular/JJ
  rings/NNS


In [6]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)
#B - beginning, O - outsite, I - Internal

[('Paper', 'NNP', 'O'),
 ('Mario', 'NNP', 'O'),
 (':', ':', 'O'),
 ('The', 'DT', 'O'),
 ('Origami', 'NNP', 'O'),
 ('King', 'NNP', 'O'),
 ('is', 'VBZ', 'O'),
 ('a', 'DT', 'B-NP'),
 ('cross-genre', 'JJ', 'I-NP'),
 ('video', 'NN', 'I-NP'),
 ('game', 'NN', 'B-NP'),
 (',', ',', 'O'),
 ('developed', 'VBN', 'O'),
 ('by', 'IN', 'O'),
 ('Intelligent', 'NNP', 'O'),
 ('Systems', 'NNPS', 'O'),
 ('and', 'CC', 'O'),
 ('published', 'VBN', 'O'),
 ('by', 'IN', 'O'),
 ('Nintendo', 'NNP', 'O'),
 (';', ':', 'O'),
 ('it', 'PRP', 'O'),
 ('was', 'VBD', 'O'),
 ('released', 'VBN', 'O'),
 ('exclusively', 'RB', 'O'),
 ('for', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('Nintendo', 'NNP', 'O'),
 ('Switch', 'NNP', 'O'),
 ('console', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('July', 'NNP', 'O'),
 ('2020', 'CD', 'O'),
 ('.', '.', 'O'),
 ('The', 'DT', 'B-NP'),
 ('story', 'NN', 'I-NP'),
 ('follows', 'VBZ', 'O'),
 ('Mario', 'NNP', 'O'),
 ('teaming', 'VBG', 'O'),
 ('up', 'RP', 'O'),
 ('with', 'IN', 'O'),
 ('his', 'PRP$', 'O'),
 ('new

In [7]:
ne_tree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(article_text)))
print(ne_tree)

(S
  (PERSON Paper/NNP)
  (PERSON Mario/NNP)
  :/:
  The/DT
  (ORGANIZATION Origami/NNP King/NNP)
  is/VBZ
  a/DT
  cross-genre/JJ
  video/NN
  game/NN
  ,/,
  developed/VBN
  by/IN
  (ORGANIZATION Intelligent/NNP Systems/NNPS)
  and/CC
  published/VBN
  by/IN
  (GPE Nintendo/NNP)
  ;/:
  it/PRP
  was/VBD
  released/VBN
  exclusively/RB
  for/IN
  the/DT
  (ORGANIZATION Nintendo/NNP Switch/NNP)
  console/NN
  in/IN
  July/NNP
  2020/CD
  ./.
  The/DT
  story/NN
  follows/VBZ
  (PERSON Mario/NNP)
  teaming/VBG
  up/RP
  with/IN
  his/PRP$
  new/JJ
  ally/RB
  (GPE Olivia/NNP)
  to/TO
  prevent/VB
  the/DT
  (ORGANIZATION Mushroom/NNP Kingdom/NNP)
  from/IN
  being/VBG
  folded/VBN
  entirely/RB
  into/IN
  origami/NN
  ./.
  The/DT
  game/NN
  is/VBZ
  designed/VBN
  to/TO
  look/VB
  entirely/RB
  like/IN
  paper/NN
  ,/,
  with/IN
  multiple/JJ
  open-world/JJ
  areas/NNS
  allowing/VBG
  for/IN
  exploration/NN
  and/CC
  puzzle-solving/NN
  ./.
  Turn-based/JJ
  combat/NN
  is/VBZ
 

In [8]:
for sent in nltk.sent_tokenize(article_text):
   for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
      if hasattr(chunk, 'label'):
         print(chunk.label(), ' '.join(c[0] for c in chunk))

PERSON Paper
PERSON Mario
ORGANIZATION Origami King
ORGANIZATION Intelligent Systems
GPE Nintendo
ORGANIZATION Nintendo Switch
PERSON Mario
GPE Olivia
ORGANIZATION Mushroom Kingdom
PERSON Kensuke Tanabe
GPE Reception
PERSON Philippe Chaperon
GPE French
ORGANIZATION Paris Opera
PERSON Chaperon
PERSON Giuseppe Verdi
ORGANIZATION Palais Garnier
GPE Paris
GPE Set
PERSON Philippe Chaperon
PERSON Adam CuerdenWikipedia
ORGANIZATION Wikimedia Foundation
GPE English


In [9]:
sentence_list = nltk.sent_tokenize(article_text)  #fommatted data doesn;t contain punctuation so article_text used
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')     #all stopwords stored here

word_frequencies = {}    #calculating freq of words(excluding stopwords)

for word in nltk.word_tokenize(formatted_article_text):  
    if word not in stopwords:
        if word not in word_frequencies.keys():    #encountered for the first time
            word_frequencies[word] = 1     
        else:                                      #incrementing count
            word_frequencies[word] += 1

print(word_frequencies)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'Paper': 1, 'Mario': 2, 'The': 5, 'Origami': 1, 'King': 1, 'cross': 1, 'genre': 1, 'video': 1, 'game': 3, 'developed': 1, 'Intelligent': 1, 'Systems': 1, 'published': 1, 'Nintendo': 2, 'released': 1, 'exclusively': 1, 'Switch': 1, 'console': 1, 'July': 1, 'story': 1, 'follows': 1, 'teaming': 1, 'new': 2, 'ally': 1, 'Olivia': 1, 'prevent': 1, 'Mushroom': 1, 'Kingdom': 1, 'folded': 1, 'entirely': 3, 'origami': 1, 'designed': 1, 'look': 1, 'like': 1, 'paper': 1, 'multiple': 1, 'open': 1, 'world': 1, 'areas': 1, 'allowing': 1, 'exploration': 1, 'puzzle': 2, 'solving': 1, 'Turn': 1, 'based': 1, 'combat': 2, 'organized': 1, 'circular': 1, 'rings': 1, 'rotated': 1, 'line': 1, 'enemies': 1, 'deal': 1, 'damage': 1, 'producer': 1, 'Kensuke': 1, 'Tanabe': 1, 'pictured': 1, 'anticipating': 1, 'could': 1, 'satisfy': 1, 'every': 1, 'fan': 1, 'opted': 1, 'gameplay': 2, 'concepts': 1, 'compa

In [10]:
maximum_frequncy = max(word_frequencies.values())

#Calculating weighted freq of all words
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

sentence_scores = {}    #calculating freq of sentences

for sent in sentence_list:  
    for word in nltk.word_tokenize(sent.lower()):   #convert to lowercase
        if word in word_frequencies.keys():        
            if len(sent.split(' ')) < 30:      #for smaller sentences < 30
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

import heapq    #to retrieve highest scores
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary) 

The producer, Kensuke Tanabe (pictured), anticipating that he could not satisfy every fan, opted for entirely new gameplay and concepts compared to previous games in the series. The game received generally positive reviews, being praised for its graphics, writing and characters, and critiqued for the lack of other elements of role-playing games, such as experience points. The game is designed to look entirely like paper, with multiple open-world areas allowing for exploration and puzzle-solving. The story follows Mario teaming up with his new ally Olivia to prevent the Mushroom Kingdom from being folded entirely into origami. Turn-based combat is organized into circular rings, which can be rotated to line up enemies to deal more damage. (Full article...) March 9 Philippe Chaperon (1823–1906) was a French painter and scenic designer, known particularly for his work at the Paris Opera. This is Chaperon's set design for the third act of Giuseppe Verdi's Rigoletto for an 1885 production of