## Scraping Data without using BS4

In [84]:
import wikipedia as w
import pandas as pd

In [85]:
text=w.summary("Flash (DC Comics character)")
text

'The Flash (or simply Flash) is the name of several superheroes appearing in American comic books published by DC Comics. Created by writer Gardner Fox and artist Harry Lampert, the original Flash first appeared in Flash Comics #1 (cover date January 1940/release month November 1939). Nicknamed "the Scarlet Speedster", all incarnations of the Flash possess "superspeed", which includes the ability to run, move, and think extremely fast, use superhuman reflexes, and seemingly violate certain laws of physics.\nThus far, at least five different characters—each of whom somehow gained the power of "the Speed Force"—have assumed the mantle of the Flash in DC\'s history: college athlete Jay Garrick (1940–1951, 1961–2011, 2017–present), forensic scientist Barry Allen (1956–1985, 2008–present), Barry\'s nephew Wally West (1986–2011, 2016–present), Barry\'s grandson Bart Allen (2006–2007), and Chinese-American Avery Ho (2017–present). Each incarnation of the Flash has been a key member of at leas

In [86]:
R = w.search("Flash (DC Comics character)")
page = w.page(R[0])
title = page.title
content = page.content
print("Page title : ", title, "\n")
print("Page content : ", content, "\n")

Page title :  Flash (DC Comics character) 

Page content :  The Flash (or simply Flash) is the name of several superheroes appearing in American comic books published by DC Comics. Created by writer Gardner Fox and artist Harry Lampert, the original Flash first appeared in Flash Comics #1 (cover date January 1940/release month November 1939). Nicknamed "the Scarlet Speedster", all incarnations of the Flash possess "superspeed", which includes the ability to run, move, and think extremely fast, use superhuman reflexes, and seemingly violate certain laws of physics.
Thus far, at least five different characters—each of whom somehow gained the power of "the Speed Force"—have assumed the mantle of the Flash in DC's history: college athlete Jay Garrick (1940–1951, 1961–2011, 2017–present), forensic scientist Barry Allen (1956–1985, 2008–present), Barry's nephew Wally West (1986–2011, 2016–present), Barry's grandson Bart Allen (2006–2007), and Chinese-American Avery Ho (2017–present). Each in

In [87]:
type(content)

str

In [97]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

In [98]:
stopWords = set(stopwords.words("english"))
words = word_tokenize(content)

In [99]:
freq_table = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freq_table:
        freq_table[word] += 1
    else:
        freq_table[word] = 1

In [100]:
sents = sent_tokenize(text)
sent_val = dict()

In [101]:
for sent in sents:
    for word, freq in freq_table.items():
        if word in sent.lower():
            if sent in sent_val:
                sent_val[sent] += freq
            else:
                sent_val[sent] = freq

In [102]:
sumValues = 0
for sent in sent_val:
    sumValues += sent_val[sent]
sumValues

13747

In [103]:
avg = int(sumValues / len(sent_val))
avg

1057

In [104]:
summary = ''
for sent in sents:
    if (sent in sent_val) and (sent_val[sent] > (1 * avg)):
        summary += sent
print(summary)

Created by writer Gardner Fox and artist Harry Lampert, the original Flash first appeared in Flash Comics #1 (cover date January 1940/release month November 1939).Thus far, at least five different characters—each of whom somehow gained the power of "the Speed Force"—have assumed the mantle of the Flash in DC's history: college athlete Jay Garrick (1940–1951, 1961–2011, 2017–present), forensic scientist Barry Allen (1956–1985, 2008–present), Barry's nephew Wally West (1986–2011, 2016–present), Barry's grandson Bart Allen (2006–2007), and Chinese-American Avery Ho (2017–present).The original meeting of the Golden Age Flash Jay Garrick and Silver Age Flash Barry Allen in "Flash of Two Worlds" (1961) introduced the Multiverse storytelling concept to DC readers, which would become the basis for many DC stories in the years to come.Like his Justice League colleagues Wonder Woman, Superman and Batman, the Flash has a distinctive cast of adversaries, including the various Rogues (unique among 

## Scraping Data using BS4

In [49]:
import bs4
from bs4 import BeautifulSoup
from urllib.request import urlopen 

In [54]:
html=urlopen('https://en.wikipedia.org/wiki/Flash_(DC_Comics_character)')
html

<http.client.HTTPResponse at 0x23b7edf82e0>

In [55]:
bs=BeautifulSoup(html.read(),'html.parser')
HD=bs.findAll(['h1','h2','h3','h4','h5','h6'])
print(HD)

[<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Flash (DC Comics character)</span></h1>, <h2 id="mw-toc-heading">Contents</h2>, <h2><span class="mw-headline" id="Publication_history">Publication history</span></h2>, <h3><span class="mw-headline" id="Golden_Age">Golden Age</span></h3>, <h3><span class="mw-headline" id="Silver_Age">Silver Age</span></h3>, <h3><span class="mw-headline" id="Modern_Age">Modern Age</span></h3>, <h2><span class="mw-headline" id="Fictional_character_biographies">Fictional character biographies</span></h2>, <h3><span class="mw-headline" id="Jay_Garrick">Jay Garrick</span></h3>, <h3><span class="mw-headline" id="Barry_Allen">Barry Allen</span></h3>, <h3><span class="mw-headline" id="Wally_West">Wally West</span></h3>, <h3><span class="mw-headline" id="Bart_Allen">Bart Allen</span></h3>, <h3><span class="mw-headline" id="Avery_Ho">Avery Ho</span></h3>, <h3><span class="mw-headline" id="Others_to_carry_the_mantle_of_th

In [56]:
bs.select('p')

[<p><b>The Flash</b> (or simply <b>Flash</b>) is the name of several <a href="/wiki/Superhero" title="Superhero">superheroes</a> appearing in <a href="/wiki/American_comic_book" title="American comic book">American comic books</a> published by <a href="/wiki/DC_Comics" title="DC Comics">DC Comics</a>. Created by writer <a href="/wiki/Gardner_Fox" title="Gardner Fox">Gardner Fox</a> and artist <a href="/wiki/Harry_Lampert" title="Harry Lampert">Harry Lampert</a>, the original Flash first appeared in <i><a href="/wiki/Flash_Comics" title="Flash Comics">Flash Comics</a></i> #1 (cover date January 1940/release month November 1939).<sup class="reference" id="cite_ref-dc-ency_1-0"><a href="#cite_note-dc-ency-1">[1]</a></sup> Nicknamed "the Scarlet Speedster", all incarnations of the Flash possess "superspeed", which includes the ability to run, move, and think extremely fast, use superhuman reflexes, and seemingly violate certain <a class="mw-redirect" href="/wiki/Physical_law" title="Physic

In [59]:
for i in bs.select('p'):
    content=i.text
    print(content)

The Flash (or simply Flash) is the name of several superheroes appearing in American comic books published by DC Comics. Created by writer Gardner Fox and artist Harry Lampert, the original Flash first appeared in Flash Comics #1 (cover date January 1940/release month November 1939).[1] Nicknamed "the Scarlet Speedster", all incarnations of the Flash possess "superspeed", which includes the ability to run, move, and think extremely fast, use superhuman reflexes, and seemingly violate certain laws of physics.

Thus far, at least five different characters—each of whom somehow gained the power of "the Speed Force"—have assumed the mantle of the Flash in DC's history: college athlete Jay Garrick (1940–1951, 1961–2011, 2017–present), forensic scientist Barry Allen (1956–1985, 2008–present), Barry's nephew Wally West (1986–2011, 2016–present), Barry's grandson Bart Allen (2006–2007), and Chinese-American Avery Ho (2017–present). Each incarnation of the Flash has been a key member of at least

In [60]:
stopWords = set(stopwords.words("english"))
words = word_tokenize(content)

In [61]:
freq_table = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
        continue
    if word in freq_table:
        freq_table[word] += 1
    else:
        freq_table[word] = 1

In [62]:
sentences = sent_tokenize(text)
sent_value = dict()

In [63]:
for sentence in sentences:
    for word, freq in freq_table.items():
        if word in sentence.lower():
            if sentence in sent_value:
                sent_value[sentence] += freq
            else:
                sent_value[sentence] = freq

In [67]:
sum_val = 0
for sentence in sent_value:
    sum_val += sent_value[sentence]
sumValues

71

In [66]:
avg = int(sum_val / len(sent_value))
avg

5

In [70]:
summary = ''
for sentence in sentences:
    if (sentence in sent_value) and (sent_value[sentence] > (0.6 * avg)):
        summary += sentence
print(summary)

Created by writer Gardner Fox and artist Harry Lampert, the original Flash first appeared in Flash Comics #1 (cover date January 1940/release month November 1939).Nicknamed "the Scarlet Speedster", all incarnations of the Flash possess "superspeed", which includes the ability to run, move, and think extremely fast, use superhuman reflexes, and seemingly violate certain laws of physics.Thus far, at least five different characters—each of whom somehow gained the power of "the Speed Force"—have assumed the mantle of the Flash in DC's history: college athlete Jay Garrick (1940–1951, 1961–2011, 2017–present), forensic scientist Barry Allen (1956–1985, 2008–present), Barry's nephew Wally West (1986–2011, 2016–present), Barry's grandson Bart Allen (2006–2007), and Chinese-American Avery Ho (2017–present).Each incarnation of the Flash has been a key member of at least one of DC's premier teams: the Justice Society of America, the Justice League, and the Teen Titans.The original meeting of the 