Scrapy learning using Beautiful Soup

### Beautiful Soup 
is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping

### Scrapy 
| A Fast and Powerful Scraping and Web Crawling Framework

### Selenium 
is a portable framework for testing web applications. Selenium provides a playback tool for authoring functional tests without the need to learn a test scripting language

In [5]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="cc">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">Thisis demo for learling web scraping</p>
"""

In [7]:
from bs4 import BeautifulSoup

In [34]:
soup = BeautifulSoup(html_doc,'html.parser')  # html5lib

In [59]:
# print(soup.prettify())

In [35]:
soup.title

<title>The Dormouse's story</title>

In [18]:
soup.title.text

"The Dormouse's story"

In [19]:
soup.title.name

'title'

In [21]:
soup.title.string

"The Dormouse's story"

In [22]:
soup.title.parent.name

'head'

In [23]:
soup.p   # all classes

<p class="title"><b>The Dormouse's story</b></p>

In [25]:
soup.p['class']  # title

['title']

In [27]:
soup.a   # qll a tag

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [28]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [29]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [31]:
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [32]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
Thisis demo for learling web scraping



### Tag object

In [36]:
soup = BeautifulSoup('<b class="boldest"> extremly </b>')
tag = soup.b

In [37]:
tag

<b class="boldest"> extremly </b>

In [41]:
soup.b['class']

['boldest']

### find all prime minter of india from wikipedia 

In [104]:
import requests
from bs4 import BeautifulSoup

In [105]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_Prime_Ministers_of_India')

In [106]:
soup = BeautifulSoup(html.text,'html.parser')

In [107]:
# print(soup.prettify())

In [108]:
res =soup.find_all('span',{'class':'fn'})

In [109]:
type(res[0])

The history saving thread hit an unexpected error (OperationalError('disk I/O error',)).History will not be written to the database.


bs4.element.Tag

In [204]:
for i in range(1,len(res)):
#     result = res[i].string.parent.name  # name of attribute
#     result = res[i].string.parent  # all tages 
    result = res[i].a['title']   
#     result = res[i].string
    print(result)

Jawaharlal Nehru
Louis Mountbatten, 1st Earl Mountbatten of Burma
Rajendra Prasad
Gulzarilal Nanda
Sarvepalli Radhakrishnan
Lal Bahadur Shastri
Gulzarilal Nanda
Indira Gandhi
V. V. Giri
Morarji Desai
B. D. Jatti
Charan Singh
Neelam Sanjiva Reddy
Indira Gandhi
Rajiv Gandhi
Zail Singh
Vishwanath Pratap Singh
R. Venkataraman
Chandra Shekhar
P. V. Narasimha Rao
Atal Bihari Vajpayee
Shankar Dayal Sharma
H. D. Deve Gowda
Inder Kumar Gujral
Atal Bihari Vajpayee
K. R. Narayanan
Manmohan Singh
A. P. J. Abdul Kalam
Pratibha Patil
Narendra Modi
Pranab Mukherjee


In [130]:
type(res) # res[0] is tag

bs4.element.ResultSet

In [131]:
type(res[0])

bs4.element.Tag

In [129]:
res[0].name

'span'

In [140]:
print(res[0].attrs)  # finding all attributes

{'class': ['fn', 'org', 'country-name']}


In [143]:
res[0]['class'] # class name of 

['fn', 'org', 'country-name']

In [147]:
res[1].string

'Jawaharlal Nehru'

In [157]:
type(res[1].string)

bs4.element.NavigableString

In [163]:
# soup.head
# soup.title
# soup.body.b

In [165]:
# soup.a

In [168]:
# soup.find_all('span') # tags of html 
# soup.find_all('a')

In [176]:
head_tag = soup.head

In [180]:
# head_tag.contents[1] 

In [181]:
title_tag = soup.title

In [192]:
# title_tag.contents[0] 
    # or
title_tag.string

'List of Prime Ministers of India - Wikipedia'

In [185]:
for child in title_tag.children:
    print(child)

List of Prime Ministers of India - Wikipedia


In [186]:
len(list(soup.descendants))

4608

In [189]:
len(list(soup.children))

4

In [193]:
for string in soup.strings:
    print(repr(string))

'\n'
'\n'
'\n'
'\n'
'List of Prime Ministers of India - Wikipedia'
'\n'
'document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );'
'\n'
'(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_Prime_Ministers_of_India","wgTitle":"List of Prime Ministers of India","wgCurRevisionId":889795106,"wgRevisionId":889795106,"wgArticleId":844745,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages using Timeline","Use dmy dates from May 2013","Use Indian English from May 2013","All Wikipedia articles written in Indian English","Articles with hCards","Featured lists","Prime Ministers of India","Lists of prime ministers","Lists of political office-holders in India"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSepa

### we can also check next next_element tag and previous_element tags 

            .name using this we can get tag or attribute name 

In [252]:
res[0].next_element.next_element.next_element.next_element.next_element.next_element

<a class="image" href="/wiki/File:Emblem_of_India.svg"><img alt="Emblem of India.svg" data-file-height="562" data-file-width="331" decoding="async" height="102" src="//upload.wikimedia.org/wikipedia/commons/thumb/5/55/Emblem_of_India.svg/60px-Emblem_of_India.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/5/55/Emblem_of_India.svg/90px-Emblem_of_India.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/5/55/Emblem_of_India.svg/120px-Emblem_of_India.svg.png 2x" width="60"/></a>

In [253]:
res[0].previous_element.next_element

<span class="fn org country-name"><a href="/wiki/India" title="India">India</a></span>

In [313]:
import re
for tags in soup.find_all(re.compile('^small')):
#     print(tags.name)
#     print(type(tags))
    print(tags.string)

None
(1889–1964)
(1898–1998)
(1904–1966)
(1898–1998)
(1917–1984)
(1896–1995)
(1902–1987)
(1917–1984)
(1944–1991)
(1931–2008)
(1927–2007)
(1921–2004)
(1924–2018)
(1933–)
(1919–2012)
(1924-2018)
(1932–)
(1950–)


#### A List

In [286]:
l = soup.find_all(['small','a']) # find using list

In [288]:
type(l[0])

bs4.element.Tag

#### Function

In [331]:
def class_has_no_id(tag):
    return tag.has_attr('title') and not tag.has_attr('class')

In [332]:
tag

<small>(1950–)</small>

In [333]:
soup.find_all(class_has_no_id)

[<link href="/w/opensearch_desc.php" rel="search" title="Wikipedia (en)" type="application/opensearchdescription+xml"/>,
 <a href="/wiki/Wikipedia:Featured_lists" title="This is a featured list. Click here for more information."><img alt="This is a featured list. Click here for more information." data-file-height="438" data-file-width="462" decoding="async" height="19" src="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/30px-Cscr-featured.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/40px-Cscr-featured.svg.png 2x" width="20"/></a>,
 <a href="/wiki/India" title="India">India</a>,
 <a href="/wiki/Politics_of_India" title="Politics of India">politics and government of<br/>India</a>,
 <a href="/wiki/Constitution_of_India" title="Constitution of India">Constitution</a>,
 <a href="/wiki/Law_of_India" title="Law of India">law</a>,
 <a href="/w

In [342]:
soup.find_all('title')[0]

<title>List of Prime Ministers of India - Wikipedia</title>

In [356]:
# soup.find_all("a")

In [376]:
soup.find(string=re.compile("a"))

'List of Prime Ministers of India - Wikipedia'

In [385]:
soup.find_all(class_ = re.compile("itl"))

[<div class="toctitle" dir="ltr" lang="en"><h2>Contents</h2><span class="toctogglespan"><label class="toctogglelabel" for="toctogglecheckbox"></label></span></div>,
 <th class="navbox-title" colspan="3" scope="col"><div class="plainlinks hlist navbar mini"><ul><li class="nv-view"><a href="/wiki/Template:Prime_Ministers_of_India" title="Template:Prime Ministers of India"><abbr style=";;background:none transparent;border:none;-moz-box-shadow:none;-webkit-box-shadow:none;box-shadow:none; padding:0;" title="View this template">v</abbr></a></li><li class="nv-talk"><a href="/wiki/Template_talk:Prime_Ministers_of_India" title="Template talk:Prime Ministers of India"><abbr style=";;background:none transparent;border:none;-moz-box-shadow:none;-webkit-box-shadow:none;box-shadow:none; padding:0;" title="Discuss this template">t</abbr></a></li><li class="nv-edit"><a class="external text" href="//en.wikipedia.org/w/index.php?title=Template:Prime_Ministers_of_India&amp;action=edit"><abbr style=";;ba

In [418]:
a = soup.find("small")

In [429]:
a.next_sibling.next_element

<a href="#cite_note-13">[13]</a>

In [444]:
span = soup.a

In [446]:
span.find_next_siblings("a")

[]