# Web Scrapping
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work

In [120]:
# Import package
from bs4 import BeautifulSoup

Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:

In [121]:
html_doc = """
<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [123]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)


<html>
<head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


In [124]:
# Prettify method gives us a BeautifulSoup object, which represents the document as a nested data structure:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



# Here are some simple ways to navigate that data structure:

In [125]:
# Ttitle

print(soup.title)
print(soup.title.name)
print(soup.title.string)

print("<------------------------------------>")

print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])

print("<------------------------------------>")

print(soup.a)
print(soup.find_all('a'))
print(soup.find_all(id = 'link3'))

<title>The Dormouse's story</title>
title
The Dormouse's story
<------------------------------------>
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<------------------------------------>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


# One common task is extracting all the URLs found within a page’s <a> tags:

In [126]:
a_Tag = soup.find_all('a')

In [127]:
for link in a_Tag:
    print(link.get("href"))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


# Another common task is extracting all the text from a page:

In [128]:
print(soup.get_text())



The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



# Kinds of objects
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment. These objects represent the HTML elements that comprise the page.

class bs4.Tag

In [129]:
tag = soup.b
print(type(tag))
print(tag.name)


<class 'bs4.element.Tag'>
b


In [130]:
tag.name ='Anjehs'
print(tag)

<Anjehs>The Dormouse's story</Anjehs>


# attrs
An HTML or XML tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [131]:
tags1 = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tags1['id']


'boldest'

In [132]:
print(tags1.attrs)
tags1.attrs.keys()

{'id': 'boldest'}


dict_keys(['id'])

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [133]:
#Modify
tags1['id'] = 'Anjesh Sahani'
tags1['another-attribute'] = 1
print(tags1)

#Delete
del tags1['id']
del tags1['another-attribute']
print(tags1)


<b another-attribute="1" id="Anjesh Sahani">bold</b>
<b>bold</b>


In [95]:

# tags1['id']
# tags1.get('id')

# Multi-valued attributes
HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. By default, Beautiful Soup stores the value(s) of a multi-valued attribute as a list:

In [134]:
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
print(css_soup.p['class'])
# ['body']

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.p['class'])


['body']
['body', 'strikeout']


In [135]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index first">homepage</a></p>', 'html.parser')
print(rel_soup.a['rel'])
# ['index', 'first']

rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>

['index', 'first']
<p>Back to the <a rel="index contents">homepage</a></p>


In [136]:
# If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup stores it as a simple string:

id_soup = BeautifulSoup('<p id="my id"></p>', 'html.parser')
id_soup.p['id']
# 'my id'

'my id'

In [137]:
# You can force all attributes to be stored as strings by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor:

no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
no_list_soup.p['class']
# 'body strikeout'

'body strikeout'

In [138]:
# You can use get_attribute_list to always return the value in a list container, whether it’s a string or multi-valued attribute value:

id_soup.p['id']
# 'my id'
id_soup.p.get_attribute_list('id')
# ["my id"]

['my id']

In [139]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': ['class', 'accesskey', 'dropzone'],
 'a': ['rel', 'rev'],
 'link': ['rel', 'rev'],
 'td': ['headers'],
 'th': ['headers'],
 'form': ['accept-charset'],
 'object': ['archive'],
 'area': ['rel'],
 'icon': ['sizes'],
 'iframe': ['sandbox'],
 'output': ['for']}

In [140]:
# class bs4.NavigableString
# A tag can contain strings as pieces of text. Beautiful Soup uses the NavigableString class to contain these pieces of text:

soupNavi = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soupNavi.b
print(tag.string)
# 'Extremely bold'

type(tag.string)
# <class 'bs4.element.NavigableString'>
# A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str:

unicode_string = str(tag.string)
print(unicode_string)
 
# 'Extremely bold'
type(unicode_string)
# <type 'str'>
# You can’t edit a string in place, but you can replace one string with another, using replace_with():

tag.string.replace_with("No longer bold")
print(tag)
# <b class="boldest">No longer bold</b>

Extremely bold
Extremely bold
<b class="boldest">No longer bold</b>


In [141]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer Design By --Anjesh Sahani</footer>", "xml")
doc.find(string="INSERT FOOTER HERE").replace_with(footer)


print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer Design By --Anjesh Sahani</footer></document>


In [143]:
markup = "<b><!--Hey, Anjesh, how are you doing. Want to buy a used parser?--></b>"
soup1 = BeautifulSoup(markup, 'html.parser')
comment = soup1.b.string
type(comment)
print(comment)
print(soup1.b.prettify())

Hey, Anjesh, how are you doing. Want to buy a used parser?
<b>
 <!--Hey, Anjesh, how are you doing. Want to buy a used parser?-->
</b>



In [145]:
print(soup1.head)

None
