# Installing a Parser
By default, Beautiful Soup supports the HTML parser included in Python’s standard library, however it also supports many external third party python parsers like lxml parser or html5lib parser.

$pip install lxml

$pip install html5lib

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
url = "https://www.tutorialspoint.com/index.htm"
req = requests.get(url)

In [None]:
req.status_code
req.text
print(req.text)

In [None]:
req.headers

In [None]:
soup = BeautifulSoup(req.text, "html.parser")
soup

In [None]:
soup.title # getting title with tag

In [None]:
soup.title.text # getting only title text

In [None]:
soup.find('title') # return single tag

In [None]:
soup.find_all('title') # return multipul tag

In [None]:
soup.find('title').string # text / string 

In [None]:
soup.head  # getting head of the HTML

In [None]:
soup.find_all('head') # return multiple tag

In [None]:
soup.body   # getting body of the HTML

In [None]:
soup.link # return single link 

In [None]:
soup.find_all('link') # return multiple link

In [None]:
soup.link.attrs # return link as list 

In [None]:
soup.link.attrs['href'] # return only link text
   

In [None]:
for link in soup.find_all('a'): # extract all the URLs within a webpage
    print(link.get('href'))

In [None]:
for link in soup.find_all('a'): # extract all the a tag within a webpage
    print(link)

# Step-1
Another way is to pass the document through open filehandle.

In [None]:
with open("BCB.html") as fp:
   B_soup = BeautifulSoup(fp)
   print(B_soup)

In [None]:
B_soup.nav # access a specific tag

In [None]:
a=B_soup.nav  # access class of a specific tag
a['class']

In [None]:
t=B_soup.html # tag type
type(t)

In [None]:
t.name # Return tah name

In [None]:
t.name = "Rasel" # Change the tag name
t

In [None]:
B_soup.title

In [None]:
B_soup.head

In [None]:
B_soup.body

In [None]:
B_soup.find_all('link') # return multiple link

In [None]:
for link in soup.find_all('a'): # extract all the URLs within a webpage
    print(link)

| Tag Objects |

A HTML tag is used to define various types of content. A tag object in BeautifulSoup corresponds to an HTML or XML tag in the actual page or document.

Tags contain lot of attributes and methods and two important features of a tag are its name and attributes.

In [None]:
soup = BeautifulSoup('<b class="boldest">TutorialsPoint</b>')
soup

In [None]:
tag2 = soup.b  # access class of a specific tag
tag2['class']

In [None]:
tag = soup.html
type(tag)

In [None]:
tag

In [None]:
tag.name # Return tah name

In [None]:
tag.name = "Rasel" # Change the tag name
tag

| Attributes (tag.attrs) |

A tag object can have any number of attributes. The tag <b class=”boldest”> has an attribute ‘class’ whose value is “boldest”. Anything that is NOT tag, is basically an attribute and must contain a value. You can access the attributes either through accessing the keys (like accessing “class” in above example) or directly accessing through “.attrs”

In [None]:
soup = BeautifulSoup("<div class='tutorial'>Hello, Bangladesh!</div>",'lxml')
tag2 = soup.div
tag2['class']

In [None]:
type(soup) # checkig type 

In [None]:
soup.name # checking name

In [None]:
soup.string # To access the contents, use “.string” 

In [None]:
type(soup.string) # checking type

In [None]:
soup.string.replace_with("Hello World!") # replace the string with another string but you can’t edit the existing string.

In [None]:
soup.div['class'] # access class name

We can do all kind of modifications to our tag’s attributes (add/remove/modify).

In [None]:
tag2['class']= 'Online-Learning' # Change class name
tag2['class']

In [None]:
soup.div['class'] = 'Online-Learning' # same | Change class name

In [None]:
tag2['style']='2007' # Adding tah style
tag2

In [None]:
del tag2['style'] # delete tag style
tag2

In [None]:
del soup.div['style'] # Same | delete tag style
soup

In [None]:
del tag2['class']
tag2

| Comments |

In [None]:
soup = BeautifulSoup('<p><!-- Everything inside it is COMMENTS --></p>')
comment = soup.p.string # return comment
comment

In [None]:
type(comment) # return type

In [None]:
print(soup.p.prettify()) # return comment with tag

# Step-2
Here’s an HTML document I’ll be using as an example throughout this document.

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
html_doc = """
<html>
<head>
<title>The Dormouse's story</title>
</head>

<body>
<p class="title">
<b> The Dormouse's story </b>
</p>

<p class="story">

 Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

</p>

<p class="story">... </p>
"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)

In [None]:
print(soup.prettify()) # Return whole code HTML formet

In [None]:
soup.title # return title

In [None]:
soup.title.name    # return title name

In [None]:
soup.title.string #  return title string 

In [None]:
soup.title.parent  # return parent

In [None]:
soup.title.parent.name #return parent name

In [None]:
soup.p # return whole p tag

In [None]:
soup.find_all('p') # return all p tag

In [None]:
soup.p['class'] # return class of p tag 

In [None]:
soup.a # return a tag 

In [None]:
soup.find_all('a') # return all a tag

In [None]:
soup.find(id="link,") # return whole tag link of id="link3"

In [None]:
# extracting all the URLs found within a page’s <a> tags

for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# extracting all the text from a page:
print(soup.get_text())

# Task-1: | Making the soup |

To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:

Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See Parsing XML.)

In [None]:
with open("BCB.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser') # Making the soup To parse a document
soup

In [None]:
soup = BeautifulSoup("<html>a web page</html>", 'html.parser')  # Same | # Making the soup To parse a document
soup

In [None]:
# Same 
print(BeautifulSoup("<html><head></head><body>Sacr&eacute; bleu!</body></html>", "html.parser")) # Same 

| Parsing XML | 

In [None]:
soup = BeautifulSoup("<html>a web page</html>", 'xml')  # Same | # Making the soup To parse a document
soup

# Task-2 | Kinds of objects |

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

| Tag |

A Tag object corresponds to an XML or HTML tag in the original document:

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

In [None]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
type(tag)

| Name | 

In [None]:
tag.name #Every tag has a name, accessible as .name:

In [None]:
tag.name = "Rasel" # If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:
tag

In [None]:
tag['class'] # Access only class name

| Attributes |

A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [None]:
tag = BeautifulSoup('<b id="boldest">bold</b>', 'html.parser').b
tag['id']

In [None]:
tag.attrs  # access that dictionary directly as .attrs:

|| add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [None]:
tag['id'] = 'verybold' # update id name
tag

In [None]:
tag['another-attribute'] = 1 # add another attribute
tag

In [None]:
del tag['another-attribute']
tag

In [None]:
tag.get('id') #  get id name

| Multi-valued attributes |

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [None]:
css_soup = BeautifulSoup('<p class="body"></p>', 'html.parser')
css_soup.p['class']   # return single value attributes

In [None]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
css_soup.p['class'] # return mult-valued attributes

In [None]:
id_soup = BeautifulSoup('<p id="my name"></p>', 'html.parser')
id_soup.p['id'] # If an attribute looks like it has more than one value, but it’s not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:

| Turn a tag back into a string |

When you turn a tag back into a string, multiple attribute values are consolidated:

In [None]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>', 'html.parser')
rel_soup.a['rel']

In [None]:
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

| multi_valued_attributes=None |

You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor:


In [None]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser', multi_valued_attributes=None)
no_list_soup.p['class']

| get_attribute_list |

You can use get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [None]:
id_soup.p.get_attribute_list('id')

| Note |

If you parse a document as XML, there are no multi-valued attributes:

In [None]:
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

| multi_valued_attributes |

Again, you can configure this using the multi_valued_attributes argument:

In [None]:
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']

| Note | 

You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:

In [None]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

# Task-3: | NavigableString |
A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

In [None]:
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
tag = soup.b
tag.string

In [None]:
type(tag.string) # checking type

| Note |

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with str:

In [None]:
unicode_string = str(tag.string) # convert a NavigableString to a Unicode string with str:
unicode_string

In [None]:
type(unicode_string) # checking type

| Note:

You can’t edit a string in place, but you can replace one string with another, using replace_with():

In [None]:
tag.string.replace_with("No longer bold") # replace one string with another
tag

# Task-4: | BeautifulSoup |
The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents:

In [None]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
doc

In [None]:
footer = BeautifulSoup("<footer>Here's the footers </footer>", "xml")
footer

In [None]:
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
print(doc)

| Note:

Since the BeautifulSoup object doesn’t correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it’s useful to look at its .name, so it’s been given the special .name “[document]”:

In [None]:
doc.name
footer.name
soup.name

# Task-5: | Comments and special strings |
Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The main one you’ll probably encounter is the comment:

The Comment object is just a special type of NavigableString:

In [None]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup, 'html.parser') # Return comment string 
comment = soup.b.string
comment

In [None]:
type(comment)

In [None]:
print(soup.b.prettify()) # Comment is displayed with special formatting:

# Task-5: Navigating the tree

I’ll use this as an example to show you how to move from one part of a document to another.

In [None]:
# Here’s the “Three sisters” HTML document again:
html_doc = """
<html><head><title>The Dormouse's storys</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [None]:
soup = BeautifulSoup(html_doc, 'html.parser')

| Note:

 that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

| Navigating using tag names |

The simplest way to navigate the parse tree is to say the name of the tag you want. 

In [None]:
soup.head # If you want the <head> tag, just say soup.head:

In [None]:
soup.title # return title

In [None]:
soup.title.string # return title string

In [None]:
soup.body.b

In [None]:
soup.find_all('a') # return a tag 

In [None]:
for b in soup.find_all('a'): # return link of a tag
    print(b.get('href'))

| .contents and .children |

A tag’s children are available in a list called .contents:

The .contents and .children attributes only consider a tag’s direct children. For instance, the <head> tag has a single direct child–the <title> tag:

In [None]:
soup.head.contents # return content of head

In [None]:
soup.body.contents # return content of body

In [None]:
st=soup.head.contents[0] # return content of head
st

In [None]:
st.contents # return string

| note: 

The BeautifulSoup object itself has children. In this case, the <html> tag is the child of the BeautifulSoup object.:

A string does not have .contents, because it can’t contain anything:

In [None]:
len(soup.contents)

In [None]:
for child in soup.head.contents[0].children: # Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:
    print(child)     # return all children of head

In [None]:
for child in soup.head.descendants: # return all descendants of head
    print(child)

In [None]:
for child in soup.body.descendants: # return all descendants of body
    print(child)

| note: 

The head tag has only one child, but it has two descendants: the title tag and the title tag’s child. The BeautifulSoup object only has one direct child (the <html> tag), but it has a whole lot of descendants:

In [None]:
len(list(soup.children)) # Return number of child of  BeautifulSoup

In [None]:
len(list(soup.descendants)) # Return number of descendants of  BeautifulSoup

| .string |

In [None]:
soup.head.string

In [None]:
soup.b.string

In [None]:
print(soup.html.string) # If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:

| .strings and stripped_strings |

If there’s more than one thing inside a tag, you can still look at just the strings. Use the .strings generator:

These strings tend to have a lot of extra whitespace, which you can remove by using the .stripped_strings generator instead:

In [None]:
for string in soup.strings: # return all string of BeautifulSoup
    print(repr(string))

In [None]:
for string in soup.stripped_strings: # return all string of BeautifulSoup without whitespace
    print(repr(string))

| .parent and .parents |

Continuing the “family tree” analogy, every tag and every string has a parent: the tag that contains it.

You can access an element’s parent with the .parent attribute. In the example “three sisters” document, the <head> tag is the parent of the <title> tag:

In [None]:
soup.title.parent # return parent of title

In [None]:
soup.head.string.parent  # return parent of string

In [None]:
print(soup.html.parent)

In [None]:
html_tag = soup.html
type(html_tag.parent)

In [None]:
print(soup.parent) # the .parent of a BeautifulSoup object is defined as None:

In [None]:
link = soup.a # iterate over all of an element’s parents with .parents
for parent in link.parents:
    print(parent.name)

| Going sideways |

The <b> tag and the <c> tag are at the same level: they’re both direct children of the same tag. We call them siblings. 

In [None]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'html.parser')
print(sibling_soup.prettify())

| .next_sibling and .previous_sibling  |

You can use .next_sibling and .previous_sibling to navigate between page elements that are on the same level of the parse tree:

In [None]:
sibling_soup.b.next_sibling # move tag b to tag c

In [None]:
sibling_soup.c.previous_sibling # return tag c to tag ab

|| Note: 

The <b> tag has a .next_sibling, but no .previous_sibling, because there’s nothing before the <b> tag on the same level of the tree. For the same reason, the <c> tag has a .previous_sibling but no .next_sibling:

In [None]:
print(sibling_soup.b.previous_sibling)
print(sibling_soup.c.next_sibling)

|| Note: 

The strings “text1” and “text2” are not siblings, because they don’t have the same parent:

In [None]:
print(sibling_soup.b.string.next_sibling)

sibling_soup.b.string

In [None]:
link = soup.a
link

link.next_sibling # next sibling of a 

In [None]:
link.next_sibling.next_sibling # next sibling of \n

| .next_siblings and .previous_siblings |

iterate over a tag’s siblings with .next_siblings or .previous_siblings:

In [None]:
for sibling in soup.a.next_siblings: # iterate next_siblings
    print(repr(sibling))

In [None]:
for sibling in soup.find(id="link3").previous_siblings: # iterate previous_siblings
    print(repr(sibling))

| .next_element and .previous_element | 

The .next_element attribute of a string or tag points to whatever was parsed immediately afterwards. It might be the same as .next_sibling, but it’s usually drastically different.

In [None]:
last_a_tag = soup.find("a", id="link3")
last_a_tag

In [None]:
last_a_tag.next_sibling

In [None]:
last_a_tag.next_element # .next_element of that <a> tag, the thing that was parsed immediately after the <a> tag, is not the rest of that sentence: it’s the word “Tillie”

In [None]:
last_a_tag.previous_element # return previous_element of a tag

In [None]:
last_a_tag.previous_element.next_element #  previous_element nad next_element work same time

| .next_elements and .previous_elements |

In [None]:
for element in last_a_tag.next_elements: # iterate next_elements  
    print(repr(element))

In [None]:
for element in last_a_tag.previous_elements: # iterate next_elements  
    print(repr(element))

# Task-5: | Searching the tree |

In [159]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

|| Note:

If you pass in a byte string, Beautiful Soup will assume the string is encoded as UTF-8. You can avoid this by passing in a Unicode string instead.

In [161]:
soup.find_all('b')  #  This code finds all the <b> tags in the document:

[<b>The Dormouse's story</b>]