# <center>Quick Start with Beautiful Soup</center>

### Installing Beautiful Soup

In [224]:
!pip install beautifulsoup4



You should consider upgrading via the 'c:\users\admin\appdata\local\programs\python\python36\python.exe -m pip install --upgrade pip' command.


### Import module 

In [225]:
from bs4 import BeautifulSoup
import requests

### Making the soup

In [226]:
# Pass in a string 
response = requests.get('http://data.pr4e.org/romeo.txt')
html = response.text
soup = BeautifulSoup(html, 'html.parser')

In [227]:
# Pass in an open filehandle
with open("index.html") as fp:
#     html_txt = fp.readlines()
    soup = BeautifulSoup(fp, 'html.parser')

In [228]:
print(soup.prettify())

<html>
 <head>
  <title>
   Hello, Data Engineer
  </title>
 </head>
 <body>
  <p class="title">
   Hello
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   and they lived at the bottom of a well.
  </p>
  <p class="story">
  </p>
 </body>
</html>


### Kinds of objects

- **Tag**

In [229]:
# Get element by tag
tag = soup.title
print(tag)

<title>Hello, Data Engineer</title>


- **Name**

In [230]:
# Get name of element
tag.name

'title'

In [231]:
# Change name of tag
tag.name = 'new_title'
print(tag)

<new_title>Hello, Data Engineer</new_title>


- **Attributes**

In [232]:
#  Get value of attributes
p = soup('p', 'title')
p

[<p class="title">
           Hello
          <b>
          The Dormouse's story
          </b>
 </p>]

In [233]:
# You can access that dictionary directly as .attrs
a = soup.a
a_attrs= a.attrs
print(type(a_attrs), a_attrs)

<class 'dict'> {'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}


You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [234]:
tag = soup.a
print(tag.attrs)
# {'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
tag['class'] = 'fpt'
tag['href'] = 'http://fpt.com.vn'
tag
# <b another-attribute="1" id="verybold"></b>

del tag['class']
del tag['href']
tag

{'class': ['sister'], 'href': 'http://example.com/elsie', 'id': 'link1'}


<a id="link1">
         Elsie
         </a>

### Navigating the tree

Tags may contain ``strings`` and other ``tags``. These elements are the tag’s ``children``. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

- **Navigating using tag names**

Using a tag name as an attribute will give you only the first tag by that name:

In [235]:
# Get head element
head = soup.head
print(head)
# Get title element
title = soup.new_title
print(title)
# Get tag b in body
b = soup.body.b
print(b)

<head><new_title>Hello, Data Engineer</new_title></head>
<new_title>Hello, Data Engineer</new_title>
<b>
         The Dormouse's story
         </b>


In [236]:
# Get all the <a> tags
list_a_tags = soup.find_all('a')
print(list_a_tags)

[<a id="link1">
         Elsie
         </a>, <a class="sister" href="http://example.com/lacie" id="link2">
         Lacie
         </a>, <a class="sister" href="http://example.com/tillie" id="link3">
         Tillie
         </a>]


- **``.contents`` and ``.children``**¶

In [237]:
contents_html = soup.head.contents
print(len(contents_html))
for content in contents_html:
    print(content.name, '<------')

1
new_title <------


In [238]:
#  You can iterate over a tag’s children using the .children generator
body_children = soup.head.children
print(body_children)
for child in body_children:
    print(child.name, '<----')

<list_iterator object at 0x000002134CD2F630>
new_title <----


In [239]:
for child in soup.head.descendants:
    print(type(child), child)

<class 'bs4.element.Tag'> <new_title>Hello, Data Engineer</new_title>
<class 'bs4.element.NavigableString'> Hello, Data Engineer


In [240]:
print(len(list(soup.children)))
print(len(list(soup.descendants)))


1
30


- **``.strings`` and ``.stripped_strings``**

In [241]:
# Use the .strings generator:
for string in soup.body.strings:
    print(repr(string))

'\n'
'\n          Hello\n         '
"\n         The Dormouse's story\n         "
'\n'
'\n'
'\n         Once upon a time there were three little sisters; and their names were\n         '
'\n         Elsie\n         '
'\n'
'\n         Lacie\n         '
'\n         and\n         '
'\n         Tillie\n         '
'\n         and they lived at the bottom of a well.\n      '
'\n'
'\n'
'\n'


In [242]:
# You can remove whitespace by using the .stripped_strings generator
for string in soup.body.stripped_strings:
    print(repr(string))

'Hello'
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
'Lacie'
'and'
'Tillie'
'and they lived at the bottom of a well.'


## <center>Searching the tree</center>

### - A string

In [243]:
# Finds all the <b> tags in the document
soup.find_all('b')

[<b>
          The Dormouse's story
          </b>]

### - A regular expression

In [244]:
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

body
b


In [245]:
### - A list

In [246]:
soup.find_all(["a", "b"])

[<b>
          The Dormouse's story
          </b>, <a id="link1">
          Elsie
          </a>, <a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
          </a>, <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
          </a>]

In [247]:
### - A function

In [248]:
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

[<p class="title">
           Hello
          <b>
          The Dormouse's story
          </b>
 </p>, <p class="story">
          Once upon a time there were three little sisters; and their names were
          <a id="link1">
          Elsie
          </a>
 <a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
          </a>
          and
          <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
          </a>
          and they lived at the bottom of a well.
       </p>, <p class="story">
 </p>]

## <center> ``.find_all()`` </center>
 Signature: ``find_all(name, attrs, recursive, string, limit, **kwargs)``

In [249]:
soup.find_all("new_title")
# [<title>The Dormouse's story</title>]

[<new_title>Hello, Data Engineer</new_title>]

In [250]:
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

[<p class="title">
           Hello
          <b>
          The Dormouse's story
          </b>
 </p>]

In [251]:
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[<a id="link1">
          Elsie
          </a>, <a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
          </a>, <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
          </a>]

In [252]:
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

[<a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
          </a>]

In [253]:
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

'\n         Once upon a time there were three little sisters; and their names were\n         '

In [254]:
### Searching by CSS class

In [255]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/lacie" id="link2">
          Lacie
          </a>, <a class="sister" href="http://example.com/tillie" id="link3">
          Tillie
          </a>]