In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [2]:
# 'html.parser'
# Đây là tên của trình phân tích cú pháp (parser) mà BeautifulSoup sẽ sử dụng để "đọc hiểu" HTML.

# 'html.parser' là một parser mặc định có sẵn trong Python (không cần cài thêm).

# Ngoài ra còn có các parser khác như:

# 'lxml' → nhanh hơn, cần cài thêm gói lxml

# 'html5lib' → phân tích giống trình duyệt thật nhất, cần cài thêm
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



Here are some simple ways to navigate that data structure:

In [3]:
soup.title

<title>The Dormouse's story</title>

In [4]:
soup.title.name

'title'

In [5]:
soup.title.string

"The Dormouse's story"

In [6]:
soup.title.parent

<head><title>The Dormouse's story</title></head>

In [7]:
soup.title.parent.name

'head'

In [8]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [9]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [10]:
soup.p['class']

['title']

In [11]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [13]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

In [14]:
soup.find(id = 'link3')

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [15]:
soup.a['href']

'http://example.com/elsie'

One common task is extracting all the URLs found within a page’s ``<a>`` tags:



In [16]:
for link in soup.find_all('a'):
  print(link['href'])

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [17]:
for link in soup.find_all('a'):
  print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


Another common task is extracting all the text from a page:



In [18]:
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



**Making the soup**


In [20]:
from bs4 import BeautifulSoup

with open("index.html") as fp :
  soup = BeautifulSoup(fp)

soup = BeautifulSoup("<html>data</html>")

In [23]:
# First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:

In [22]:
BeautifulSoup("Sacr&eacute; bleu!")

<html><body><p>Sacré bleu!</p></body></html>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment.

In [38]:
# Tag
# A Tag object corresponds to an XML or HTML tag in the original document:
soup = BeautifulSoup('<b class = "boldest">Extremely bold</b>')
tag = soup.b
type(tag)


In [39]:
soup.b

<b class="boldest">Extremely bold</b>

Tags have a lot of attributes and methods, and I’ll cover most of them in Navigating the tree and Searching the tree. For now, the most important features of a tag are its name and attributes.

In [40]:
# Name
# Every tag has a name, accessible as .name:
tag.name

'b'

In [41]:
# If you change a tag’s name, the change will be reflected in any HTML markup generated by Beautiful Soup:
tag.name = "blockquote"
tag

<blockquote class="boldest">Extremely bold</blockquote>

Attributes

A tag may have any number of attributes. The tag <b id="boldest"> has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary:

In [43]:
tag["class"]
tag.attrs

{'class': ['boldest']}

In [44]:
soup = BeautifulSoup('<b id = "boldest">Extremely bold</b>')

In [45]:
tag = soup.b
tag

<b id="boldest">Extremely bold</b>

In [46]:
tag["id"]

'boldest'

You can access that dictionary directly as .attrs:

In [47]:
tag.attrs

{'id': 'boldest'}

You can add, remove, and modify a tag’s attributes. Again, this is done by treating the tag as a dictionary:

In [48]:
tag["id"] = "verybold"
tag["hung"] = 1
tag

<b hung="1" id="verybold">Extremely bold</b>

In [49]:
del tag["hung"]
del tag["id"]
tag

<b>Extremely bold</b>

In [50]:
tag["id"]

KeyError: 'id'

In [51]:
print(tag.get("id"))

None


Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is class (that is, a tag can have more than one CSS class). Others include rel, rev, accept-charset, headers, and accesskey. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:

In [52]:
css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']

['body']

In [55]:
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p["class"]

['body', 'strikeout']

Beautiful Soup chỉ tự động chuyển đổi những thuộc tính mà HTML chính thức định nghĩa là đa giá trị

Các thuộc tính khác, dù có chứa nhiều từ được phân cách bởi dấu cách, vẫn được giữ nguyên dưới dạng chuỗi

Điều này đảm bảo tính nhất quán với tiêu chuẩn HTML và tránh việc xử lý sai các giá trị không phải là danh sách thực sự

In [56]:
from bs4 import BeautifulSoup
html1 = '<div class="btn primary large">Test</div>'
soup1 = BeautifulSoup(html1,'html.parser')
print(soup1.div["class"])

['btn', 'primary', 'large']


In [57]:
html2 = '<div title="hello world test">Test</div>'
soup2 = BeautifulSoup(html2,'html.parser')
print(soup2.div['title'])

hello world test


In [58]:
html3 = '<div data-values="one two three">Test</div>'
soup3 = BeautifulSoup(html3,'html.parser')
print(soup3.div['data-values'])

one two three


When you turn a tag back into a string, multiple attribute values are consolidated:

In [59]:
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']

['index']

In [60]:
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

<p>Back to the <a rel="index contents">homepage</a></p>


In [61]:
rel_soup.a

<a rel="index contents">homepage</a>

You can disable this by passing multi_valued_attributes=None as a keyword argument into the BeautifulSoup constructor:

In [62]:
no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None)
no_list_soup.p['class']

'body strikeout'

You can use `get_attribute_list to get a value that’s always a list, whether or not it’s a multi-valued atribute:

In [65]:
id_soup = BeautifulSoup('<p id="body strikeout"></p>')

In [66]:
id_soup.p.get_attribute_list('id')

['body strikeout']

In [68]:
class_is_multi= { '*' : 'id'}

In [69]:
id_soup = BeautifulSoup('<p id="body strikeout"></p>','xml', multi_valued_attributes=class_is_multi)
id_soup.p['id']

['body', 'strikeout']

You probably won’t need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:

In [70]:
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES

{'*': {'accesskey', 'class', 'dropzone'},
 'a': {'rel', 'rev'},
 'link': {'rel', 'rev'},
 'td': {'headers'},
 'th': {'headers'},
 'form': {'accept-charset'},
 'object': {'archive'},
 'area': {'rel'},
 'icon': {'sizes'},
 'iframe': {'sandbox'},
 'output': {'for'}}

NavigableString trong Beautiful Soup:

Định nghĩa:
NavigableString là một lớp (class) trong Beautiful Soup được sử dụng để chứa các đoạn văn bản bên trong các thẻ HTML.

NavigableString

A string corresponds to a bit of text within a tag. Beautiful Soup uses the NavigableString class to contain these bits of text:

In [71]:
tag.string

'Extremely bold'

In [72]:
(type(tag.string))

bs4.element.NavigableString

A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree. You can convert a NavigableString to a Unicode string with unicode():

In [79]:
unicode_string = str(tag.string)
unicode_string

'Extremely bold'

In [81]:
type(unicode_string)

str

Chỉnh sửa NavigableString:

Quy tắc cơ bản:
NavigableString là bất biến (immutable), nghĩa là bạn không thể chỉnh sửa trực tiếp nội dung của nó. Tuy nhiên, bạn có thể thay thế toàn bộ chuỗi bằng một chuỗi khác sử dụng phương thức replace_with().

In [83]:
tag.string.replace_with("No longer bold")
tag

<b>No longer bold</b>

NavigableString supports most of the features described in Navigating the tree and Searching the tree, but not all of them. In particular, since a string can’t contain anything (the way a tag may contain a string or another tag), strings don’t support the .contents or .string attributes, or the find() method.

If you want to use a NavigableString outside of Beautiful Soup, you should call unicode() on it to turn it into a normal Python Unicode string. If you don’t, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you’re done using Beautiful Soup. This is a big waste of memory.

BeautifulSoup

The BeautifulSoup object represents the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the tree.

You can also pass a BeautifulSoup object into one of the methods defined in Modifying the tree, just as you would a Tag. This lets you do things like combine two parsed documents:

So sánh với Tag:

In [84]:
soup = BeautifulSoup('<div><p>Test</p></div>', 'html.parser')
print(type(soup))
print(soup.name)

<class 'bs4.BeautifulSoup'>
[document]


In [85]:
tag = soup.div
print(type(tag))
print(tag.name)

<class 'bs4.element.Tag'>
div


In [86]:
doc = BeautifulSoup("<document><content/>INSERT FOOTER HERE</document", "xml")
footer = BeautifulSoup("<footer>Here's the footer</footer>", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)

  doc.find(text="INSERT FOOTER HERE").replace_with(footer)


'INSERT FOOTER HERE'

In [87]:
print(doc)

<?xml version="1.0" encoding="utf-8"?>
<document><content/><footer>Here's the footer</footer></document>


Tên đặc biệt của đối tượng BeautifulSoup:

Giải thích:
Vì đối tượng BeautifulSoup không tương ứng với một thẻ HTML hoặc XML thực tế, nó không có tên thẻ và thuộc tính như các thẻ thông thường. Tuy nhiên, để thuận tiện, nó được gán tên đặc biệt là "[document]".

In [88]:
soup.name
# u'[document]'

'[document]'

Comments and other special strings

Tag, NavigableString, and BeautifulSoup cover almost everything you’ll see in an HTML or XML file, but there are a few leftover bits. The only one you’ll probably ever need to worry about is the comment:

In [89]:
markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)

bs4.element.Comment

The Comment object is just a special type of NavigableString:



In [90]:
comment

'Hey, buddy. Want to buy a used parser?'

But when it appears as part of an HTML document, a Comment is displayed with special formatting:

In [92]:
print(soup.b.prettify())

<b>
 <!--Hey, buddy. Want to buy a used parser?-->
</b>



Các lớp đặc biệt khác trong Beautiful Soup:

Beautiful Soup định nghĩa các lớp cho bất kỳ thứ gì khác có thể xuất hiện trong tài liệu XML:

CData - Dữ liệu CDATA

ProcessingInstruction - Chỉ thị xử lý

Declaration - Khai báo XML

Doctype - Khai báo DOCTYPE


Đặc điểm chung:

Tất cả đều là các lớp con (subclass) của NavigableString
Chúng thêm một số tính năng bổ sung vào chuỗi cơ bản
Giống như Comment, chúng đại diện cho các phần tử đặc biệt trong XML/HTML

In [93]:
from bs4 import CData
cdata = CData("A CDATA block")
comment.replace_with(cdata)

print(soup.b.prettify())

<b>
 <![CDATA[A CDATA block]]>
</b>



Navigating the tree

Here’s the “Three sisters” HTML document again:

In [206]:
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

Going down

Tags may contain strings and other tags. These elements are the tag’s children. Beautiful Soup provides a lot of different attributes for navigating and iterating over a tag’s children.

Note that Beautiful Soup strings don’t support any of these attributes, because a string can’t have children.

Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the ``<head>`` tag, just say soup.head:

In [207]:
soup.head

<head><title>The Dormouse's story</title></head>

In [208]:
soup.title

<title>The Dormouse's story</title>

You can do use this trick again and again to zoom in on a certain part of the parse tree. This code gets the first ``<b>`` tag beneath the ``<body>`` tag:

In [209]:
soup.body.b

<b>The Dormouse's story</b>

Using a tag name as an attribute will give you only the first tag by that name:



In [210]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

If you need to get all the ``<a>`` tags, or anything more complicated than the first tag with a certain name, you’ll need to use one of the methods described in Searching the tree, such as find_all():

In [211]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents and .children

A tag’s children are available in a list called .contents:

In [212]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [213]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [214]:
title_tag = head_tag.contents[0]

In [215]:
title_tag

<title>The Dormouse's story</title>

In [216]:
title_tag.contents

["The Dormouse's story"]

The BeautifulSoup object itself has children. In this case, the ``<html>`` tag is the child of the BeautifulSoup object.:

In [217]:
soup.contents

[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body></html>]

In [218]:
type(soup)

In [219]:
len(soup.contents)

1

In [220]:
soup.contents[0].name

'html'

A string does not have .contents, because it can’t contain anything:



In [221]:
text = title_tag.contents[0]
text.contents

AttributeError: 'NavigableString' object has no attribute 'contents'

Instead of getting them as a list, you can iterate over a tag’s children using the .children generator:

In [222]:
title_tag

<title>The Dormouse's story</title>

In [223]:
for child in title_tag.children:
    print(child)

The Dormouse's story


Thuộc tính .descendants:

Giải thích:

Thuộc tính .contents và .children chỉ xem xét các con trực tiếp (direct children) của một thẻ. Trong khi đó, .descendants sẽ xem xét tất cả các phần tử con ở mọi cấp độ (con, cháu, chắt...).

In [224]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [225]:
len(list(soup.children))

1

In [226]:
list(soup.children)

[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body></html>]

In [227]:
len(list(soup.descendants))

25

In [228]:
list(soup.descendants)

[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p></body></html>,
 <head><title>The Dormouse's story</title></head>,
 <title>The Dormouse's story</title>,
 "The Dormouse's story",
 '\n',
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie<

Thuộc tính .string:

Định nghĩa:

Nếu một thẻ chỉ có một con duy nhất, và con đó là một NavigableString, thì con đó sẽ được cung cấp thông qua thuộc tính .string.

In [229]:
# from bs4 import BeautifulSoup

# html = '''
# <div>
#     <title>Tiêu đề trang web</title>
#     <p>Đây là một đoạn văn đơn giản</p>
#     <span>Văn bản ngắn</span>
# </div>
# '''

# soup = BeautifulSoup(html, 'html.parser')

# # Các thẻ chỉ có một con là NavigableString
# print("=== THÀNH CÔNG (.string) ===")
# print(f"title.string: '{soup.title.string}'")
# print(f"p.string: '{soup.p.string}'")
# print(f"span.string: '{soup.span.string}'")
# print(f"Loại: {type(soup.title.string)}")

In [230]:
title_tag

<title>The Dormouse's story</title>

Giải thích thuộc tính .string:
Trường hợp 1: Thẻ có một con duy nhất là NavigableString

Khi một thẻ chỉ chứa văn bản thuần túy (không có thẻ con nào khác)
Thuộc tính .string sẽ trả về NavigableString đó
Ví dụ: ``<title>`` The Dormouse's story ``</title>`` → title_tag.string sẽ là 'The Dormouse's story'

Trường hợp 2: Thẻ có một con duy nhất là thẻ khác, và thẻ con đó có .string

Nếu thẻ cha chỉ có một thẻ con, và thẻ con đó cũng chỉ có một NavigableString
Thì thẻ cha sẽ "kế thừa" thuộc tính .string từ thẻ con
Ví dụ: ``<head>`` ``<title>``The Dormouse's story``</title>`` ``</head>``

head_tag.contents sẽ chứa thẻ ``<title>``
Nhưng head_tag.string vẫn trả về 'The Dormouse's story' (giống như title.string)



Trường hợp 3: Thẻ chứa nhiều hơn một phần tử

Khi thẻ có nhiều con (nhiều thẻ con hoặc nhiều đoạn văn bản)
Không rõ ràng .string nên tham chiếu đến cái gì
Do đó .string được định nghĩa là None
Ví dụ: Thẻ ``<html>`` thường chứa cả ``<head>`` và ``<body>``, nên html.string sẽ là None

In [231]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [233]:
head_tag.string

"The Dormouse's story"

In [235]:
print(soup.html.string)

None
