<div align=center>
<img alt="图 2" src="../../images/1423de6c692472bf11eac9ee0017f8b03ffeadc34bc5f08079f8aedf6413e623.png" width=75%/>  


- `Beautiful Soup` 就是 **Python 的一个 `HTML` 或 `XML` 的解析库**，可以用它来方便地从网页中提取数据。
    - `Beautiful Soup` 提供一些简单的、Python 式的函数来处理导航、搜索、修改分析树等功能。
    - **`Beautiful Soup` 自动将输入文档转换为 `Unicode` 编码，输出文档转换为 `UTF-8` 编码**。（你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。）
    - `Beautiful Soup` 已成为和 `lxml`、`html6lib` 一样出色的 Python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。

- **表 4-3 BeautifulSoup 支持的解析器**
<div align=center>
<img alt="图 3" src="../../images/4e7ba4ab09f6bbab1d340dc6022227a428d238af02756523f9312b95076c745b.png" width=75%/>  


In [12]:
from bs4 import BeautifulSoup as BP

soup = BP('<p>Hello</p>', 'lxml')

print('\n', soup)
print('\n', soup.p)
print('\n', soup.p.string)


 <html><body><p>Hello</p></body></html>

 <p>Hello</p>

 Hello


In [25]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

In [35]:
from bs4 import  BeautifulSoup
soup = BeautifulSoup(html, 'lxml')  # BeaufulSoup 对象的初始化。然后，将这个对象赋值给 soup 变量。
print(soup, '\n\n【标准的缩进格式】')
print(soup.prettify())
print(soup.title.string)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html> 

【标准的缩进格式】
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="siste

In [30]:
from lxml import etree
xpath_html = etree.HTML(html)
result = etree.tostring(xpath_html)
result = result.decode('utf-8')
print(result)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>


In [36]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


输出它的类型，是 bs4.element.Tag 类型，这是 Beautiful Soup 中一个重要的数据结构。经过选择器选择后，选择结果都是这种 Tag 类型。Tag 具有一些属性，比如 string 属性，调用该属性，可以得到节点的文本内容

In [45]:
print(soup.p)
print(soup.p.attrs)
print(soup.p.attrs['name'])

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
{'class': ['title'], 'name': 'dromouse'}
dromouse


In [46]:
print(soup.p['name'])
print(soup.p['class'])

dromouse
['title']


这里需要注意的是，有的返回结果是 **字符串** ，有的返回结果是 **字符串组成的列表** 。

比如:
- name 属性的值是唯一的，返回的结果就是单个字符串。
- class，一个节点元素可能有多个 class，所以返回的是列表。

在实际处理过程中，我们要注意判断类型。

In [5]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

In [6]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

print(soup.p.contents)      # 直接子节点
for i in soup.p.contents:
    print(type(i), '\n', '【元素】', i)

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
<class 'bs4.element.NavigableString'> 
 【元素】 
            Once upon a time there were three little sisters; and their names were
            
<class 'bs4.element.Tag'> 
 【元素】 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<class 'bs4.element.NavigableString'> 
 【元素】 

<class 'bs4.element.Tag'> 
 【元素】 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<class 'bs4.element.NavigableString'> 
 【元素】  
            and
            
<class 'bs4.element.Tag'> 
 【元素】 <a class="sister" href="http://exa

In [72]:
print(soup.p.children, '\n')

for child in enumerate(soup.p.children):
    print(child)

<list_iterator object at 0x000001A5284C79A0> 

(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>)
(2, '\n')
(3, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
(4, ' \n            and\n            ')
(5, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
(6, '\n            and they lived at the bottom of a well.\n        ')


In [73]:
print(soup.p.descendants, '\n')
for i in enumerate(soup.p.descendants):
    print(i)

<generator object Tag.descendants at 0x000001A528489200> 

(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>)
(2, '\n')
(3, <span>Elsie</span>)
(4, 'Elsie')
(5, '\n')
(6, '\n')
(7, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
(8, 'Lacie')
(9, ' \n            and\n            ')
(10, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
(11, 'Tillie')
(12, '\n            and they lived at the bottom of a well.\n        ')


In [77]:
soup.span.parent

<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>

In [16]:
html = """
<html>
    <body>
        <p class="story">
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup)
print(soup.a.parents, len(list(enumerate(soup.a.parents))))
list(enumerate(soup.a.parents))


<html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>
<generator object PageElement.parents at 0x000001A604A4EB30> 4


[(0,
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>),
 (1,
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body>),
 (2,
  <html>
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body></html>),
 (3,
  <html>
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body></html>)]

In [17]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

Next Sibling 
            Hello
            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]


In [21]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents), len(list(enumerate(soup.a.parents))))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])
print(list(soup.a.parents)[0]['class'])

Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'> 4
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']
['story']


In [27]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.ul);                     print()
print(soup.find_all(name='ul'));    print()
print(type(soup.find_all(name='ul')[0]))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

<class 'bs4.element.Tag'>


In [37]:
for i in soup.find_all(name='ul'):
    print(i.find_all(name='li'));
    for j in i.find_all(name='li'):
        print(j.string, end=' ')
    print()

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo Bar Jay 
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo Bar 


In [39]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html, 'lxml')

In [46]:
# 用 attrs 来传递
print(soup.find_all(attrs={'id':'list-1'}));        print()
print(soup.find_all(attrs={'class':'element'}));    print()          

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]



In [47]:
# 不用 attrs 来传递
print(soup.find_all(id='list-1'));                  print()
print(soup.find_all(class_='element'))              # 关键字

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]


In [49]:
import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

['Hello, this is a link', 'Hello, this is a link, too']


In [50]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>


In [58]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'));    print()
print(soup.select('ul li'));                    print()
print(soup.select('#list-2 .element'));         print()
print(soup.select('ul'));                       print()
print(type(soup.select('ul')))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

[<li class="element">Foo</li>, <li class="element">Bar</li>]

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>


In [69]:
for ul in soup.select('ul'):
    print(ul);              print()
    print(type(ul));        print()
    print(ul.select('li')); print('————————————————————————————————————————————————————————————————————————————')

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

<class 'bs4.element.Tag'>

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
————————————————————————————————————————————————————————————————————————————
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>

<class 'bs4.element.Tag'>

[<li class="element">Foo</li>, <li class="element">Bar</li>]
————————————————————————————————————————————————————————————————————————————


In [70]:
for ul in soup.select('ul'):
    print(ul);              print()
    print(ul['id']);        print()
    print(ul.attrs['id']);        print('————————————————————————————————————————————————————————————————————————————')

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

list-1

list-1
————————————————————————————————————————————————————————————————————————————
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>

list-2

list-2
————————————————————————————————————————————————————————————————————————————


In [74]:
print(soup.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]


In [77]:
for li in soup.select('li'):
    print(type(li))
    print(li);
    print('Get Text:', li.get_text())
    print('String:',   li.string)
    print()

<class 'bs4.element.Tag'>
<li class="element">Foo</li>
Get Text: Foo
String: Foo

<class 'bs4.element.Tag'>
<li class="element">Bar</li>
Get Text: Bar
String: Bar

<class 'bs4.element.Tag'>
<li class="element">Jay</li>
Get Text: Jay
String: Jay

<class 'bs4.element.Tag'>
<li class="element">Foo</li>
Get Text: Foo
String: Foo

<class 'bs4.element.Tag'>
<li class="element">Bar</li>
Get Text: Bar
String: Bar

