# 总结

<div align=center>
<img alt="图 2" src="../../images/1423de6c692472bf11eac9ee0017f8b03ffeadc34bc5f08079f8aedf6413e623.png" width=75%/>  

- `Beautiful Soup` 就是 **Python 的一个 `HTML` 或 `XML` 的解析库**，可以用它来方便地从网页中提取数据。
    - `Beautiful Soup` 提供一些简单的、Python 式的函数来处理导航、搜索、修改分析树等功能。
    - **`Beautiful Soup` 自动将输入文档转换为 `Unicode` 编码，输出文档转换为 `UTF-8` 编码**。（你不需要考虑编码方式，除非文档没有指定一个编码方式，这时你仅仅需要说明一下原始编码方式就可以了。）
    - `Beautiful Soup` 已成为和 `lxml`、`html6lib` 一样出色的 Python 解释器，为用户灵活地提供不同的解析策略或强劲的速度。

- **表 4-3 BeautifulSoup 支持的解析器**
<div align=center>
<img alt="图 3" src="../../images/4e7ba4ab09f6bbab1d340dc6022227a428d238af02756523f9312b95076c745b.png" width=75%/>  


```Python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')  # 初始化对象
soup.prettify()                     # 标准缩进格式

from lxml import etree
xpath_html = etree.HTML(html)
result = etree.tostring(xpath_html) # 默认是 bytes 类型
print(result.decode('utf-8'))
```

```Python
# 获取 文本
soup.li.text
soup.li.string
soup.li.get_text()

# 获取 属性值（ class 属性的属性值）
soup.p.attrs['class']        
soup.p['class']

# 查找节点
soup.find(name="ul")        # 返回第一个匹配结果
soup.find_all(name="li")    # 标准写法      返回所有匹配结果
soup.find_all("li")         # 非标准写法

# 按照属性查找节点
soup.find_all(attrs={'id':'list-1'})    # （利用 attrs 传递）
soup.find_all(id='list-1')              # （不利用 attrs 传递）
soup.find_all(class_='element')    ###### class 为 Python 的关键字

# 利用 CSS 选择器
soup.select('ul li')                    # 节点
soup.select('#list-2 .element')         # id
soup.select('.panel-heading')           # class

# 节点
soup.p.contents     # 直接子节点
soup.p.children     # 子节点        返回 迭代器， 需要用 enumerate()
soup.p.descendants  # 子孙节点      返回 迭代器， 需要用 enumerate()
soup.p.parent       # 父节点
soup.p.parents      # 祖先节点      返回 迭代器， 需要用 enumerate()
soup.a.next_sibling # 兄弟节点
```

# 正文

In [31]:
from bs4 import BeautifulSoup

soup = BeautifulSoup('<p>Hello</p>', 'lxml')

print('\n', soup)
print('\n', soup.p)
print('\n', soup.p.string)


 <html><body><p>Hello</p></body></html>

 <p>Hello</p>

 Hello


In [1]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')  # BeaufulSoup 对象的初始化。然后，将这个对象赋值给 soup 变量。
print(soup, '\n\n【标准的缩进格式】')
print(soup.prettify())
print(soup.title.string)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html> 

【标准的缩进格式】
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="siste

In [4]:
from lxml import etree
xpath_html = etree.HTML(html)
result = etree.tostring(xpath_html)
print(result, '\n')
result = result.decode('utf-8')
print(result)

b'<html><head><title>The Dormouse\'s story</title></head>\n<body>\n<p class="title" name="dromouse"><b>The Dormouse\'s story</b></p>\n<p class="story">Once upon a time there were three little sisters; and their names were\n<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,\n<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and\n<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>\n<p class="story">...</p>\n</body></html>' 

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they l

In [33]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


输出它的类型，是 **`bs4.element.Tag`** 类型，这是 Beautiful Soup 中一个重要的数据结构。经过选择器选择后，选择结果都是这种 Tag 类型。Tag 具有一些属性，比如 string 属性，调用该属性，可以得到节点的文本内容

In [10]:
print(soup.p)
print(soup.p.b)
print(soup.p.name)          # 节点的名字
print('-'*40)
print(soup.p.attrs)         # 属性
print(soup.p.attrs['name']) 
print('-'*40)
print(soup.p['class'])
print(soup.p['name'])       # 属性值


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
p
----------------------------------------
{'class': ['title'], 'name': 'dromouse'}
dromouse
----------------------------------------
['title']
dromouse


这里需要注意的是，有的返回结果是 **字符串** ，有的返回结果是 **字符串组成的列表** 。

比如:
- **`name` 属性的值是唯一的，返回的结果就是单个字符串。**
- **`class` 属性可能有多个值，所以返回的是列表。**

## 关联选择

### （1）子节点与子孙节点

#### 直接子节点 contents （不包含子孙节点，仅包含子节点）

In [21]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p, '\n', '-'*75)

print(len(soup.p.contents), '\n', soup.p.contents)      # 直接子节点

### 节点与节点之间的换行符 单独 作为一个元素（子节点）

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p> 
 ---------------------------------------------------------------------------
7 
 ['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']


#### 子节点 children

In [28]:
print(soup.p.children, '\n', '-'*75)
for child in enumerate(soup.p.children):
    print(child)

<list_iterator object at 0x000002265D454340> 
 ---------------------------------------------------------------------------
(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>)
(2, '\n')
(3, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
(4, ' \n            and\n            ')
(5, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
(6, '\n            and they lived at the bottom of a well.\n        ')


#### 子孙节点 descendants

In [29]:
print(soup.p.descendants, '\n', '-'*75)
for i in enumerate(soup.p.descendants):
    print(i)

<generator object Tag.descendants at 0x000002265E0E3890> 
 ---------------------------------------------------------------------------
(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')
(1, <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>)
(2, '\n')
(3, <span>Elsie</span>)
(4, 'Elsie')
(5, '\n')
(6, '\n')
(7, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>)
(8, 'Lacie')
(9, ' \n            and\n            ')
(10, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>)
(11, 'Tillie')
(12, '\n            and they lived at the bottom of a well.\n        ')


### （2）父节点与祖先节点

#### 父节点 parent

In [31]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

soup.span.parent

<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>

#### 祖先节点 parents

由内到外

In [34]:
html = """
<html>
    <body>
        <p class="story">
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

list(enumerate(soup.a.parents))

[(0,
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>),
 (1,
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body>),
 (2,
  <html>
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body></html>),
 (3,
  <html>
  <body>
  <p class="story">
  <a class="sister" href="http://example.com/elsie" id="link1">
  <span>Elsie</span>
  </a>
  </p>
  </body></html>)]

### （3）兄弟节点

`next_sibling`

`previous_sibling`

`next_siblings`

`previous_siblings`

In [35]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""

soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

Next Sibling 
            Hello
            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]


### （4）信息提取（文本、属性等）

In [37]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print('【Next Sibling】')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)

print('\n【Parents】')
print(type(soup.a.parents), len(list(enumerate(soup.a.parents))))
print(list(soup.a.parents)[0])

print('\n【属性值】')
print(list(soup.a.parents)[0].attrs['class'])
print(list(soup.a.parents)[0]['class'])

【Next Sibling】
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie

【Parents】
<class 'generator'> 4
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>

【属性值】
['story']
['story']


## 方法选择器

前面所讲的选择方法都是通过属性来选择的，这种方法非常快，但是如果进行比较复杂的选择的话，它就比较烦琐，不够灵活了。

幸好，Beautiful Soup 还为我们提供了一些查询方法，比如 `find_all()` 和 `find()` 等，调用它们，然后传入相应的参数，就可以灵活查询了。

### **`find_all()`**

**查询所有符合条件的元素**。给它传入一些属性或文本，就可以得到符合条件的元素

**`find_all(name , attrs , recursive , text , **kwargs)`**

#### （1） **`name`**

根据节点名来查询元素

In [44]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')

print(soup.ul, '\n' , '-'*75);
print(soup.find_all('ul'), '\n', '-'*75);
# print(soup.find_all(name='ul'))   # 标准写法
print(type(soup.find_all('ul')))
print(type(soup.find_all('ul')[0]))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul> 
 ---------------------------------------------------------------------------
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>] 
 ---------------------------------------------------------------------------
<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>


**因为都是 Tag 类型，所以依然可以进行嵌套查询。**

In [51]:
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
    for li in ul.find_all('li'):
        print(li.string, end=' ')
    print('\n', '-'*75)

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo Bar Jay 
 ---------------------------------------------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo Bar 
 ---------------------------------------------------------------------------


####  （2） **`attrs`**

In [52]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]


##### 不用 attrs 来传递

对于一些**常用的属性**，比如 id 和 class 等，我们可以不用 attrs 来传递。

In [53]:
soup.find_all(id='list-1')

[<ul class="list" id="list-1" name="elements">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 <li class="element">Jay</li>
 </ul>]

In [20]:
soup.find_all(class_='element')

[<li class="element">Foo</li>,
 <li class="element">Bar</li>,
 <li class="element">Jay</li>,
 <li class="element">Foo</li>,
 <li class="element">Bar</li>]

#### （3） **`text`**

In [54]:
import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
soup.find_all(text = re.compile('link'))

['Hello, this is a link', 'Hello, this is a link, too']

### **`find()`**

> 返回第一个结果

In [56]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')

# print(soup.find(name="ul"), '\n', '-'*75)
print(soup.find('ul'), '\n', '-'*75)
print(type(soup.find('ul')), '\n', '-'*75)
print(soup.find(class_='list'))

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul> 
 ---------------------------------------------------------------------------
<class 'bs4.element.Tag'> 
 ---------------------------------------------------------------------------
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>


另外，还有许多查询方法，其用法与前面介绍的 find_all()、find() 方法完全相同，只不过查询范围不同，这里简单说明一下。

<img alt="image.png 1" src="../../images/b16dcb85e981969b80f67bfdc8ece5d35207afe0f795f0a893db31b4b91d12af.png" width=75%/>  


## css 选择器

`soup.select()` 返回所有的匹配

In [61]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.select('.panel-heading'), '\n','-'*50)
print(soup.select('ul li'), '\n','-'*50)
print(soup.select('#list-2 .element'), '\n','-'*50)
print(type(soup.select('ul li')), type(soup.select('ul li')[0]), '\n','-'*50)
print(soup.select('li'))

[<div class="panel-heading">
<h4>Hello</h4>
</div>] 
 --------------------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] 
 --------------------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>] 
 --------------------------------------------------
<class 'bs4.element.ResultSet'> <class 'bs4.element.Tag'> 
 --------------------------------------------------
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]


#### 嵌套选择

select() 方法同样支持嵌套选择。例如，先选择所有 ul 节点，再遍历每个 ul 节点，选择其 li 节点，样例如下：



In [62]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


#### 属性获取

`li['id']`

`li['class']`

`li.attrs['id']`

`li.attrs['class']`

In [63]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2


#### 文本获取

##### `get_text()`、`string`、`text` 方法

In [69]:
for li in soup.select('li'):
    print('Get Text:', li.get_text())
    print('  String:', li.string)
    print('    Text:', li.text)
    print('-'*50)

Get Text: Foo
  String: Foo
    Text: Foo
--------------------------------------------------
Get Text: Bar
  String: Bar
    Text: Bar
--------------------------------------------------
Get Text: Jay
  String: Jay
    Text: Jay
--------------------------------------------------
Get Text: Foo
  String: Foo
    Text: Foo
--------------------------------------------------
Get Text: Bar
  String: Bar
    Text: Bar
--------------------------------------------------
