# Beautiful Soup

在使用Beautiful Soup解析是，实际上依赖解析器。除了支持Python标准库的HTML解析器外，还有支持一些第三方解析器。

![image](https://github.com/DRNTT/SpiderImage/blob/master/ch4/beautiful%20soup%20parse.PNG?raw=true)

## 1、 基础用法

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

prettify()方法将字符串以标准的缩进格式输出。这里修正的body与html节点是在初始化BeautifulSoup时，已经完成。<br>
soup.title.string为输出title节点的文本内容。

## 2、节点选择器

### 2.1 选择元素

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

首先选择节点后，输出的类型为Tag，需要获取文本内容时，再调取属性string。<br>
这里注意在使用soup.head,soup.p时，只会匹配第一个节点。

### 2.2 提取信息

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
# 获取节点名称
print(soup.title.name)
# 获取节点属性, 注意返回结果的属性
print(soup.p.attrs)
print(soup.p.attrs['name'])
print(soup.p['class'])
# 获取节点内容
print(soup.title.string)

### 2.3 嵌套选择

继续调用获取的元素

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(soup.head.title.string)

### 2.4 关联选择

(1)获取子节点

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span><!-- Elsie --></span></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
# 获取直接子节点
print(soup.p.contents)
print('*'*80)
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)
print('*'*80)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

第一种返回的结果为一个列表，这里的每一个值都为p的直接子节点。<br>
第二种返回的结果为一个迭代器，里面的每一个元素也都是p的直接子节点。<br>
第三种返回结果为p节点的所有子孙节点。

(2)获取父节点

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span><!-- Elsie --></span></a>,

<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)
print('*'*80)
print(soup.a.parents)
for i, parent in enumerate(soup.a.parents):
    print(i, parent)
    

parent获取该节点的一个父节点，而parents获取该节点的所有祖先节点。

(3)获取兄弟节点

In [None]:
from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span><!-- Elsie --></span></a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:', soup.a.next_sibling)
print('Prev Sibling:', soup.a.previous_sibling)
print('Next Siblings:', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings:', list(enumerate(soup.a.previous_siblings)))

## 3、方法选择器

### 3.1 find_all()

查询所有符合条件的元素。

(1)通过节点名称。

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

(2)通过属性

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1',
                          'class': ['list']}))
# class是python关键字所以添加_,这里的匹配是模糊匹配
print(soup.find_all(class_='list'))

(3)通过文本内容

In [None]:
from bs4 import BeautifulSoup
import re

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
# text可以是字符串，也可以是正则表达式对象。
print(soup.find_all(text=re.compile('Foo')))

## 3.2 find()

返回第一个匹配的元素。与find_all()返回值类型不同，需要注意。

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(soup.find(class_='list-small'))

## 3.3 其余方法
- **find_parents(), find_parent():** 前者返回所有祖先节点，后者返回直接父节点。
- **find_next_siblings(), find_next_sibling():** 前者返回后面所有的兄弟节点，后者返回后面的第一个兄弟节点。
- **find_previous_siblings(), find_previous_sibling():** 前者返回前面的所有兄弟节点，后者返回前面的第一个兄弟节点。
- **find_all_next(), find_next():** 前者返回节点后所有符合条件的节点，后者返回第一个符合条件的节点。
- **find_all_previous(), find_previou():** 前者返回节点前所有符合条件的节点，后者返回第一个符合条件的节点。

## 3.4 CSS选择器

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('#list-1 .element'))
print(soup.select('ul li'))

使用select方法，传入相应的CSS选择器。<br>
第一个选择器为：选择class为panel节点下的class为panel-heading的节点<br>
第二个选择器为：选择id为list-1节点下的class为element的节点<br>
第三个选择器为：选择ul节点下的li节点

(1) 嵌套选择

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    for li in ul.select('li'):
        print(li)

(2) 获取属性与文本

In [None]:
from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    for li in ul.select('li'):
        print(li['class'], li.string)