# BeautifulSoup基本使用

### BeautifulSoup简介

beautifulSoup “美味的汤，绿色的浓汤”

一个灵活又方便的网页解析库，处理高效，支持多种解析器, 利用它就不用编写正则表达式也能方便的实现网页信息的抓取.

BeautifulSoup既可以接受html字符串,也可以接受二进制content来进行html解析

### 解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装。

### 基本使用

In [7]:
from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

soup = BeautifulSoup(html,'lxml')
# print(soup.prettify())  
print('获取整个title标签')
print(soup.title)
print('获取title标签名称')
print(soup.title.name)
print('获取title标签内容')
print(soup.title.string)
print('获取title父标签名称')
print(soup.title.parent.name)
      
print('获取p标签')
print(soup.p)
print('获取p标签class属性')
print(soup.p["class"])
print('获取a标签')      
print(soup.a)
print('获取所有a标签')      
print(soup.find_all('a'))
print(soup.find(id='link3'))

for link in soup.find_all('a'):
    print(link.get('href'))

print(soup.get_text())

获取整个title标签
<title>The Dormouse's story</title>
获取title标签名称
title
获取title标签内容
The Dormouse's story
获取title父标签名称
head
获取p标签
<p class="title"><b>The Dormouse's story</b></p>
获取p标签class属性
['title']
获取a标签
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
获取所有a标签
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



### 标签选择器

通过`soup.tagname` 就可以获得这个标签的整个内容,包括标签名称, 属性和内容.

注意: 如果文档中有多个相同标签，返回的结果是第一个标签的内容.

### 获取名称

通过`soup.title.name`就可以获得该title标签的名称，即title

### 获取属性

有两种方式:

* soup.p.attrs['name']
* soup.p['name']

上面两种方式都可以获取p标签的name属性值

### 获取内容

`soup.p.string` : 结果就可以获取第一个p标签的内容: The Dormouse's story

### 嵌套选择

注意: 嵌套选择的前提是你要确定嵌套标签是存在的,如果不存在则返回None

`soup.head.title.string`

### 子节点和子孙节点

* tag的` .contents` 属性可以将tag的子节点以列表的方式输出, 但字符串没有` .contents `属性,因为字符串没有子节点
* tag的`.children`和`.contents`获取结果是一样的, 不同的是` .children` 是生成器,一个迭代对象，而不是列表,只能对tag的子节点进行循环
* .contents 和 .children 属性仅包含tag的直接子节点
* .descendants 属性可以对所有tag的子孙节点进行递归循环

In [4]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for item in soup.p.contents:
    print(item)
    print('----------------------------------------------------------------')
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx') 
for item in soup.p.children: 
    print(item)
    print('=================================================================')    
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx')   
for item in soup.p.descendants:
    print(item)
    print('#################################################################')


            Once upon a time there were three little sisters; and their names were
            
----------------------------------------------------------------
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
----------------------------------------------------------------


----------------------------------------------------------------
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
----------------------------------------------------------------

            and
            
----------------------------------------------------------------
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
----------------------------------------------------------------

            and they lived at the bottom of a well.
        
----------------------------------------------------------------
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

            Once upon a time there were three little sisters; a

### 父节点和祖先节点

* `soup.a.parent` :获取父节点的信息
* `soup.a.parents` : 获取祖先节点，这个方法返回的结果是一个列表

`soup.a.parents`会分别将a标签的父节点的信息存放到列表中，以及父节点的父节点也放到列表中，并且最后还会讲整个文档放到列表中，所有列表的最后一个元素以及倒数第二个元素都是存的整个文档的信息

In [5]:
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

for item in soup.a.parents:
    print(item)

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body>
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
     

### 兄弟节点

`soup.a.next_siblings` :  获取后面的兄弟节点
`soup.a.previous_siblings` : 获取前面的兄弟节点
`soup.a.next_sibling` : 获取下一个兄弟标签
`soup.a.previous_sinbling` :  获取上一个兄弟标签

### .string, .strings和stripped_strings

* 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 `.string` 得到子节点
* 如果一个tag仅有一个子节点,那么这个tag也可以使用`.string` 方法,输出结果与当前唯一子节点的`.string`结果相同
* 如果tag包含了多个子节点,tag就无法确定`.string`方法应该调用哪个子节点的内容,` .string` 的输出结果是 None
* 如果tag中包含多个字符串 ,可以使用` .strings `来循环获取
* 如果输出的字符串中可能包含了很多空格或空行,使用 `.stripped_strings` 可以去除多余空白内容

### text和get_text()

* text是属性
* get_text(): 这个方法获取到tag中包含的所有文本内容,包括子孙tag中的文本内容,并将结果作为Unicode字符串返回.

In [10]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

for item in soup.find_all(True):
    print(item.string)
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx')

for item in soup.find_all(True):
    print(item.get_text().strip())
    
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx')

print(soup.find_all(text='Foo'))

None
None
None
None
Hello
None
None
Foo
Bar
Jay
None
Foo
Bar
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hello



Foo
Bar
Jay


Foo
Bar
Hello



Foo
Bar
Jay


Foo
Bar
Hello



Foo
Bar
Jay


Foo
Bar
Hello
Hello
Foo
Bar
Jay


Foo
Bar
Foo
Bar
Jay
Foo
Bar
Jay
Foo
Bar
Foo
Bar
xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
['Foo', 'Foo']


## 标准选择器

### find_all

`find_all(name,attrs,recursive,text,**kwargs)` : 根据标签名，属性，内容查找文档, 结果返回的是一个列表.

In [6]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>


### find以及find嵌套使用

`find( name , attrs , recursive , string , **kwargs )`: find返回的匹配结果的第一个元素

与 `find_all()` 方法的唯一区别是返回一个列表,而` find()` 方法直接返回结果.

find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None .

注意: find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等.

find方法可以嵌套使用,如: `soup.find('tr').find('td')`

### 其他find选择器

* `find_parent()`:  返回直接父节点
* `find_parents()`: 返回所有祖先节点
* `find_previous_sibling()`: 返回前面第一个兄弟节点
* `find_previous_siblings()`: 返回前面所有兄弟节点
* `find_next_sibling()`: 返回后面第一个兄弟节点
* `find_next_siblings()`: 返回后面所有兄弟节点
* ` find_next()`: 返回节点后第一个符合条件的节点
* `find_all_next()`: 返回节点后所有符合条件的节点
* ` find_previous()`: 返回节点前第一个符合条件的节点
* `find_all_previous()`: 返回节点后所有符合条件的节点

### attrs

attrs可以传入字典的方式来查找标签，但是这里有个特殊的就是`class`,因为`class`在python中是特殊的字段，

如果想要查找class相关的可以更改`attrs={'class_':'element'}`或者`soup.find_all('',{"class":"element})`，

特殊的标签属性可以不写attrs，例如id

In [7]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]


## CSS选择器

通过select()直接传入CSS选择器就可以完成选择, 熟悉前端的人对CSS可能更加了解，其实用法也是一样的
* .表示class 
* #表示id
* 标签1，标签2 找到所有的标签1和标签2
* 标签1 标签2 找到标签1内部的所有的标签2
* [attr] 可以通过这种方法找到具有某个属性的所有标签
* [atrr=value] 例子[target=_blank]表示查找所有target=_blank的标签

In [11]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>


### 总结

* 推荐使用lxml解析库，必要时使用html.parser
* 标签选择筛选功能弱但是速度快
* 建议使用find()、find_all() 查询匹配单个结果或者多个结果
* 如果对CSS选择器熟悉建议使用select()
* 记住常用的获取属性和文本值的方法