# 搜索文档树

Beutiful Soup定义了很多搜索方法，这里着重介绍2个:find()和find_all()

In [24]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

## 过滤器的类型（filter）

### 字符串

最简单的过滤器是字符串，会查找与字符串完整匹配的内容

In [3]:
soup.find_all('b')

[<b>The Dormouse's story</b>]

如果传入字节码参数，BeautifulSoup会当作UTF-8编码，可以传入一段Unicode编码来避免BeautifulSoup解析编码出错

### 正则表达式

传入正则表达式作为参数，BeautifulSoup会通过正则表达式的match()来匹配内容。下面例子找出了所有以b开头的标签

In [5]:
import re
for tag in soup.find_all(re.compile(r"^b")):
    print(tag.name)

body
b


找出所有名字中包含t的标签

In [7]:
for tag in soup.find_all(re.compile(r"t")):
    print(tag.name)

html
title


### 列表

如果传入列表参数，BeautifulSoup会将与列表中任一元素匹配的内容返回

下面代码找到文档中所有的```<a>```标签和```<b>```标签

In [8]:
soup.find_all(["a", "b"])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

### True

True可以匹配任何值，下面代码查找到所有的tag，但不会返回任何字符串节点

In [9]:
for tag in soup.find_all(True):
    print(tag.name)

html
head
title
body
p
b
p
a
a
a
p


### 方法

还可以传入一个自己定义的方法，方法只接受一个元素参数，如果这个方法返回True表示当前元素匹配并且被找到如果不是则返回False

In [26]:
def has_class_but_not_id(tag):
    return tag.has_attr("class") and not tag.has_attr("id")

In [27]:
soup.find_all(has_class_but_not_id)

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

下面代码找到找到所有被文字包含的节点内容

In [28]:
from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString) and isinstance(tag.previous_element, NavigableString))

In [29]:
for tag in soup.find_all(surrounded_by_strings):
    print(tag.name)

p
a
a
a
p


## find_all()

find_all( name , attrs , recursive , text , **kwargs )

find_all()方法搜索当前tag的所有子节点，并判断是否符合过滤器的条件

In [35]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

In [36]:
soup.find_all("p", "title")

[<p class="title"><b>The Dormouse's story</b></p>]

In [37]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [38]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [39]:
import re
soup.find(text=re.compile('sisters'))

'Once upon a time there were three little sisters; and their names were\n'

### name参数

name参数可以查找所有名字为name的tag,字符串对象会被自动忽略掉

In [40]:
soup.find_all("title")

[<title>The Dormouse's story</title>]

**搜索name参数的值可以是任何一个类型的过滤器，字符串，正则表达式，列表，方法或True**

### keyword参数

搜索时会把该参数当作指定名字的tag的属性来搜索，如果包含一个名字为id的参数，BeautifulSoup会搜索每个tag的"id"属性

In [42]:
soup.find_all(id="link2")

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果传入href参数，会搜索每个tag的href属性

In [43]:
soup.find_all(href=re.compile("elsie"))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

搜索指定名字的属性时可以使用的参数值包括字符串，正则表达式，列表，True

查找所有包含id属性的tag，无论id的值是什么

In [50]:
soup.find_all(id=True)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性

In [54]:
soup.find_all(href=re.compile("elsie"), id="link1")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

有些tag属性在搜索时不能使用，比如HTML5中的data-*属性

In [57]:
data_soup = BeautifulSoup("<div data-foo='value'>foo！</div>", 'lxml')

In [58]:
data_soup.find_all(data-foo='value')

SyntaxError: keyword can't be an expression (<ipython-input-58-cb7f77bd5a5e>, line 1)

但是可以通过find_all()方法的attrs参数定义一个字典参数来搜索包含特殊属性的tag

In [60]:
data_soup.find_all(attrs={"data-foo":"value"})

[<div data-foo="value">foo！</div>]

### 按CSS搜索

按照CSS类名搜索tag的功能非常实用，但标识CSS类名的关键字class在python中时保留字，使用class做参数会导致语法错误

**但是可以使用```class_```参数搜索有指定CSS类名的tag**

In [63]:
soup.find_all("a", class_="sister")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

**```class_```参数同样接受不同类型的过滤器、字符串、正则表达式、方法或True**

In [64]:
soup.find_all(class_=re.compile("itl"))

[<p class="title"><b>The Dormouse's story</b></p>]

In [65]:
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

In [66]:
soup.find_all(class_=has_six_characters)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]