# 遍历文档树

In [1]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

## 操作文档树最简单的方法就是使用想获取的tag的name

In [2]:
soup.title

<title>The Dormouse's story</title>

In [3]:
soup.head

<head><title>The Dormouse's story</title></head>

### 可以在文档树的tag中多次调用这个方法，下面的代码可以获取```<body>```标签中的第一个```<b>```标签

In [4]:
soup.body.b

<b>The Dormouse's story</b>

### 通过点取属性的方式只能获得当前名字的第一个tag

In [5]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

### 如果想要获取所有的```<a>```标签或是通过名字得到比一个tag更多的内容的时候就需要用到Searching the tree中描述的方法，比如find_all()

In [6]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

## .contents和.children

### tag的.contents属性可以将tag的子节点以列表的方式输出

In [7]:
head_tag = soup.head
head_tag

<head><title>The Dormouse's story</title></head>

In [8]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [9]:
title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

In [10]:
title_tag.contents

["The Dormouse's story"]

### BeautifulSoup对象本身一定会包含子节点，<html>标签也是BeautifulSoup对象的子节点

In [11]:
len(soup.contents)

1

In [12]:
soup.contents[0].name

'html'

### 字符串没有.contents属性，因为字符串没有子节点

In [13]:
text = title_tag.contents[0]
text.contents

AttributeError: 'NavigableString' object has no attribute 'contents'

### 通过tag的.children生成器，可以对tag的子节点进行循环

In [14]:
for children in soup.body.children:
    print(children)

<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>




## .descendants

.contents和.children属性仅包含tag的直接子节点。例如```<head>```标签只有一个直接子节点```<title>```

In [15]:
head_tag.contents

[<title>The Dormouse's story</title>]

 但是title标签还包含一个子节点：字符串，这种情况下字符串也属于```<head>```标签的子孙节点，.descendants属性可以对所有的tag的子孙节点进行递归循环

In [16]:
for child in head_tag.descendants:
    print(child)

<title>The Dormouse's story</title>
The Dormouse's story


In [17]:
for child in soup.descendants:
    print(child)

<html><head><title>The Dormouse's story</title></head>
<body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
The Dormouse's story


<body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bot

In [18]:
len(list(soup.children))

1

In [19]:
len(list(soup.descendants))

25

## .string

如果tag只有一个NavigableString类型的子节点，那么这个tag可以使用.string得到子节点

In [20]:
title_tag.string

"The Dormouse's story"

如果一个tag仅有一个子节点，那么这个tag也可以使用.string方法输出结果与当前唯一子节点的.string结果相同

In [21]:
head_tag.contents

[<title>The Dormouse's story</title>]

In [22]:
head_tag.string

"The Dormouse's story"

如果tag包含了多个子节点,tag就无法确定.string方法应该调用哪个子节点的内容，.string的输出结果是None:

In [23]:
print(soup.html.string)

None


## .strings和stripped_strings

### 如果tag中包含多个字符串，可以使用.strings来循环获取

In [24]:
for string in soup.strings:
    print(repr(string))

"The Dormouse's story"
'\n'
"The Dormouse's story"
'\n'
'Once upon a time there were three little sisters; and their names were\n'
'Elsie'
',\n'
'Lacie'
' and\n'
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
'...'
'\n'


### 输出的字符串中可能包含了很多的空格和或空行，使用```.stripped_strings```可以去除多余空白的内容。

In [25]:
for string in soup.stripped_strings:
    print(repr(string))

"The Dormouse's story"
"The Dormouse's story"
'Once upon a time there were three little sisters; and their names were'
'Elsie'
','
'Lacie'
'and'
'Tillie'
';\nand they lived at the bottom of a well.'
'...'


全部是空格的行会被忽略掉，段首和段末的空白会被删除

## 父节点

每个tag或字符串都有父节点：被包含在某个tag中

### .parent属性获取某个元素的父节点

In [26]:
title_tag = soup.title
title_tag

<title>The Dormouse's story</title>

In [27]:
title_tag.parent

<head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点：<title>标签

In [28]:
title_tag.string.parent

<title>The Dormouse's story</title>

文档的顶层节点比如```<html>```的父节点是BeautifulSoup对象

In [29]:
html_tag = soup.html
type(html_tag.parent)

bs4.BeautifulSoup

BeautifulSoup对象的.parent是None

### .parents

通过元素的.parents属性可以递归得到元素的所有父辈节点，下面的例子使用了.parents的方法遍历了```<a>```标签到根节点的所有节点

In [30]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [31]:
for parent in link.parents:
    if parent:
        print(parent.name)
    else:
        print(parent)

p
body
html
[document]


## 兄弟节点

In [32]:
sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'lxml')

In [33]:
print(sibling_soup.prettify())

<html>
 <body>
  <a>
   <b>
    text1
   </b>
   <c>
    text2
   </c>
  </a>
 </body>
</html>


```<b>和<c>```是兄弟节点，具有相同的缩进级别

### .next_sibling和.previous_sibling查询兄弟节点

In [34]:
sibling_soup.b.next_sibling

<c>text2</c>

In [35]:
sibling_soup.c.previous_sibling

<b>text1</b>

**实际文档中的tag的.next_sibling和.previous_sibling属性通常是字符串或空白**

爱丽丝文档第一个```<a>```标签的next_sibling结果不是第二个```<a>```标签，而是它们之间的顿号和换行符

In [36]:
link = soup.a
link

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [37]:
link.next_sibling

',\n'

**第二个```<a>```标签是顿号的.next_sibling属性**

In [38]:
link.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

### .next_siblings和.previous_siblings

通过.next_siblings和.previous_siblings属性可以对当前的兄弟节点迭代输出

In [39]:
for sibling in soup.a.next_siblings:
    print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'


In [40]:
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'


## 回退和前进

In [41]:
# <html><head><title>The Dormouse's story</title></head>
# <p class="title"><b>The Dormouse's story</b></p>```

HTML解析器把这段字符串转换成一连串的事件: “打开```<html>```标签”,”打开一个```<head>```标签”,”打开一个```<title>```标签”,”添加一段字符串”,”关闭```<title>```标签”,”打开```<p>```标签”,等等.Beautiful Soup提供了重现解析器初始化过程的方法.

### .next_element和.previous_element

.next_element属性指向解析过程中下一个被解析的对象(字符串或tag)，结果可能与.next_sibling相同，但通常是不一样的

In [42]:
last_a_tag = soup.find("a", id="link3")
last_a_tag

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [43]:
last_a_tag.next_sibling

';\nand they lived at the bottom of a well.'

**但是.next_element()属性结果是在```<a>```标签被解析之后的解析内容，不是```<a>```标签后的句子部分，应该是字符串“title”**

In [44]:
last_a_tag.next_element

'Tillie'

这是因为在原始文档中，字符串“Tillie”在分号前出现，解析器先进入```<a>```标签，然后是字符串Tillie,然后关闭</a>标签，然后是分号和剩余部分，分号与```<a>```标签在同一级，但是字符串“Tillie”会被先解析

In [45]:
last_a_tag.previous_element

' and\n'

In [46]:
last_a_tag.previous_element.next_element

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

### .next_elements和.previous_elements

迭代器，可以向前或向后访问文档的解析内容，就好像文档正在被解析一样

In [47]:
for element in last_a_tag.next_elements:
    print(repr(element))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'
