# BeautifulSoup 

## 解析库

解析器 | 使用方法 | 优势 | 劣势
:- | :- | :-| :-|
Python标准库|BeautifulSoup(markup,"html.parser")|Python的内置标准库、执行速度适中、文档容错能力强|Python 2.7.3 or 3.2.2 前的版本中文容错能力差|
lxml HTML解析器 | BeautifulSoup(markup,"lxml")|速度快、文档容错能力强|需要安装C语言库|
lxml XML解析器 | BeautifulSoup(markup,"xml")|速度快、唯一支持XML的解析器|需要安装C语言库|
html5lib|BeautifulSoup(markup,"html5lib")|最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档|速度慢、不依赖外部扩展|

## 基本使用

In [2]:
html="""
<html><head><title>The Document's stroy</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's stroy</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.example.com/else" class="sister" id="link1"><!-- Else --></a>,
<a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a> and
<a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>;
and they lived at the bottom of a wall.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print("soup.prettify():\n",soup.prettify())
print("soup.title.string:",soup.title.string)

soup.prettify():
 <html>
 <head>
  <title>
   The Document's stroy
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's stroy
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://www.example.com/else" id="link1">
    <!-- Else -->
   </a>
   ,
   <a class="sister" href="http://www.example.com/lacie" id="link2">
    Lacle
   </a>
   and
   <a class="sister" href="http://www.example.com/tillle" id="link3">
    Tille
   </a>
   ;
and they lived at the bottom of a wall.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
soup.title.string: The Document's stroy


## 标签选择器

### 选择元素

In [2]:
html="""
<html><head><title>The Document's stroy</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's stroy</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.example.com/else" class="sister" id="link1"><!-- Else --></a>,
<a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a> and
<a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>;
and they lived at the bottom of a wall.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.title)
print(type(soup.title))
print(soup.head)
print(type(soup.head))
print(soup.p)
print(type(soup.p))
print(soup.a)

<title>The Document's stroy</title>
<class 'bs4.element.Tag'>
<head><title>The Document's stroy</title></head>
<class 'bs4.element.Tag'>
<p class="title" name="dromouse"><b>The Dormouse's stroy</b></p>
<class 'bs4.element.Tag'>
<a class="sister" href="http://www.example.com/else" id="link1"><!-- Else --></a>


### 获取标签的名称

In [3]:
html="""
<html><head><title>The Document's stroy</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's stroy</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.example.com/else" class="sister" id="link1"><!-- Else --></a>,
<a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a> and
<a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>;
and they lived at the bottom of a wall.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print(soup.title.name)

title


### 获取标签内容

In [5]:
html="""
<html><head><title>The Document's stroy</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's stroy<!-- p comment --></b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.example.com/else" class="sister" id="link1"><!-- Else,This is Comment --></a>,
<a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a> and
<a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>;
and they lived at the bottom of a wall.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
print("p.string:",soup.p.string)
print("p.text:",soup.p.text)
print("b.string:",soup.b.string)
print("b.text:",soup.b.text)
print("a.string:",soup.a.string)
print("a.text:", soup.a.text)

p.string: None
p.text: The Dormouse's stroy
b.string: None
b.text: The Dormouse's stroy
a.string:  Else,This is Comment 
a.text: 


---
小结：明显看出，标签.stirng和标签.text返回的内容不同。
1. text则是返回由标签包裹的内容，且不含注释；
2. string返回的是由标签包裹的内容。
3. 但两者不同之处在于，string可以返回注释内的文本信息，要显示的文本和注释同时存在时，string则会返回None，text依旧返回由标签包裹其正常显示的内容

### 嵌套选择

In [7]:
html="""
<html><head><title>The Document's stroy</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's stroy<!-- p comment --></b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://www.example.com/else" class="sister" id="link1"><!-- Else,This is Comment --></a>,
<a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a> and
<a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>;
and they lived at the bottom of a wall.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)
print(soup.body.a['id'])

The Document's stroy
link1


### 子节点和子孙节点

In [18]:
html="""
<html>
 <head>
  <title>The Document's stroy</title>
 </head> 
 <body> 
  <p class="story">Once upon a time there were three little sisters; and their names were 
      <a href="http://www.example.com/else" class="sister" id="link1">
        <span>Else</span>
      </a>
    <a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a>
    and 
    <a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>
    and they lived at the bottom of a wall.
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

['Once upon a time there were three little sisters; and their names were \n      ', <a class="sister" href="http://www.example.com/else" id="link1">
<span>Else</span>
</a>, '\n', <a class="sister" href="http://www.example.com/lacie" id="link2">Lacle</a>, '\n    and \n    ', <a class="sister" href="http://www.example.com/tillle" id="link3">Tille</a>, '\n    and they lived at the bottom of a wall.\n  ']
type of soup.p.children: <list_iterator object at 0x000001FB76091CF8>
0 Once upon a time there were three little sisters; and their names were 
      
------------------------------------------------------------
1 <a class="sister" href="http://www.example.com/else" id="link1">
<span>Else</span>
</a>
------------------------------------------------------------
2 

------------------------------------------------------------
3 <a class="sister" href="http://www.example.com/lacie" id="link2">Lacle</a>
------------------------------------------------------------
4 
    and 
    
------------

In [23]:
# children
print("==="*10)
children = soup.p.children
print("type of soup.p.children:",children)
print("==="*10)
for i,child in enumerate(children):
    print(i,child)
    print("------"*12)

# descendants
print("==="*10)
descendants = soup.p.descendants
print("type of soup.p.descendants:",descendants)
print("==="*10)
for i,child in enumerate(descendants):
    print(i,child)
    print("------"*12)

type of soup.p.children: <list_iterator object at 0x000001FB760DA780>
0 Once upon a time there were three little sisters; and their names were 
      
------------------------------------------------------------------------
1 <a class="sister" href="http://www.example.com/else" id="link1">
<span>Else</span>
</a>
------------------------------------------------------------------------
2 

------------------------------------------------------------------------
3 <a class="sister" href="http://www.example.com/lacie" id="link2">Lacle</a>
------------------------------------------------------------------------
4 
    and 
    
------------------------------------------------------------------------
5 <a class="sister" href="http://www.example.com/tillle" id="link3">Tille</a>
------------------------------------------------------------------------
6 
    and they lived at the bottom of a wall.
  
------------------------------------------------------------------------
type of soup.p.descend

### 父节点和祖先节点

In [31]:
html="""
<html>
 <head>
  <title>The Document's stroy</title>
 </head> 
 <body> 
  <p class="story">Once upon a time there were three little sisters; and their names were 
      <a href="http://www.example.com/else" class="sister" id="link1">
        <span>Else</span>
      </a>
    <a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a>
    and 
    <a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>
    and they lived at the bottom of a wall.
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print("type of parent:",type(soup.a.parent))
print(soup.a.parent)
print('----'*10)
for p in soup.a.parents:
    if p is None:
        print(p)
    else:
        print(p.name)

type of parent: <class 'bs4.element.Tag'>
<p class="story">Once upon a time there were three little sisters; and their names were 
      <a class="sister" href="http://www.example.com/else" id="link1">
<span>Else</span>
</a>
<a class="sister" href="http://www.example.com/lacie" id="link2">Lacle</a>
    and 
    <a class="sister" href="http://www.example.com/tillle" id="link3">Tille</a>
    and they lived at the bottom of a wall.
  </p>
----------------------------------------
p
body
html
[document]


### 兄弟节点

In [40]:
html="""
<html>
 <head>
  <title>The Document's stroy</title>
 </head> 
 <body> 
  <p class="story">Once upon a time there were three little sisters; and their names were 
      <a href="http://www.example.com/else" class="sister" id="link1">
        <span>Else</span>
      </a>
    <a href="http://www.example.com/lacie" class="sister" id="link2">Lacle</a>
    and 
    <a href="http://www.example.com/tillle" class="sister" id="link3">Tille</a>
    and they lived at the bottom of a wall.
  </p> 
  <p class="story">...</p>
 </body>
</html>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('----'*10)
print('previous_sibling:',soup.a.previous_sibling)
print('----'*10)
print('next_sibling:',soup.a.next_sibling)
print('----'*10)
print('----'*10)
print('previous_siblings:',list(soup.a.previous_siblings))
print('----'*10)
print('next_siblings:',list(soup.a.next_siblings))
print('----'*10)

----------------------------------------
previous_sibling: Once upon a time there were three little sisters; and their names were 
      
----------------------------------------
next_sibling: 

----------------------------------------
----------------------------------------
previous_siblings: ['Once upon a time there were three little sisters; and their names were \n      ']
----------------------------------------
next_siblings: ['\n', <a class="sister" href="http://www.example.com/lacie" id="link2">Lacle</a>, '\n    and \n    ', <a class="sister" href="http://www.example.com/tillle" id="link3">Tille</a>, '\n    and they lived at the bottom of a wall.\n  ']
----------------------------------------


## 标准选择器

### find_all

```python
find_all( name , attrs , recursive , text , **kwargs )
```

可根据标签名、属性、内容查找文档

#### name

In [45]:
html = '''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    <div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">BAr</li>
            <li class="element">Gods</li>
        </ul>
        
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Gods</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print('----'*10)
print(type(soup.find_all('ul')[0]))
print('----'*10)
for ul in soup.find_all('ul'):
    for li in ul.find_all('li'):
        print(li)

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">BAr</li>
<li class="element">Gods</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Gods</li>
</ul>]
----------------------------------------
<class 'bs4.element.Tag'>
----------------------------------------
<li class="element">Foo</li>
<li class="element">BAr</li>
<li class="element">Gods</li>
<li class="element">Foo</li>
<li class="element">Gods</li>


#### attrs

In [46]:
print(soup.find_all(attrs={'id':'list-1'}))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">BAr</li>
<li class="element">Gods</li>
</ul>]


In [48]:
# 等同的写法

print(soup.find_all(id='list-1'))

print(soup.find_all(class_="element"))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">BAr</li>
<li class="element">Gods</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">BAr</li>, <li class="element">Gods</li>, <li class="element">Foo</li>, <li class="element">Gods</li>]
