In [1]:
from bs4 import BeautifulSoup
import bs4

In [2]:
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

# 1.创建beaitifulsoup 对象
- Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。

# 2.调用prettify(),把解析的字符串以标准的缩进格式输出

In [9]:
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


# 3节点选择器
- 对于tag,它有两个重要的属性，是name和attrs

## 3.1 选择元素
- 直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本,只会匹配到第一个节点。

In [15]:
print(soup.title.string)
print(soup.head.string)
print(soup.a.string)
print(soup.p.string)

The Dormouse's story
The Dormouse's story
 Elsie 
The Dormouse's story


## 3.2 提取信息

### 3.2.1 获取名称

In [18]:
print(soup.name)
print(soup.title.name)
print(soup.head.name)
print(soup.p.attrs)

[document]
title
head
{'class': ['title'], 'name': 'dromouse'}


### 3.2.2获取属性
- 每个节点可能有多个属性，比如id和class等,选择这个节点元素后，可以调用attrs获取所有属性

In [21]:
print(soup.p.attrs)
print(soup.p["class"])
print(soup.p["name"])

{'class': ['title'], 'name': 'dromouse'}
['title']
dromouse


### 3.2.3 提取出标签携带的字符串
- 获取的是第一个节点内的文本

In [15]:
print(soup.p.string)
print(soup.a.string)

The Dormouse's story
 Elsie 


In [18]:
# 带有注释的标签内容
if type(soup.a.string) == bs4.element.Comment:
    print(soup.a.string)

Elsie 


## 3.3 嵌套选择

In [24]:
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story


## 3.4 关联选择
- 在做选择的时候，有时候不能做到一步就选到想要的节点元素，需要先选中某一个节点元素，然后以它为基准再选择它的子节点、父节点、兄弟节点等，这里就来介绍如何选择这些节点元素

### 3.4.1 子节点和子孙节m点
- 选去节点元素之后，如果想要获取它的直接子节点，可以直接调用contents属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

In [25]:
print(soup.head.contents)
print(soup.body.contents)
print(soup.head.children)
for child in soup.body.children:
    print(child)

[<title>The Dormouse's story</title>]
['\n', <p class="title" name="dromouse"><b>The Dormouse's story</b></p>, '\n', <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, '\n', <p class="story">...</p>, '\n']
<list_iterator object at 0x00000251B01BA5C0>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p c

### 3.4.2 父节点和祖先节点
- 如果要获取某个节点元素的父节点，可以调用parent属性
- 如果想要获取所有的祖先节点，可以调用parents属性

In [29]:
print(soup.a.parent)
print(soup.a.parents)
print(list(enumerate(soup.a.parents)))

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<generator object PageElement.parents at 0x00000251B01C40C0>
[(0, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>), (1, <body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/e

In [31]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""

### 3.4.3 兄弟节点
- 获取同级节点

In [36]:
soup = BeautifulSoup(html,'lxml')
print('Next Sibling',soup.a.next_sibling)
print('Prev Sibling',soup.a.previous_sibling)
print('Next Siblings',list(enumerate(soup.a.next_siblings)))
print('Prev Sibling',list(enumerate(soup.a.previous_siblings)))

Next Sibling 
            Hello
            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Sibling [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]


### 3.4.4 提取信息
- 提取文本、属性等，也同样的方法

In [37]:
html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
        </p>
"""

In [47]:
soup = BeautifulSoup(html,'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']


# 4. 方法选择器
- find_all()就是查询所有符合条件的元素，给它传入一些属性和文本，这样就可以得到符合条件的元素。  
`find_all(name , attrs , recursive , text , **kwargs)`

## 4.1 name

In [49]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

In [55]:
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]


In [56]:
for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar


## 4.2 attrs
- 除了根据节点名查询，我们也可以传入一些属性来查询

In [57]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

In [64]:
soup =BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
print(soup.find_all(id='list-1'))
# 而对于class来说，由于class在Python里是一个关键字，所以后面需要加一个下划线
print(soup.find_all(class_='elements'))

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[]


## 4.3 text
- ext参数可用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象

In [65]:
import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''

In [67]:
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text=re.compile('link')))

['Hello, this is a link', 'Hello, this is a link, too']


# 5. CSS选择器
- 使用CSS选择器时，只需要调用select()方法，传入相应的CSS选择器即可

In [69]:
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

In [79]:
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel'))
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

[<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>]
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>


## 5.1 嵌套选择

In [80]:
for ul in soup.select('ul'):
    print(soup.select('li'))

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]


## 5.2 选择属性

In [81]:
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

list-1
list-1
list-2
list-2


## 5.2获取文本 
- 要获取文本，当然也可以用前面所讲的string属性。此外，还有一个方法，那就是get_text()

In [83]:
for li in soup.select('li'):
    print('Get Text：',li.get_text())
    print('String：',li.string)

Get Text： Foo
String： Foo
Get Text： Bar
String： Bar
Get Text： Jay
String： Jay
Get Text： Foo
String： Foo
Get Text： Bar
String： Bar


# 综上总结，二者结果基本完全一致  
- 推荐使用lxml解析库，必要时使用html.parse。  
- 节点选择筛选功能弱，但是速度快。  
- 建议使用find()或者find_all()查询匹配单个结果或者多个结果。  
- 如果对CSS选择器熟悉的话，可以使用select()方法选择。