## 4.1 XPath  <=== 非常强大好好学习

1. XPath概览

2. XPath 常用规则

|表达式|描述|
|--:|--|
|nodename|选取此节点的所有子节点|
| / | 从当前节点选取直接子节点 |
| // | 从当前节点选取子孙节点 |
| . | 选取当前节点 |
| .. | 选取当前节点的父节点 |
| @ | 选取属性 |

In [5]:
text = '''
<div>
<ul>
<li class="item-0"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

In [7]:
from lxml import etree

html = etree.HTML(text)
# 输出修正后的HTML代码, 但是类型是bytes
result = etree.tostring(html)
# 使用decodes()将bytes转换成str
print(result.decode('utf-8'))
# 可以看到，经过处理之后，li节点标签被补全，
# 并且还向动添加了body, html节点 。

<html><body><div>
<ul>
<li class="item-0"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div>
</body></html>


In [9]:
# 也可以直接读取文本文件进行解析
html = etree.parse ('./test2.html', etree.HTMLParser()) 
result = etree.tostring(html) 
print(result.decode('utf-8'))
# 这次的输出结果略有不同，多了一个 DOCTYPE 的声明，不过对解析无任何影响，结果如下:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div>
<ul>
<li class="item-0"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li></ul>
</div></body></html>


In [15]:
# 所有节点
from lxml import etree
html = etree.parse("./test2.html", etree.HTMLParser())
html.xpath('//*')

[<Element html at 0x10d7f1b48>,
 <Element body at 0x10b74d308>,
 <Element div at 0x10d7f1248>,
 <Element ul at 0x10d7f1108>,
 <Element li at 0x10d7f1b88>,
 <Element a at 0x10d7f1c08>,
 <Element li at 0x10d7f1c48>,
 <Element a at 0x10d7f1c88>,
 <Element li at 0x10d7f1cc8>,
 <Element a at 0x10d7f1bc8>,
 <Element li at 0x10d7f1d08>,
 <Element a at 0x10d7f1d48>,
 <Element li at 0x10d7f1d88>,
 <Element a at 0x10d7f1dc8>]

In [16]:
html.xpath('//li')

[<Element li at 0x10d7f1b88>,
 <Element li at 0x10d7f1c48>,
 <Element li at 0x10d7f1cc8>,
 <Element li at 0x10d7f1d08>,
 <Element li at 0x10d7f1d88>]

In [17]:
# 选取文本 text()
html.xpath('//li[@class="item-0"]/a/text ()')

['first item', 'fifth item']

In [18]:
# 属性获取
html.xpath('//li/a/@href')

['linkl.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

In [29]:
# 属性多值匹配
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, li)]/a/text()')
print(result)
# 通过 contains()方法，第一个参数传人属性名称，第二个参数传人属性值，只要此属性包含
# 所传人的属性值，就可以完成匹配了

['first item']


In [32]:
# 多属性匹配
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

['first item']


In [42]:
# 按序选择
# 这里我们使用了 last()、 position()等函数。 在 XPath 中，提供了 100多个函数，
# 包括存取 、 数 值、字符串、逻辑、节点、序列等处理功能，
text = '''
<div>
<ul>
<li class="item-0"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
print(html.xpath('//li[1]/a/text()'))
print(html.xpath('//li[last()]/a/text()'))
print(html.xpath('//li[position()<3]/a/text()'))
print(html.xpath('//li[last()-2]/a/text()'))

['first item']
['fifth item']
['first item', 'second item']
['third item']


In [68]:
#  节点轴选择： 包括获取子元素, 兄弟元素，父元素，祖先元素等
text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
print(1, html.xpath('//li[1]/ancestor::*')) # 先选中祖先节点，然后选中下面的全部节点
print(2, html.xpath('//li[1]/ancestor::div'))# 先选中祖先节点，然后选中下面的div节点
print(3, html.xpath('//li[1]/attribute::*')) # 选中li[1]节点，然后选中其的所有属性，获取属性值
print(4, html.xpath('//li[1]/child::a[@href="link1.html"]')) # 选中li[1]节点,  然后选中子节点(符合属性[@href="link1.html"])
print(5, html.xpath('//li[1]/descendant::span')) # 选中li[1]节点,  然后选中孙节点 指明是span
print(6, html.xpath('//li[1]/following::*[2]'))# 调用了following轴，可以 获取当前节点之后的所有节点，*[2]然后选取第二个后续节点
print(7, html.xpath('//li[1]/following-sibling::*'))# 可以获取当前节点之后的所有同级节点

1 [<Element html at 0x10d8a1588>, <Element body at 0x10b74d088>, <Element div at 0x10d8a1cc8>, <Element ul at 0x10d8866c8>]
2 [<Element div at 0x10b74d088>]
3 ['item-0']
4 [<Element a at 0x10b74d088>]
5 [<Element span at 0x10d8a1cc8>]
6 [<Element a at 0x10d8a1cc8>]
7 [<Element li at 0x10b74d088>, <Element li at 0x10d8866c8>, <Element li at 0x10d886a08>, <Element li at 0x10d886808>]


## 4.2 BeautifulSoup

1. 简介

2. 准备工作

3. 解析器
    * html.parser / lxml ✅ / xml / html5lib
    * 推荐 lxml 速度快，容错能力强
    


In [72]:
# 4. 基本用法
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

from bs4 import BeautifulSoup # 初始化 BeautifulSoup时, 会自动更正格式
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify()) # 调用 prettify()方法 这个方法可以把要解析的字符串以标准的缩进格式输出
print(soup.title.text) # 调用 soup.title.string （与soup.title.text输出相同），这实际上是输出HTML中title节点的文本内容

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story


In [77]:
# 5.节点选择器
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup # 初始化 BeautifulSoup时, 会自动更正格式
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


In [79]:
# 获取节点名称
soup.title.name

'title'

In [89]:
# 获取属性
print(soup.p.attrs) # 返回字典
print(soup.p.attrs['name'])
# 更简单 注意区分返回的是字符串还是字符串列表
print(soup.p['class'])
print(soup.p['name'])

{'class': ['title'], 'name': 'dromouse'}
dromouse
['title']
dromouse


In [91]:
# 嵌套选择
html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
'''
from bs4 import BeautifulSoup # 初始化 BeautifulSoup时, 会自动更正格式
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story


In [94]:
# 关联选择： 需要先选中某一个节点元素，然后以它为基准再选择它的子节点、父节点、 兄弟节点等，
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
'''
from bs4 import BeautifulSoup # 初始化 BeautifulSoup时, 会自动更正格式
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents) # contents 属性得到的结果是直接子节点的列表 

['\n    Once upon a time there were three little sisters; and their names were\n    ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \nand\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\nand they lived at the bottom of a well.\n']


In [96]:
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

<list_iterator object at 0x10dcc27b8>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
and they lived at the bottom of a well.



In [97]:
# 得到所有的子孙节点
print(soup.p.descendants) # descendants会递归查询所有子节点，得到所有的子孙节点 。

for i, child in enumerate(soup.p.descendants):
    print(i, child)

<generator object descendants at 0x10dbd4620>
0 
    Once upon a time there were three little sisters; and their names were
    
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
and they lived at the bottom of a well.



In [100]:
# (2) 父节点和祖先节点
html = '''
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
soup.a.parent
# 选择的是第一个a节点的直接父节点元素。很明显，它的父节点是 p节点，输出结果便是 p 节点及其 内部的内容

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>

In [102]:
html = '''
<html>
<body>
<p class="story">
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents)) # 生成器类型
print(list(enumerate(soup.a.parents)))

<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>)]


In [103]:
# (3) 兄弟节点
html = '''
<html>
<head>
<body>
<p class="story">
                Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
                Hello
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
                and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
                and they lived at the bottom of a well.
</p>
'''

soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))


Next Sibling 
                Hello

Prev Sibling 
                Once upon a time there were three little sisters; and their names were

Next Siblings [(0, '\n                Hello\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n                and\n'), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n                and they lived at the bottom of a well.\n')]
Prev Siblings [(0, '\n                Once upon a time there were three little sisters; and their names were\n')]


In [105]:
# (4) 提取信息
html = '''
<html>
<head>
<body>
<p class="story">
                Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" 
class="sister" id="link2">Lacie</a>
</p>
'''
soup = BeautifulSoup(html, 'lxml')
print('Nex Silbing:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

Nex Silbing:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
                Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']


6.方法选择器

    find_all(name, atrrs, recursive, text, **kwargs)
    
    * name 节点名
    * atrrs 属性 例如: {'class': 'sister'}
    * text 匹配节点的文本, 传入字符串或正则表达式:
        soup.find_all(text=re.compile('link)
        

    find() 查找到第一个匹配的节点元素
其他用法类似的函数:

    find_parents() find_parent()
    find_next_siblings() find_next_sibling()
    find_previous_siblings() find_previous_siblings()
    find_all_next() find_next()
    find_all_previous() find_previous()

7.CSS 选择器: 只需要调用 select()方法，传人相应的 css选择器即可，

* 嵌套选择
* 获取属性
* 获取文本 .get_text()

## 4.3 pyquery

1. 准备工作

2. 初始化
    * 它的初始化方式有多种，比如直接传入字符串，传入URL，传入文件名


In [3]:
# 字符串初始化

html = """
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
from pyquery import PyQuery as pq
# 然后声明了一个长HTML字符串，
# 并将其当作参数传递给PyQuery类，这样就成功完成了初始化。
doc = pq(html)
print(doc('li')) # 选择所有的li节点

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>



In [4]:
# URL 初始化
from pyquery import PyQuery as pq
#  等价于
# doc = pq(requests.get('https://cuiqingcai.com').text)
doc = pq(url='https://cuiqingcai.com') 
print(doc('title'))

<title>静觅丨崔庆才的个人博客</title>&#13;



In [5]:
# 文件初始化
doc = pq(filename='test2.html')
print(doc('li'))

<li class="item-0"><a href="linkl.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</li>


3. CSS选择器

In [7]:
html = """
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
"""
doc=pq(html)
print(doc('#container .list li')) # CSS 选择器
print(type(doc('#container .list li')))

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

<class 'pyquery.pyquery.PyQuery'>


4. 查找节点

In [10]:
# 子节点
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>

<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>



In [11]:
# 父节点
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc=pq(html)
items = doc('.list')
container = items.parent()
print(container) # CSS 选择器
print(type(container))

<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

<class 'pyquery.pyquery.PyQuery'>


In [12]:
doc=pq(html)
items = doc('.list')
parents = items.parents()
print(parents) # 祖先节点
print(type(parents))

<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div><div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>

<class 'pyquery.pyquery.PyQuery'>


In [14]:
li = doc('.list .item-0.active')
print(li.siblings()) # 所有兄弟节点

<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0">first item</li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>



5. 遍历
       调用items()

In [16]:
li = doc('.list .item-0.active')
print(li)
print(str(li))

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>



In [19]:
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li, type(li))

<class 'generator'>
<li class="item-0">first item</li>
 <class 'pyquery.pyquery.PyQuery'>
<li class="item-1"><a href="link2.html">second item</a></li>
 <class 'pyquery.pyquery.PyQuery'>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
 <class 'pyquery.pyquery.PyQuery'>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
 <class 'pyquery.pyquery.PyQuery'>
<li class="item-0"><a href="link5.html">fifth item</a></li>
 <class 'pyquery.pyquery.PyQuery'>


6. 获取信息

In [27]:
html = """
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
"""
doc = pq(html)
a = doc('.item-0.active a')
print(a, type(a))
print(a.attr('href')) # 获取属性
print(a.attr.href)

<a href="link3.html"><span class="bold">third item</span></a> <class 'pyquery.pyquery.PyQuery'>
link3.html
link3.html


In [29]:
a = doc('a')
print(a, type(a))
print(a.attr.href) # 返回结果包含多个节点时，调用attr()方法，只会得到第一个节点的属性。
# 返回结果包含多个节点时，调用 attr()方法，只会得 到第一个节点的属性 。

<a href="link2.html">second item</a><a href="link3.html"><span class="bold">third item</span></a><a href="link4.html">fourth item</a><a href="link5.html">fifth item</a> <class 'pyquery.pyquery.PyQuery'>
link2.html


In [31]:
# 获取文本
a = doc('.item-0.active a')
print(a)
print(a.text())

<a href="link3.html"><span class="bold">third item</span></a>
third item


In [34]:
# 如果想要获取这个节点内部的HTML文本，就要用 html()方法
li = doc('.item-0.active')
print(li)
print(li.text())
print(li.html())

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

third item
<a href="link3.html"><span class="bold">third item</span></a>


In [36]:
# 如果得到的结果是多个节点，并且想要获取每个节点的内部 HTML 文本， 则需要遍历每个节点 。 
# 而 text()方法不需要遍历就可以获取，
# 它将所有节点取文本之后合并成一个字 符串。
li = doc('li')
print(li.html())
print(li.text()) 
print(type(li.text()))

first item
first item second item third item fourth item fifth item
<class 'str'>


7. 节点操作
    
        提供了一系列方法来对节点进行动态修改，比如为某个节点添加一个 class，移除某个节点等，这些操作有 时候会为提取信息带来极大的便利 。

In [40]:
# addClass 和 removeClass 改变节点的class属性
li = doc('.item-0.active') 
print(li)
li.removeClass('active')
print(li)
li.addClass('active') 
print(li)

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>



In [45]:
# attr、 text 和 html
html = '''
<ul class='list'>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
</ul>
'''
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')
print(li)
li.text('changed item')
print(li)
li.html('<span>changed item</span>')
print(li)
# 如果 attr()方法只传入第一个参数的属性名，则是获取这个属性值 ; 如果传入第二个参 数，可以用来修改属性值 。 
# text()和 html()方法如果不传参数 ，则是获取节点内纯文本和 HTML 文本; 如果传人参数 ，则进行赋值 。

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link">changed item</li>

<li class="item-0 active" name="link"><span>changed item</span></li>


In [49]:
# remove()
html = '''
<div class="wrap">
    Hello, World
<p>This is a paragraph.</p>
</div>
'''
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())

Hello, World
This is a paragraph.


In [52]:
wrap.find('p').remove() # 删除节点P
print(wrap.text())
# 其实还有很多节点操作的方法，比如 append()、 empty()和 prepend()等方

Hello, World


8. 伪类选择器

In [56]:
html = '''
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
</div>
'''
doc = pq(html)
print(doc('li:first-child'))
print(doc('li:last-child'))
print(doc('li:nth-child(2)'))
print(doc('li:gt(2)'))
print(doc('li:nth-child(2n)'))
print(doc('li:contains(second)'))

<li class="item-0">first item</li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-1"><a href="link2.html">second item</a></li>

