# XPath

## 1、XPath常用规则

![image](https://github.com/DRNTT/SpiderImage/blob/master/ch4/XPath%E5%B8%B8%E7%94%A8%E8%A7%84%E5%88%99.png?raw=true)

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
# 也可以通过文件来进行解析。
# html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(type(result))
print(result.decode('utf-8'))

到这里发现,result的类型是bytes，需要使用decode()方法转为string类型。
<br>同时,上面文本中最后缺少的</li>标签，被成功的补上，因此etree模块可以自动的修正HTML文本。

## 1.1获取所有节点

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''
html = etree.HTML(text)
result = html.xpath('//*')
print(result)

# 同样可以获取指定的所有节点
result2 = html.xpath('//li')
print(result2)

## 1.2 获取子节点

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

# 获取直接子节点
result = html.xpath('//li/a')
print(result)

# 获取直接子节点
result2 = html.xpath('//ul/a')
print(result2)

# 获取所有子孙
result3 = html.xpath('//ul//a')
print(result3)

这里需要注意下是获取直接子节点，还是子孙节点。<br>
ul的直接子节点为li，所以//ul/a不能获取到任何节点。需要改为//ul//a获取所有的子孙节点。

## 1.3 获取父节点

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
# 选择a节点并且该节点的href属性为link4.html，选取该节点的父节点，返回class属性值
result = html.xpath('//a[@href="link4.html"]/../@class')
# result = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

## 1.4 文本获取

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

# 错误获取
result0 = html.xpath('//li[@class="item-0"]/text()')
print(result0)

html = etree.HTML(text)
# 第一种选取到a节点，再获取文本
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

# 第二种使用//
result2 = html.xpath('//li[@class="item-0"]//text()')
print(result2)

第一次的错误获取，是因为获取的本文li标签内的内容，即</a>与</li>之间的内容，因为最后一次etree模块自动修正了</li>，所以获取了一个\n。<br>
正确的第一次获取是选中了相应的a标签，从而获取其中的文本。<br>
第二次获取是选择li标签的所有子孙节点内的文本。<br>

## 1.5 属性获取

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
# 获取li便签的class属性
result = html.xpath('//li/@class')
print(result)

## 1.6 属性多值匹配

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0 li li-first"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
# 这里使用前先的匹配是获取不到结果。
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

# contains()函数
result1 = html.xpath('//li[contains(@class, "item-0")]/a/text()')
print(result1)

这里是一个属性含有多个值，使用contains()函数，与下面的多属性匹配不同。

## 1.7 多属性匹配

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0" name="laowang"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
result = html.xpath('//li[@class="item-0" and @name="laowang"]/a/text()')
print(result)

这里涉及操作符and, 该节点需要满足两个属性值才可以。<br>
运算符及其介绍如下。
![image](https://github.com/DRNTT/SpiderImage/blob/master/ch4/operator.PNG?raw=true)

## 1.8 按序选择

In [None]:
from lxml import etree

text = '''
<div>
<ul>
<li class="item-0" name="laowang"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
result1 = html.xpath('//li[1]/a/text()')
print(result1)
result2 = html.xpath('//li[last()]/a/text()')
print(result2)
result3 = html.xpath('//li[position()<3]/a/text()')
print(result3)
result4 = html.xpath('//li[last()-2]/a/text()')
print(result4)

注意这里的下标从1开始，并不是从0。

## 1.9 节点轴选取

In [None]:
from lxml import etree

text = '''
<div>
<ul class='ul'>
<li class="item-0" name="laowang"><a href="link1.html"><span>first item</span></a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a>
</ul>
</div>
'''

html = etree.HTML(text)
result1 = html.xpath('//li[1]/ancestor::*')
print(result1)
result2 = html.xpath('//li[1]/ancestor::div')
print(result2)
result3 = html.xpath('//li[1]/attribute::*')
print(result3)
result4 = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result4)
result5 = html.xpath('//li[1]/descendant::span')
print(result5)
result6 = html.xpath('//li[1]//following::*[2]')
print(result6)
result7 = html.xpath('//li[1]//following-sibling::*')
print(result7)

第一次选择使用**ancesotor**轴，获取所有的祖先节点。需要使用两个冒号，节点选择器采用\*，表示匹配所有节点。<br>
第二次选择为匹配祖先节点的div节点。<br>
第三次选择为**attribute**轴，可以获取属性值，\*表示获取所有属性值。<br>
第四次选择为**child**轴，可以获取所有直接子节点。<br>
第五次选择为**descendant**轴，可以获取所有子孙节点,span为限制条件，获取span节点。<br>
第六次选择为**following**轴，获取当前节点之后的所有节点。数字2为第二个后续节点。<br>
第七次使用**following-sibling**轴，获取当前节点之后的所有同级节点。