In [1]:
import requests
from scrapy.http import TextResponse

## xpath
Scrapy uses xpaths to define its HTML targets. Xpath is a syntax for
describing parts of an X(HT)ML document, and while it can get
rather complicated, the basics are straightforward and will often
solve the job at hand. You can get the xpath of an HTML element by using Chrome’s Elements
tab to hover over the source and then right-clicking and
selecting Copy Xpath.

```
//E
Element <E> anywhere in the document (e.g., //img gets all images on the page)

//E[@id="foo"]
Select element <E> with id foo

//*[@id="foo"]
Select any element with id foo

//E/F[1]
First child element <F> of element <E>

//E/*[1]
First child of element <E>
```

### 1. XPath使用路徑表示來尋找節點，常用的語法如下
表示式         | 說明                     |
:-------------|:-------------------------|
nodename      | 選擇所有的<nodename>的節點 | 
/             | 從root node開始選取       | 
//            | 選取目前節點下所有的節點   | 
.             | 從目前的節點選取          | 
..            | 選取目前節點的parent      | 
@             | 選取attribute            | 


### 2. XPath可利用萬用字元(wildcards)來選擇不確定的節點
萬用字元 |	說明                |
:-------|:---------------------|
*	    | 匹配任意element節點   |
@*	    | 匹配任意attribute節點 |
node()  | 匹配任意節點          |


### 3. XPath可以使用|來選擇多個path
查詢路徑   |	結果                   |
:---------|:--------------------------|
//p`|`//a |	一次取得全部 p 標籤和 a 標籤|

[reference](https://matthung0807.blogspot.com/2017/12/xpath.html)


1、child 選取當前節點的所有子元素

2、parent 選取當前節點的父節點

3、descendant 選取當前節點的所有後代元素（子、孫等）

4、ancestor 選取當前節點的所有先輩（父、祖父等）

5、descendant-or-self 選取當前節點的所有後代元素（子、孫等）以及當前節點本身

6、ancestor-or-self 選取當前節點的所有先輩（父、祖父等）以及當前節點本身

7、preceding-sibling 選取當前節點之前的所有同級節點

8、following-sibling 選取當前節點之後的所有同級節點

9、preceding 選取文檔中當前節點的開始標籤之前的所有節點

10、following 選取文檔中當前節點的結束標籤之後的所有節點

11、self 選取當前節點

12、attribute 選取當前節點的所有屬性

13、namespace 選取當前節點的所有命名空間節點

[reference](https://www.cnblogs.com/zhaozhan/archive/2009/09/10/1564332.html)

In [2]:
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country')
response = TextResponse(r.url, body=r.text, encoding='utf-8')

In [3]:
h3s = response.xpath('//h3')

In [4]:
def check_response_result(selector_list, num):
    print(f'length: {len(selector_list)}')

    if len(selector_list)>=num*2:
        i=-1
        li = list(range(len(selector_list)))
        li.reverse()
        tail_li = li[0:num]
        tail_li.reverse()

        for selector in selector_list:
            i += 1
            if i < num:
                print(selector_list[i].extract())
            elif i in tail_li:
                print(selector_list[i].extract())
            else:
                pass
    else:
        print('data is too small, decrease the value of num')


In [5]:
check_response_result(selector_list=h3s, num=5)

length: 88
<h3><span class="mw-headline" id="Argentina">Argentina</span></h3>
<h3><span class="mw-headline" id="Australia">Australia</span></h3>
<h3><span class="mw-headline" id="Austria">Austria</span></h3>
<h3><span class="mw-headline" id="Bangladesh">Bangladesh</span></h3>
<h3><span class="mw-headline" id="Belarus">Belarus</span></h3>
<h3 id="p-interaction-label" class="vector-menu-heading">
		<span>Contribute</span>
	</h3>
<h3 id="p-tb-label" class="vector-menu-heading">
		<span>Tools</span>
	</h3>
<h3 id="p-coll-print_export-label" class="vector-menu-heading">
		<span>Print/export</span>
	</h3>
<h3 id="p-wikibase-otherprojects-label" class="vector-menu-heading">
		<span>In other projects</span>
	</h3>
<h3 id="p-lang-label" class="vector-menu-heading">
		<span>Languages</span>
	</h3>


some result we don't need

## extract country of birth by xpath

In [6]:
country = h3s[0].xpath('span[@class="mw-headline"]/text()').extract()

In [7]:
country

['Argentina']

In [8]:
"""
Assuming we have a country’s <h3> header, we now need to get the
<ol> ordered list of Nobel winners. 

Handily, the xpath following-sibling selector can do just that.
following-sibling:: ==> 選取當前節點之後的所有同級節點
"""

'\nAssuming we have a country’s <h3> header, we now need to get the\n<ol> ordered list of Nobel winners. \n\nHandily, the xpath following-sibling selector can do just that.\nfollowing-sibling:: ==> 選取當前節點之後的所有同級節點\n'

In [9]:
ol_arg = h3s[0].xpath('following-sibling::ol[1]')

In [10]:
ol_arg

[<Selector xpath='following-sibling::ol[1]' data='<ol><li><a href="/wiki/C%C3%A9sar_Mil...'>]

In [11]:
lis_arg = ol_arg.xpath('li')
len(lis_arg)

5

In [12]:
lis_arg

[<Selector xpath='li' data='<li><a href="/wiki/C%C3%A9sar_Milstei...'>,
 <Selector xpath='li' data='<li><a href="/wiki/Adolfo_P%C3%A9rez_...'>,
 <Selector xpath='li' data='<li><a href="/wiki/Luis_Federico_Lelo...'>,
 <Selector xpath='li' data='<li><a href="/wiki/Bernardo_Houssay" ...'>,
 <Selector xpath='li' data='<li><a href="/wiki/Carlos_Saavedra_La...'>]

In [13]:
li = lis_arg[0]  # select the first list element
li.extract()

'<li><a href="/wiki/C%C3%A9sar_Milstein" title="César Milstein">César Milstein</a>*, Physiology or Medicine, 1984</li>'

In [14]:
name = li.xpath('a//text()')[0].extract()

In [15]:
name

'César Milstein'

In [16]:
"""
descendant-or-self:: ==> 選取當前節點的所有後代元素（子、孫等）以及當前節點本身
"""

'\ndescendant-or-self:: ==> 選取當前節點的所有後代元素（子、孫等）以及當前節點本身\n'

In [17]:
list_text = li.xpath('descendant-or-self::text()').extract()

In [18]:
list_text

['César Milstein', '*, Physiology or Medicine, 1984']

In [19]:
' '.join(list_text)

'César Milstein *, Physiology or Medicine, 1984'

## Selecting with Relative Xpaths

In [20]:
"""
As just shown, Scrapy xpath selections return lists of selectors
which, in turn, have their own xpath methods. When using the
xpath method, it’s important to be clear about relative and absolute
selections. Let’s make the distinction clear using the Nobel page’s
table of contents as an example.
"""

'\nAs just shown, Scrapy xpath selections return lists of selectors\nwhich, in turn, have their own xpath methods. When using the\nxpath method, it’s important to be clear about relative and absolute\nselections. Let’s make the distinction clear using the Nobel page’s\ntable of contents as an example.\n'

In [21]:
toc = response.xpath('//div[@id="toc"]')[0]

In [22]:
# relative Xpaths(following commends are same)
lis = toc.xpath('.//ul/li[2]/ul/li')
lis = toc.xpath('ul/li[2]/ul/li')
len(lis)

76

In [23]:
check_response_result(lis, 5)

length: 76
<li class="toclevel-2 tocsection-3"><a href="#Argentina"><span class="tocnumber">2.1</span> <span class="toctext">Argentina</span></a></li>
<li class="toclevel-2 tocsection-4"><a href="#Australia"><span class="tocnumber">2.2</span> <span class="toctext">Australia</span></a></li>
<li class="toclevel-2 tocsection-5"><a href="#Austria"><span class="tocnumber">2.3</span> <span class="toctext">Austria</span></a></li>
<li class="toclevel-2 tocsection-6"><a href="#Bangladesh"><span class="tocnumber">2.4</span> <span class="toctext">Bangladesh</span></a></li>
<li class="toclevel-2 tocsection-7"><a href="#Belarus"><span class="tocnumber">2.5</span> <span class="toctext">Belarus</span></a></li>
<li class="toclevel-2 tocsection-74"><a href="#United_States"><span class="tocnumber">2.72</span> <span class="toctext">United States</span></a></li>
<li class="toclevel-2 tocsection-75"><a href="#Venezuela"><span class="tocnumber">2.73</span> <span class="toctext">Venezuela</span></a></li>
<li

In [24]:
"""
A common mistake is to use a nonrelative xpath selector on the current
selection, which selects from the whole document, in this case
getting all <li> tags under condition('//ul/li[2]/ul'):
"""
lis = toc.xpath('//ul/li[2]/ul/li')
len(lis)

81

In [25]:
check_response_result(lis, 5)

length: 81
<li class="toclevel-2 tocsection-3"><a href="#Argentina"><span class="tocnumber">2.1</span> <span class="toctext">Argentina</span></a></li>
<li class="toclevel-2 tocsection-4"><a href="#Australia"><span class="tocnumber">2.2</span> <span class="toctext">Australia</span></a></li>
<li class="toclevel-2 tocsection-5"><a href="#Austria"><span class="tocnumber">2.3</span> <span class="toctext">Austria</span></a></li>
<li class="toclevel-2 tocsection-6"><a href="#Bangladesh"><span class="tocnumber">2.4</span> <span class="toctext">Bangladesh</span></a></li>
<li class="toclevel-2 tocsection-7"><a href="#Belarus"><span class="tocnumber">2.5</span> <span class="toctext">Belarus</span></a></li>
<li><a href="/wiki/Nobel_Committee_for_Chemistry" title="Nobel Committee for Chemistry">Chemistry</a></li>
<li><a href="/wiki/Committee_for_the_Sveriges_Riksbank_Prize_in_Economic_Sciences_in_Memory_of_Alfred_Nobel" title="Committee for the Sveriges Riksbank Prize in Economic Sciences in Memory