## 使用 Selector來分析網頁(原始碼)

- 我們學過使用BeautifulSoup來分析網頁，簡單好用，但較慢。
- lxml也可以拿來分析網頁，但是，操作上相對複雜許多。
- Scrapy的Selector是一種基於lxml但卻操作相對容易的工具。

- 我們可以把一個html檔丟給Selector，讓Selector剖析該檔。

In [2]:
from scrapy.selector import Selector

text = '''
    <html>
        <body>
            <h1>Hello World</h1>
            <h1>Hello Scrapy</h1>
            <b>Hello python></b>
            <ul>
                <li>C++</li>
                <li>Java</li>
                <li>Python</li>
            </ul>
        </body>
    </html>
'''
selector = Selector(text=text)                      # 直接用Selector()讀入html檔
print(selector)

<Selector xpath=None data='<html>\n        <body>\n            <h1...'>


In [3]:
# 也可以使用Response來建置Selector
# 使用Response建立的物件，要傳給
# Selector的關鍵字論元response。
# 對比上面的做法是傳給關鍵字論元 text

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = '''
    <html>
        <body>
            <h1>Hello World</h1>
            <h1>Hello Scrapy</h1>
            <b>Hello python></b>
            <ul>
                <li>C++</li>
                <li>Java</li>
                <li>Python</li>
            </ul>
        </body>
    </html>
'''

# 利用HtmlResponse()來讀取html檔。HtmlResponse()
# 中需要標註url()，如果html檔不在url()之中，需在
# body中標明。下面 url中的網址是保留網域，並無實際內容


response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8') 
selector = Selector(response=response)
print(selector)

<Selector xpath=None data='<html>\n        <body>\n            <h1...'>


In [10]:
# 呼叫Selector物件的xpath或css這兩個方法(method)來取得文件中的某個/些部份

selector_list = selector.xpath('//h1')  # 選取所有<h1>，// 的意義在下面會說明
print(selector_list)
print('=' * 100)
# 取得tag所標記的文本
# 回傳的仍是表列，但html標記被移除了。
for sel in selector_list:
    print(sel.xpath('./text()'))        # ./text()在下面說明

[<Selector xpath='//h1' data='<h1>Hello World</h1>'>, <Selector xpath='//h1' data='<h1>Hello Scrapy</h1>'>]
[<Selector xpath='./text()' data='Hello World'>]
[<Selector xpath='./text()' data='Hello Scrapy'>]


In [13]:
# 用Selector().xpath()仍可使用 .xpath來做xpath或/及css搜尋

print(selector_list.xpath('./text()'))  # selector_list雖然像是一個list，但可直接使用 .xpath()，不需要做for loop。
print('='*100)
print(selector.xpath('.//ul').css('li').xpath('./text()'))

[<Selector xpath='./text()' data='Hello World'>, <Selector xpath='./text()' data='Hello Scrapy'>]
[<Selector xpath='./text()' data='C++'>, <Selector xpath='./text()' data='Java'>, <Selector xpath='./text()' data='Python'>]


### 使用Selector或doo eelectorList可用以下方法來分析選取內容：

1. extract()
1. re()
1. extract_first() [SelectorList專有]
1. re_first() [SelectorList專有]

In [19]:
s1 = selector.xpath('.//li')
print(s1)
print('='*100)
print(s1[0].extract())           # 把表列中第一個(index=0)的文本擷取出來
print('='*100)
s1 = selector.xpath('.//li/text()')
print(s1)
print('='*100)
print(s1[1].extract())           # 加了text()之後，擷取出來的文本就不會有HTML標記了

[<Selector xpath='.//li' data='<li>C++</li>'>, <Selector xpath='.//li' data='<li>Java</li>'>, <Selector xpath='.//li' data='<li>Python</li>'>]
<li>C++</li>
[<Selector xpath='.//li/text()' data='C++'>, <Selector xpath='.//li/text()' data='Java'>, <Selector xpath='.//li/text()' data='Python'>]
Java


In [21]:
# extract()取出所有值，extract_first()取出第一個值

s1 = selector.xpath('.//li/text()')
print(s1.extract())
print('='*100)
print(s1.extract_first())

['C++', 'Java', 'Python']
C++


In [26]:
# 亦可使用正則表示式 (re) 來搜尋特定內容

text = '''
    <ul>
        <li>Python學習手冊 <b>價格：99.00元</b></li>
        <li>Python核心程式設計 <b>價格：88.00元</b></li>
        <li>Python基礎教學 <b>價格：80.00元</b></li>
    </ul>
'''

selector = Selector(text=text)
print(selector.xpath('.//li/b/text()'))
print('='*100)
print(selector.xpath('.//li/b/text()').extract())
print('='*100)

# 利用Selector的re，非 re module來做蒐尋
print(selector.xpath('.//li/b/text()').re('\d+\.\d+'))    # 一個或以上的十進位數字 + 小數點 + 一個或以上的十進位數字

print('='*100)

print(selector.xpath('.//li/b/text()').re_first('\d+\.\d+'))  #用 Selector 中的 re 找第一個價格

[<Selector xpath='.//li/b/text()' data='價格：99.00元'>, <Selector xpath='.//li/b/text()' data='價格：88.00元'>, <Selector xpath='.//li/b/text()' data='價格：80.00元'>]
['價格：99.00元', '價格：88.00元', '價格：80.00元']
['99.00', '88.00', '80.00']
99.00


In [28]:
# 一般不需手動建立Selector，因為Response會自動建立

from scrapy.http import HtmlResponse

body = '''
    <html>
        <body>
            <h1>Hello World</h1>
            <h1>Hello Scrapy</h1>
            <b>Hello python></b>
            <ul>
                <li>C++</li>
                <li>Java</li>
                <li>Python</li>
            </ul>
        </body>
    </html>
'''

response = HtmlResponse(url='http://www.example.com', body = body, encoding='utf8')
print(response.selector) # 不需要import，也不需要呼叫任何函式，直接使用 .selector 即可
                         # 跟上面的比較一下，可以發現，不需要使用 Selector()來建立selector
print('='*100)

# 上面介紹的 .xpath()及 .css()都可直接作用於HtmlResponse()建立的物件上
# 不需要把 response 用 Selecotr() 再處理一次

print(response.xpath('.//h1/text()').extract())
print(response.css('li::text').extract())

<Selector xpath=None data='<html>\n        <body>\n            <h1...'>
['Hello World', 'Hello Scrapy']
['C++', 'Java', 'Python']


## XPath

- xpath即XML 路徑語言(XML Path Language)，是用來確定xml文件中某部份位置的語言。
- xml是HTML的孫輩，兩者共享許多特性。
- xml文件中的節點有多種類型，最常用的包括：
    1. 根節點：整個文件樹的根
    1. 元素節點(標籤)：html, body, div, p, a, 等
    1. 屬性節點：href
    1. 文字節點(文本)：Hello world, Click here等。
- 節點間有以下關係：
    1. 父子：body為html的子節點，p和a是div的子節點，反之為父節點
    1. 兄弟：a和p為兄弟節點
- XPath常用基本語法：
    1. / ：選取文件的根(root)
    1. . ：選取目前節點
    1. .. ：選取目前節點的父節點
    1. ELEMENT：選取子節點中所有ELEMENT元素節點
    1. //ELEMENT：選取後代節點中所有ELEMENT元素節點
    1. \* ：選取所有元素節點
    1. text() ：選取所有文字節點
    1. @ATTR ：選取名叫ATTR的屬性節點
    1. @* ：選取所有屬性節點
    1. [謂語] ：謂語用來尋找某個特定的節點或包含某個特定值的節點    

In [1]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = '''
    <html>
        <head>
            <base href='http://example.com/' />
            <title>Example website</title>
        </head>
        <body>
            <div id='images'>
                <a href = 'image1.html'>Name: Image I<br/><img src='image1.jpg' /></a>
                <a href = 'image2.html'>Name: Image 2<br/><img src='image2.jpg' /></a>
                <a href = 'image3.html'>Name: Image 3<br/><img src='image3.jpg' /></a>
                <a href = 'image4.html'>Name: Image 4<br/><img src='image4.jpg' /></a>
                <a href = 'image5.html'>Name: Image 5<br/><img src='image5.jpg' /></a>
            </div>
        </body>
    </html>
'''

response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')

print(response.xpath('/html')) #描述一個從根開始的絕對路徑
print(response.xpath('/html/head'))

[<Selector xpath='/html' data='<html>\n        <head>\n            <ba...'>]
[<Selector xpath='/html/head' data='<head>\n            <base href="http:/...'>]


In [36]:
# 選取 div之下的所有a (指定特定路徑中的 a)
for item in response.xpath('/html/body/div/a'):
    print(item)

<Selector xpath='/html/body/div/a' data='<a href="image1.html">Name: Image I<b...'>
<Selector xpath='/html/body/div/a' data='<a href="image2.html">Name: Image 2<b...'>
<Selector xpath='/html/body/div/a' data='<a href="image3.html">Name: Image 3<b...'>
<Selector xpath='/html/body/div/a' data='<a href="image4.html">Name: Image 4<b...'>
<Selector xpath='/html/body/div/a' data='<a href="image5.html">Name: Image 5<b...'>


In [37]:
#選取文件中所有a，不管其位置

for item in response.xpath('//a'):
    print(item)

<Selector xpath='//a' data='<a href="image1.html">Name: Image I<b...'>
<Selector xpath='//a' data='<a href="image2.html">Name: Image 2<b...'>
<Selector xpath='//a' data='<a href="image3.html">Name: Image 3<b...'>
<Selector xpath='//a' data='<a href="image4.html">Name: Image 4<b...'>
<Selector xpath='//a' data='<a href="image5.html">Name: Image 5<b...'>


In [38]:
#選取body後代中所有img (也可以指定是哪個路徑、哪個標記下的所有後代節點)

for item in response.xpath('/html/body//img'):
    print(item)

<Selector xpath='/html/body//img' data='<img src="image1.jpg">'>
<Selector xpath='/html/body//img' data='<img src="image2.jpg">'>
<Selector xpath='/html/body//img' data='<img src="image3.jpg">'>
<Selector xpath='/html/body//img' data='<img src="image4.jpg">'>
<Selector xpath='/html/body//img' data='<img src="image5.jpg">'>


In [3]:
#選取所有的a中的文字節點

sel = response.xpath('//a/text()')
print(sel)
print('='*100)

# 全部印出看起來比較亂，
# 所以一行一行印出來

for s in sel:
    print(s)
    
# 直接使用 .extract()，不需迴圈

sel.extract()

[<Selector xpath='//a/text()' data='Name: Image I'>, <Selector xpath='//a/text()' data='Name: Image 2'>, <Selector xpath='//a/text()' data='Name: Image 3'>, <Selector xpath='//a/text()' data='Name: Image 4'>, <Selector xpath='//a/text()' data='Name: Image 5'>]
<Selector xpath='//a/text()' data='Name: Image I'>
<Selector xpath='//a/text()' data='Name: Image 2'>
<Selector xpath='//a/text()' data='Name: Image 3'>
<Selector xpath='//a/text()' data='Name: Image 4'>
<Selector xpath='//a/text()' data='Name: Image 5'>


['Name: Image I',
 'Name: Image 2',
 'Name: Image 3',
 'Name: Image 4',
 'Name: Image 5']

In [42]:
# 選取html下的所有元素子節點 (daughter tags)
# xpath()結果是一個表列，直接印出不容易看，
# 故使用迴圈將成員一一印出，較容易看

for item in response.xpath('html/*'):
    print(item)
    
# 選取特定路徑下的div的所有後代元素節點 (descendants)

for item in response.xpath('/html/body/div//*'):
    print(item)
    

<Selector xpath='/html/body/div//*' data='<a href="image1.html">Name: Image I<b...'>
<Selector xpath='/html/body/div//*' data='<br>'>
<Selector xpath='/html/body/div//*' data='<img src="image1.jpg">'>
<Selector xpath='/html/body/div//*' data='<a href="image2.html">Name: Image 2<b...'>
<Selector xpath='/html/body/div//*' data='<br>'>
<Selector xpath='/html/body/div//*' data='<img src="image2.jpg">'>
<Selector xpath='/html/body/div//*' data='<a href="image3.html">Name: Image 3<b...'>
<Selector xpath='/html/body/div//*' data='<br>'>
<Selector xpath='/html/body/div//*' data='<img src="image3.jpg">'>
<Selector xpath='/html/body/div//*' data='<a href="image4.html">Name: Image 4<b...'>
<Selector xpath='/html/body/div//*' data='<br>'>
<Selector xpath='/html/body/div//*' data='<img src="image4.jpg">'>
<Selector xpath='/html/body/div//*' data='<a href="image5.html">Name: Image 5<b...'>
<Selector xpath='/html/body/div//*' data='<br>'>
<Selector xpath='/html/body/div//*' data='<img src="image5.jpg

In [44]:
# 選取所有div的孫節點img

for item in response.xpath('//div/*/img'):
    print(item)

<Selector xpath='//div/*/img' data='<img src="image1.jpg">'>
<Selector xpath='//div/*/img' data='<img src="image2.jpg">'>
<Selector xpath='//div/*/img' data='<img src="image3.jpg">'>
<Selector xpath='//div/*/img' data='<img src="image4.jpg">'>
<Selector xpath='//div/*/img' data='<img src="image5.jpg">'>


In [45]:
# 選取所有img的scr屬性

for item in response.xpath('//img/@src'):
    print(item)

<Selector xpath='//img/@src' data='image1.jpg'>
<Selector xpath='//img/@src' data='image2.jpg'>
<Selector xpath='//img/@src' data='image3.jpg'>
<Selector xpath='//img/@src' data='image4.jpg'>
<Selector xpath='//img/@src' data='image5.jpg'>


In [46]:
# 選取文件中所有的ATTR屬性

for item in response.xpath('//@href'):
    print(item)

<Selector xpath='//@href' data='http://example.com/'>
<Selector xpath='//@href' data='image1.html'>
<Selector xpath='//@href' data='image2.html'>
<Selector xpath='//@href' data='image3.html'>
<Selector xpath='//@href' data='image4.html'>
<Selector xpath='//@href' data='image5.html'>


In [47]:
# 選取第一個a下的img的所有屬性(這裏只有src一個屬性)
# 第幾個標記，以中括號加數字表示

for item in response.xpath('//a[1]/img/@*'):
    print(item)

<Selector xpath='//a[1]/img/@*' data='image1.jpg'>


In [49]:
# 選取目前節點，用來描述相對路徑
# 如果用 .xpath()則要用index 0 來達到相同結果

sel = response.xpath('//a')[0]
print(sel)

# 從目前這個a的後代中取得img

print(sel.xpath('.//img'))

<Selector xpath='//a' data='<a href="image1.html">Name: Image I<b...'>
[<Selector xpath='.//img' data='<img src="image1.jpg">'>]


In [51]:
# 選取所有img的父節點
# 注意：即使是父節點，仍是寫在標記之後...

for item in response.xpath('//img/..'):
    print(item)

<Selector xpath='//img/..' data='<a href="image1.html">Name: Image I<b...'>
<Selector xpath='//img/..' data='<a href="image2.html">Name: Image 2<b...'>
<Selector xpath='//img/..' data='<a href="image3.html">Name: Image 3<b...'>
<Selector xpath='//img/..' data='<a href="image4.html">Name: Image 4<b...'>
<Selector xpath='//img/..' data='<a href="image5.html">Name: Image 5<b...'>


In [58]:
# 選取所有a中的第3個

print(response.xpath('//a[3]'))

# 最後一個
print(response.xpath('//a[last()]'))

# 使用position，選前三個
print(response.xpath('//a[position()<=3]'))

#選取所有含有id屬性的div
print(response.xpath('//div[@id]'))

# 選取所有含id屬性且值為images的div
print(response.xpath('//div[@id="images"]'))

[<Selector xpath='//a[3]' data='<a href="image3.html">Name: Image 3<b...'>]
[<Selector xpath='//a[last()]' data='<a href="image5.html">Name: Image 5<b...'>]
[<Selector xpath='//a[position()<=3]' data='<a href="image1.html">Name: Image I<b...'>, <Selector xpath='//a[position()<=3]' data='<a href="image2.html">Name: Image 2<b...'>, <Selector xpath='//a[position()<=3]' data='<a href="image3.html">Name: Image 3<b...'>]
[<Selector xpath='//div[@id]' data='<div id="images">\n                <a ...'>]
[<Selector xpath='//div[@id="images"]' data='<div id="images">\n                <a ...'>]


In [59]:
# .xpath()中的 string()

from scrapy.selector import Selector
text = '<a href="#">Click here go to to the <strong>Next Page</strong></a>'
sel=Selector(text=text)
print(sel)

<Selector xpath=None data='<html><body><a href="#">Click here go...'>


In [63]:
print(sel.xpath('string(/html/body/a/strong)').extract())

#在xpath中使用string這個函式，與下面作用相同

print(sel.xpath('/html/body/a/strong/text()').extract())

['Next Page']
['Next Page']


In [65]:
# 取得整個字串 Click here to go to the Next Page
# 因為上面的字串分屬不同的標籤(元素)之下

print(sel.xpath('/html/body/a//text()').extract()) # 輸出一個有兩個字串的表列

#這個時候可以使用 string
print(sel.xpath('string(/html/body/a)').extract())

['Click here go to to the ', 'Next Page']
['Click here go to to the Next Page']


## CSS 選擇器

- 見下面範例

In [78]:
# 建立HTML文件

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

body = '''
    <html>
        <head>
            <base href='http://example.com/' />
            <title>Example website</title>
        <body>
            <div id="images-1" style='width: 1230px;'>
                <a href='image1.html'>Name: Image1 <br/><img src='imge1.jpg' /></a>
                <a href='image1.html'>Name: Image2 <br/><img src='imge2.jpg' /></a>
                <a href='image1.html'>Name: Image3 <br/><img src='imge3.jpg' /></a>
            </div>
            <div id='images-2' class='small'>
                <a href='image4.html'>Name: Image 4 <br/><imag src='image4.jpg' /></a>
                <a href='image5.html'>Name: Image 5 <br/><imag src='image5.jpg' /></a>
            </div>
        </body>
    </html>
'''

response = HtmlResponse(url='http://www.example.com', body=body, encoding='utf8')

In [84]:


# 選取所有的 img
print(response.css('img'))
print('='*100)

# 選取所有base和title
print(response.css('base, title'))
print('='*100)

# 選取dvi後代中的img (descendants)
print(response.css('div img'))
print('='*100)

# 選取body子元素中的div (daughters)
print(response.css('body>div'))
print('='*100)

# 選取包含style屬性的元素
print(response.css('[style]'))
print('='*100)

# 選取屬性id值為images-1的元表
print(response.css('[id=images-1]'))
print('='*100)

# 選取每個div的第一個a
print(response.css('div>a:nth-child(1)'))
print('='*100)

# 選取第二個div的第一個a
print(response.css('div:nth-child(2)>a:nth-child(1)'))
print('='*100)

# 選取第一個dvi中的最後一個a
print(response.css('div:first-child>a:last-child'))
print('='*100)

# 選取所有a的文字

sel= response.css('a::text')
print(sel)
print('='*100)
print(sel.extract())

[<Selector xpath='descendant-or-self::img' data='<img src="imge1.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="imge2.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="imge3.jpg">'>]
[<Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<base href="http://example.com/">'>, <Selector xpath='descendant-or-self::base | descendant-or-self::title' data='<title>Example website</title>'>]
[<Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="imge1.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="imge2.jpg">'>, <Selector xpath='descendant-or-self::div/descendant-or-self::*/img' data='<img src="imge3.jpg">'>]
[<Selector xpath='descendant-or-self::body/div' data='<div id="images-1" style="width: 1230...'>, <Selector xpath='descendant-or-self::body/div' data='<div id="images-2" class="small">\n   ...'>]
[<Selector xpath='descendant-or-self::*[@style]' data='<div 