# Day20
## 網頁結構解析：使用 lxml 套件操作 XPath
- 使用 lxml.html
- 使用 XPath 語法獲取子節點

## 作業說明
由於 Day18 作業我們已經練習過一些定位工具，今天使用和 Day19 一樣的網站，針對 XPath 更多變化用法再深入練習吧。

- 題目網站：
https://pokemondb.net/pokedex/all
- 使用 XPath 技巧把寶可夢表格抓下來

## Self memo

- [Xpath cheatsheet](https://devhints.io/xpath)

In [1]:
import lxml.html
import requests

### `GET` Request

In [2]:
url = 'https://pokemondb.net/pokedex/all'
req_txt = requests.get(url)
print(req_txt.status_code)
print()
print(req_txt.text[:1000])

200

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Pokémon Pokédex: list of Pokémon with stats | Pokémon Database</title>
<link rel="preconnect" href="https://img.pokemondb.net">
<style>@font-face{font-family:"Fira Sans";font-style:normal;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:italic;font-weight:400;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-400i.woff2") format("woff2");unicode-range:U+0000-00FF,U+0131,U+0152-0153,U+02BB-02BC,U+02C6,U+02DA,U+02DC,U+2000-206F,U+2074,U+20AC,U+2122,U+2191,U+2193,U+2212,U+2215,U+FEFF,U+FFFD}@font-face{font-family:"Fira Sans";font-style:normal;font-weight:700;font-display:swap;src:url("/static/fonts/fira-sans-v10-latin-600.woff2") format("woff2");unicode-r

### 轉為 HTML Element 物件
- 使用 `lxml.html.fromstring()`

In [3]:
# 轉為 Element 物件

tree = lxml.html.fromstring(req_txt.text)

tree

<Element html at 0x1b25ff89368>

### 指定相符特徵的節點
- 找到寶可夢資訊表格
- 使用：`tree.xpath('//<tag_name>[@<attribute>=<attribute_value>]')`


In [4]:
table = tree.xpath('//table[@id="pokedex"]')[0]

### 連續查找
- 取得所有表格中的列

In [5]:
header = table.xpath('//thead/tr/th/div/text()')
body_rows = table.xpath('//tbody/tr')

### 指定節點文字相符：找出文字是 Ivysaur 的節點
- Hint: 使用 `tree.xpath('//<tag_name>[text()="some_string"]')`

In [6]:
table.xpath('//a[text()="Ivysaur"]/text()')[0]

'Ivysaur'

### 找出屬性包含部分文字的節點：找出各種類型的寶可夢種類標籤

- 包含： `tree.xpath('//<tag_name>[contains(<attribute>, <attribute_value>)]')`
- 不包含： `tree.xpath('//<tag_name>[not(contains(<attribute>, <attribute_value>))]')`

In [7]:
# 找出各種類型的寶可夢種類標籤(GRASS, POISON, ...)，用 set 過濾出不重複種類有哪幾種

sorted(list(set(table.xpath('//tbody//td[@class="cell-icon"]//a//text()'))))

['Bug',
 'Dark',
 'Dragon',
 'Electric',
 'Fairy',
 'Fighting',
 'Fire',
 'Flying',
 'Ghost',
 'Grass',
 'Ground',
 'Ice',
 'Normal',
 'Poison',
 'Psychic',
 'Rock',
 'Steel',
 'Water']

### 將資訊組成表格

In [8]:
header_cols = table.xpath('//thead/tr/th/div/text()')
row_values = [[''.join(j.xpath('.//text()')) for j in i.xpath('./td')] for i in table.xpath('//tbody/tr')]

print(header_cols)
print(row_values[:10])

['#', 'Name', 'Type', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']
[['001', 'Bulbasaur', 'Grass Poison', '318', '45', '49', '49', '65', '65', '45'], ['002', 'Ivysaur', 'Grass Poison', '405', '60', '62', '63', '80', '80', '60'], ['003', 'Venusaur', 'Grass Poison', '525', '80', '82', '83', '100', '100', '80'], ['003', 'Venusaur Mega Venusaur', 'Grass Poison', '625', '80', '100', '123', '122', '120', '80'], ['004', 'Charmander', 'Fire ', '309', '39', '52', '43', '60', '50', '65'], ['005', 'Charmeleon', 'Fire ', '405', '58', '64', '58', '80', '65', '80'], ['006', 'Charizard', 'Fire Flying', '534', '78', '84', '78', '109', '85', '100'], ['006', 'Charizard Mega Charizard X', 'Fire Dragon', '634', '78', '130', '111', '130', '85', '100'], ['006', 'Charizard Mega Charizard Y', 'Fire Flying', '634', '78', '104', '78', '159', '115', '100'], ['007', 'Squirtle', 'Water ', '314', '44', '48', '65', '50', '64', '43']]


In [9]:
import pandas as pd

df = pd.DataFrame(row_values, columns=header_cols)
df["Type"] = df["Type"].apply(lambda x: x.strip().split(" "))
df

Unnamed: 0,#,Name,Type,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed
0,001,Bulbasaur,"[Grass, Poison]",318,45,49,49,65,65,45
1,002,Ivysaur,"[Grass, Poison]",405,60,62,63,80,80,60
2,003,Venusaur,"[Grass, Poison]",525,80,82,83,100,100,80
3,003,Venusaur Mega Venusaur,"[Grass, Poison]",625,80,100,123,122,120,80
4,004,Charmander,[Fire],309,39,52,43,60,50,65
...,...,...,...,...,...,...,...,...,...,...
1070,902,Basculegion Female,"[Water, Ghost]",530,120,92,65,100,75,78
1071,903,Sneasler,"[Poison, Fighting]",510,80,130,60,40,80,120
1072,904,Overqwil,"[Dark, Poison]",510,85,115,95,65,65,85
1073,905,Enamorus Incarnate Forme,"[Fairy, Flying]",580,74,115,70,135,80,106


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1075 entries, 0 to 1074
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   #        1075 non-null   object
 1   Name     1075 non-null   object
 2   Type     1075 non-null   object
 3   Total    1075 non-null   object
 4   HP       1075 non-null   object
 5   Attack   1075 non-null   object
 6   Defense  1075 non-null   object
 7   Sp. Atk  1075 non-null   object
 8   Sp. Def  1075 non-null   object
 9   Speed    1075 non-null   object
dtypes: object(10)
memory usage: 84.1+ KB
