# Day29
## 動態網頁爬蟲起手式：Selenium 
- 認識 Selenium 及其應用
- 瞭解 Selenium 如何用於動態爬蟲



## 作業說明
練習使用 Selenium 爬取品類列表連結
- 目標網站： https://channel.jd.com/outdoor.html

目標：
- 取得當前所有小類別名稱

![](https://i.imgur.com/2aCtZM5.png)

Hint: 
- 參考講義 Day29 內容
- 請記得先安裝 Chrome 瀏覽器
- 也會用到我們 Day20 所學的 Xpath

### 套件安裝

In [1]:
!pip install -U selenium
!pip install webdriver_manager
!pip install fake-useragent

Collecting webdriver_manager
  Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Installing collected packages: webdriver_manager
Successfully installed webdriver_manager-4.0.2
Collecting fake-useragent
  Downloading fake_useragent-2.1.0-py3-none-any.whl.metadata (17 kB)
Downloading fake_useragent-2.1.0-py3-none-any.whl (125 kB)
   ---------------------------------------- 0.0/125.8 kB ? eta -:--:--
   --- ------------------------------------ 10.2/125.8 kB ? eta -:--:--
   --- ------------------------------------ 10.2/125.8 kB ? eta -:--:--
   --------- ----------------------------- 30.7/125.8 kB 330.3 kB/s eta 0:00:01
   --------- ----------------------------- 30.7/125.8 kB 330.3 kB/s eta 0:00:01
   ------------------- ------------------- 61.4/125.8 kB 273.8 kB/s eta 0:00:01
   ------------------------- ------------- 81.9/125.8 kB 327.3 kB/s eta 0:00:01
   ---------------------------- ---------- 

### 套件導入

In [2]:
from fake_useragent import UserAgent
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from webdriver_manager.chrome import ChromeDriverManager
import time
from tqdm import tqdm

In [5]:
from selenium.webdriver.chrome.service import Service
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service)



### 使用 fake-useragent 產生 User Agent

In [9]:
driver = webdriver.Chrome(service=service)

# 目標網址
base_url = 'https://channel.jd.com/outdoor.html'
driver.get(base_url)

opt = webdriver.ChromeOptions()
user_agent = UserAgent()
opt.add_argument('--user-agent=%s' % user_agent)

driver.close()

### 測試是否能成功啟動 webdriver 開啟目標網址


### 獲取資料
- 練習 selenium 操作
- 活用 `WebElement.find_elements_by_xpath()`, `WebElement.text`, `WebElement.get_attribute()` 等方法

In [16]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

# ✅ 設定 Chrome 選項
options = Options()
options.add_argument("--headless")  # 無頭模式 (不顯示瀏覽器)
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920,1080")  # 避免載入手機版

# ✅ 啟動 WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

# ✅ 目標網址
base_url = 'https://channel.jd.com/outdoor.html'
driver.get(base_url)

# ✅ 等待網頁完全加載 (最多等待 15 秒)
WebDriverWait(driver, 15).until(
    EC.presence_of_element_located((By.ID, "Categorys"))
)

# ✅ 爬取所有大分類 (`dt`)
categories = driver.find_elements(By.XPATH, '//div[@id="Categorys"]//dl[@class="item-inner"]/dt')

# ✅ 爬取所有小分類 (`a`)
subcategories = driver.find_elements(By.XPATH, '//div[@id="Categorys"]//dl[@class="item-inner"]/dd/a')

# ✅ 存放爬取結果
data = []

# ✅ 遍歷所有分類
for dt in categories:
    category_name = dt.text.strip()
    
    # ✅ 找到該分類下所有的小分類
    sub_links = dt.find_elements(By.XPATH, "./following-sibling::dd/a")
    
    for sub in sub_links:
        sub_name = sub.text.strip()
        sub_href = sub.get_attribute("href")
        
        # ✅ 確保連結完整 (部分是 `//` 開頭)
        if sub_href.startswith("//"):
            sub_href = "https:" + sub_href
        
        data.append((category_name, sub_name, sub_href))

# ✅ 顯示爬取結果
for item in data:
    print(item)

# ✅ 關閉瀏覽器
driver.quit()


('户外鞋服', '冲锋衣裤', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&go=0')
('户外鞋服', '徒步鞋', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12136&go=0')
('户外鞋服', '抓绒衣裤', 'https://list.jd.com/list.html?cat=1318,2628,12128')
('户外鞋服', '羽绒服棉服', 'https://list.jd.com/list.html?cat=1318,2628,12126')
('户外鞋服', '越野跑鞋', 'https://list.jd.com/list.html?cat=1318,2628,12137')
('户外鞋服', '软壳', 'https://list.jd.com/list.html?cat=1318,2628,12129')
('户外鞋服', '登山鞋', 'https://list.jd.com/list.html?cat=1318,2628,12134')
('户外鞋服', '休闲鞋', 'https://list.jd.com/list.html?cat=1318,2628,12138')
('户外装备', '帐篷', 'https://list.jd.com/list.html?cat=1318,1462,1473')
('户外装备', '照明', 'https://list.jd.com/list.html?cat=1318,1462,1476')
('户外装备', '背包', 'https://list.jd.com/list.html?cat=1318,1462,1472')
('户外装备', '户外仪表', 'https://list.jd.com/list.html?cat=1318,1462,2631')
('户外装备', '工具', 'https://list.jd.com/list.html?cat=1318,1462,1479')
('户外装备', '望远镜', 'https://list.jd.com/list.html?cat=1318,1462,1480')
('户外装备', '旅游用品', 'htt

In [17]:
import pandas as pd

pd.DataFrame(data, columns=["中類別","小類別","類別頁連結"])


Unnamed: 0,中類別,小類別,類別頁連結
0,户外鞋服,冲锋衣裤,https://list.jd.com/list.html?cat=1318%2C2628%...
1,户外鞋服,徒步鞋,https://list.jd.com/list.html?cat=1318%2C2628%...
2,户外鞋服,抓绒衣裤,"https://list.jd.com/list.html?cat=1318,2628,12128"
3,户外鞋服,羽绒服棉服,"https://list.jd.com/list.html?cat=1318,2628,12126"
4,户外鞋服,越野跑鞋,"https://list.jd.com/list.html?cat=1318,2628,12137"
5,户外鞋服,软壳,"https://list.jd.com/list.html?cat=1318,2628,12129"
6,户外鞋服,登山鞋,"https://list.jd.com/list.html?cat=1318,2628,12134"
7,户外鞋服,休闲鞋,"https://list.jd.com/list.html?cat=1318,2628,12138"
8,户外装备,帐篷,"https://list.jd.com/list.html?cat=1318,1462,1473"
9,户外装备,照明,"https://list.jd.com/list.html?cat=1318,1462,1476"


In [18]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time
import numpy as np
from tqdm import tqdm

# ✅ 設定 Chrome Options
options = Options()
options.add_argument("--headless")  # 無頭模式 (可選)
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920,1080")  # 避免載入行動版頁面

# ✅ 啟動 WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

# ✅ 目標網址
base_url = 'https://channel.jd.com/outdoor.html'
driver.get(base_url)

# ✅ 等待 `Categorys` 區塊載入
WebDriverWait(driver, 15).until(
    EC.presence_of_element_located((By.ID, "Categorys"))
)

# ✅ 找到所有大分類 (`dl.item-inner`)
categories = driver.find_elements(By.XPATH, '//dl[@class="item-inner"]')

# ✅ 存放爬取結果
data = []

# ✅ 開始 `hover` (懸浮滑鼠) 並抓取數據
for category in tqdm(categories, desc="Hovering over categories"):
    # 獲取大分類名稱 (`dt`)
    category_name = category.find_element(By.TAG_NAME, "dt").text.strip()

    # ✅ 使用 ActionChains 讓滑鼠懸浮
    hover = ActionChains(driver).move_to_element(category)
    hover.perform()  # 模擬懸浮
    time.sleep(np.random.uniform(2, 5))  # 避免封鎖，隨機等待

    # ✅ 等待子分類 (`dd a`) 出現
    subcategories = WebDriverWait(driver, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//dl[@class="item-inner"]/dd/a'))
    )

    # ✅ 收集小分類資訊
    for sub in subcategories:
        sub_name = sub.text.strip()
        sub_href = sub.get_attribute("href")

        # 確保 `href` 為完整 URL
        if sub_href.startswith("//"):
            sub_href = "https:" + sub_href

        data.append((category_name, sub_name, sub_href))

# ✅ 印出結果
print("爬取到的分類數據:")
for item in data:
    print(item)

# ✅ 關閉瀏覽器
driver.quit()


Hovering over categories: 100%|██████████| 5/5 [00:20<00:00,  4.19s/it]


爬取到的分類數據:
('户外鞋服', '冲锋衣裤', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12123&go=0')
('户外鞋服', '徒步鞋', 'https://list.jd.com/list.html?cat=1318%2C2628%2C12136&go=0')
('户外鞋服', '抓绒衣裤', 'https://list.jd.com/list.html?cat=1318,2628,12128')
('户外鞋服', '羽绒服棉服', 'https://list.jd.com/list.html?cat=1318,2628,12126')
('户外鞋服', '越野跑鞋', 'https://list.jd.com/list.html?cat=1318,2628,12137')
('户外鞋服', '软壳', 'https://list.jd.com/list.html?cat=1318,2628,12129')
('户外鞋服', '登山鞋', 'https://list.jd.com/list.html?cat=1318,2628,12134')
('户外鞋服', '休闲鞋', 'https://list.jd.com/list.html?cat=1318,2628,12138')
('户外鞋服', '帐篷', 'https://list.jd.com/list.html?cat=1318,1462,1473')
('户外鞋服', '照明', 'https://list.jd.com/list.html?cat=1318,1462,1476')
('户外鞋服', '背包', 'https://list.jd.com/list.html?cat=1318,1462,1472')
('户外鞋服', '户外仪表', 'https://list.jd.com/list.html?cat=1318,1462,2631')
('户外鞋服', '工具', 'https://list.jd.com/list.html?cat=1318,1462,1479')
('户外鞋服', '望远镜', 'https://list.jd.com/list.html?cat=1318,1462,1480')
('户外鞋服', '旅

In [19]:
import pandas as pd

pd.DataFrame(data, columns=["中類別","小類別","類別頁連結"])

Unnamed: 0,中類別,小類別,類別頁連結
0,户外鞋服,冲锋衣裤,https://list.jd.com/list.html?cat=1318%2C2628%...
1,户外鞋服,徒步鞋,https://list.jd.com/list.html?cat=1318%2C2628%...
2,户外鞋服,抓绒衣裤,"https://list.jd.com/list.html?cat=1318,2628,12128"
3,户外鞋服,羽绒服棉服,"https://list.jd.com/list.html?cat=1318,2628,12126"
4,户外鞋服,越野跑鞋,"https://list.jd.com/list.html?cat=1318,2628,12137"
...,...,...,...
175,游泳,女士泳衣,https://search.jd.com/Search?keyword=%E5%A5%B3...
176,游泳,泳裤,https://search.jd.com/Search?keyword=%E6%B3%B3...
177,游泳,泳帽,https://search.jd.com/Search?keyword=%E6%B3%B3...
178,游泳,冲浪潜水,https://search.jd.com/Search?keyword=%E5%86%B2...
