## → Chapter Four  
*Getting Lost in Dangdang.com’s Chaotic Marketplace*

🔺After bookschina, I tried scraping [Dangdang.com](https://www.dangdang.com) using a similar Selenium setup. The method was the same:  
search by ISBN → scrape the search result page → extract product URLs → scrape product detail pages.  

---

## → 第四章  
*乱入当当书市之险途*

🔺在成功爬取 bookschina 之后，我尝试使用类似的 Selenium 设置来爬取 [Dangdang.com](https://www.dangdang.com)。方法相同：  
通过 ISBN 进行搜索 → 爬取搜索结果页 → 提取商品链接 → 爬取商品详情页。  



### **→ Use Selenium to Scrape Book Titles from Dangdang**

Combining a similar website structure logic from the previous web scraping code for bookschina.com and the method using Selenium, I used the following code to: 

- Simulate real browswer interaction to search for books by their ISBNs on dangdang.com
- Extract titles and links from the search result page  
- Save the result to a CSV file

---

### **→ 使用 Selenium 从当当网爬取图书标题**

结合前面在 bookschina.com 的网页结构分析方法，以及使用 Selenium 模拟浏览器操作的思路，我使用了以下代码来完成以下任务：

- 模拟真实浏览器行为，在当当网根据 ISBN 搜索图书  
- 从搜索结果页面中提取图书标题和链接  
- 将结果保存为 CSV 文件


In [None]:
import time
import random
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException

# set path to chromedriver
driver_path = "C:/Users/aliso/Downloads/chromedriver-win64/chromedriver.exe"

# configure browser
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/135.0.0.0 Safari/537.36")

service = Service(executable_path=driver_path)
driver = webdriver.Chrome(service=service, options=options)
driver.maximize_window()

# load ISBNs from CSV
df = pd.read_csv('../data/scannedResults.csv', encoding="utf-8")
all_books = []

# loop through a range of ISBNs (you can modify the range to specify how many entries you want to query at one time)
for idx in range(0, 250):
    isbn = str(df.loc[idx, 'CODECONTENT']).strip()
    if not isbn or isbn == 'nan':
        print(f"skipping index {idx} — invalid isbn")
        continue

    print(f"searching isbn: {isbn}")
    search_url = f"https://search.dangdang.com/?medium=01&key4={isbn}&category_path=01.00.00.00.00.00"
    driver.get(search_url)

    # random delay to mimic human behavior
    sleep_time = random.uniform(3, 8)
    print(f"sleeping for {sleep_time:.2f} seconds...")
    time.sleep(sleep_time)

    # try to find product listing blocks
    containers = [
        "//div[@id='search_nature_rg']",
        "//div[@class='con shoplist']",
        "//div[@id='search_component_1']"
    ]

    li_elements = []
    for container_xpath in containers:
        try:
            container = driver.find_element(By.XPATH, container_xpath)
            li_elements = container.find_elements(By.XPATH, ".//li[@ddt-pit='1']")
            if li_elements:
                break
        except NoSuchElementException:
            continue

    if not li_elements:
        print("no book entries found.")
        continue

    # extract title and url from each result
    for li in li_elements:
        try:
            try:
                a_tag = li.find_element(By.XPATH, ".//p[@class='name']/a")
            except:
                try:
                    a_tag = li.find_element(By.CSS_SELECTOR, "p.name a")
                except:
                    a_tag = li.find_element(By.CSS_SELECTOR, "a[name='itemlist-title']")
            
            title = a_tag.get_attribute("title").strip()
            url = a_tag.get_attribute("href")
            print(f"found book: {title}")

            all_books.append({
                "isbn": isbn,
                "title": title,
                "url": url
            })

        except Exception as e:
            print(f"error extracting title: {e}")
            continue

# save results to csv
result_df = pd.DataFrame(all_books)
if not result_df.empty:
    result_df.to_csv("dangdangResults.csv", index=False, encoding='utf-8-sig')
    print("saved to dangdangResults.csv")

# close browser
driver.quit()

🔺At first, this worked. I was able to extract basic metadata like title, publisher, author, and publication date from the **search result pages**.  
However, Dangdang implements login restrictions and rate-limiting on the **product detail pages**, which stopped further scraping after a few queries.  

→ <img src="../images/0402.png" alt="dangdanglogin page" width="300px">  

🔺Another issue: the "titles" in the search results were filled with SEO junk—there’s no clean or official book title field.  
This created messy, inconsistent metadata for use in inventory management. (only the highlighted texts are the actual book titles)

→ <img src="../images/0403.png" alt="dangdang title examples" width="300px">  


---

🔺一开始进展顺利。我能够从**搜索结果页**中提取出基本的元数据，比如书名、出版社、作者和出版日期。  
但当我尝试访问**商品详情页**来获取更多信息时，Dangdang 启用了登录限制和访问频率限制，几次请求后就被阻止继续爬取。

→ <img src="../images/0402.png" alt="dangdanglogin page" width="300px">  

🔺还有一个问题是：搜索结果中的“书名”充满了 SEO 垃圾词汇——没有一个干净、标准的字段来表示真实的书名。  
这导致爬取下来的元数据非常混乱，不适合直接用于库存管理。

→ <img src="../images/0403.png" alt="dangdang title examples" width="300px">  

