## → Chapter Three  

*Battle with the Forbidden Databases, Selenium’s Brave Charge Ends in Defeat*
 
🔺Since scraping bookschina.com worked, I assumed my plan would succeed. I scanned all the books over several days—about 10 hours total. Then I returned to chinabooks.com to scrape the remaining data.  

🔺But this time, the request returned nothing. The website had changed. On the homepage, a banner stated:  

> “近日，我司接到多起举报，有不法分子使用我公司名义以对外提供刷单业务等形式实施网络诈骗，我司严正声明，从未以任何方式组织开展刷单业务，亦未授权任何单位、个人开展上述业务，请谨防诈骗 >>”  
> **Translation**: Recently, our company received multiple reports of fraudsters falsely claiming to offer fake transaction services in our name. We solemnly declare that we have never engaged in such activities nor authorized any third party to do so. Please beware of scams.  

→ <img src="../images/0301.png" alt="Chinabooks banner warning against fraud" width="400px">


🔺I then used a plug in called Selenium, a Python library that allows the code to open a web browser and simulate real interactions (like clicking, typing, etc.) and specific delays to test whether it was a bot detection issue. The site still returned no results. Below is a brief tutorial using my test file as an example to use Selenium to Simulate a Single ISBN Search.

---
## → 第三回 

*大战禁书数据库，自动侠赛琳娜勇闯奇门阵*  

🔺由于此前成功爬取了 bookschina.com，我以为整个流程已经可以顺利跑通。于是我花了几天时间扫描了所有书籍，总共花了大约10小时。接着我回到 chinabooks.com 继续爬取剩下的数据。  

🔺但这一次，请求返回了空内容。网站已经发生了变化。在首页顶部出现了一条公告：  

> “近日，我司接到多起举报，有不法分子使用我公司名义以对外提供刷单业务等形式实施网络诈骗，我司严正声明，从未以任何方式组织开展刷单业务，亦未授权任何单位、个人开展上述业务，请谨防诈骗 >>”  

→ <img src="../images/0301.png" alt="chinabooks反诈骗警示横幅截图" width="400px">


🔺我随后使用了一个名为 Selenium 的插件，它是一个 Python 库，可以让代码打开网页浏览器并模拟真实用户的操作（如点击、输入等），还能设置等待时间。我尝试使用人工行为模拟来测试是否是反爬机制的问题。但即使如此，网站依然没有返回结果。下面是一个简要教程，展示如何使用我的测试文件，通过 Selenium 模拟单个ISBN搜索操作。

---

### **→ Using Selenium to Simulate a Real Browser for Scraping**  

**→ 1. What is Selenium?**  
Selenium is a Python library that allows you to automate web browser interactions just like a human would. It opens a browser, clicks buttons, types text, and waits for content to load.  
It is especially useful when dealing with JavaScript-heavy websites or sites that block traditional scraping methods using `requests`, like I just did with Beautiful Soup and request.


**→ 2. Install Selenium**  
install Selenium by running the following code: 
```bash
pip install selenium
```

**→ 3. Download ChromeDriver**

1. Go to: [https://googlechromelabs.github.io/chrome-for-testing/](https://googlechromelabs.github.io/chrome-for-testing/)
2. Check current Chrome version (type `chrome://version` in your browser). (Make sure the current Chrome browser version matches the ChromeDriver version.)
3. Download the matching ChromeDriver for the operating system.
4. Unzip the file and place it somewhere accessible, like `Downloads/` folder, and copy the file path.


**→ 4. Run the test script**

After setting up for Selenium, I used the following code to:

- Load an ISBN from a CSV file  
- Opens `bookschina.com`  
- Adds a cookie to force desktop mode  
- Searches for the ISBN in the search bar  
- Saves the resulting HTML for debugging 

In [None]:
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Setup
base_url = "https://www.bookschina.com/book_find2/?stp="
driver_path = r"C:/Users/aliso/Downloads/chromedriver-win64/chromedriver.exe"
service = ChromeService(executable_path=driver_path)
options = Options()
options.add_argument("user-agent=Mozilla/5.0 ...")

df = pd.read_csv('../data/scannedResults.csv')
isbn = str(df['CODECONTENT'][7])

driver = webdriver.Chrome(service=service, options=options)

# Add cookie to force desktop version
driver.get("https://www.bookschina.com/")
driver.add_cookie({"name": "isPc", "value": "yes", "path": "/", "domain": "www.bookschina.com"})
driver.get("https://www.bookschina.com/")

# Simulate search
time.sleep(2)
search_input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "keyword")))
search_input.click()
time.sleep(0.5)
search_input.clear()
search_input.send_keys(isbn)
search_input.send_keys(Keys.RETURN)

# Save HTML
time.sleep(10)
with open("debug_page.html", "w", encoding="utf-8") as f:
    f.write(driver.page_source)
print("✅ Saved page content to debug_page.html")

driver.quit()

🔺However, this also did not work. Selenium successfully opened the page on the browser but wasn't able to extract the html information. 

→ <img src="../images/0302.png" alt="robots.txt page" width="400px">

🔺My friend reminded me to check the site's robots.txt file. robots.txt is a file that websites use to control how web crawlers or bots interact with their pages. It can block certain pages from being accessed, or disallow all scraping entirely. The one at chinabooks.com now blocks scraping.

→ <img src="../images/0303.png" alt="robots.txt explanation" width="400px">

🔺The same goes for Douban (豆瓣), a Chinese social media and database platform for books, films, music, and cultural reviews. It's often considered the Chinese equivalent of Goodreads, but also includes user-generated reviews, tags, and ratings. In Douban's case, it uses a strict login and rate-limiting to prevent scraping or even manual information extraction in bulk. Douban also requires a Chinese phone number for full access and authentication, which, ultimately prevents people trying to bypass their anti-bot mechanism illegally, because Chinese phone numbers are tied to each citizens' real name and state issued ID number by law.

---




### **→ 使用 Selenium 模拟真实浏览器进行爬取**  

**→ 1. 什么是 Selenium？**  
Selenium 是一个 Python 库，可以模拟真实用户在网页浏览器上的操作，比如打开网页、点击按钮、输入文字、等待内容加载等。  
当网站使用大量 JavaScript 或通过 `requests`（如我之前用 Beautiful Soup 所做的）阻止传统爬虫时，Selenium 就非常有用。


**→ 2. 安装 Selenium**  
通过以下命令安装 Selenium：  
```bash
pip install selenium
```

**→ 3. 下载 ChromeDriver**

打开网站：https://googlechromelabs.github.io/chrome-for-testing/

在浏览器地址栏输入 chrome://version，查看当前 Chrome 浏览器版本。（确保下载的 ChromeDriver 与 Chrome 浏览器版本匹配。）

下载与你的操作系统相匹配的 ChromeDriver。

解压下载的文件，将其放到一个方便访问的位置（例如 Downloads/ 文件夹），并复制它的完整文件路径备用。

**→ 4. 运行测试脚本**

在完成 Selenium 设置后，我使用以下代码实现了以下操作：

- 从 CSV 文件中读取一个 ISBN
- 打开 bookschina.com 网站
- 添加 cookie 强制显示桌面版网页
- 在搜索栏中输入 ISBN 并执行搜索
- 保存返回的网页 HTML 以供调试

In [None]:
from selenium import webdriver 
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Setup
base_url = "https://www.bookschina.com/book_find2/?stp="
driver_path = r"C:/Users/aliso/Downloads/chromedriver-win64/chromedriver.exe"
service = ChromeService(executable_path=driver_path)
options = Options()
options.add_argument("user-agent=Mozilla/5.0 ...")

df = pd.read_csv('../data/scannedResults.csv')
isbn = str(df['CODECONTENT'][7])

driver = webdriver.Chrome(service=service, options=options)

# Add cookie to force desktop version
driver.get("https://www.bookschina.com/")
driver.add_cookie({"name": "isPc", "value": "yes", "path": "/", "domain": "www.bookschina.com"})
driver.get("https://www.bookschina.com/")

# Simulate search
time.sleep(2)
search_input = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "keyword")))
search_input.click()
time.sleep(0.5)
search_input.clear()
search_input.send_keys(isbn)
search_input.send_keys(Keys.RETURN)

# Save HTML
time.sleep(10)
with open("debug_page.html", "w", encoding="utf-8") as f:
    f.write(driver.page_source)
print("✅ Saved page content to debug_page.html")

driver.quit()

🔺然而，这依然没有成功。Selenium 虽然成功打开了浏览器页面，但无法提取 HTML 内容。  

→ <img src="../images/0302.png" alt="robots.txt 页面截图" width="400px">  

🔺朋友提醒我检查网站的 robots.txt 文件。`robots.txt` 是网站用于控制网络爬虫或机器人访问规则的文件。网站可以通过它禁止爬取特定页面，甚至完全拒绝所有爬虫访问。chinabooks.com 的 robots.txt 文件现在已经明确禁止了爬虫行为。

→ <img src="../images/0303.png" alt="robots.txt 说明截图" width="400px">  

🔺豆瓣（Douban）也是如此。豆瓣是一个中文社交媒体和数据库平台，包含图书、电影、音乐和文化评论内容。它常被视为中国版 Goodreads，同时提供用户生成的评论、标签和评分等功能。在豆瓣的情况下，它通过严格的登录机制和访问频率限制来防止爬虫，甚至连人工大批量查看信息也会被限制。此外，豆瓣要求用户必须绑定中国大陆手机号进行完整身份认证。由于中国手机号必须通过身份证实名制登记，这一机制从根本上阻止了大多数企图绕过反爬机制的非法行为。
