## → Chapter Two  
*Dreaming up Automation, Wielding the Magic Scanner* 

🔺Faced with the frustration of manual data entry, I hoped that ISBN scanning could automate and speed up the process. My idea was to take one day to scan all the books before they got moved around again, then use the ISBNs to query book information from a database, and bulk import everything into Shopify. That sounded like a month’s worth of work condensed into two days.  

🔺Initially, I was told the store had already tried using ISBNs to automatically retrieve product information on Shopify, but failed—most of the books were in Chinese and not supported by U.S.-based databases. Still, I believed there must be a way to automate this using web scraping. I started researching Chinese book databases or websites that might host the book metadata I needed.  

🔺I found a Chinese online book retailer called [bookschina.com](https://bookschina.com/) and ran a test scrape using around 70 ISBNs. Using a Python library called BeautifulSoup, I was able to successfully extract book information. Below is a short example of how I used Python and BeautifulSoup to scrape data from the retailer’s website.

---
## → 第二回
*小文员梦会自动术 扫描仪在手如有神助*

🔺面对手动录入数据的挫败感，我开始希望能通过ISBN扫描来实现自动化，加快进度。我设想只要花一天时间，把所有书籍的ISBN扫完，在它们再次被移来移去之前保存下来，然后通过某个数据库用ISBN批量查询书籍信息，最后统一导入Shopify。这听起来像是能把一个月的工作压缩成两天完成的理想方案。

🔺不过一开始我被告知，店里以前试过用ISBN来自动导入Shopify产品信息，但没成功——因为大多数书是中文的，用美国的服务查不到相关信息。但我不死心，觉得肯定有办法能结合网络爬虫来实现这套流程。我开始查找是否有中国本地数据库或网站能提供书籍的元数据。

🔺我找到了一个中文图书在线零售网站：[bookschina.com](https://bookschina.com/)，用大约70个ISBN进行了测试。我使用了Python中的BeautifulSoup库进行爬取，成功获取了书籍信息。下面是我如何使用Python和BeautifulSoup爬取该网站信息的简要记录。

---

### **→ Scraping an Online Retailer Page for Book Info with ISBN**  

**→ 1. Scan book ISBNs with an app on your phone and create a CSV**  
&nbsp;&nbsp;&nbsp;&nbsp;I used an app called [QRbot](https://qrbot.net/locale/en/) which allows you to scan ISBN codes using your phone camera and export a CSV file from your scan history.  
&nbsp;&nbsp;&nbsp;&nbsp;This results in a CSV with a list of ISBNs under a column named `CODECONTENT`, which I saved in the `data/` folder.  

**→ 2. Set up**  
&nbsp;&nbsp;&nbsp;&nbsp;To set up Python, I recommend following the Programming Historian guide:  
&nbsp;&nbsp;&nbsp;&nbsp;[Introduction and Installation – Programming Historian](https://programminghistorian.org/en/lessons/introduction-and-installation)  

&nbsp;&nbsp;&nbsp;&nbsp;**Python Library Requirements:**  
&nbsp;&nbsp;&nbsp;&nbsp;- `pandas`  
&nbsp;&nbsp;&nbsp;&nbsp;- `beautifulsoup4`  
&nbsp;&nbsp;&nbsp;&nbsp;- `requests`  

**→ 3. Get a list of product detail page URLs**  
&nbsp;&nbsp;&nbsp;&nbsp;The scraping method requires identifying a URL pattern that leads to the product’s detail page, so we can extract relevant data from its HTML.  

&nbsp;&nbsp;&nbsp;&nbsp;For [https://bookschina.com/](https://bookschina.com/), the website uses a structured query format for ISBNs that looks like this:  



In [None]:
bookschina.com/book_find2/?stp={{ISBN}}&sCate=2

  I can use this URL to find search result pages for each of our ISBNs in the scanned books list **(scanResults.csv)**.
  Additionally, since this is just the search result page, I needed to extract the link to the product detail page so that I can get more information on the book.

  → <img src="../images/0202.png" alt="bookschina.com Search Result with Highlight" width="400px">
  
  to get the link, I needed to find the element for it in html

  → <img src="../images/0204.png" alt="bookschina.com Search Result with Highlight" width="400px">

  In this case, when I append the href to the detail page url format, I get: 'https://www.bookschina.com/{href}.htm', which then led me to the product detail page.


  → <img src="../images/0203.png" alt="bookschina.com Search Result with Highlight" width="400px">

  Using a 'for' loop, I found detail page URLs for all the ISBNs that returned a result on the website with the following code:


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


# Load scanned ISBNs from the csv in your data folder named scannedResults.csv. 
# Make sure that the file name is correct
df = pd.read_csv('../data/scannedResults.csv')

# Create a list to store the extracted hrefs
extracted_links = []

# iterate over all ISBNs in the [CODECONTENT] column in the csv. 
# Make sure that the column containing ISBNs in your csv is named correctly.
for isbn in df['CODECONTENT']:
    url = f'https://bookschina.com/book_find2/?stp={isbn}&sCate=2'
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            h2_tag = soup.find('h2', class_='name')
            if h2_tag and h2_tag.a:
                link = h2_tag.a.get('href')
                extracted_links.append(link)
            else:
                extracted_links.append(None)


# Add extracted links to the list
df['bookschina_href'] = extracted_links


**→ 3. Get info from the product detail page**  
&nbsp;&nbsp;&nbsp;&nbsp;Analyzing the product detail page, I found the elements for each field I wanted to extract.

  → <img src="../images/0205.png" alt="bookschina.com Search Result with Highlight" width="400px">
  → <img src="../images/0206.png" alt="bookschina.com Search Result with Highlight" width="400px">

&nbsp;&nbsp;&nbsp;&nbsp;With the following code, I iterated over all the product page URLs and from each page extracted the title, author, publisher, publishing date, page numbers, description, and author introduction:

In [None]:

#headers for requests
headers = {
    'User-Agent': 'Mozilla/5.0'
}

# List to store extracted book info 
book_data = []

for href in df['bookschina_href']:
    if pd.isna(href):
        book_data.append({
            'title': None,
            'author': None,
            'publisher': None,
            'publish_date': None,
            'pages': None,
            'description': None,
            'author_intro': None
        })
        continue

    url = f'https://www.bookschina.com{href}.htm'
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_tag = soup.find("title")
        title = title_tag.text.strip() if title_tag else None

        # author name
        author_tag = soup.find("div", class_="author")
        author = author_tag.find("a").text.strip() if author_tag else None

        # Publisher and Publishing Date
        publisher_tag = soup.find("div", class_="publisher")
        publisher = publisher_tag.find_all("a")[0].text.strip() if publisher_tag else None
        publish_date = publisher_tag.find("i").text.strip() if publisher_tag and publisher_tag.find("i") else None

        # page number
        other_info = soup.find("div", class_="otherInfor")
        pages = None
        if other_info:
            span_tags = other_info.find_all("span")
            for idx, span in enumerate(span_tags):
                if "页数" in span.text:
                    pages_tag = other_info.find_all("i")[0]
                    pages = pages_tag.text.strip() if pages_tag else None

        # description
        desc_tag = soup.find("div", class_="brief")
        description = desc_tag.find("p").text.strip() if desc_tag else None

        # author intro
        intro_tag = soup.find("div", class_="excerpt")
        author_intro = intro_tag.find("p").text.strip() if intro_tag else None

        book_data.append({
            'title': title,
            'author': author,
            'publisher': publisher,
            'publish_date': publish_date,
            'pages': pages,
            'description': description,
            'author_intro': author_intro
        })

    except Exception as e:
        book_data.append({
            'title': None,
            'author': None,
            'publisher': None,
            'publish_date': None,
            'pages': None,
            'description': None,
            'author_intro': None
        })


# convert to data frame
book_info_df = pd.DataFrame(book_data)

# Save to CSV
book_info_df.to_csv('../data/bookschinaResults.csv', index=False)


### **→ 使用ISBN爬取图书在线零售页面信息**  

**→ 1. 使用手机App扫描书籍ISBN并生成CSV文件**  
&nbsp;&nbsp;&nbsp;&nbsp;我使用了一款名为 [QRbot](https://qrbot.net/locale/en/) 的App，它可以用手机摄像头扫描ISBN码，并将扫描记录导出为CSV文件。  
&nbsp;&nbsp;&nbsp;&nbsp;生成的CSV文件中包含一列名为 `CODECONTENT` 的ISBN列表，我将其保存在 `data/` 文件夹中。  

**→ 2. 设置环境**  
&nbsp;&nbsp;&nbsp;&nbsp;要设置Python环境，我推荐参考 Programming Historian 提供的指南：  
&nbsp;&nbsp;&nbsp;&nbsp;[Python简介与安装 – Programming Historian](https://programminghistorian.org/en/lessons/introduction-and-installation)  

&nbsp;&nbsp;&nbsp;&nbsp;**所需的Python库：**  
&nbsp;&nbsp;&nbsp;&nbsp;- `pandas`  
&nbsp;&nbsp;&nbsp;&nbsp;- `beautifulsoup4`  
&nbsp;&nbsp;&nbsp;&nbsp;- `requests`  

**→ 3. 获取产品详情页的URL列表**  
&nbsp;&nbsp;&nbsp;&nbsp;爬取方法的第一步是识别出指向图书详情页的URL模式，以便我们从页面HTML中提取相关数据。  

&nbsp;&nbsp;&nbsp;&nbsp;对于 [https://bookschina.com/](https://bookschina.com/)，该网站采用结构化的ISBN查询URL格式，如下所示：  


In [None]:
bookschina.com/book_find2/?stp={{ISBN}}&sCate=2

我可以使用这个URL格式，在我们的ISBN扫描结果列表（**scanResults.csv**）中为每个ISBN生成对应的搜索结果页面。  
此外，由于这个页面只是搜索结果页，我还需要提取其中的商品详情页链接，以便获取该书的更多信息。

→ <img src="../images/0202.png" alt="bookschina.com搜索结果页面截图" width="400px">

为了获取链接，我需要在HTML中找到对应的元素。

→ <img src="../images/0204.png" alt="bookschina.com搜索结果链接结构截图" width="400px">

在这个例子中，当我将 `href` 添加到详情页的URL格式中，即得到：`https://www.bookschina.com/{href}.htm`，从而进入该图书的商品详情页。

→ <img src="../images/0203.png" alt="bookschina.com商品详情页截图" width="400px">

我使用一个 `for` 循环，为所有在网站上有结果的ISBN构建了对应的商品详情页URL，以下是相关代码：


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


# Load scanned ISBNs from the csv in your data folder named scannedResults.csv. 
# Make sure that the file name is correct
df = pd.read_csv('../data/scannedResults.csv')

# Create a list to store the extracted hrefs
extracted_links = []

# iterate over all ISBNs in the [CODECONTENT] column in the csv. 
# Make sure that the column containing ISBNs in your csv is named correctly.
for isbn in df['CODECONTENT']:
    url = f'https://bookschina.com/book_find2/?stp={isbn}&sCate=2'
    headers = {
        'User-Agent': 'Mozilla/5.0'
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            h2_tag = soup.find('h2', class_='name')
            if h2_tag and h2_tag.a:
                link = h2_tag.a.get('href')
                extracted_links.append(link)
            else:
                extracted_links.append(None)


# Add extracted links to the list
df['bookschina_href'] = extracted_links


**→ 3. 从商品详情页中提取信息**  
&nbsp;&nbsp;&nbsp;&nbsp;在分析商品详情页时，我找到了想要提取的各个字段对应的HTML元素。

→ <img src="../images/0205.png" alt="bookschina.com商品详情页字段结构截图1" width="400px">  
→ <img src="../images/0206.png" alt="bookschina.com商品详情页字段结构截图2" width="400px">

&nbsp;&nbsp;&nbsp;&nbsp;使用以下代码，我对所有商品详情页的URL进行遍历，并从每一页中提取了标题、作者、出版社、出版时间、页数、内容简介和作者简介：


In [None]:

#headers for requests
headers = {
    'User-Agent': 'Mozilla/5.0'
}

# List to store extracted book info 
book_data = []

for href in df['bookschina_href']:
    if pd.isna(href):
        book_data.append({
            'title': None,
            'author': None,
            'publisher': None,
            'publish_date': None,
            'pages': None,
            'description': None,
            'author_intro': None
        })
        continue

    url = f'https://www.bookschina.com{href}.htm'
    
    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        # title
        title_tag = soup.find("title")
        title = title_tag.text.strip() if title_tag else None

        # author name
        author_tag = soup.find("div", class_="author")
        author = author_tag.find("a").text.strip() if author_tag else None

        # Publisher and Publishing Date
        publisher_tag = soup.find("div", class_="publisher")
        publisher = publisher_tag.find_all("a")[0].text.strip() if publisher_tag else None
        publish_date = publisher_tag.find("i").text.strip() if publisher_tag and publisher_tag.find("i") else None

        # page number
        other_info = soup.find("div", class_="otherInfor")
        pages = None
        if other_info:
            span_tags = other_info.find_all("span")
            for idx, span in enumerate(span_tags):
                if "页数" in span.text:
                    pages_tag = other_info.find_all("i")[0]
                    pages = pages_tag.text.strip() if pages_tag else None

        # description
        desc_tag = soup.find("div", class_="brief")
        description = desc_tag.find("p").text.strip() if desc_tag else None

        # author intro
        intro_tag = soup.find("div", class_="excerpt")
        author_intro = intro_tag.find("p").text.strip() if intro_tag else None

        book_data.append({
            'title': title,
            'author': author,
            'publisher': publisher,
            'publish_date': publish_date,
            'pages': pages,
            'description': description,
            'author_intro': author_intro
        })

    except Exception as e:
        book_data.append({
            'title': None,
            'author': None,
            'publisher': None,
            'publish_date': None,
            'pages': None,
            'description': None,
            'author_intro': None
        })


# convert to data frame
book_info_df = pd.DataFrame(book_data)

# Save to CSV
book_info_df.to_csv('../data/bookInfoBooksChina.csv', index=False)
