# Scraping metadata and book reviews

The goal of this section is to scrape the specific information of all books in the list "Can't Wait Sci-Fi/Fantasy of 2023" on the Goodreads website (url: https://www.goodreads.com/list/show/171192). We need to get the information of all the books under this list and collect two types of data in total. The first is metadata, the specific data items include: the book title, author, average rating, total number of ratings, total number of reviews, genres, number of pages, and URL; The second is the book reviews, the specific data items include: username, time of the book review, content of the book review, user label, rating, and support of the book review (i.e., the number of likes).

## 1.Installing the necessary libraries and Browser Driver

Since the Goodreads website no longer provides an API, and it is impossible to access the full text of book reviews, we developed the following code. We use the web scraping libraries Beautiful Soup and Selenium to help collect data.

Note: make sure you have installed the necessary libraries, such as Beautiful Soup and Selenium, before running the code. You can use pip, the Python package installer, to install these libraries.

To install them, open your command prompt or terminal and enter the following commands:
pip install selenium;
pip install beautifulsoup4

We take ChromeDriver as an example, of course, you can also adopt other browser drivers such as Safari. Please follow the setup instruction https://sites.google.com/chromium.org/driver/getting-started

In [None]:
# Import the required libraries
import csv
import os
import re
import time
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.alert import Alert
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup

## 2. Analyzing the webpage structure

The first step: browse the structure of target webpage. Based on the observation, we found the science fiction in the list "Can't Wait Sci-Fi/Fantasy of 2023" are presented in a table format. In the table, only partial information is shown, such as the title of science fiction and rating. The title of every science fiction corresponds to a hyperlink, and clicking on the hyperlink will turn to the detail page of that science fiction. The detail page contains the information we need, i.e. metadata and book reviews.

In other words, we have to click on the hyperlink to get the specifical information. So, we set the following ideas:
(1) Obtain a complete link list corresponding to the science fiction in the list "Can't Wait Sci-Fi/Fantasy of 2023";
(2) Iterate through the target list of links to scrape the specifical information from every novel webpage;
(3) Store the data locally.

## 3. Obtaining the URL list

We found that the URL is contained the "href" attribute of novel’s title with the help of developer tool. Selenium provides several ways to locate the elements, like XPath, ID and name. In this case, we choose to use XPath to locate the title element of every novel.

The list “Can't Wait Sci-Fi/Fantasy of 2023” is divided into 5 pages, and its url has a certain pattern.

The URL for page 1 is:
https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=1

The URL for page 2 is:
https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=2

The URL for page 3 is:
https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=3

The URL for page 4 is:
https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=4

The URL for page 5 is:
https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=5

So, the code can be reused by slightly adjusting the URL for each page.


In [None]:
# Define the path to the ChromeDriver executable.
chromedriver_path = '/Users/wanshuo/Desktop/Master/DH_MA_thesis/dataset/chromedriver-mac-arm64/chromedriver'

# Create an instance of ChromeOptions to specify additional options for the Chrome browser.
chrome_options = webdriver.ChromeOptions()

# Add an argument to Chrome options to specify the path to the ChromeDriver executable.
chrome_options.add_argument(f'--webdriver-path={chromedriver_path}')

# Initialize the Chrome WebDriver with the specified options.
driver = webdriver.Chrome(options=chrome_options)

# Navigate to the specified URL (a list page for Sci-Fi/Fantasy books of 2023).
driver.get('https://www.goodreads.com/list/show/171192.Can_t_Wait_Sci_Fi_Fantasy_of_2023?page=1')

# Pause the execution for 10 seconds to allow the page to fully load.
time.sleep(10)

# Simulate pressing the PAGE_DOWN key 60 times to scroll down the webpage.
for i in range(60):
    # Simulate the Page Down key press using ActionChains
    ActionChains(driver).key_down(Keys.PAGE_DOWN).key_up(Keys.PAGE_DOWN).perform()
    # Pause the execution for 0.2 seconds between each scroll to mimic natural user behavior.
    time.sleep(0.2)

# Initialize an empty list to store all URLs extracted from the webpage.
all_url_list = []

# Because the URL is contained within the "href" attribute of the novel's title，
# define the XPath expression to locate the URLs.
url_xpath = '//*[@id="all_votes"]/table/tbody/tr/td[3]/a/@href'

# Get the page source (HTML content) of the current webpage.
html = driver.page_source
# Parse the HTML content using lxml's etree to create an HTML tree.
tree = etree.HTML(html)
# Extract the URLs from the HTML tree using the specified XPath expression.
url_list = tree.xpath(url_xpath)
# Add the extracted URLs to the list of all URLs.
all_url_list.extend(url_list)

# Close the WebDriver and quit the browser.
driver.quit()

## 4. Iterating through the URL list to scrape specifical information

### 4.1 getting metadata
Selenium is a powerful tool for automating web browsing, it can simulate various actions human beings take on a webpage, like clicking on the button and scrolling webpages. We can scrape the data we need easily without being detected by the website in this way (Selenium, 2023). In this programme, we utilize Selenium to simulate the human action of browsing a webpage so as to scrape metadata, i.e., the book title, author, average rating, total number of ratings, total number of reviews, genres, number of pages.

In [None]:
# Path to the ChromeDriver executable
chromedriver_path = '/Users/wanshuo/Desktop/Master/DH_MA_thesis/dataset/chromedriver-mac-arm64/chromedriver'

# Create a ChromeOptions object to configure ChromeDriver
chrome_options = webdriver.ChromeOptions()

# Disable image loading to speed up page load times
prefs = {"profile.managed_default_content_settings.images": 2}
# Add the image display preference to the Chrome options
chrome_options.add_experimental_option("prefs", prefs)

# Add the path to ChromeDriver as an argument
chrome_options.add_argument(f'--webdriver-path={chromedriver_path}')

# Create a new instance of Chrome WebDriver with the specified options
driver = webdriver.Chrome(options=chrome_options)

# Define the XPaths for the elements we want to scrape
title_xpath = '//*[@id="__next"]/div[2]/main/div[1]/div[2]/div[2]/div[1]/div[1]/h1'
author_xpath = '//*[@id="__next"]/div[2]/main/div[1]/div[2]/div[2]/div[2]/div[1]/h3/div/span[1]/a/span[1]'
average_rating_xpath = '//*[@id="__next"]/div[2]/main/div[1]/div[2]/div[2]/div[2]/div[2]/a/div[1]/div'
rating_count_xpath = '//*[@id="__next"]/div[2]/main/div[1]/div[2]/div[2]/div[2]/div[2]/a/div[2]/div/span[1]'
text_reviews_rating_count_xpath = '//*[@id="__next"]/div[2]/main/div[1]/div[2]/div[2]/div[2]/div[2]/a/div[2]/div/span[2]'

# Create an empty list to store all the book title, author, average rating, total number of ratings, total number of reviews, genres, number of pages.
all_title = []
all_author = []
all_average_rating = []
all_rating_count = []
all_text_reviews_rating_count = []
all_genres = []
all_num_pages = []

# List to store URLs that were successfully scraped
successful_urls = []

# Iterate over each URL in the list of URLs to scrape
for url_item in all_url_list:
    url = 'https://www.goodreads.com' + url_item  # Construct the full URL
    driver.get(url)  # Open the URL in the browser
    time.sleep(5)  # Wait for the page to load completely

    # Maximize the browser window to ensure all elements are visible
    driver.maximize_window()

    # XPath for the close button of any pop-up that might appear
    close_button_xpath = '/html/body/div[3]/div/div[1]/div'
    try:
        # Try to find and click the close button if it exists
        close_button = driver.find_element(By.XPATH, close_button_xpath)
        close_button.click()
    except NoSuchElementException:
        pass  # If the close button is not found, proceed without error

    # Wait for any dynamic content to load
    time.sleep(5)

    # Get the page source (HTML) of the current page
    html = driver.page_source
    # Parse the HTML with lxml
    tree = etree.HTML(html)

    # Check if the page contains a "Page not found" message
    if 'Sorry, the page you requested could not be found.' in driver.page_source:
        print(f"Page not found for URL: {url}. Skipping to the next book.")
        continue

    # If the page is found, add the URL to the list of successful URLs
    successful_urls.append(url)

    # Extract and append the book title from the page
    title_list = tree.xpath(title_xpath + '/text()')
    all_title.append(title_list[0])

    # Extract and append the author from the page
    author_list = tree.xpath(author_xpath + '/text()')
    all_author.append(author_list[0])

    # Extract and append the average rating from the page
    average_rating_list = tree.xpath(average_rating_xpath + '/text()')
    all_average_rating.append(average_rating_list[0])

    # Extract and append the rating count from the page
    rating_count_list = tree.xpath(rating_count_xpath + '/text()')
    all_rating_count.append(rating_count_list[0])

    # Extract and append the text reviews count from the page
    text_reviews_rating_count_list = tree.xpath(text_reviews_rating_count_xpath + '/text()')
    all_text_reviews_rating_count.append(text_reviews_rating_count_list[0])

    # Attempt to find and extract the genres for the book
    try:
        # Find all elements that match the genre button's XPath
        genres_elements = driver.find_elements(By.XPATH, '//span[contains(@class, "BookPageMetadataSection__genreButton")]/a[@class="Button Button--tag-inline Button--small"]/span[@class="Button__labelItem"]')
        # Extract the text from each genre element and strip any surrounding whitespace
        genres_list = [element.text.strip() for element in genres_elements]
        # Join the genres into a single string separated by semicolons, as there may be multiple genres
        content = ';'.join(genres_list)
        # Append the joined genres string to the all_genres list
        all_genres.append(content)
    # Handle the case where the genre elements are not found
    except NoSuchElementException:
        # Print a message indicating no genres were found for the URL
        print(f"No genres found for URL: {url}.")
        # Append "N/A" to the all_genres list to indicate that genres are not available
        all_genres.append("N/A")

    # Attempt to find and extract the number of pages for the book
    try:
        # Find all elements that match the number of pages format's XPath
        num_pages_elements = driver.find_elements(By.XPATH, '//p[@data-testid="pagesFormat"]')
        # Extract the number from each element's text and compile into a list
        num_pages_list = [re.search(r'\d+', element.text.strip()).group() for element in num_pages_elements]
        # Append the first found number of pages to the all_num_pages list
        all_num_pages.append(num_pages_list[0])
    # Handle the case where the number of pages elements are not found or text extraction fails
    except (NoSuchElementException, IndexError, AttributeError):
        # Print a message indicating no number of pages were found for the URL
        print(f"No number of pages found for URL: {url}.")
        # Append "N/A" to the all_num_pages list to indicate that the number of pages is not available
        all_num_pages.append("N/A")

    # Print the scraped details for the current book
    print(f"Book {index + 1}: {title_list[0]} - {author_list[0]}, Rating: {average_rating_list[0]}, Rating Count: {rating_count_list[0]}, Text Reviews Count: {text_reviews_rating_count_list[0]}, Genres: {content}, Num Pages: {num_pages_list[0]}")

# Close the browser window
driver.quit()


Book 1: Witch King - Martha Wells, Rating: 3.72, Rating Count: 15,068, Text Reviews Count: 3,264, Genres: Fantasy;Fiction;Adult;Magic;High Fantasy;Witches;Science Fiction, Num Pages: 424
Book 2: System Collapse - Martha Wells, Rating: 4.23, Rating Count: 33,540, Text Reviews Count: 4,188, Genres: Science Fiction;Fiction;Audiobook;Adult;Science Fiction Fantasy;Space;Space Opera, Num Pages: 245
Book 3: Hell Bent - Leigh Bardugo, Rating: 4.15, Rating Count: 114,636, Text Reviews Count: 15,911, Genres: Fantasy;Fiction;Horror;Mystery;Urban Fantasy;Paranormal;Adult, Num Pages: 484
Book 4: Tress of the Emerald Sea - Brandon Sanderson, Rating: 4.43, Rating Count: 113,398, Text Reviews Count: 17,504, Genres: Fantasy;Fiction;Romance;Adventure;Young Adult;High Fantasy;Audiobook, Num Pages: 483
Book 5: A Day of Fallen Night - Samantha    Shannon, Rating: 4.39, Rating Count: 28,528, Text Reviews Count: 5,008, Genres: Fantasy;Fiction;LGBT;Queer;Adult;Lesbian;Dragons, Num Pages: 868
Book 6: The Adven

### 4.2 Storing metadata locally

In [None]:
# Define the path to the CSV file
csv_file_path = 'Page_1_data_list.csv'

# Open the CSV file for writing
# 'w' mode means write; if the file does not exist, it will be created
# encoding='utf-8' ensures that the file is encoded in UTF-8 to handle special characters
# newline='' prevents extra blank lines in the CSV file
with open(csv_file_path, 'w', encoding='utf-8', newline='') as csvfile:

    # Create a CSV writer object
    writer = csv.writer(csvfile)

    # Write the header row to the CSV file
    writer.writerow(['title', 'author', 'average_rating', 'num_pages', 'rating_count', 'text_reviews_rating_count', 'genres', 'url'])

    # Iterate over the combined data lists using the zip function
    # Each iteration returns a tuple containing one item from each list
    # This is used to write a row of data for each book
    for url_item, title_item, author_item, average_rating_item, num_pages_item, rating_count_item, text_reviews_rating_count_item, genres_item in zip(successful_urls, all_title, all_author, all_average_rating, all_num_pages, all_rating_count, all_text_reviews_rating_count, all_genres):

        # Write a row of data to the CSV file
        # The order of items in the list matches the header row
        writer.writerow([title_item, author_item, average_rating_item, num_pages_item, rating_count_item, text_reviews_rating_count_item, genres_item, url_item])


**Reference**

Selenium. (2023, June 27). https://www.selenium.dev/