# 📚 Goodreads Book Data Web Scraper

This project scrapes book metadata from the [Goodreads "Best Books Ever" list](https://www.goodreads.com/list/show/1.Best_Books_Ever) and compiles it into a clean, structured dataset. The output CSV is ready for downstream analysis, visualization, and data engineering workflows.

---

## 🌐 Web Scraping Workflow

- Navigated Goodreads list pages to **collect links** to individual book detail pages.
- Extracted core metadata from each book page
- Structured the raw data into a **Pandas DataFrame** with consistent formatting and field alignment.
- Saved the full dataset to a CSV file for persistent storage and analysis.

---

## ⚙️ Engineering & Stability Features

- Implemented **retry logic** and **exception handling** to gracefully handle failed requests or timeouts.
- Used a **custom User-Agent** string to minimize detection and blocking by Goodreads.
- **Rate-limited and scoped** requests to be respectful of Goodreads’ servers.
- Configured **logging output** to track progress and debug issues without interrupting the workflow.

---

## 🛠️ Tools & Libraries

- **Python**: Core scripting and data pipeline logic
- **Requests**: Robust HTTP request handling
- **BeautifulSoup**: HTML parsing and content extraction
- **Pandas / NumPy**: Data wrangling and DataFrame construction
- **Jupyter Notebook**: Interactive development and documentation

---

## 📤 Output

- `../data/goodreads-books-raw.csv`: Dataset containing metadata for each scraped book.
- This file is **overwritten on each run** to ensure fresh results without accumulating stale data.

---

## ✅ Outcomes

- Generated a production-style, reusable scraper that extracts real-world data for analysis.
- Enabled future integration with tools like **Power BI**, **Tableu**, or **machine learning pipelines** by maintaining clean outputs.
---

## Importing Libraries & Notebook Setup

In [1]:
# Web scraping
import requests
from bs4 import BeautifulSoup
import time
import json

# Typing & structure
from typing import Optional
from dataclasses import dataclass

# Utilities & diagnostics
import logging
import numpy as np
import pandas as pd

# Jupyter Notebook display settings
from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

# Set up logging format and level
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s [%(levelname)s] %(message)s',
    handlers=[logging.StreamHandler()]
)

## Web Scraping Data from Goodreads

In [2]:
# Custom user agent to avoid blocking by Goodreads
user_agent = {'user-agent': 'Mozilla/5.0'}

In [3]:
# Base URL for Goodreads "Best Books Ever" list
base_URL = "https://www.goodreads.com/list/show/1.Best_Books_Ever?page="

### 🔗 Function: `collect_book_links`

Retrieves book URLs from a single Goodreads list page. This function is the first step in the scraping pipeline, feeding individual book page links into the metadata extraction process.

It constructs the correct URL based on a given page number, sends a request using a persistent `requests.Session`, and parses the HTML to find relative book links. These are then converted to full Goodreads URLs.

The function is designed to handle transient network failures gracefully using built-in retry logic.

In [4]:
def collect_book_links(
    session: requests.Session,
    page_num: int,
    num_attempts: int = 3,
    log_warnings: bool = False
) -> list[str]:
    """
    Collects book page URLs from a specified Goodreads list page.

    Parameters:
        session (requests.Session): An active requests session used for sending HTTP requests.
        page_num (int): The page number of the Goodreads list to scrape.
        num_attempts (int, optional): Number of retry attempts in case of a failed request. Defaults to 3.
        logging (bool, optional): If True, warning messages are logged for failed attempts. Defaults to False.

    Returns:
        list[str]: A list of full URLs to individual Goodreads book pages.
                   Returns an empty list if all attempts fail.
    """
    
    # Generate URL for the specified page number
    page_URL = base_URL + str(page_num)

    for attempt in range(num_attempts):
        try:
            # Send HTTP GET request to the page
            response = session.get(page_URL)

            # Check if the response status code indicates failure
            if response.status_code != 200:
                if logging:
                    logging.warning(f"Failed to retrieve page {page_num}. Status code: {response.status_code}")

                # Sleep for a short, random time to avoid triggering any anti-scraping defenses
                time.sleep(np.random.uniform(0.1, 1.0))
                continue

            # # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')

             # Find all <a> tags with class 'bookTitle' - these contain links to individual book pages
            book_links = soup.find_all('a', class_='bookTitle')

            # Convert relative book URLs to absolute URLs
            complete_book_links = ["https://www.goodreads.com" + link['href'] for link in book_links]

            return complete_book_links

        except Exception as e:
            if logging:
                    logging.warning(f"Failed to retrieve page {page_num}. Status code: {response.status_code}")
            time.sleep(np.random.uniform(0.1, 1.0))
            continue

    # Return empty list if all attempts fail
    return []

### 🧩 Function: `safe_extract`

This utility function provides a unified way to extract and clean HTML content from a parsed page. It abstracts away a lot of the repetitive logic that would otherwise clutter the metadata extraction process.

It supports:

- Extracting plain text or specific attributes from elements

- Cleaning up inner HTML for complex fields like descriptions

- Parsing embedded JSON (used to extract metadata like language)

By centralizing error handling and content extraction, it keeps the main scraping functions concise and readable.

In [5]:
def safe_extract(
    soup: BeautifulSoup,
    selector: str,
    attr: Optional[str] = None,
    default: Optional[str] = None,
    description: bool = False,
    language: bool = False,
    verbose: bool = False
) -> Optional[str]:
    """
    Safely extracts content from a BeautifulSoup object using a CSS selector.

    Parameters:
        soup (BeautifulSoup): Parsed HTML of the page.
        selector (str): CSS selector string to identify the desired element.
        attr (str, optional): The attribute to extract (e.g., "href", "src"). If None, extracts text. Defaults to None.
        default (str, optional): Value to return if extraction fails. Defaults to None.
        description (bool, optional): If True, cleans HTML inside descriptions. Defaults to False.
        language (bool, optional): If True, parses JSON from a <script> tag to extract language info. Defaults to False.
        verbose (bool, optional): If True, prints extraction errors to console. Defaults to False.

    Returns:
        Optional[str]: Extracted content as text or attribute value, or default if extraction fails.
    """
    
    try:
        element = soup.select_one(selector)

        if not element:
            return default

        elif description:
            # Clean HTML tags and remove irrelevant inline elements
            element_soup = BeautifulSoup(str(element), 'html.parser')
            tags = element_soup.find_all('i')
            for tag in tags:
                if tag.find("a") or "ISBN" in tag.text:
                    tag.decompose()
            return element_soup.get_text(separator=" ").strip()

        elif language:
            # Extract language information from JSON data embedded in a <script> tag
            json_data = json.loads(element.string)
            return json_data.get("inLanguage")

        else:
            # Return the requested attribute or the cleaned text content
            return element[attr] if attr else element.text.strip()

    except Exception as e:
        if verbose:
            print(f"Error in safe_extract for {selector}: {e}")
        return default

### 📦 Dataclass: `Book` 
Defines a consistent structure for storing the scraped book metadata. Using a @dataclass ensures type clarity, enforces field names, and simplifies downstream transformation into a DataFrame.

The class includes fields for all expected metadata, including basic bibliographic info (title, author), content-related details (description, genres, language), statistics (ratings/reviews), and series information if available.

In [6]:
@dataclass
class Book:
    title: Optional[str]
    author: Optional[str]
    description: Optional[str]
    genres: list[str]
    language: Optional[str]
    num_pages: Optional[str]
    publication_year: Optional[str]
    rating: Optional[str]
    num_ratings: Optional[str]
    num_reviews: Optional[str]
    part_of_series: bool
    series_name: Optional[str]
    cover_link: Optional[str]
    book_link: str
    series_link: Optional[str]

### 📚 Function: `collect_book_data`

The main metadata extraction function. Given a Goodreads book URL, it fetches the HTML and parses all relevant fields.

Handles:
- Core content (title, author, genres)
- User interaction stats (ratings, reviews)
- Optional fields (description, series info)
- JSON-parsed data (language metadata)

The result is a complete `Book` dataclass object containing structured metadata for downstream use. Includes retry logic to handle failed requests gracefully.

In [7]:
def collect_book_data(
    session: requests.Session,
    book_link: str,
    num_attempts: int = 3,
    log_warnings: bool = False
) -> Optional[Book]:
    """
    Extracts metadata from an individual Goodreads book page.

    Parameters:
        session (requests.Session): An active requests session used to send HTTP requests.
        book_link (str): The full URL to a Goodreads book page.
        num_attempts (int, optional): Number of retry attempts in case of request or parsing failure. Defaults to 3.
        logging (bool, optional): If True, logs errors during failures. Defaults to False.

    Returns:
        Optional[Book]: A Book dataclass instance containing extracted metadata.
                        Returns None if all attempts fail or required fields are missing.
    """
    
    for attempt in range(num_attempts):
        try:
            # Send HTTP GET request to the book page
            response = session.get(book_link, headers=user_agent)

            # Check for an unsuccessful response
            if response.status_code != 200:
                if logging:
                    logging.warning(f"Failed to retrieve book at link: {book_link}. Status code: {response.status_code}")

                time.sleep(np.random.uniform(0.1, 1.0))
                continue

            # Parse page content
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract core book metadata
            title = safe_extract(soup, "h1.Text.Text__title1")
            author = safe_extract(soup, "span.ContributorLink__name")
            description = safe_extract(soup, "span.Formatted", description = True)

            # Extract genres
            genres = [genre.text for genre in soup.select("span.BookPageMetadataSection__genreButton")]

            # Extract language info from JSON script tag
            language = safe_extract(soup, "script[type='application/ld+json']", language = True)

            # Extract number of pages and publication year
            num_pages = safe_extract(soup, "div.FeaturedDetails")
            num_pages = num_pages.split()[0] if num_pages is not None else None

            publication_year = safe_extract(soup, "div.FeaturedDetails")
            publication_year = publication_year.split()[-1] if publication_year is not None else None

            # Extract user statistics
            rating = safe_extract(soup, "div.RatingStatistics__rating")
            num_ratings = safe_extract(soup, "span[data-testid='ratingsCount']")
            num_ratings = num_ratings.replace(",", "").split("\xa0")[0] if num_ratings is not None else None
            num_reviews = safe_extract(soup, "span[data-testid='reviewsCount']")
            num_reviews = num_reviews.replace(",", "").split("\xa0")[0] if num_reviews is not None else None

            # Extract cover image link and series data
            cover_link = safe_extract(soup, "img.ResponsiveImage", attr="src")
            series_link = safe_extract(soup, "h3.Text.Text__title3.Text__italic.Text__regular.Text__subdued a", attr="href")

            part_of_series = series_link is not None
            series_name = None

            if part_of_series:
                series_name = safe_extract(soup, "h3.Text.Text__title3.Text__italic.Text__regular.Text__subdued a")
                series_name = series_name.split("#")[0].strip() if series_name is not None else None

            return Book(
                title = title,
                author = author,
                description = description,
                genres = genres,
                language = language,
                num_pages = num_pages,
                publication_year = publication_year,
                rating = rating,
                num_ratings = num_ratings,
                num_reviews = num_reviews,
                part_of_series = part_of_series,
                series_name = series_name,
                cover_link = cover_link,
                book_link = book_link,
                series_link = series_link
            )

        except Exception as e:
            if logging:
                logging.error(f"An error occurred on book found at {book_link}: {e}")
            time.sleep(np.random.uniform(0.1, 1.0))
            continue

    # Return None if all attempts fail
    return None

### 🚀 Driver Code

Coordinates the end-to-end scraping process.

Steps:
1. Sets up an HTTP session with a custom User-Agent
2. Iterates through specified Goodreads list pages
3. Collects links to each book on the page
4. Extracts detailed metadata from each book
5. Aggregates results into a list of `Book` objects
6. Converts the list into a Pandas DataFrame
7. Saves the final dataset to a CSV file

Includes logging and delay mechanisms to monitor progress and minimize risk of IP blocks from Goodreads.

In [8]:
data: list[Book] = []

num_pages = 100      # Number of Goodreads list pages to scrape (After 100, the last page is repeated)
num_attempts = 3     # Retry attempts per request

start = time.time()
session = requests.Session()
session.headers.update(user_agent)

for page_num in range(num_pages): 
    book_links = collect_book_links(session, page_num + 1, num_attempts, logging = True)
    
    if not book_links:
        logging.error(f"No book links found on page {page_num + 1}. Ending collection.")
        break
        
    for book_link in book_links:
        book_data = collect_book_data(session, book_link, num_attempts, logging = True)
        if book_data is None or book_data.title is None:
            continue
        data.append(book_data)
        if len(data) % 100 == 0:
            logging.info(f"{len(data)} books collected")
        time.sleep(0.05)  # Small delay between requests to reduce server load

# Convert the list of Book dataclass instances to DataFrame 
books_df = pd.DataFrame([book.__dict__ for book in data])
logging.info(f"Scraping complete. {len(books_df)} books collected")

end = time.time()
logging.info(f"Total scraping time: {(end - start)/60:.2f} minutes")

2025-07-01 20:54:51,170 [INFO] 100 books collected
2025-07-01 20:56:57,225 [INFO] 200 books collected
2025-07-01 20:58:52,526 [INFO] 300 books collected
2025-07-01 21:00:51,350 [INFO] 400 books collected
2025-07-01 21:03:05,161 [INFO] 500 books collected
2025-07-01 21:05:17,725 [INFO] 600 books collected
2025-07-01 21:07:23,700 [INFO] 700 books collected
2025-07-01 21:09:25,113 [INFO] 800 books collected
2025-07-01 21:11:30,381 [INFO] 900 books collected
2025-07-01 21:13:45,846 [INFO] 1000 books collected
2025-07-01 21:15:54,724 [INFO] 1100 books collected
2025-07-01 21:18:06,918 [INFO] 1200 books collected
2025-07-01 21:20:18,537 [INFO] 1300 books collected
2025-07-01 21:22:30,586 [INFO] 1400 books collected
2025-07-01 21:25:31,827 [INFO] 1500 books collected
2025-07-01 21:27:46,661 [INFO] 1600 books collected
2025-07-01 21:29:50,285 [INFO] 1700 books collected
2025-07-01 21:32:08,301 [INFO] 1800 books collected
2025-07-01 21:34:24,503 [INFO] 1900 books collected
2025-07-01 21:36:36,7

In [9]:
books_df.head(5)

Unnamed: 0,title,author,description,genres,language,num_pages,publication_year,rating,num_ratings,num_reviews,part_of_series,series_name,cover_link,book_link,series_link
0,The Hunger Games,Suzanne Collins,Winning means fame and fortune. Losing means c...,"[Young Adult, Dystopia, Fiction, Fantasy, Scie...",English,374,2008,4.35,9541172,249053,True,The Hunger Games,https://images-na.ssl-images-amazon.com/images...,https://www.goodreads.com/book/show/2767052-th...,https://www.goodreads.com/series/73758-the-hun...
1,Pride and Prejudice,Jane Austen,"Since its immediate success in 1813, Pride an...","[Classics, Romance, Fiction, Historical Fictio...",English,279,1813,4.29,4605881,135792,False,,https://images-na.ssl-images-amazon.com/images...,https://www.goodreads.com/book/show/1885.Pride...,
2,To Kill a Mockingbird,Harper Lee,"""Shoot all the bluejays you want, if you can h...","[Classics, Fiction, Historical Fiction, School...",English,323,1960,4.26,6658977,127709,True,To Kill a Mockingbird,https://images-na.ssl-images-amazon.com/images...,https://www.goodreads.com/book/show/2657.To_Ki...,https://www.goodreads.com/series/255474-to-kil...
3,Harry Potter and the Order of the Phoenix,J.K. Rowling,It's official: the evil Lord Voldemort has ret...,"[Fantasy, Young Adult, Fiction, Magic, Audiobo...",English,896,2003,4.5,3674686,73546,True,Harry Potter,https://images-na.ssl-images-amazon.com/images...,https://www.goodreads.com/book/show/58613451-h...,https://www.goodreads.com/series/45175-harry-p...
4,The Book Thief,Markus Zusak,It is 1939. Nazi Germany. The country is holdi...,"[Historical Fiction, Fiction, Young Adult, Cla...",English,592,2005,4.39,2790201,155737,False,,https://images-na.ssl-images-amazon.com/images...,https://www.goodreads.com/book/show/19063.The_...,


In [10]:
# Export dataset to CSV
books_df.to_csv('../data/goodreads-books-raw.csv', index = False)