# Web Scraping Goodreads: Exploring the World of Books

Welcome to this web scraping project where extract data from [Goodreads](https://www.goodreads.com/). Welcome into the world of books, data, and insights. If you're a book lover like me, you're in for a treat! And if you're not, well, I believe this project might just inspire you to delve into the captivating realm of literature.

In this project, we'll be harnessing the power of web scraping to extract a wealth of information from Goodreads, a treasure trove of book-related data. From book titles and authors to ratings, and more, Goodreads offers a vast reservoir of knowledge waiting to be explored.

This introductory guide will walk you through the process of setting up your environment, sending HTTP requests, and navigating the structure of web pages to gather data. It's a journey that promises exciting possibilities for data analysis and uncovering hidden insights about the world of books.

So, let's dive in and start exploring the fascinating world of literature through the lens of data! Get ready to scrape, analyze, and discover the stories that await us.


In [1]:
#imports necessary libraries (Pandas, requests, BeautifulSoup) for web scraping and data manipulation.
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import re

In [2]:
# Define a user-agent header to identify your scraper
user_agent = "MyWebScraper/1.0 (+https://github.com/Adesuaayo/goodreads_webscraper)"

In [3]:
book_titles = []
authors = []
avg_ratings = []
ratings = []
published_years = []
editions = []

In [4]:
pages_to_scrape = 600

# Specify the delay between requests in seconds (e.g., 2 seconds)
request_delay = 3

# Loop through the pages to scrape
for page in range(1, pages_to_scrape + 1):
    
    # Construct the URL for the current page
    url = "https://www.goodreads.com/search?page=" + str(page) + "&q=self+help&qid=ozFSiGJBE0"
   
    try:
        # Send an HTTP GET request to the URL with the user-agent header
        headers = {"User-Agent": user_agent}
        response = requests.get(url, headers=headers).text

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response, "html5lib")
    
        # Check for server errors or maintenance
        if soup.title and "service unavailable" in soup.title.text.lower():
            print(f"Server error on page {page}. Skipping...")
            continue

        # Select the table containing the list of books
        table = soup.find_all("tbody")[0]

        # Loop through the rows of the table
        for row in table.find_all("tr"):
            cells = row.find_all("td")[1]

            # Extract book title
            title = cells.find("a").find("span").text
            book_titles.append(title)

            # Extract author's name
            author = cells.find("a", class_="authorName").text
            authors.append(author)
            

            #rating
            all_ratings = cells.find_all('span', class_ = 'minirating')
            all_ratings_text = all_ratings[0].text.strip()
            pattern_2 = re.compile(r"(\d\.?\d*)\savg")
            avg_ratings.append(pattern_2.search(all_ratings_text).group(1))

            #n_ratings
            pattern_4 = re.compile(r"(\d\,?\d*) rating")
            ratings_matches = pattern_4.search(all_ratings_text)
            ratings.append(ratings_matches.group(1) if ratings_matches else 0)  


#             # Extract average rating
#             avg_rating = cells.find("span", class_="greyText smallText uitext").text.split()[0]
#             avg_ratings.append(avg_rating)

#             # Extract rating
#             rating = cells.find("span", class_="greyText smallText uitext").text.split()[4]
#             ratings.append(rating)

            # Extract published year, handling cases where it may not be in the expected format
            year_info = cells.find("span", class_="greyText smallText uitext").text.split()
            year = None
            for item in year_info:
                if item.isdigit() and len(item) == 4:
                    year = item
                    break
            if year:
                published_years.append(year)
            else:
                published_years.append(0)  # Handle cases where year is not found

            # Extract edition information
            edition = cells.find("span", class_="greyText smallText uitext").text.split()[-2]
            editions.append(edition)

        # Sleep to add a delay between requests
        time.sleep(request_delay)
    
    except requests.exceptions.RequestException as e:
        # Handle HTTP request errors (e.g., connection issues)
        print(f"Error on page {page}: {e}")

    except IndexError as e:
        # Handle "list index out of range" error
        print(f"Index error on page {page}: {e}")

    except Exception as e:
        # Handle other unexpected errors
        print(f"Unexpected error on page {page}: {e}")

Index error on page 75: list index out of range
Error on page 107: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /search?page=107&q=self+help&qid=ozFSiGJBE0 (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1129)')))
Error on page 210: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /search?page=210&q=self+help&qid=ozFSiGJBE0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002BA3889C790>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
Error on page 249: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error on page 277: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: 

In [5]:
# After scraping all pages, we create a DataFrame from the collected data
data = {
    "Title": book_titles,
    "Author": authors,
    "Average Rating": avg_ratings,
    "Rating": ratings,
    "Year Published": published_years,
    "Edition": editions
}

In [6]:
goodreads = pd.DataFrame(data)

# Display the first five rows of the dataframe
goodreads.head()

Unnamed: 0,Title,Author,Average Rating,Rating,Year Published,Edition
0,"10% Happier: How I Tamed the Voice in My Head,...",Dan Harris,3.92,4207,2014,1
1,Self-Help,Lorrie Moore,4.14,3332,1985,39
2,Parenting from the Inside Out: How a Deeper Se...,Daniel J. Siegel,4.16,4623,2003,48
3,What's Our Problem?: A Self-Help Book for Soci...,Tim Urban,4.3,2720,2023,1
4,How to Be Fine: What We Learned from Living by...,Jolenta Greenberg,3.55,4272,2020,10


In [7]:
len(goodreads)

11760

In [8]:
goodreads.to_csv("Goodreads_books.csv", index=False)