# Web Scraping Goodreads: Exploring the World of Books

Welcome to this web scraping project where extract data from [Goodreads](https://www.goodreads.com/). Welcome into the world of books, data, and insights. If you're a book lover like me, you're in for a treat! And if you're not, well, I believe this project might just inspire you to delve into the captivating realm of literature.

In this project, we'll be harnessing the power of web scraping to extract a wealth of information from Goodreads, a treasure trove of book-related data. From book titles and authors to ratings, and more, Goodreads offers a vast reservoir of knowledge waiting to be explored.

This introductory guide will walk you through the process of setting up your environment, sending HTTP requests, and navigating the structure of web pages to gather data. It's a journey that promises exciting possibilities for data analysis and uncovering hidden insights about the world of books.

So, let's dive in and start exploring the fascinating world of literature through the lens of data! Get ready to scrape, analyze, and discover the stories that await us.


In [1]:
#imports necessary libraries (Pandas, requests, BeautifulSoup) for web scraping and data manipulation.

import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# Fetch the HTML content of the "self-help" bookshelf on Goodreads.
base_url = "https://www.goodreads.com/shelf/show/self-help"
response = requests.get(base_url).text

In [3]:
soup = BeautifulSoup(response, "html.parser")

In [4]:
# Find and extract the total number of items
total_items_info = soup.find("div", class_="mediumText").get_text().strip()
total_items = int(total_items_info.split()[-1].replace(',', ''))  

In [5]:
# Set the number of items displayed per page and calculate the number of pages
items_per_page = 50  # Adjust based on the actual number of items per page
total_pages = (total_items + items_per_page - 1) // items_per_page

# Limit the number of pages to scrape (e.g., the first 50 pages)
max_pages_to_scrape = 200

In [6]:
# Initialize lists to store data
title = []
url_list = []
authors = []
avg_ratings = []
rating = []
year = []

In [7]:
# Iterate through the specified number of pages
for page in range(1, min(max_pages_to_scrape, total_pages) + 1):
    url = f"{base_url}?page={page}"
    response = requests.get(url).text
    soup = BeautifulSoup(response, "html.parser")
    book_elements = soup.find_all("div", "elementList")
    
    for book_element in book_elements:
        # Use try-except block to handle potential NoneType errors
        try:
            book_title = book_element.find("a", "bookTitle").text
            book_url = "https://www.goodreads.com" + book_element.find("a", "bookTitle").get("href")
            author = book_element.find("a", "authorName").text
            rating_text = book_element.find("span", "greyText smallText").text.split()
            avg_rating = rating_text[2]
            ratings = rating_text[4]
            published_year = rating_text[-1] if len(rating_text) == 9 else ""
    
            # Append data to lists
            title.append(book_title)
            url_list.append(book_url)
            authors.append(author)
            avg_ratings.append(avg_rating)
            rating.append(ratings)
            year.append(published_year)
        except AttributeError:
            # Handle the case where an element is not found
            print(f"Skipping a book on page {page} due to missing data.")

Skipping a book on page 1 due to missing data.
Skipping a book on page 1 due to missing data.
Skipping a book on page 2 due to missing data.
Skipping a book on page 2 due to missing data.
Skipping a book on page 3 due to missing data.
Skipping a book on page 3 due to missing data.
Skipping a book on page 4 due to missing data.
Skipping a book on page 4 due to missing data.
Skipping a book on page 5 due to missing data.
Skipping a book on page 5 due to missing data.
Skipping a book on page 6 due to missing data.
Skipping a book on page 6 due to missing data.
Skipping a book on page 7 due to missing data.
Skipping a book on page 7 due to missing data.
Skipping a book on page 8 due to missing data.
Skipping a book on page 8 due to missing data.
Skipping a book on page 9 due to missing data.
Skipping a book on page 9 due to missing data.
Skipping a book on page 10 due to missing data.
Skipping a book on page 10 due to missing data.
Skipping a book on page 11 due to missing data.
Skipping a

Skipping a book on page 87 due to missing data.
Skipping a book on page 87 due to missing data.
Skipping a book on page 88 due to missing data.
Skipping a book on page 88 due to missing data.
Skipping a book on page 89 due to missing data.
Skipping a book on page 89 due to missing data.
Skipping a book on page 90 due to missing data.
Skipping a book on page 90 due to missing data.
Skipping a book on page 91 due to missing data.
Skipping a book on page 91 due to missing data.
Skipping a book on page 92 due to missing data.
Skipping a book on page 92 due to missing data.
Skipping a book on page 93 due to missing data.
Skipping a book on page 93 due to missing data.
Skipping a book on page 94 due to missing data.
Skipping a book on page 94 due to missing data.
Skipping a book on page 95 due to missing data.
Skipping a book on page 95 due to missing data.
Skipping a book on page 96 due to missing data.
Skipping a book on page 96 due to missing data.
Skipping a book on page 97 due to missin

Skipping a book on page 171 due to missing data.
Skipping a book on page 171 due to missing data.
Skipping a book on page 172 due to missing data.
Skipping a book on page 172 due to missing data.
Skipping a book on page 173 due to missing data.
Skipping a book on page 173 due to missing data.
Skipping a book on page 174 due to missing data.
Skipping a book on page 174 due to missing data.
Skipping a book on page 175 due to missing data.
Skipping a book on page 175 due to missing data.
Skipping a book on page 176 due to missing data.
Skipping a book on page 176 due to missing data.
Skipping a book on page 177 due to missing data.
Skipping a book on page 177 due to missing data.
Skipping a book on page 178 due to missing data.
Skipping a book on page 178 due to missing data.
Skipping a book on page 179 due to missing data.
Skipping a book on page 179 due to missing data.
Skipping a book on page 180 due to missing data.
Skipping a book on page 180 due to missing data.
Skipping a book on p

In [14]:
# Create a DataFrame
good_reads = pd.DataFrame({
    "Title": title,
    "URL": url_list,
    "Authors": authors,
    "Avg Ratings": avg_ratings,
    "Rating": rating,
    "Published_year": year
})

In [15]:
good_reads.head()

Unnamed: 0,Title,URL,Authors,Avg Ratings,Rating,Published_year
0,Atomic Habits: An Easy & Proven Way to Build G...,https://www.goodreads.com/book/show/40121378-a...,James Clear,4.37,698883,
1,The Subtle Art of Not Giving a F*ck: A Counter...,https://www.goodreads.com/book/show/28257707-t...,Mark Manson,3.9,1002966,2017.0
2,How to Win Friends and Influence People (Paper...,https://www.goodreads.com/book/show/4865.How_t...,Dale Carnegie,4.22,916840,1936.0
3,The 7 Habits of Highly Effective People: Power...,https://www.goodreads.com/book/show/36072.The_...,Stephen R. Covey,4.15,708544,1988.0
4,The Power of Habit: Why We Do What We Do in Li...,https://www.goodreads.com/book/show/12609433-t...,Charles Duhigg,4.13,483249,2012.0


In [16]:
len(good_reads)

10000

In [17]:
good_reads.to_csv("goodreads.csv", index=False)