## Web scraping of Books to Scrape

This project involves web scraping the "Books to Scrape" website to extract book information. The process includes extracting book titles, genres, prices, availability, stock inventory, ratings, and URLs. The resulting data will be stored in a CSV for further analysis.


In [22]:
#Import necessary packkages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from urllib.parse import urljoin

In [23]:
#URL to be scraped
base_url = "http://books.toscrape.com/"

In [24]:
# Fetch and parse the main page content
response = requests.get(base_url)
response.raise_for_status() #checks for HTTP errors and raises exception if error occurs
soup = BeautifulSoup(response.content, "html.parser")
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

* The function get_stock_info below takes soup (the BeautifulSoup object) as input, uses regular expressions to parse the stock information from the text and extract the number of books in stock and the availability status ("In Stock" or "Out of Stock") from each product page.

In [25]:
# Function to extract the number of books in stock from the product page
def get_stock_info(soup):
    availability_element = soup.find("p", class_="instock availability")
    if availability_element:
        availability_text = availability_element.text.strip()
        match = re.search(r"\((\d+) available\)", availability_text)
        if match:
            num_in_stock = int(match.group(1))
            in_stock = "In Stock"
        else:
            num_in_stock = 0
            in_stock = "Out of Stock"
        return in_stock, num_in_stock
    else:
        return "Not available", 0

* The function scrape_book_urls defined below extracts all book URLs from soup. It uses urljoin to ensure correctly formed URLs relative to the base_url.

In [26]:
# Function to scrape book URLs from a single page
def scrape_book_urls(soup, base_url):
    product_urls = []
    for h3 in soup.find_all("h3"):
        a = h3.find("a")
        if a:
            product_urls.append(urljoin(base_url, a["href"]))
    return product_urls

* The function extract_genre_urls extracts URLs for all genres from the main page's HTML.

In [27]:
# Function to extract genre URLs from the main page
def extract_genre_urls(soup, base_url):
    genre_urls = []
    side_categories = soup.find("div", class_="side_categories")
    if side_categories:
        ul_categories = side_categories.find("ul", class_="nav nav-list")
        if ul_categories:
            categories = ul_categories.find_all("li")
            for category in categories:
                a_tag = category.find("a")
                if a_tag:
                    genre_urls.append(urljoin(base_url, a_tag["href"]))
    return genre_urls

# Extract genre URLs from the main page
genre_urls = extract_genre_urls(soup, base_url)

* This is a crucial function. It iterates through all pages of a specific genre, scraping book details (title, price, rating, etc.) from each product page. It handles pagination by finding the "next" page link.

In [28]:
# Function to scrape all book data from a genre, iterating through all pages
def scrape_all_book_data_from_genre(genre, genre_url):
    genre_book_data = []
    current_url = genre_url
    while True:
        print(f"Scraping page: {current_url}")

        response = requests.get(current_url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")

        # Extract product URLs from the current page
        urls = scrape_book_urls(soup, current_url)  # Pass current_url

        # Scrape product details from each URL on the page
        for product_url in urls:

            product_response = requests.get(product_url)
            product_response.raise_for_status()
            product_soup = BeautifulSoup(product_response.content, "html.parser")

            # Extract product details
            title = product_soup.find("h1").text.strip()
            price = product_soup.find("p", class_="price_color").text.strip()
            rating_element = product_soup.find("p", class_="star-rating")
            rating = rating_element["class"][-1] if rating_element else "Not rated"
            in_stock, num_in_stock = get_stock_info(product_soup)

            book_info = {
                "Book Title": title,
                "Genre": genre,
                "Price (£)": price,
                "Availability": in_stock,
                "Stock": num_in_stock,
                "Rating": rating,
                "URL of the books page": product_url,
            }
            genre_book_data.append(book_info)

        # Find the "next" page link using the HTML structure provided
        pager = soup.find("ul", class_="pager")
        next_link = None
        if pager:
            next_li = pager.find("li", class_="next")
            if next_li:
                next_a = next_li.find("a")
                if next_a:
                    next_link = next_a["href"]

        if next_link:
            current_url = urljoin(current_url, next_link)  # Construct the full URL for the next page.
        else:
            break  # No more "next" links, so exit the loop

    return genre_book_data

In [29]:
# Initialize an empty list to store all scraped book data
all_book_data = []

In [30]:
# Scrape data for each genre
start_scraping = False
for genre_url in genre_urls:
    genre = genre_url.split("/")[-2].split("_")[0]  # Extract the genre name from the URL
    if genre == "travel":
        start_scraping = True
    if start_scraping:
        print(f"Scraping genre: {genre}")
        genre_book_data = scrape_all_book_data_from_genre(genre, genre_url)
        all_book_data.extend(genre_book_data)

Scraping genre: travel
Scraping page: http://books.toscrape.com/catalogue/category/books/travel_2/index.html
Scraping genre: mystery
Scraping page: http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
Scraping page: http://books.toscrape.com/catalogue/category/books/mystery_3/page-2.html
Scraping genre: historical-fiction
Scraping page: http://books.toscrape.com/catalogue/category/books/historical-fiction_4/index.html
Scraping page: http://books.toscrape.com/catalogue/category/books/historical-fiction_4/page-2.html
Scraping genre: sequential-art
Scraping page: http://books.toscrape.com/catalogue/category/books/sequential-art_5/index.html
Scraping page: http://books.toscrape.com/catalogue/category/books/sequential-art_5/page-2.html
Scraping page: http://books.toscrape.com/catalogue/category/books/sequential-art_5/page-3.html
Scraping page: http://books.toscrape.com/catalogue/category/books/sequential-art_5/page-4.html
Scraping genre: classics
Scraping page: http://boo

In [31]:
# Create a Pandas DataFrame from the scraped data
df = pd.DataFrame(all_book_data)

df

Unnamed: 0,Book Title,Genre,Price (£),Availability,Stock,Rating,URL of the books page
0,It's Only the Himalayas,travel,£45.17,In Stock,19,Two,http://books.toscrape.com/catalogue/its-only-t...
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,travel,£49.43,In Stock,15,Four,http://books.toscrape.com/catalogue/full-moon-...
2,See America: A Celebration of Our National Par...,travel,£48.87,In Stock,14,Three,http://books.toscrape.com/catalogue/see-americ...
3,Vagabonding: An Uncommon Guide to the Art of L...,travel,£36.94,In Stock,8,Two,http://books.toscrape.com/catalogue/vagabondin...
4,Under the Tuscan Sun,travel,£37.33,In Stock,7,Three,http://books.toscrape.com/catalogue/under-the-...
...,...,...,...,...,...,...,...
995,Why the Right Went Wrong: Conservatism--From G...,politics,£52.65,In Stock,14,Four,http://books.toscrape.com/catalogue/why-the-ri...
996,Equal Is Unfair: America's Misguided Fight Aga...,politics,£56.86,In Stock,12,One,http://books.toscrape.com/catalogue/equal-is-u...
997,Amid the Chaos,cultural,£36.58,In Stock,15,One,http://books.toscrape.com/catalogue/amid-the-c...
998,Dark Notes,erotica,£19.19,In Stock,15,Five,http://books.toscrape.com/catalogue/dark-notes...


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Book Title             1000 non-null   object
 1   Genre                  1000 non-null   object
 2   Price (£)              1000 non-null   object
 3   Availability           1000 non-null   object
 4   Stock                  1000 non-null   int64 
 5   Rating                 1000 non-null   object
 6   URL of the books page  1000 non-null   object
dtypes: int64(1), object(6)
memory usage: 54.8+ KB


In [34]:
df.isnull().sum()

Book Title               0
Genre                    0
Price (£)                0
Availability             0
Stock                    0
Rating                   0
URL of the books page    0
dtype: int64

In [35]:
df["Price (£)"] = df["Price (£)"].astype(str).str.replace(r'[£,]', '', regex=True).astype(float)
df['Rating'] = df['Rating'].map({'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}).astype(int)
df["Availability"] = df["Availability"].astype("category")
df["Genre"] = df["Genre"].str.title()


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype   
---  ------                 --------------  -----   
 0   Book Title             1000 non-null   object  
 1   Genre                  1000 non-null   object  
 2   Price (£)              1000 non-null   float64 
 3   Availability           1000 non-null   category
 4   Stock                  1000 non-null   int64   
 5   Rating                 1000 non-null   int64   
 6   URL of the books page  1000 non-null   object  
dtypes: category(1), float64(1), int64(2), object(3)
memory usage: 48.1+ KB


In [36]:
df

Unnamed: 0,Book Title,Genre,Price (£),Availability,Stock,Rating,URL of the books page
0,It's Only the Himalayas,Travel,45.17,In Stock,19,2,http://books.toscrape.com/catalogue/its-only-t...
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,Travel,49.43,In Stock,15,4,http://books.toscrape.com/catalogue/full-moon-...
2,See America: A Celebration of Our National Par...,Travel,48.87,In Stock,14,3,http://books.toscrape.com/catalogue/see-americ...
3,Vagabonding: An Uncommon Guide to the Art of L...,Travel,36.94,In Stock,8,2,http://books.toscrape.com/catalogue/vagabondin...
4,Under the Tuscan Sun,Travel,37.33,In Stock,7,3,http://books.toscrape.com/catalogue/under-the-...
...,...,...,...,...,...,...,...
995,Why the Right Went Wrong: Conservatism--From G...,Politics,52.65,In Stock,14,4,http://books.toscrape.com/catalogue/why-the-ri...
996,Equal Is Unfair: America's Misguided Fight Aga...,Politics,56.86,In Stock,12,1,http://books.toscrape.com/catalogue/equal-is-u...
997,Amid the Chaos,Cultural,36.58,In Stock,15,1,http://books.toscrape.com/catalogue/amid-the-c...
998,Dark Notes,Erotica,19.19,In Stock,15,5,http://books.toscrape.com/catalogue/dark-notes...


In [37]:
# Save the DataFrame to a CSV file
df.to_csv("a1_books_814000298.csv", index=False)