# Online Bookstore Scraping – BookSmart Solutions

**Final Project | Master in Data Engineering – Web Scraping with Python**  
Author: *Stefano Trovato*  
Date: *30/11/2025*

## Overview

In this project, I develop a web scraping script to extract information about books listed on the competitor’s online bookstore **Books to Scrape**.  
The goal is to build a structured dataset and export it to `books.csv`, in order to support market analysis activities for **BookSmart Solutions S.r.l.**

## Project objectives

For each book on the website, the following information is collected:

- **Title** – book title  
- **Rating** – star rating (CSS class → numeric value from 1 to 5)  
- **Price** – book price converted to a numeric value  
- **Availability** – stock status (e.g. *In stock* / *Out of stock*)

The resulting dataset (1000 rows × 4 columns) can be used by the marketing and sales teams for pricing analysis, competitive benchmarking, and identifying the most relevant titles.



## Methodology

- The target website `https://books.toscrape.com/` is **static**, so **Requests + BeautifulSoup** are sufficient (no need for Selenium).  
- For each page, the script parses the `article.product_pod` blocks and extracts the title, rating, price, and availability.  
- The rating is converted from the CSS class (`One`, `Two`, …) to a numeric value (1–5).  
- The price is cleaned (removal of currency symbol, handling of special characters) and converted to a `float`.  
- Pagination is handled automatically by following the `li.next` link across all pages.  
- Network errors are managed with retry logic using exponential backoff and **polite rate limiting** (random `sleep`).  
- The final result is saved to `books.csv` and validated through a preview of the resulting DataFrame.


In [1]:
# Final Project – Web Scraping "Books to Scrape"
# Extraction of: Title, Rating (1–5), Price, Availability
# Output: full dataset (1000 books) in 'books.csv'

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time, random
from urllib.parse import urljoin
import re

# ---------- Helpers ----------
def rating_to_number(rating_class: str):
    """Convert CSS rating class (One, Two, ...) to a numeric value from 1 to 5."""
    mapping = {"Zero": 0, "One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
    return mapping.get(rating_class)

def clean_price(price_str: str):
    """
    Convert strings such as '£51.77' or 'Â£51.77' to the float 51.77.
    Robust to special characters and different currency symbols.
    """
    if not price_str:
        return None

    # Try to extract the numeric part (e.g. 51.77 or 51,77)
    match = re.search(r"(\d+[.,]\d+)", price_str)
    if not match:
        return None

    num_str = match.group(1).replace(",", ".")
    try:
        return float(num_str)
    except ValueError:
        return None

def safe_get_text(tag):
    """Return cleaned text from a tag or None if the tag is missing."""
    return tag.get_text(strip=True) if tag else None

def fetch_soup(url, session, headers, max_retries=3, timeout=10):
    """Download a page with retry logic and return a BeautifulSoup object."""
    last_err = None
    for attempt in range(1, max_retries + 1):
        try:
            r = session.get(url, headers=headers, timeout=timeout)
            r.raise_for_status()
            return BeautifulSoup(r.text, "html.parser")
        except requests.RequestException as e:
            last_err = e
            wait = 2 ** attempt + random.uniform(0, 1)
            print(f"[WARN] Network error ({attempt}/{max_retries}) on {url}: {e}. Retrying in {wait:.1f}s")
            time.sleep(wait)
    raise last_err

# ---------- Main Scraper ----------
def scrape_books(start_url):
    session = requests.Session()
    headers = {"User-Agent": "Mozilla/5.0 (compatible; BookSmartBot/1.0)"}
    all_rows = []
    url = start_url

    while url:
        print(f"[INFO] Scraping: {url}")
        try:
            soup = fetch_soup(url, session, headers)

            books = soup.find_all("article", class_="product_pod")
            for b in books:
                # Title
                try:
                    title = b.find("h3").find("a").get("title")
                except Exception:
                    title = None

                # Price
                try:
                    raw_price = safe_get_text(b.find("p", class_="price_color"))
                    price = clean_price(raw_price)
                except Exception:
                    price = None

                # Rating
                try:
                    star_p = b.find("p", class_="star-rating")
                    rating_class = star_p.get("class", [None, None])[1]
                    rating = rating_to_number(rating_class)
                except Exception:
                    rating = None

                # Availability
                try:
                    availability = safe_get_text(b.find("p", class_="instock availability"))
                except Exception:
                    availability = None

                all_rows.append({
                    "Title": title,
                    "Rating": rating,
                    "Price": price,
                    "Availability": availability
                })

            # Next page
            next_li = soup.find("li", class_="next")
            if next_li and next_li.find("a"):
                href = next_li.find("a").get("href")
                # Robust handling of relative URLs
                url = urljoin(url, href)
            else:
                url = None

            # Polite rate limiting
            time.sleep(random.uniform(1, 3))

        except Exception as e:
            print(f"[ERROR] Error while scraping {url}: {e}")
            break

    return all_rows

# ---------- Run + Save ----------
BASE_URL = "https://books.toscrape.com/catalogue/page-1.html"
books_data = scrape_books(BASE_URL)

df = pd.DataFrame(books_data)
df.to_csv("books.csv", index=False, encoding="utf-8")

print(f"\n[OK] Saved 'books.csv' with {len(df)} rows.")
preview = pd.concat([df.head(10), df.tail(10)])
preview

[INFO] Scraping: https://books.toscrape.com/catalogue/page-1.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-2.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-3.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-4.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-5.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-6.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-7.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-8.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-9.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-10.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-11.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-12.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-13.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-14.html
[INFO] Scraping: https://books.toscrape.com/catalogue/page-15.html
[INF

Unnamed: 0,Title,Rating,Price,Availability
0,A Light in the Attic,3,51.77,In stock
1,Tipping the Velvet,1,53.74,In stock
2,Soumission,1,50.1,In stock
3,Sharp Objects,4,47.82,In stock
4,Sapiens: A Brief History of Humankind,5,54.23,In stock
5,The Requiem Red,1,22.65,In stock
6,The Dirty Little Secrets of Getting Your Dream...,4,33.34,In stock
7,The Coming Woman: A Novel Based on the Life of...,3,17.93,In stock
8,The Boys in the Boat: Nine Americans and Their...,4,22.6,In stock
9,The Black Maria,1,52.15,In stock


## Conclusions

The script developed in this project makes it possible to:

- automatically and reliably collect data for **1,000 books** from the Books to Scrape website;
- obtain a clean dataset containing the key variables required for market analysis (title, rating, price, availability);
- easily integrate the `books.csv` file into BI tools or Data Engineering pipelines.

In a real-world scenario, this scraper could be scheduled to run periodically (e.g. via cron or Airflow) to continuously update the monitoring of competitors’ prices and stock availability.
