# Scraping Childerns Book

## Project Introduction

This project focuses on web scraping using **Selenium** to extract data on children’s books from the fictional online bookstore <a href="https://books.toscrape.com/">Books to Scrape</a>. The goal of the project is to demonstrate practical skills in *web automation*, and *data collection and extraction*.

The scraped data is then organized and stored into **JSON** and **CSV** files.

This project highlights the ability to:

- Navigate and interact with websites programmatically.

- Handle page structures and categories.

- Extract and structure relevant information efficiently.

- Apply automation techniques for real-world data collection scenarios.

Preparing Selenium toolkit and the main url

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time

In [2]:
driver = webdriver.Chrome()

In [3]:
main_url = "https://toscrape.com/"

In [4]:
driver.get(main_url)

In [5]:
bookstore_link = driver.find_element(By.PARTIAL_LINK_TEXT, "bookstore").get_attribute(
    "href"
)

In [6]:
print(bookstore_link)  # Verifyin the link

http://books.toscrape.com/


In [7]:
# Navigating to the bookstore page
driver.find_element(By.PARTIAL_LINK_TEXT, "bookstore").click()

In [8]:
# Verifying links and category
category = driver.find_element(By.LINK_TEXT, "Childrens")
category_url = category.get_attribute("href")
print(f"Category URL: {category_url}")

Category URL: https://books.toscrape.com/catalogue/category/books/childrens_11/index.html


In [9]:
category.click()

In [10]:
category_name = (
    driver.find_element(By.CLASS_NAME, "page-header")
    .find_element(By.TAG_NAME, "h1")
    .text
)
category_results = driver.find_element(By.CSS_SELECTOR, "form.form-horizontal").text

print(f"Category Name: {category_name}")
print(f"Category Results: {category_results}")

Category Name: Childrens
Category Results: 29 results - showing 1 to 20.


We concluded that there is pagination for our desired data.

In [11]:
pagination = True
column = ["Upc", "Title", "Price", "Rating", "Stock", "Stock_Qty", "Url", "Image"]
data_set = []
count = 0

In [12]:
page = 1
while pagination:
    try:
        print(f"Processing page {page} -- {count + 1}")
        listings = driver.find_elements(By.CSS_SELECTOR, "ol.row li")

        for listing in listings:
            article = listing.find_element(By.TAG_NAME, "article")

            image = article.find_element(By.CSS_SELECTOR, "a")
            article_link = image.get_attribute("href")
            image_src = image.find_element(By.TAG_NAME, "img").get_attribute("src")
            image_alt = image.find_element(By.TAG_NAME, "img").get_attribute("alt")

            rating = article.find_element(
                By.CSS_SELECTOR, 'p[class*="star"]'
            ).get_attribute("class")
            title = article.find_element(By.CSS_SELECTOR, "h3 > a").get_attribute(
                "title"
            )
            price = article.find_element(By.CLASS_NAME, "price_color").text

            # printing extracted data for following
            print(f"Data -- {title} -- {rating} -- price")

            # Introduce a short delay between requests to mimic human behavior and prevent IP blocking.
            time.sleep(1)

            if article_link:
                listing.find_element(By.TAG_NAME, "img").click()
                upc = driver.find_element(
                    By.XPATH, '//th[contains(text(),"UPC")]/following-sibling::td'
                ).text

                if upc:
                    stock_qty = driver.find_element(
                        By.XPATH,
                        '//th[contains(text(), "Availability")]/following-sibling::td',
                    ).text

                    stock = stock_qty.split("(")

                    temp = [
                        upc,
                        title,
                        price,
                        rating.replace("star-rating", "").strip(),
                        stock[0].strip(),
                        stock[1].replace("avalable", "").replace(")", "").strip(),
                        article_link,
                        image_src,
                    ]
                count += 1
                data_set.append(temp)

            # Add a delay to avoid being flagged as a bot.
            time.sleep(2)
            # Go back to the listing page
            driver.back()
        try:
            driver.find_element(By.LINK_TEXT, "next").click()
            page += 1
        except NoSuchElementException:
            pagination = False
            print(f"No more pagination or cannot reach it, currently at page {page}")
    except Exception as e:
        print(f"Exception Occured: {str(e)}")
        pagination = False

Processing page 1 -- 1
Data -- Birdsong: A Story in Pictures -- star-rating Three -- price
Data -- The Bear and the Piano -- star-rating One -- price
Data -- The Secret of Dreadwillow Carse -- star-rating One -- price
Data -- The White Cat and the Monk: A Retelling of the Poem “Pangur Bán” -- star-rating Four -- price
Data -- Little Red -- star-rating Three -- price
Data -- Walt Disney's Alice in Wonderland -- star-rating Five -- price
Data -- Twenty Yawns -- star-rating Two -- price
Data -- Rain Fish -- star-rating Three -- price
Data -- Once Was a Time -- star-rating Two -- price
Data -- Luis Paints the World -- star-rating Three -- price
Data -- Nap-a-Roo -- star-rating One -- price
Data -- The Whale -- star-rating Four -- price
Data -- Shrunken Treasures: Literary Classics, Short, Sweet, and Silly -- star-rating Three -- price
Data -- Raymie Nightingale -- star-rating Two -- price
Data -- Playing from the Heart -- star-rating One -- price
Data -- Maybe Something Beautiful: How Art 

In [13]:
print(f"Total rows in dataset: {len(data_set)}")

Total rows in dataset: 29


Exporting the scraped data into csv and json.

In [14]:
import csv
import json

In [15]:
with open("book_details.csv", mode="w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(column)
    for data in data_set:
        writer.writerow(data)
    print("CSV file created")

final_data_set = list()
for data in data_set:
    final_data_set.append(dict(zip(column, data)))

with open("book_details.json", "w") as json_file:
    json.dump(final_data_set, json_file, indent=4)

print("JSON file created")

CSV file created
JSON file created


quitting the driver.

In [16]:
driver.quit()

## Conclusion

This project successfully demonstrated the use of Selenium as a powerful tool for automating web scraping tasks in a structured and efficient manner.

- By carefully verifying navigations, validating URLs, and incorporating pagination handling, the scraper was able to traverse through multiple pages of the Children’s Books category without data loss.

- Both **CSS selectors** and **XPath expressions** were utilized to accurately locate and extract relevant elements from the DOM, ensuring flexibility and robustness in the scraping process. 

- To mimic natural browsing behavior and reduce the likelihood of being flagged as a bot, **time.sleep** delays were strategically introduced between requests, imitating human interaction with the website.

Finally, the collected data was exported efficiently into **JSON** and **CSV** formats, enabling easy storage, sharing, and further analysis.