# Important
**Before starting, rename this notebook to the following format: "last name"_"id number"**

E.g.: ferraropetrillo_123456

# Allowed Reference Materials

For this assignment, students are permitted to consult only the following materials:


- [https://docs.python.org/3/](https://docs.python.org/3/)


- [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)


- [https://requests.readthedocs.io/en/latest/](https://requests.readthedocs.io/en/latest/)

- [https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)



## Important Notes
In the following it is provided the code required to use an external Google Chrome installation to scrape from dynamic sites. However, **the test can be solved by using the plain requests library**.


In [None]:
# Here we install Google Chrome
%pip install google-colab-selenium

In [None]:
import google_colab_selenium as gs
from selenium.webdriver.chrome.options import Options

# We start Google Chrome in the headless mode (without user interface)
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36')

driver = gs.Chrome(options=options)

# Objective: build a Python scraper to extract data about the books being sold on http://books.toscrape.com.


**Phase 1: Scraper Development**

Implement a Python script containing two main functions for data extraction:

*1* **scrape_book_details(book_url)**:

  - **Input:** The URL of a specific page dedicated to a single book (e.g., http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html).

  - **Logic:** Using requests to download the HTML content and BeautifulSoup4 for parsing, this function must extract the following information

    -- Title: The full title of the book

    -- Price: The price of the book (as a float number).

    -- Availability: The number of copies
    available in stock (0, if none)

    -- Description: The book's description (first 50 characters)

  - Output: A tuple, dictionary, or another structured object containing the four pieces of data extracted for the book.

*2* **scrape_page_books(page_url)**:
	- **Input:** The URL of a page listing multiple books (e.g.,http://books.toscrape.com/catalogue/page-1.html).

- **Logic:**
  -- Identify all the links leading to the individual book detail pages present on that page.

  -- For each link found, invoke the scrape_book_details function to retrieve the corresponding book's details.

  -- Collect in a list the results obtained from all calls to_scrape_book_details for the books on the page.

- Output: A Pandas DataFrame, containing all data collected by the scraper

**Phase 2: Write a Test Script:**
* The script should call scrape_page_books on any page of the catalogue  (e.g., starting with the first page of the catalogue).
* Save the resulting DataFrame to disk using the to_excel function
* Load the previously saved DataFrame from the disk and implement the following selections:
  * Increase by a 10% the availability of all books
  * Display the details (i.e., title, price, availability, description) of all books with a price lower than 20 pounds.




# Your solution here

In [1]:
# Import necessary libraries for web scraping and data manipulation
import requests
import bs4
from bs4 import BeautifulSoup
import re

In [5]:
# Set the URL of a specific product page on the website
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
# Custom headers to mimic a browser and avoid getting blocked by the website
headers = {'User-Agent': 'Mozilla/5.0'}
# Send a GET request to the URL
p = requests.get(url, headers=headers)
# Check the status code of the response (200 means OK)
p.status_code

200

In [6]:
# Parse the HTML content of the page using lxml parser
page = BeautifulSoup(p.content, 'lxml')

In [7]:
# Extract the title of the product
title = page.find('div', class_='product_main').h1.text
title

'A Light in the Attic'

In [8]:
# Extract the price of the product
price = float(page.find('p', class_='price_color').text.replace('£',''))
price

51.77

In [9]:
# Extract the availability of the product (in the function I'll apply the None option)
stock = int(page.find('p', class_='instock').text.strip().replace('(','').replace(')','').split()[2])
stock

22

In [10]:
# Extract the description of the product (only first 50 characters)
description = page.find('div', id='product_description').find_next_sibling('p').text.strip()[:50]
description

"It's hard to imagine a world without A Light in th"

In [11]:
# Define a function to scrape details from a product page
def scrape_book_details(book_url):
  headers = {'User-Agent': 'Mozilla/5.0'}
  p = requests.get(book_url, headers=headers)
  page = BeautifulSoup(p.content, 'lxml')

  title = page.find('div', class_='product_main').h1.text
  price = float(page.find('p', class_='price_color').text.replace('£',''))
  if page.find('p', class_='instock').text.strip().replace('(','').replace(')','').split()[2] != 0:
    stock = int(page.find('p', class_='instock').text.strip().replace('(','').replace(')','').split()[2])
  else:
    stock = None
  description = page.find('div', id='product_description').find_next_sibling('p').text.strip()[:50]

  return title, price, stock, description

In [12]:
# Test if the function works on a specific product
book_url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
scrape_book_details(book_url)

('A Light in the Attic',
 51.77,
 22,
 "It's hard to imagine a world without A Light in th")

In [13]:
import pandas as pd
# Define a function to scrape the details of any page of the catalogue
def scrape_page_books(page_url):
  p = requests.get(page_url, headers=headers)
  page = BeautifulSoup(p.content, 'lxml')
  output = []
  # Find all product cards on the page by targeting their container
  ads_tot = page.find_all('article', class_='product_pod')
  # Loop through each product card
  for ad in ads_tot:
    ad_url_fin = ad.a['href']
    ad_url = 'https://books.toscrape.com/catalogue/' + ad_url_fin
    output.append(scrape_book_details(ad_url))
  # Convert the results list into a pandas DataFrame
  data = pd.DataFrame(output, columns = ['Title', 'Price', 'Stock', 'Description'])
  return data

In [14]:
# Now the URL for the search results for page 1
page_url = 'https://books.toscrape.com/catalogue/page-1.html'
scrape_page_books(page_url)

Unnamed: 0,Title,Price,Stock,Description
0,A Light in the Attic,51.77,22,It's hard to imagine a world without A Light i...
1,Tipping the Velvet,53.74,20,"""Erotic and absorbing...Written with starling ..."
2,Soumission,50.1,20,"Dans une France assez proche de la nôtre, un h..."
3,Sharp Objects,47.82,20,"WICKED above her hipbone, GIRL across her hear..."
4,Sapiens: A Brief History of Humankind,54.23,20,From a renowned historian comes a groundbreaki...
5,The Requiem Red,22.65,19,Patient Twenty-nine.A monster roams the halls ...
6,The Dirty Little Secrets of Getting Your Dream...,33.34,19,Drawing on his extensive experience evaluating...
7,The Coming Woman: A Novel Based on the Life of...,17.93,19,"""If you have a heart, if you have a soul, Kare..."
8,The Boys in the Boat: Nine Americans and Their...,22.6,19,For readers of Laura Hillenbrand's Seabiscuit ...
9,The Black Maria,52.15,19,"Praise for Aracelis Girmay:""[Girmay's] every l..."


In [18]:
# If I want apply the function and build a data frame for all pages available I can iterate
# Base URL without number
base_url = "https://books.toscrape.com/catalogue/page-{}.html"

# Do for all pages (I know that are 50)
for page_num in range(1, 51):
    url = base_url.format(page_num)
    scrape_page_books(url)


In [19]:
# Define a function to filter the DataFrame by price and increments the stock availability of 10%
def scrape_by_price(page_url, price):
  data = scrape_page_books(page_url)
  data['Stock'] = (data['Stock'] * 1.10)
  filtered_data = data[data["Price"] <= price]
  filtered_data.to_excel(f"Data_filtered_under_{price}pounds.xlsx")

  return filtered_data


In [16]:
# Try the function
scrape_by_price(page_url, 20)

Unnamed: 0,Title,Price,Stock,Description
7,The Coming Woman: A Novel Based on the Life of...,17.93,20.9,"""If you have a heart, if you have a soul, Kare..."
10,"Starving Hearts (Triangular Trade Trilogy, #1)",13.99,20.9,"Since her assault, Miss Annette Chetwynd has b..."
12,Set Me Free,17.46,20.9,Aaron Ledbetter’s future had been planned out ...
