# Scraping the Books to Scrape Website

## This websites lists books along with their titles, prices and ratings.

## Aim: To scrape the following data from each book:
- Title
- Price
- Availability
- Rating

In [43]:
import requests

##### URL of the website to scrape

In [30]:
url = "http://books.toscrape.com/"

In [31]:
# Sending a Get requests to fetch the HTML content of the page

In [32]:
response = requests.get(url)

In [33]:
# To check if the response was successful (status code 200 means success)

In [34]:
if response.status_code == 200:
    print("Page successfully retrieved!")
else:
    print("Failed to retrieve page.")

Page successfully retrieved!


#### Step 2: Parsing the HTML content

 ##### Once we get the HTML content of the page, we need to parse it to make it easier to search for specific data, we use beautifulsoup for this task.

In [46]:
from bs4 import BeautifulSoup

##### Parsing the HTML content using BeautifulSoup

In [48]:
Soup = BeautifulSoup(response.text, "html.parser")

In [50]:
## Let us disply the first 500 characters of the parsed HTML to see what we have
print(Soup.prettify()[500:])

nt="24th Jun 2016 09:29" name="created"/>
  <meta content="" name="description"/>
  <meta content="width=device-width" name="viewport"/>
  <meta content="NOARCHIVE,NOCACHE" name="robots"/>
  <!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
  <!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
  <link href="static/oscar/favicon.ico" rel="shortcut icon"/>
  <link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
  <link href="static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.css" rel="stylesheet"/>
  <link href="static/oscar/css/datetimepicker.css" rel="stylesheet" type="text/css"/>
 </head>
 <body class="default" id="default">
  <header class="header container-fluid">
   <div class="page_inner">
    <div class="row">
     <div class="col-sm-8 h1">
      <a href="index.html">
       Books to Scrape
      </a>
      <small>
       We love being scraped!
      </small>
     </div>

Explanation:
- BeautifulSoup(response.text, "html.parser"): This line tells BeautifulSoup to parse the HTML content returned by the website.
- Soup.prettify(): This format the HTML in a readable way.

### Step 3: Extract Data from the Page

Now, we want to extract useful data, such as the title, price and rating of each book on the page.
The books are contained in HTML tags with specific Classes. Let's Identify and extract that data.

--Find all book containers on the page

In [52]:
book_containers = Soup.find_all("article", class_="product_pod")

Loop through each book container and extract the releveant informations

In [54]:
for book in book_containers:
    # Extract the book's title
    title = book.find("h3").find("a")["title"]
    # Extract the book's price
    price = book.find("p", class_="price_color").text
    # Extract the book's rating
    rating = book.find("p", class_="star-rating")["class"][1]
    # Print the extracted information
    print(f"Title: {title}")
    print(f"Price: {price}")
    print(f"Rating: {rating} stars")
    print("-" * 40)

Title: A Light in the Attic
Price: Â£51.77
Rating: Three stars
----------------------------------------
Title: Tipping the Velvet
Price: Â£53.74
Rating: One stars
----------------------------------------
Title: Soumission
Price: Â£50.10
Rating: One stars
----------------------------------------
Title: Sharp Objects
Price: Â£47.82
Rating: Four stars
----------------------------------------
Title: Sapiens: A Brief History of Humankind
Price: Â£54.23
Rating: Five stars
----------------------------------------
Title: The Requiem Red
Price: Â£22.65
Rating: One stars
----------------------------------------
Title: The Dirty Little Secrets of Getting Your Dream Job
Price: Â£33.34
Rating: Four stars
----------------------------------------
Title: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
Price: Â£17.93
Rating: Three stars
----------------------------------------
Title: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 B

 Explanation:
 -Soup.find_all(): This finds all occurences of the <article> tag with the class product_pod, which represent each book.
 -book.find(): This is used to find specific HTML elements inside the book container, we use it to get the title, price and rating of the book.

### Step 4: Store the Data

Once we have the data, it's common to store it in a file for later use, We will store the data in a CSV file which can be easily opened in programs like Excel.

In [56]:
import csv

In [59]:
#open a csv file to store data
with open("books.csv", mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Price", "Rating"]) # Write the header row
    # Loop through the csv containers and write the data to the csv file
    for book in book_containers:
        title = book.find("h3").find("a")["title"]
        price = book.find("p", class_="price_color").text
        rating = book.find("p", class_="star-rating")["class"][1]
        writer.writerow([title, price, rating])  # Write data for each book

        
print("Data successfully saved to books.csv!")

Data successfully saved to books.csv!


Explanation:
-csv.writer(file): This creates a writer object to write data into the CSV file.
-writer.writerow: This write a row of data into the CSV file


### 5. Ethical considerations in Web Scraping

While web scraping is a powerful tool, it's essential to scrape responsibly. Here are some ethical considerations:
• Respect robots.txt : Some websites have a file called robots. txt that tells you which parts of the site can be scraped and which cannot.
• Don't overload the server: Sending too many requests in a short period can slow down or crash a website. Always include a delay between requests.
• Check the website's Terms of Service: Some websites prohibit scraping, so always check their rules before scraping data.