<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping Tutorial: Amazon Best Sellers</title>
</head>

</html>


<body>
    <h1>Web Scraping from Amazon with Python</h1>
    <p>In this tutorial, I will walk through the process of web scraping from Amazon’s Best Sellers page in the Teaching & Education category to collect data about the top 50 authors and their ratings. Before we start, ensure you have the following Python libraries installed:</p>
        <li><strong>requests:</strong> to send HTTP requests and retrieve the web pages</li>
        <li><strong>BeautifulSoup:</strong> to parse and extract information from the HTML content.</li>
        <li><strong>pandas:</strong> to organize and save the extracted data in a tabular format.</li>
     
<p> Pandas and requests are already available in your Python environment if you are using colab or jupyter notebook. Use the command below in your colab or jupyter notebook to install BeautifulSoup: </p>
    
<pre><code>!pip install beautifulsoup4</code></pre>

<p>Now, let’s get started with web scraping from Amazon.</p>
</body>

<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping Tutorial: Amazon Best Sellers - Pagination</title>
</head>
<body>
    <h1>Step 1: Understanding the Target URL and Pagination</h1>
    <p>We are targeting the Amazon Best Sellers page in the Teaching & Education category. Amazon’s pagination allows us to navigate through multiple pages of results. The base URL for the first page looks like this:</p>
   <pre><code>https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_1?ie=UTF8&pg=1</code></pre>
<p>Notice the pagination parameters “pg” and “zg_bs_pg” in the URL. We will increment these values to navigate through the pages.</p>
</body>


<h1>Step 2: Set Up the HTTP Request</h1>

<p>To scrape the content from Amazon, we first need to send a request to the server and retrieve the HTML content of the page. We also need to mimic a real browser to avoid being blocked by Amazon, which is why we always need to include a User-Agent header in the request. Here’s how to set up the HTTP request:</p>


In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# base url of the best sellers page for teaching & education books
base_url = "https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_{}?ie=UTF8&pg={}"

# http headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

<h1> Step 3: Iterate Over Pages to Collect Data </h1>
<p> Now, we will loop through the first three pages to collect data for the top 50 books (assuming each page displays around 20 items). On each page, we will extract the author’s name and rating: </p>

In [7]:
# initialize a list to store book data
book_list = []

# iterate over the first 3 pages to get top 50 books (assuming each page has about 20 items)
for page in range(1, 4):
    # construct the URL for the current page
    url = base_url.format(page, page)
    
    # send a GET request to the url
    response = requests.get(url, headers=headers)
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "html.parser")
    
    # find all the book elements
    books = soup.find_all("div", {"class": "zg-grid-general-faceout"})
    
    # iterate over each book element to extract data
    for book in books:
        if len(book_list) < 50:  # stop once we've collected 50 books
            author = book.find("a", class_="a-size-small a-link-child").get_text(strip=True) if book.find("a", class_="a-size-small a-link-child") else "N/A"
            rating = book.find("span", class_="a-icon-alt").get_text(strip=True) if book.find("span", class_="a-icon-alt") else "N/A"
            
            # append the extracted data to the book_list
            book_list.append({
                "Author": author,
                "Rating": rating
            })
        else:
            break

<p>Here, we looped through the first three pages using a for loop. The condition <b>if len(book_list) < 50 </b>: ensures that we stop once we’ve collected data for 50 books. The code works by iterating through the first three pages of Amazon’s Best Sellers list in the Teaching & Education category. For each page, it sends a GET request to retrieve the HTML content, then parses this content using BeautifulSoup to find the relevant book elements. It extracts the author and rating for each book, appending the data to a list until it has collected information for 50 books.</P>

<p>The loop breaks once 50 books have been processed, which ensures that only the top 50 authors and their ratings are captured.</P>

<h1>Step 4: Store and Save the Data</h1>

<p>After collecting the data, we will store it in a Pandas DataFrame and save it to a CSV file:</P>

In [8]:
# convert the list of dictionaries into a DataFrame
df = pd.DataFrame(book_list)

print(df.head())

# save the DataFrame to a CSV file
df.to_csv("amazon_top_50_books_authors_ratings.csv", index=False)

                    Author              Rating
0  Samapti Sinha Mahapatra  4.6 out of 5 stars
1             Akshay Kumar  4.3 out of 5 stars
2                      N/A  4.4 out of 5 stars
3            Lori Gottlieb  4.6 out of 5 stars
4           एम लक्ष्मीकांत  4.4 out of 5 stars


<h4>Let’s have a look at some sample rows as well:</h4>

In [9]:
print(df.sample(10))

                     Author              Rating
8   EduGorilla Prep Experts  3.5 out of 5 stars
32      RPH Editorial Board  3.8 out of 5 stars
16   EduGorilla PREP EXPERT  4.0 out of 5 stars
47         Ronnie Screwvala  4.4 out of 5 stars
20             Mike Chapple  4.7 out of 5 stars
31               Pavan Soni  4.9 out of 5 stars
25          Nikhil Kr Gupta  4.5 out of 5 stars
38   ALLEN Expert Faculties                 N/A
28                      N/A  4.3 out of 5 stars
12             Kriti Sharma  4.6 out of 5 stars


<h4>This method can be adapted for different categories or more extensive data collection by adjusting the page range or the conditions within the loop.</h4>

<h1>Summary</h1>

<p1>So, web scraping is a technique used to extract data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to extract the necessary information.</p1>