## Introduction
In this notebook, we'll web scrape the top 50 bestselling books on Amazon.in, the world's biggest bookstore Amazon which has more than 48.5 million books available on its website.

The tools we are going to use to web scrape Amazon.in in are:
- **Python**- To write down our code in Python language.
- **Requests**- To fetch and interact with the web page.
- **BeautifulSoup**- To parse and extract data from the web page.
- **Pandas**- To interact with data which is extracted using the BeautifulSoup and convert the data into the csv.

## Steps

So,we are going to scrape the page `https://www.amazon.in/gp/bestsellers/books/` and will grab the books details.

The steps we will follow:
1. Import the necessary libraries.
2. Fetch the `https://www.amazon.in/gp/bestsellers/books/` using the requests library.
3. We'll parse and extract the information using the BeautifulSoup.
4. Finally, we'll convert the data extracted to a pandas dataframe.

## Importing the necessary libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Fetching the web page

In [2]:
# Target url
url = 'https://www.amazon.in/gp/bestsellers/books'

# Using a request library to fetch the url
response = requests.get(url)

In [3]:
# Checking the response status
response.status_code

200

Status code `200` shows that the request for fetching the web page is successful and we can proceed further to extract the information from the web page.

## Parsing and extracting the information 

We cannot use `reponse.text` to extract any information. So, we'll create the BeautifulSoup object by passing the content from the fetched page as response.text to html parser which will convert the text to a html format.

In [4]:
doc = BeautifulSoup(response.text, 'html.parser')

In [5]:
type(doc)

bs4.BeautifulSoup

We are ready to extract the books information from the amazon page. Let's start by pressing right click on the first book and select `inspect`.

We know that the book details we want to scrap are available under the `div_tag` with `class = zg-grid-general-faceout`. Let's use this information to grab the details about top 50 selling books.

In [6]:
div_selection_class ='zg-grid-general-faceout'
div_tags = doc.find_all('div', class_ = div_selection_class)

In [7]:
#check the number of div tags that are fetched.
len(div_tags)

50

We got the 50 tags, Let's check the second div tag beacuse first tag has insufficent information.

In [8]:
div_tags[1]

<div class="zg-grid-general-faceout"><div><a class="a-link-normal" href="/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_2/000-0000000-0000000?pd_rd_i=1847941834&amp;psc=1" role="link" tabindex="-1"><div class="a-section a-spacing-mini _p13n-zg-list-grid-desktop_maskStyle_noop__3Xbw5"><img alt="Atomic Habits: The life-changing million copy bestseller" class="a-dynamic-image p13n-sc-dynamic-image p13n-product-image" data-a-dynamic-image='{"https://images-eu.ssl-images-amazon.com/images/I/51S7KOWir7L._AC_UL302_SR302,200_.jpg":[302,200],"https://images-eu.ssl-images-amazon.com/images/I/51S7KOWir7L._AC_UL604_SR604,400_.jpg":[604,400],"https://images-eu.ssl-images-amazon.com/images/I/51S7KOWir7L._AC_UL906_SR906,600_.jpg":[906,600]}' height="200px" src="https://images-eu.ssl-images-amazon.com/images/I/51S7KOWir7L._AC_UL302_SR302,200_.jpg" style="max-width:302px;max-height:200px"/></div></a><a class="a-link-normal" href="/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_2/000-

As we can see `div_tags[1]` has lot information which is difficult to understand. So, let's go back to our web page and `inspect` on the element individually that we want to grab.

In [9]:
#Book Name
div_tags[1].find('span').text

'Atomic Habits: The life-changing million copy bestseller'

In [10]:
#Author Name
a_selection_tag = 'a-size-small a-link-child'
div_tags[1].find('a', class_ = a_selection_tag).text

'James Clear'

In [11]:
# Book Url
a_selection_tag = 'a-link-normal'
div_tags[1].find('a', class_ = a_selection_tag)['href']

# Book Url with a complete working link
base_url = 'https://amazon.in'
base_url + div_tags[1].find('a', class_ = a_selection_tag)['href']

'https://amazon.in/Atomic-Habits-James-Clear/dp/1847941834/ref=zg_bs_books_2/000-0000000-0000000?pd_rd_i=1847941834&psc=1'

In [12]:
# Number of star ratings
span_selection_class = 'a-icon-alt'
div_tags[1].find('span', class_ = span_selection_class).text

'4.6 out of 5 stars'

In [13]:
# Number of ratings
span_selection_class = 'a-size-small'
div_tags[1].find('span', class_ = span_selection_class).text

'38,185'

In [14]:
# Book Type
span_selection_class = 'a-size-small a-color-secondary a-text-normal'
div_tags[1].find('span', class_ = span_selection_class).text

'Paperback'

In [15]:
# Book Price
span_selection_class = 'p13n-sc-price'
div_tags[1].find('span', class_ = span_selection_class).text

'₹593.00'

We got all the attributes we wanted to create a dataset of the top 50 bestselling books. Now let's create a function which will grab the all attributes which are required for our dataset.

In [16]:
def books_info(div_tags):
    # Tag for the Book Name
    Book_Name_tags = div_tags.find('span')
    
    # Tag for the Author's Name
    Author_Name_tags = div_tags.find(['a', 'span'], class_ = ['a-size-small a-link-child', 'a-size-small a-color-base'])
    
    # Book URL
    Book_URL = 'https://amazon.in' + div_tags.find('a', class_ = 'a-link-normal')['href']
    
    # Tag for the Book Type
    Book_Type_tags = div_tags.find('span', class_ = 'a-size-small a-color-secondary a-text-normal')
    
    # Tag for Book Price
    Price_tags = div_tags.find('span', class_ = 'p13n-sc-price')
    
    # Tag for the number of stars
    Star_Rating_tags = div_tags.find('span', class_ = 'a-icon-alt')
    
    # Tag for the number of reviews
    Reviews_tags = div_tags.find('div', class_ = 'a-icon-row')
    
    return Book_Name_tags, Author_Name_tags, Book_URL, Book_Type_tags, Price_tags, Star_Rating_tags, Reviews_tags

Let's create a dictionary to hold the list of all attribute and run for a loop to append all the values in their respective lists.

In [17]:
books_dict = {
    'Book_Name':[],
    'Author_Name':[],
    'Book_URL':[],
    'Book_Type':[],
    'Price':[],
    'Star_Rating':[],
    'Reviews':[]
}

for i in range(0, len(div_tags)):
    info = books_info(div_tags[i])
    
    if info[0] is not None:
        books_dict['Book_Name'].append(info[0].text)
    else:
        books_dict['Book_Name'].append('Missing')
        
    if info[1] is not None:
        books_dict['Author_Name'].append(info[1].text)
    else:
        books_dict['Author_Name'].append('Missing')
        
    if info[2] is not None:
        books_dict['Book_URL'].append(info[2])
    else:
        books_dict['Book_URL'].append('Missing')
        
    if info[3] is not None:
        books_dict['Book_Type'].append(info[3].text)
    else:
        books_dict['Book_Type'].append('Missing')
        
    if info[4] is not None:
        books_dict['Price'].append(info[4].text)
    else:
        books_dict['Price'].append('Missing')
        
    if info[5] is not None:
        books_dict['Star_Rating'].append(info[5].text)
    else:
        books_dict['Star_Rating'].append('Missing')
        
    if info[6] is not None:
        books_dict['Reviews'].append(info[6].find('span', class_ = 'a-size-small').text)
    else:
        books_dict['Reviews'].append('Missing')
        

## Convert the extracted data to a dataframe

Now, let's convert this dictionary to a pandas dataframe.

In [18]:
books_df = pd.DataFrame(books_dict)
books_df

Unnamed: 0,Book_Name,Author_Name,Book_URL,Book_Type,Price,Star_Rating,Reviews
0,13 Swing Trading Strategies| Pankaj Ladha | An...,Pankaj Ladha & Anant Ladha,https://amazon.in/Swing-Trading-Strategies-Pan...,Paperback,₹299.00,Missing,Missing
1,Atomic Habits: The life-changing million copy ...,James Clear,https://amazon.in/Atomic-Habits-James-Clear/dp...,Paperback,₹593.00,4.6 out of 5 stars,38185
2,My First Library: Boxset of 10 Board Books for...,Wonder House Books,https://amazon.in/My-First-Library-Boxset-Boar...,Board book,₹399.00,4.5 out of 5 stars,48570
3,The Psychology of Money,Morgan Housel,https://amazon.in/Psychology-Money-Morgan-Hous...,Paperback,₹232.00,4.6 out of 5 stars,31286
4,Oswaal CBSE Term 2 English Science Social Scie...,Oswaal Editorial Board,https://amazon.in/Oswaal-English-Science-Stand...,Product Bundle,₹398.00,4.6 out of 5 stars,450
5,Word Power Made Easy,Norman Lewis,https://amazon.in/Word-Power-Made-Norman-Lewis...,Paperback,₹98.00,4.4 out of 5 stars,36869
6,Ikigai: The Japanese secret to a long and happ...,Héctor García,https://amazon.in/Ikigai-H%C3%A9ctor-Garc%C3%A...,Hardcover,₹375.26,4.6 out of 5 stars,31146
7,Grandma's Bag of Stories: Collection of 20+ Il...,Sudha Murty,https://amazon.in/Grandmas-Bag-Stories-Sudha-M...,Paperback,₹155.00,4.6 out of 5 stars,13488
8,Rich Dad Poor Dad: What the Rich Teach Their K...,Robert T. Kiyosaki,https://amazon.in/Rich-Dad-Poor-Middle-Updates...,Mass Market Paperback,₹339.00,4.6 out of 5 stars,57148
9,27 Years UPSC IAS/ IPS Prelims Topic-wise Solv...,Mrunal Patel,https://amazon.in/Years-Prelims-Topic-wise-Sol...,Paperback,₹295.00,4.5 out of 5 stars,1661


### Create a csv file with the extracted information

Now, finally save the dataframe to a csv formate.

In [20]:
books_df.to_csv('amazon_books.csv', index = None)

**We are done with this web scrapping project.**

**Happy Coding :)**

**Author** - **Ashish Kumar**

**Date** - **13 April 2021**