# Web Scraping Project- Amazon top rated Books 



![](https://imgur.com/MB4HWGU.png)

![](https://imgur.com/bX2qT9t)

## Web Scraping

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Mostly it is unstructured html data which is then converted into structured data and stored in spreadsheet or in database format.

### The steps we'll follow:

- We're going to scrape https://www.amazon.in/gp/bestsellers/books/
- We'll get a list of topics.
- For each topic, we'll get topic title, topic page URL
- For each topic, we'll get the top 50 books in the topic from the topic page
- For each book, we'll grab the book name, book URL, author name,book price,star rating and No of customer rated as rating.
- Save the information data to CSV file Using Pandas library

The output will look like this:

title, url ,book_name ,author name ,book price  ,star rating , rating, book_url.

![](https://imgur.com/UBu96J0.png)

## How to run the code

Option 1: Running using free online resources (1-click, recommended)
The easiest way to start executing the code is to click the Run button at the top of this page and select Run on Binder. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on Google Colab or Kaggle to use these platforms.

Option 2: Running on your computer locally
To run the code on your computer locally, you'll need to set up Python, download the notebook and install the required libraries. We recommend using the Conda distribution of Python. Click the Run button at the top of this page, select the Run Locally option, and follow the instructions.


In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
# Execute this to save new versions of the notebook
jovian.commit(project="Web-scraping-top-rated ")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "renuverma55/web-scraping-top-rated" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/renuverma55/web-scraping-top-rated[0m


'https://jovian.ai/renuverma55/web-scraping-top-rated'

## Scrape the list of books from Amazon

 - We will use Requests library to downlaod the page.
 - we will use BeautifulSoup to parse and extract information.
 - convert to a Pandas dataframe

### Install and import required libraries.

In [4]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Downloading a web page using requests

When you access a URL like using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We'll use a library called requests to download web pages from the internet. We can download a web page using the requests.get function.

In [5]:
topics_url = 'https://www.amazon.in/gp/bestsellers/books/'

In [6]:
response = requests.get(topics_url)

In [7]:
type(response)

requests.models.Response

If the request was successful, response.status_code is set to a value between 200 and 299.

In [8]:
response.status_code

200

The contents of the web page can be accessed using the .text property of the response.

In [9]:
page_contents = response.text
len(page_contents)    #The `len` fucnction tells us the length of the response object

306693

The page contains over 306693 characters! Let's view the first 1000 characters of the web page.

In [11]:
page_contents[:1000]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com">\n<link rel="dns-prefetch" href="https://m.media-amazon.com">\n<link rel="dns-prefetch" href="https://completion.amazon.com">\n<!-- sp:end-feature:cs-optimization -->\n\n<!-- sp:feature:aui-assets -->\n<link rel="stylesheet" href="https://images-eu.ssl-images-amazon.com/images/I/11EIQ5IGqaL._RC|01ZTHTZObnL.css,41C-I1lXVwL.css,31ufSReDtSL.css,013z33uKh2L.css,017DsKjNQJL.css,0131vqwP5UL.css,41EWOOlBJ9L.css,11TIuySqr6L.css,01ElnPiDxWL.css,11Qjwq-j69L.css,01Dm5eKVxwL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,21P6CS3L9LL.css,01oDR3IULNL.css,41CYNGpGlrL.css,01XPHJk60-L.css,01smHc51S9L.css,21aPhFy+riL.cs

Let's save the contents to a file with the .html extension.

In [12]:
with open('amazon_bestseller_books.html', 'w') as file:
    file.write(page_contents)

Now, a HTML File is created by the name amazon_bestseller_books.html

![](https://imgur.com/C8jLAfB.png)

## Parse the HTML source code using BeautifulSoup library

In [13]:
doc = BeautifulSoup(page_contents, 'html.parser') 

In [14]:
type(doc)

bs4.BeautifulSoup

## Inspecting HTML in the Browser

We can view the source code of any webpage right within your browser by right-clicking anywhere on a page and selecting the "Inspect" option.


![](https://imgur.com/tXEOKKj.png)

As shown above ,We can find out 'topic title' are present in the "div" tag under class  -"p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"

Let's write a function to download the page.

In [19]:
def get_topic_page(topic_url):
    response = requests.get(topic_url)
    #Check successful Response
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topic_url))
    #Parse using Beautiful Soup
    topic_doc = BeautifulSoup(response.text,'html.parser')
    return topic_doc

In [20]:
doc = get_topic_page('https://www.amazon.in/gp/bestsellers/books')

In [21]:
sel_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
topic_title_tags = doc.find_all('div',class_=sel_class) 

In [22]:
len(topic_title_tags)  # this is the table lenght which contains topic title

35

Lets create the helper function
#### Topic Title

In [23]:
def get_topic_titles(topic_title_tags): # this function is created to get the topic title
    topic = topic_title_tags.find('a').text
    return topic


In [24]:
a = get_topic_titles(topic_title_tags[1])

In [25]:
a

'Action & Adventure'

The amazon book page contains 35 categories wherein each category 50 best books are listed in the page.

In [26]:
def get_title(topic_title_tags):# this is created to append title in each book
    title = []
    for j in range(50):
        title.append(get_topic_titles(topic_title_tags))
    return title


#### Topic URL

In [27]:
def get_topic_urls(topic_title_tags): # this function is created to get the topic title url
    base_url ='https://www.amazon.in/'
    table_tag_href = base_url + topic_title_tags.find('a')['href']
    return table_tag_href


In [28]:
b = get_topic_urls(topic_title_tags[1])

In [29]:
b

'https://www.amazon.in//gp/bestsellers/books/1318158031'

In [30]:
def get_url(topic_title_tags):   # this is created to append title url in each book category wise.
    url = []
    for j in range(50):
        url.append(get_topic_urls(topic_title_tags))
    return url


#### Book Name

In [31]:
def get_books_name(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    books_name = []
    for i in range(len( books_tag)):
        try:
            author_tag = books_tag[i].find('div',class_='zg-grid-general-faceout').find('span').find('div').text
            books_name.append(author_tag)
        except AttributeError:
            books_name.append(None)
    return books_name
                   
                   

In [32]:
books_name = get_books_name(get_topic_page(b)) # b contains the url for the  first topic "action and adventure" 

In [33]:
books_name[0:5] # Printing first 5 book name from action and adventure

["Harry Potter and the Philosopher's Stone",
 'The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick',
 'Something I Never Told You',
 'The Complete Novels of Sherlock Holmes',
 'A Thousand Kisses Deep']

#### Book URL

In [34]:
def get_books_url(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    base_url = 'https://www.amazon.in/'
    books_url = []
    for i in range(len( books_tag)):
        url = base_url + books_tag[i].find('a')['href']
        books_url.append(url)
    return books_url

In [35]:
books_url = get_books_url(get_topic_page(b))

In [36]:
books_url [:5] # printing first 5 URLS

['https://www.amazon.in//Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1',
 'https://www.amazon.in//Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=1409181634&psc=1',
 'https://www.amazon.in//Something-I-Never-Told-You/dp/0143445901/ref=zg_bs_1318158031_3/000-0000000-0000000?pd_rd_i=0143445901&psc=1',
 'https://www.amazon.in//Complete-Novels-Sherlock-Holmes/dp/8175994312/ref=zg_bs_1318158031_4/000-0000000-0000000?pd_rd_i=8175994312&psc=1',
 'https://www.amazon.in//Thousand-Kisses-Deep-Novoneel-Chakraborty/dp/014345823X/ref=zg_bs_1318158031_5/000-0000000-0000000?pd_rd_i=014345823X&psc=1']

#### Author Name

In [40]:
def get_author_name(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    author_name = []
    for i in range(len( books_tag)):
        author_tag = books_tag[i].find('div',class_='a-row a-size-small').text
        author_name.append(author_tag)
    return author_name

In [41]:
author_name = get_author_name(get_topic_page(b))

In [42]:
author_name[:5]

['J.K. Rowling',
 'Alex Michaelides',
 'Shravya Bhinder',
 'Arthur Conan Doyle',
 'Novoneel Chakraborty']

#### Book Price

In [43]:
def get_book_price(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    book_price = []
    for i in range(len( books_tag)):
        try:
            price_tag =books_tag[i].find('span',class_='p13n-sc-price').text
            book_price.append(price_tag)
        except AttributeError:
            book_price.append(None)
          
    return book_price

In [44]:
book_price = get_book_price(get_topic_page(b))

In [45]:
book_price[0:5]

['₹124.95', '₹259.00', '₹175.00', '₹165.00', '₹139.00']

#### Star Rating

In [46]:
def get_star_rating(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    star_rating = []
    for i in range(len( books_tag)):
        try:
            star_tag = books_tag[i].find('div',class_='a-icon-row').text[0:3]
            star_rating.append(star_tag)
        except AttributeError:
            star_rating.append(None)
    return star_rating


In [47]:
star_rating = get_star_rating(get_topic_page(b))

In [48]:
star_rating[26:37]

['4.2', '4.4', '4.1', '4.7', '4.6', '4.4', '4.4', '4.4', '4.0', '4.5', '4.8']

#### Rating


In [49]:
def get_rating(topic_doc):
    books_tag = topic_doc.find_all('div',class_="a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc")
    rating = []
    for i in range(len( books_tag)):
        try:
            rating_tag= books_tag[i].find('div',class_='a-icon-row')('span')[1].text
            rating.append(rating_tag)
        except TypeError:
            rating.append(None)
    return rating

In [50]:
rating = get_rating(get_topic_page(b))

In [51]:
rating[:5]

['25,438', '116,881', '1,898', '13,953', None]

## Let's create a function to put them together

In [87]:
def scrape_topic_list(main_url):
    main_dict = {
        'title':[],
        'url' :[],
        'book_name' :[],
        'books_url': [],
         'author_name': [],
        'book_price': [],
         'star_rating': [],
        'rating': []
        
    }
    
    doc = get_topic_page(main_url)     
    sel_class = '_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
    topic_title_tags = doc.find_all('div',class_=sel_class)
    for i in topic_title_tags[1:35]:
        title= get_topic_titles(i)
        url= get_topic_urls(i)
        topic_doc =get_topic_page(url)
        main_dict['book_name'].extend(get_books_name(topic_doc))
        main_dict['author_name'].extend(get_author_name(topic_doc))
        main_dict['book_price'].extend(get_book_price(topic_doc))
        main_dict['star_rating'].extend(get_star_rating(topic_doc))
        main_dict['rating'].extend(get_rating(topic_doc))
        main_dict['books_url'].extend(get_books_url(topic_doc))
        main_dict['title'].extend(get_title(i))
        main_dict['url'].extend(get_url(i))
    df=pd.DataFrame(main_dict)
    return df

In [103]:
scrape_df= scrape_topic_list('https://www.amazon.in/gp/bestsellers/books/')

In [104]:
scrape_df

Unnamed: 0,title,url,book_name,books_url,author_name,book_price,star_rating,rating
0,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,Harry Potter and the Philosopher's Stone,https://www.amazon.in//Harry-Potter-Philosophe...,J.K. Rowling,₹124.95,4.7,25438
1,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,"The Silent Patient: The record-breaking, multi...",https://www.amazon.in//Silent-Patient-Alex-Mic...,Alex Michaelides,₹259.00,4.5,116881
2,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,Something I Never Told You,https://www.amazon.in//Something-I-Never-Told-...,Shravya Bhinder,₹175.00,4.3,1898
3,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,The Complete Novels of Sherlock Holmes,https://www.amazon.in//Complete-Novels-Sherloc...,Arthur Conan Doyle,₹165.00,4.5,13953
4,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...,A Thousand Kisses Deep,https://www.amazon.in//Thousand-Kisses-Deep-No...,Novoneel Chakraborty,₹139.00,,
...,...,...,...,...,...,...,...,...
1695,Travel,https://www.amazon.in//gp/bestsellers/books/13...,An Unforgettable Journey To Hometown: A travel...,https://www.amazon.in//Unforgettable-Journey-H...,Vivek Shukla,₹99.00,4.6,117
1696,Travel,https://www.amazon.in//gp/bestsellers/books/13...,Aazadi Mera Brand (Hindi),https://www.amazon.in//Aazadi-Mera-Brand-Anura...,Anuradha Beniwal,₹158.00,4.7,341
1697,Travel,https://www.amazon.in//gp/bestsellers/books/13...,"Around the World in 80 Trains: A 45,000-Mile A...",https://www.amazon.in//Around-World-80-Trains-...,Monisha Rajesh,₹214.00,4.4,997
1698,Travel,https://www.amazon.in//gp/bestsellers/books/13...,Travels in the Mogul Empire: A.D. 1656-1668,https://www.amazon.in//Travels-Mogul-Empire-16...,Francois Bernier,₹612.00,4.7,25


## Save the extracted information into CSV file

In [107]:
scrape_df.to_csv('top_rated_books.csv',index = None)     #Converting the final Dataframe 'scrape_df' to a CSV File


## Future Work
 * We may get the list of books listed in the different Next Pages. We may all Create a large Data Frame for Future Analysis. 
 * Explore other more complex websites.
 * Explore how we might go about scraping data using Selenium.

## Summary

 * Install and import libraries
 * Download and Parse the Best seller HTML page source code using resquest and Beautifulsoup to get item categories topics URL.
 * Extract information from each page
 * Creadted Pandas DataFrame using ain Function
 * Save the information data to CSV file Using Pandas library

## References
* Python offical documentation. https://docs.python.org/3/

* Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

* Tutorial on HTML: https://www.htmldog.com/guides/html/.

In [111]:
jovian.commit(files=['top_rated_books.csv'])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "renuverma55/web-scraping-top-rated" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/renuverma55/web-scraping-top-rated[0m


'https://jovian.ai/renuverma55/web-scraping-top-rated'