# Scraping Amazon Best Seller Books using Python

#### Project outline

- Here are the steps we'll follow:
- We're going to scrape https://www.amazon.in/gp/bestsellers/books/ 

- We'll first get the list of different books. For each book we'll get the book name and book page URL. For each books we'll get the top 50 books.

- For each book we'll grab the Book Name, Author Name, Stars, Number of Reviews, Book_Type, Price and the Book URL.

- For each books we'll create the CSV file in the following format

Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews

Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1,Kindle Edition,₹299.00,4.7 out of 5 stars,"39,452"

The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=1409181634&psc=1,Paperback,₹279.00,4.5 out of 5 stars,"92,969"

## Tools used to scrape the list of topics from Github

- Requests : to download the page
- BS4 : to parse and extract information
- Converting to a Pandas DataFrame

In [50]:
!pip install requests --upgrade --quiet
import requests

In [51]:
url='https://www.amazon.in/gp/bestsellers/books/'
response = requests.get(url)

In [52]:
response.status_code

200

In [53]:
len(response.text)

323337

In [54]:
page_content = response.text

In [55]:
page_content[0:500]

'<!doctype html><html lang="en-in" class="a-no-js" data-19ax5a9jf="dingo"><!-- sp:feature:head-start -->\n<head><script>var aPageStart = (new Date()).getTime();</script><meta charset="utf-8"/>\n<!-- sp:end-feature:head-start -->\n<!-- sp:feature:csm:head-open-part1 -->\n\n<!-- sp:end-feature:csm:head-open-part1 -->\n<!-- sp:feature:cs-optimization -->\n<meta http-equiv=\'x-dns-prefetch-control\' content=\'on\'>\n<link rel="dns-prefetch" href="https://images-eu.ssl-images-amazon.com" crossorigin>\n<link rel="p'

In [56]:
with open('amazon_bestseller.html',"w", encoding="utf-8") as f:
    f.write(page_content)

## Use Beautiful Soup to parse and extract information
- Parse and explore the structure of downloaded web pages using Beautiful soup.

- Use the right properties and methods to extract the required information.

- Create functions to extract from the page into lists and dictionaries.

- (Optional) Use a REST API to acquire additional information if required.

In [57]:
!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup

In [58]:
soup = BeautifulSoup(page_content, 'html.parser')

In [59]:
type(soup)

bs4.BeautifulSoup

In [60]:
soup.find('title')

<title>Amazon.in Bestsellers: The most popular items in Books</title>

In [61]:
soup.find('img')

<img alt="" src="https://m.media-amazon.com/images/G/31/social_share/amazon_logo._CB633266945_.png" style="display:none"/>

In [62]:
selection_class = "_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
books_title_tag = soup.find_all('div',{ 'class':selection_class})

In [63]:
len(books_title_tag)

36

In [64]:
books_title_tag=books_title_tag[1:len(books_title_tag)]
books_title_tag[:3]

[<div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318158031">Action &amp; Adventure</a></div>,
 <div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318052031">Arts, Film &amp; Photography</a></div>,
 <div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><a href="/gp/bestsellers/books/1318064031">Biographies, Diaries &amp; True Accounts</a></div>]

## Extracting Tittles and URL'S of the books

In [65]:
## Extracting Tittles and URL'S of the books
def get_topic_titles(soup):
    selection_class = "_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
    books_title_tag = soup.find_all('div', {'class':selection_class})
    topic_titles = []
    
    for i in books_title_tag:
        topic_titles.append(i.text.strip())
    return topic_titles

In [66]:
get_topic_titles(soup)

['Books',
 'Action & Adventure',
 'Arts, Film & Photography',
 'Biographies, Diaries & True Accounts',
 'Business & Economics',
 "Children's Books",
 'Comics & Mangas',
 'Computing, Internet & Digital Media',
 'Crafts, Home & Lifestyle',
 'Crime, Thriller & Mystery',
 'Engineering',
 'Exam Preparation',
 'Fantasy, Horror & Science Fiction',
 'Health, Family & Personal Development',
 'Health, Fitness & Nutrition',
 'Higher Education Textbooks',
 'Historical Fiction',
 'History',
 'Humour',
 'Language, Linguistics & Writing',
 'Law',
 'Literature & Fiction',
 'Maps & Atlases',
 'Medicine & Health Sciences',
 'Politics',
 'Reference',
 'Religion',
 'Romance',
 'School Books',
 'Science & Mathematics',
 'Sciences, Technology & Medicine',
 'Society & Social Sciences',
 'Sports',
 'Teen & Young Adult',
 'Textbooks & Study Guides',
 'Travel']

In [67]:
def get_topic_urls(soup):
    selection_class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8"
    books_title_tag = soup.find_all('div', {'class': selection_class})
    books_url_tag = soup.find_all('div', {'class': selection_class})
    topic_urls = []
    
    base_url = 'https://www.amazon.in/'
    for tag in books_url_tag:
        try:
            topic_urls.append(base_url + tag.find('a')['href'])
        except:
            topic_urls.append('No URL')
        #print(tag.find('a'))
        #topic_urls.append(base_url + tag.find('a')['href'])
    return topic_urls

In [68]:
print(len(get_topic_urls(soup)))

36


In [69]:
urls = get_topic_urls(soup)
urls

['No URL',
 'https://www.amazon.in//gp/bestsellers/books/1318158031',
 'https://www.amazon.in//gp/bestsellers/books/1318052031',
 'https://www.amazon.in//gp/bestsellers/books/1318064031',
 'https://www.amazon.in//gp/bestsellers/books/1318068031',
 'https://www.amazon.in//gp/bestsellers/books/64619755031',
 'https://www.amazon.in//gp/bestsellers/books/1318104031',
 'https://www.amazon.in//gp/bestsellers/books/1318105031',
 'https://www.amazon.in//gp/bestsellers/books/1318118031',
 'https://www.amazon.in//gp/bestsellers/books/1318161031',
 'https://www.amazon.in//gp/bestsellers/books/22960344031',
 'https://www.amazon.in//gp/bestsellers/books/4149751031',
 'https://www.amazon.in//gp/bestsellers/books/1402038031',
 'https://www.amazon.in//gp/bestsellers/books/1318128031',
 'https://www.amazon.in//gp/bestsellers/books/23033693031',
 'https://www.amazon.in//gp/bestsellers/books/4149418031',
 'https://www.amazon.in//gp/bestsellers/books/1318164031',
 'https://www.amazon.in//gp/bestsellers/bo

## Import Pandas to create Dataframe

In [70]:
!pip install pandas --upgrade --quiet
import pandas as pd

In [71]:
def scrape_topics():
    url='https://www.amazon.in/gp/bestsellers/books/'
    response = requests.get(url)
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(topics_url))
    
    soup = BeautifulSoup(response.text, 'html.parser')
    topics_dict = {
        'title': get_topic_titles(soup),
        'url': get_topic_urls(soup)
    }
    return pd.DataFrame(topics_dict)

In [72]:
scrape_topics().drop(0,axis=0)

Unnamed: 0,title,url
1,Action & Adventure,https://www.amazon.in//gp/bestsellers/books/13...
2,"Arts, Film & Photography",https://www.amazon.in//gp/bestsellers/books/13...
3,"Biographies, Diaries & True Accounts",https://www.amazon.in//gp/bestsellers/books/13...
4,Business & Economics,https://www.amazon.in//gp/bestsellers/books/13...
5,Children's Books,https://www.amazon.in//gp/bestsellers/books/64...
6,Comics & Mangas,https://www.amazon.in//gp/bestsellers/books/13...
7,"Computing, Internet & Digital Media",https://www.amazon.in//gp/bestsellers/books/13...
8,"Crafts, Home & Lifestyle",https://www.amazon.in//gp/bestsellers/books/13...
9,"Crime, Thriller & Mystery",https://www.amazon.in//gp/bestsellers/books/13...
10,Engineering,https://www.amazon.in//gp/bestsellers/books/22...


In [73]:
element = scrape_topics().loc[1, 'url']
element

'https://www.amazon.in//gp/bestsellers/books/1318158031'

## Extracting Information for all Books

In [74]:
books_url = 'https://www.amazon.in//gp/bestsellers/books/1318158031'

In [75]:
response = requests.get(books_url)

In [76]:
books_doc = BeautifulSoup(response.text, 'html.parser')

In [77]:
div_tags = books_doc.find_all('div', {'class':"zg-grid-general-faceout"})
#div_tags

In [78]:
import os

books_doc = BeautifulSoup(response.text, 'html.parser')

books_dict={
        'Book_Name':[],
        'Author_Name':[],
        'Book_URL':[],
        'Edition_Type':[],
        'Price':[],
        'Star_Rating':[],
        'Reviews':[]
    }

def get_topic_page(books_urls):
    # download the page
    books_url='https://www.amazon.in/gp/bestsellers/books/1318158031'
    # check sucessful response
    response=requests.get(books_urls)
    
#     print(response.status_code)
    if response.status_code!=200:
        raise Exception('failed to load page {}'.format(books_urls))
        
    # parse using BeautifulSoup
    topic_doc = BeautifulSoup(response.text, 'html.parser')
    #div_tags= books_doc.find_all('div',{'class':"zg-grid-general-faceout"})
    return topic_doc


def books_details(div_tags):
    
    #extracting book names
    Book_Name_tags =div_tags.find('span')
    
    #extracting author name of books
    Author_Name_tags = div_tags.find('a', class_ = 'a-size-small a-link-child')
    
    #extracting books urls
    Book_URL = 'https://amazon.in' + div_tags.find('a', class_ = 'a-link-normal')['href']
    
    #extracting edition type of books
    Edition_Type_tags = div_tags.find('span', class_ = 'a-size-small a-color-secondary a-text-normal')
    
    #extracting price tag of book
#     Price_tags = div_tags.find('span', class_ = 'p13n-sc-price')
    Price_tags = div_tags.find('span', class_ = '_cDEzb_p13n-sc-price_3mJ9Z')
    
    #extracting star rating of books
    Star_Rating_tags = div_tags.find('span', class_ = 'a-icon-alt')
    
    #extracting review of books
    Reviews_tags = div_tags.find('span', class_ = 'a-size-small')
    
    return Book_Name_tags, Author_Name_tags, Book_URL, Edition_Type_tags, Price_tags, Star_Rating_tags, Reviews_tags


def book_name(books_info):
    if books_info[0] is not None:
        books_dict['Book_Name'].append(books_info[0].text)
    else:
        books_dict['Book_Name'].append('Missing')
    return books_dict


def author_name(books_info):
    if books_info[1] is not None:
        books_dict['Author_Name'].append(books_info[1].text)
    else:
        books_dict['Author_Name'].append('Missing')
    return books_dict


def book_url(books_info):
    if books_info[2] is not None:
        books_dict['Book_URL'].append(books_info[2])
    else:
        books_dict['Book_URL'].append('Missing')
    return books_dict


def edition_type(books_info) :
    if books_info[3] is not None:
        books_dict['Edition_Type'].append(books_info[3].text)
    else:
        books_dict['Edition_Type'].append('Missing')
    return books_dict


def book_price(books_info):
    if books_info[4] is not None:
        return books_dict['Price'].append(books_info[4].text)
    else:
        return books_dict['Price'].append('Missing')
    return books_dict


def star_rating(books_info):
    if books_info[5] is not None:
        books_dict['Star_Rating'].append(books_info[5].text)
    else:
        books_dict['Star_Rating'].append('Missing')
    return books_dict


def book_reviews(books_info):
    if books_info[6] is not None:
        books_dict['Reviews'].append(books_info[6].text)
    else:
        books_dict['Reviews'].append('Missing')
    return books_dict


def get_books(books_doc):
    div_selection_class = 'zg-grid-general-faceout'
    div_tags = books_doc.find_all('div', class_ = div_selection_class ) # creating a dictionary
    
    for i in range(0, len(div_tags)):
        books_info = books_details(div_tags[i])
        book_name(books_info)
        author_name(books_info)
        book_url(books_info)
        edition_type(books_info)
        book_price(books_info)
        star_rating(books_info)
        book_reviews(books_info)
    return pd.DataFrame(books_dict)

In [79]:
get_books(books_doc)

Unnamed: 0,Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews
0,Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-St...,Kindle Edition,Missing,4.7 out of 5 stars,101510
1,"THE SILENT PATIENT [Paperback] Michaelides, Alex",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaeli...,Paperback,Missing,4.5 out of 5 stars,267961
2,The Hidden Hindu: Science-Fiction meets Indian...,Akshat Gupta,https://amazon.in/Hidden-Hindu-Akshat-Gupta/dp...,Paperback,Missing,4.4 out of 5 stars,2585
3,The Hidden Hindu 2,Akshat Gupta,https://amazon.in/Hidden-Hindu-2-Akshat-Gupta/...,Paperback,Missing,4.6 out of 5 stars,1289
4,How the Earth Got Its Beauty: Puffin Chapter B...,Missing,https://amazon.in/How-Earth-Got-Its-Beauty/dp/...,Hardcover,Missing,4.6 out of 5 stars,Sudha Murty
5,The Complete Novel of Sherlock Holmes,Arthur Conan Doyle,https://amazon.in/Complete-Novels-Sherlock-Hol...,Paperback,Missing,4.4 out of 5 stars,19602
6,"The Hidden Hindu Book 3 [Paperback] Gupta, Akshat",Akshat Gupta,https://amazon.in/Hidden-Hindu-Book-3/dp/01434...,Paperback,Missing,4.6 out of 5 stars,949
7,War of Lanka (Ram Chandra Series Book 4),Amish Tripathi,https://amazon.in/War-Lanka-Ram-Chandra-Book/d...,Paperback,Missing,4.3 out of 5 stars,5634
8,Baby Touch: Tummy Time [Board book] Ladybird,Missing,https://amazon.in/Baby-Touch-Tummy-Time/dp/024...,Board book,Missing,4.6 out of 5 stars,
9,The Immortals of Meluha (Shiva Trilogy Book 1)...,Amish Tripathi,https://amazon.in/Immortals-Meluha-Shiva-Trilo...,Paperback,Missing,4.6 out of 5 stars,19478


In [80]:
def scrape_books(books_url, path):
    if os.path.exists(path):
        print('The file {} already exists.. Skipping...'.format(path))
        return
    books_df = get_books(get_topic_page(books_url))
    books_df.to_csv(path, index = None)

def scrape_lists_of_books():
    print('Scraping list of book')
    books_df = scrape_topics()
    books_df = books_df.drop(0,axis=0)
    #print(books_df)
    os.makedirs('data', exist_ok = True)
    for index, row in books_df.iterrows():
        print('Scraping bestselling books details "{}"'.format(row['title']))
        scrape_books(row['url'], 'data/{}.csv'.format(row['title']))

In [81]:
scrape_lists_of_books()

Scraping list of book
Scraping bestselling books details "Action & Adventure"
The file data/Action & Adventure.csv already exists.. Skipping...
Scraping bestselling books details "Arts, Film & Photography"
The file data/Arts, Film & Photography.csv already exists.. Skipping...
Scraping bestselling books details "Biographies, Diaries & True Accounts"
The file data/Biographies, Diaries & True Accounts.csv already exists.. Skipping...
Scraping bestselling books details "Business & Economics"
The file data/Business & Economics.csv already exists.. Skipping...
Scraping bestselling books details "Children's Books"
The file data/Children's Books.csv already exists.. Skipping...
Scraping bestselling books details "Comics & Mangas"
The file data/Comics & Mangas.csv already exists.. Skipping...
Scraping bestselling books details "Computing, Internet & Digital Media"
The file data/Computing, Internet & Digital Media.csv already exists.. Skipping...
Scraping bestselling books details "Crafts, Home

## Summary 

### What we have done so far was

- Install and import libraries.

- Download and Parse the Bestseller HTML page source code using request and Beautifulsoup to get item categories topics URL.

- Extract the Book name,Book urls.

- Extract information from each page.

- Combine the extracted information Extract information from each page’s data in a Python Dictionaries.

- Save the information data to CSV file Using Pandas library.

- By the end of the project, we’ll create a CSV file in the following format:


Book_Name,Author_Name,Book_URL,Edition_Type,Price,Star_Rating,Reviews

Harry Potter and the Philosopher's Stone,J.K. Rowling,https://amazon.in/Harry-Potter-Philosophers-Stone-Rowling-ebook/dp/B019PIOJYU/ref=zg_bs_1318158031_1/000-0000000-0000000?pd_rd_i=B019PIOJYU&psc=1,Kindle Edition,₹299.00,4.7 out of 5 stars,"39,452"

The Silent Patient: The record-breaking, multimillion copy Sunday Times bestselling thriller and Richard & Judy book club pick",Alex Michaelides,https://amazon.in/Silent-Patient-Alex-Michaelides/dp/1409181634/ref=zg_bs_1318158031_2/000-0000000-0000000?pd_rd_i=1409181634&psc=1,Paperback,₹279.00,4.5 out of 5 stars,"92,969"
```