# Scraping Amazon for bestsellers in different genres

#### Web Scraping 
Web scraping is a technique used to extract information or data from websites. Web scraping is commonly employed for various purposes, including data mining, market research, price comparison, content aggregation, and more.

#### Problem Statement
We'll scrape information from https://www.amazon.in/gp/bestsellers/books/ , creating three datasets: Genre, Sub-Genre, Books.

#### Tools used
1. Python
2. Requests: It is an HTTP client library it simplifies the process of sending and receiving data from websites by providing a uniform interface for both GET and POST methods.
    + documentation: https://requests.readthedocs.io/en/latest/ 
3. Beautiful Soup: It is a Python package for parsing HTML and XML documents.
    + documentation: https://beautiful-soup-4.readthedocs.io/en/latest/ 
4. Pandas: a software library for data manipulation and analysis.
    + documentation: https://pandas.pydata.org/docs/getting_started/install.html

#### Link to Datasets
These Dataset are uploaded on Kaggle: 
https://www.kaggle.com/datasets/chhavidhankhar11/amazon-books-dataset/data?select=Amazon_Books_Scraping

### Project Outline:
1. scrape https://www.amazon.in/gp/bestsellers/books/ 
2. create a dataset of all the genre available, number of sub-genres, url 
3. A dataset of sub-genres, their main genre, number of books, url
4. A dataset of books with their title, genre, sub-genre, paperback/kindle/audiobook/hardcover, price, rating, number of people rated, url.
5. merge all of them
7. upload the final dataset on kaggle

## Use the requests library to download web pages


In [1]:
!pip install requests --upgrade --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import requests

In [3]:
topics_url='https://www.amazon.in/gp/bestsellers/books/'
response=requests.get(topics_url)

In [4]:
response.status_code

200

In [5]:
len(response.text)

324642

In [6]:
page_contents=response.text

## Use Beautiful Soup to parse and extract information


### Dataset 1
Dataset of all the genre available, number of sub-genres, url

In [7]:
!pip install beautifulsoup4 --upgrade --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [8]:
from bs4 import BeautifulSoup

In [9]:
doc=BeautifulSoup(page_contents, 'html.parser')

In [10]:
type(doc)

bs4.BeautifulSoup

#### Genre Title

In [11]:
genre_title_class='_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
genre_title_tags=doc.find_all('div',{'class':genre_title_class})

In [12]:
type(doc.find_all('hr'))

bs4.element.ResultSet

In [13]:
len(genre_title_tags)

36

In [14]:
genre_title_tags[0]

<div class="_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8" role="treeitem"><span class="_p13n-zg-nav-tree-all_style_zg-selected__1SfhQ">Books</span></div>

We don't need the first tag as it's only 'Books'. As we are making dataset for genres we can exclude this.

In [15]:
genre_title_tags=genre_title_tags[1:]

In [16]:
len(genre_title_tags)

35

In [17]:
genre_title_tags[1].find_all('a')[0].text

'Arts, Film & Photography'

In [18]:
genre_title=[]
for tag in genre_title_tags:
    genre_title.append(tag.find_all('a')[0].text)

In [19]:
genre_title

['Action & Adventure',
 'Arts, Film & Photography',
 'Biographies, Diaries & True Accounts',
 'Business & Economics',
 "Children's Books",
 'Comics & Mangas',
 'Computing, Internet & Digital Media',
 'Crafts, Home & Lifestyle',
 'Crime, Thriller & Mystery',
 'Engineering',
 'Exam Preparation',
 'Fantasy, Horror & Science Fiction',
 'Health, Family & Personal Development',
 'Health, Fitness & Nutrition',
 'Higher Education Textbooks',
 'Historical Fiction',
 'History',
 'Humour',
 'Language, Linguistics & Writing',
 'Law',
 'Literature & Fiction',
 'Maps & Atlases',
 'Medicine & Health Sciences',
 'Politics',
 'Reference',
 'Religion',
 'Romance',
 'School Books',
 'Science & Mathematics',
 'Sciences, Technology & Medicine',
 'Society & Social Sciences',
 'Sports',
 'Teen & Young Adult',
 'Textbooks & Study Guides',
 'Travel']

#### URLs

In [20]:
genre_url_class='_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
genre_url_tags=doc.find_all('div',{'class':genre_url_class})

In [21]:
genre_url_tags=genre_url_tags[1:]

In [22]:
genre_title_tags[1].find_all('a')[0]['href']

'/gp/bestsellers/books/1318052031'

In [23]:
genre_url=[]
for tag in genre_url_tags:
    genre_url.append("https://www.amazon.in"+tag.find_all('a')[0]['href'])
genre_url

['https://www.amazon.in/gp/bestsellers/books/1318158031',
 'https://www.amazon.in/gp/bestsellers/books/1318052031',
 'https://www.amazon.in/gp/bestsellers/books/1318064031',
 'https://www.amazon.in/gp/bestsellers/books/1318068031',
 'https://www.amazon.in/gp/bestsellers/books/64619755031',
 'https://www.amazon.in/gp/bestsellers/books/1318104031',
 'https://www.amazon.in/gp/bestsellers/books/1318105031',
 'https://www.amazon.in/gp/bestsellers/books/1318118031',
 'https://www.amazon.in/gp/bestsellers/books/1318161031',
 'https://www.amazon.in/gp/bestsellers/books/22960344031',
 'https://www.amazon.in/gp/bestsellers/books/4149751031',
 'https://www.amazon.in/gp/bestsellers/books/1402038031',
 'https://www.amazon.in/gp/bestsellers/books/1318128031',
 'https://www.amazon.in/gp/bestsellers/books/23033693031',
 'https://www.amazon.in/gp/bestsellers/books/4149418031',
 'https://www.amazon.in/gp/bestsellers/books/1318164031',
 'https://www.amazon.in/gp/bestsellers/books/4149493031',
 'https://w

#### Number of Sub-Genres
As the sub-genre show when we click on the url of a particular genre. Let's make a function to calculate that

In [24]:
def no_of_subgenre(url):
    response=requests.get(url)
    page_contents=response.text
    doc=BeautifulSoup(page_contents, 'html.parser')
    sub_genre_title_class='_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
    sub_genre_title_tags=doc.find_all('div',{'class':sub_genre_title_class})
    sub_genre_title_tags=sub_genre_title_tags[1:]
    return len(sub_genre_title_tags)

In [25]:
no_of_subgenre('https://www.amazon.in/gp/bestsellers/books/1318052031')

11

In [26]:
sub_genre=[]
for i in genre_url:
    sub_genre.append(no_of_subgenre(i))
sub_genre

[35,
 11,
 3,
 4,
 34,
 2,
 11,
 9,
 3,
 18,
 15,
 3,
 4,
 1,
 10,
 35,
 6,
 35,
 7,
 6,
 19,
 35,
 10,
 7,
 5,
 12,
 29,
 8,
 9,
 13,
 7,
 37,
 18,
 2,
 6]

The genres with 35 sub-genres in this list are actually the one's with 0 sub-genres. So we update the list accordingly

In [27]:
for i in range(0,len(sub_genre)):
    if sub_genre[i]==35:
        sub_genre[i]=0
    else:
        continue

In [28]:
sub_genre

[0,
 11,
 3,
 4,
 34,
 2,
 11,
 9,
 3,
 18,
 15,
 3,
 4,
 1,
 10,
 0,
 6,
 0,
 7,
 6,
 19,
 0,
 10,
 7,
 5,
 12,
 29,
 8,
 9,
 13,
 7,
 37,
 18,
 2,
 6]

In [29]:
import pandas as pd

In [30]:
genre_data={'Title':genre_title,'Number of Sub-genres':sub_genre,'URL':genre_url}
Genre_df=pd.DataFrame(genre_data)

In [31]:
Genre_df

Unnamed: 0,Title,Number of Sub-genres,URL
0,Action & Adventure,0,https://www.amazon.in/gp/bestsellers/books/131...
1,"Arts, Film & Photography",11,https://www.amazon.in/gp/bestsellers/books/131...
2,"Biographies, Diaries & True Accounts",3,https://www.amazon.in/gp/bestsellers/books/131...
3,Business & Economics,4,https://www.amazon.in/gp/bestsellers/books/131...
4,Children's Books,34,https://www.amazon.in/gp/bestsellers/books/646...
5,Comics & Mangas,2,https://www.amazon.in/gp/bestsellers/books/131...
6,"Computing, Internet & Digital Media",11,https://www.amazon.in/gp/bestsellers/books/131...
7,"Crafts, Home & Lifestyle",9,https://www.amazon.in/gp/bestsellers/books/131...
8,"Crime, Thriller & Mystery",3,https://www.amazon.in/gp/bestsellers/books/131...
9,Engineering,18,https://www.amazon.in/gp/bestsellers/books/229...


In [32]:
Genre_df.to_csv('Genre_df.csv',index=False)

### Dataset 2
A dataset of sub-genres, their main genre, number of books, url

In [33]:
def subgenre(main_genre,url):
    response=requests.get(url)
    page_contents=response.text
    doc=BeautifulSoup(page_contents, 'html.parser')
    sub_genre_title_class='_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
    sub_genre_title_tags=doc.find_all('div',{'class':sub_genre_title_class})
    sub_genre_title_tags=sub_genre_title_tags[1:]
    
    #extract all sub-genre
    sub_genre=[]
    if len(sub_genre_title_tags)!=35:
        for tag in sub_genre_title_tags:
            sub_genre.append(tag.find_all('a')[0].text)
    
    #extract sub-genre url's
    sub_genre_url=[]
    if len(sub_genre_title_tags)!=35:
        for tag in sub_genre_title_tags:
            sub_genre_url.append('https://www.amazon.in'+tag.find_all('a')[0]['href'])
    
    #extract number of books
    no_books=[]
    for i in range(0,len(sub_genre)):
        no_books.append(100)
    sub_genre
    
    #Adding main genre
    mg=[]
    for i in range(0,len(sub_genre)):
        mg.append(main_genre)
    
    #make a dataframe
    sub_genre_data={'Title':sub_genre,'Main Genre':mg,'No. of Books':no_books,'URLs':sub_genre_url}
    sub_genre_df=pd.DataFrame(sub_genre_data)
    return sub_genre_df

# to Create an array of just urls
def subg_url(main_genre,url):
    response=requests.get(url)
    page_contents=response.text
    doc=BeautifulSoup(page_contents, 'html.parser')
    sub_genre_title_class='_p13n-zg-nav-tree-all_style_zg-browse-item__1rdKf _p13n-zg-nav-tree-all_style_zg-browse-height-large__1z5B8'
    sub_genre_title_tags=doc.find_all('div',{'class':sub_genre_title_class})
    sub_genre_title_tags=sub_genre_title_tags[1:]
    
    #extract sub-genre url's
    sub_genre_url=[]
    if len(sub_genre_title_tags)!=35:
        for tag in sub_genre_title_tags:
            sub_genre_url.append('https://www.amazon.in'+tag.find_all('a')[0]['href'])
    return sub_genre_url

In [34]:
data={'Title':[],'Main Genre':[],'No. of Books':[],'URLs':[]}
SubGenre_df=pd.DataFrame(data)
url=[]

In [35]:
for i in range(0,len(genre_title)):
    df=subgenre(genre_title[i],genre_url[i])
    url.extend(subg_url(genre_title[i],genre_url[i]))
    SubGenre_df=pd.concat([SubGenre_df, df], ignore_index=True)

In [36]:
len(url)

329

In [70]:
SubGenre_df

Unnamed: 0,Title,Main Genre,No. of Books,URLs
0,Architecture,"Arts, Film & Photography",100.0,https://www.amazon.in/gp/bestsellers/books/131...
1,Cinema & Broadcast,"Arts, Film & Photography",100.0,https://www.amazon.in/gp/bestsellers/books/131...
2,Dance,"Arts, Film & Photography",100.0,https://www.amazon.in/gp/bestsellers/books/131...
3,Design & Fashion,"Arts, Film & Photography",100.0,https://www.amazon.in/gp/bestsellers/books/131...
4,Museums & Museology,"Arts, Film & Photography",100.0,https://www.amazon.in/gp/bestsellers/books/131...
...,...,...,...,...
324,Illustrated Books,Travel,100.0,https://www.amazon.in/gp/bestsellers/books/131...
325,Specialty Travel,Travel,100.0,https://www.amazon.in/gp/bestsellers/books/145...
326,Transport,Travel,100.0,https://www.amazon.in/gp/bestsellers/books/229...
327,Travel & Holiday Guides,Travel,100.0,https://www.amazon.in/gp/bestsellers/books/131...


In [38]:
SubGenre_df.to_csv('Sub_Genre_df.csv',index=False)

### Dataset 3
A dataset of books with their title, author/publication, genre, sub-genre, type(paperback/kindle/audiobook/hardcover), price, rating, number of people rated, url.

In [39]:
def Books(main_genre,sub_genre,sub_genre_url):
        response=requests.get(sub_genre_url)
        page_contents=response.text
        doc=BeautifulSoup(page_contents, 'html.parser')

        #book title
        book_title_class='p13n-sc-uncoverable-faceout'
        parent_tag=doc.find_all('div',class_=book_title_class)
        book_title=[]
        for tag in parent_tag:
            book_title.append(tag.find_all('a')[1].find('div').text)

        #book author
        book_author_div_class='a-row a-size-small'
        parent_tag2=doc.find_all('div',class_=book_author_div_class)
        div_tag=[]
        for tag in parent_tag2:
            div_tag.append(tag.find('div'))
        book_author=[]
        for i in range(0,len(div_tag),2):
            author = div_tag[i].text 
            book_author.append(author)


        #main genre
        mg2=[]
        for i in range(0,len(book_title)):
            mg2.append(main_genre)

        #sub genre
        sg2=[]
        for i in range(0,len(book_title)):
            sg2.append(sub_genre)

        #book type
        book_type_class='a-size-small a-color-secondary a-text-normal'
        book_type_tag=doc.find_all('span',class_=book_type_class)
        book_type=[]
        for tag in book_type_tag:
            book_type.append(tag.text)

        #price
        book_price=[]
        for tag in parent_tag:
            book_price.append(tag.find_all('span',class_='p13n-sc-price')[0].text)


        #rating
        rating_tag=[]
        for tag in parent_tag:
            if(tag.find('i')!= []):
                rating_tag.append(tag.find('i'))
        rating=[]
        for i in range(0,len(rating_tag)):
            if rating_tag[i] != None:
                rating.append(float(rating_tag[i].text.split()[0]))
            else:
                rating.append(0)

        #no of people rated
        npr_class='a-size-small'
        npr_tag=doc.select('span.a-size-small:not([class*=" "])')
        npr=[]
        for i in range(0,len(rating)):
            if(rating[i]==0):
                npr_tag.insert(i,0)
            else:
                continue

        for i in range(0,len(npr_tag)):
            if(npr_tag[i]==0):
                npr.append(0)
            else:
                npr.append(int(npr_tag[i].text.replace(',','')))


        #URLs
        book_url=[]
        for tag in parent_tag:
            book_url.append('https://www.amazon.in' + tag.find_all('a')[1]['href'])

        #make a dataframe
        data={'Title':book_title,'Author':book_author,'Main Genre':mg2,'Sub Genre':sg2,'Type':book_type,'Price':book_price,'Rating':rating,'No. of People rated':npr,'URLs':book_url}
        book_df=pd.DataFrame(data)
        return book_df

In [63]:
Book_data={'Title':[],'Author':[],'Main Genre':[],'Sub Genre':[],'Type':[],'Price':[],'Rating':[],'No. of People rated':[],'URLs':[]}
Book_df=pd.DataFrame(Book_data)

for index, row in SubGenre_df.iterrows():
    main_genre = row['Main Genre']
    sub_genre = row['Title']
    ur = url[index] 
    
    try:
        df = Books(main_genre, sub_genre, ur)
        Book_df=pd.concat([Book_df, df])
        
    except:
        continue

In [66]:
Book_df

Unnamed: 0,Title,Author,Main Genre,Sub Genre,Type,Price,Rating,No. of People rated,URLs
0,The Complete Novel of Sherlock Holmes,Arthur Conan Doyle,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹169.00,4.4,19923.0,https://www.amazon.in/Complete-Novels-Sherlock...
1,Black Holes (L) : The Reith Lectures [Paperbac...,Stephen Hawking,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹99.00,4.5,7686.0,https://www.amazon.in/Black-Holes-Lectures-Ste...
2,The Kite Runner,Khaled Hosseini,"Arts, Film & Photography",Cinema & Broadcast,Kindle Edition,₹175.75,4.6,50016.0,https://www.amazon.in/Kite-Runner-Khaled-Hosse...
3,Greenlights: Raucous stories and outlaw wisdom...,Matthew McConaughey,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹389.00,4.6,32040.0,https://www.amazon.in/Greenlights-Raucous-stor...
4,The Science of Storytelling: Why Stories Make ...,Will Storr,"Arts, Film & Photography",Cinema & Broadcast,Paperback,₹348.16,4.5,1707.0,https://www.amazon.in/Science-Storytelling-Wil...
...,...,...,...,...,...,...,...,...,...
45,Insight Guides Poland (Travel Guide with Free ...,Insight Travel Guide,Travel,Travel & Holiday Guides,Paperback,"₹1,326.00",4.7,16.0,https://www.amazon.in/Insight-Guides-Poland-Tr...
46,Lonely Planet India 19 (Travel Guide),Anirban Mahapatra,Travel,Travel & Holiday Guides,Paperback,₹850.00,4.4,187.0,https://www.amazon.in/Lonely-Planet-India-Trav...
47,Eyewitness Travel Phrase Book French (EW Trave...,DK,Travel,Travel & Holiday Guides,Paperback,₹307.00,4.5,168.0,https://www.amazon.in/Eyewitness-Travel-Phrase...
48,Lonely Planet Australia (Travel Guide),Andrew Bain,Travel,Travel & Holiday Guides,Kindle Edition,"₹1,814.50",4.7,267.0,https://www.amazon.in/Lonely-Planet-Australia-...


In [67]:
Book_df = Book_df.reset_index(drop=True)

In [69]:
Book_df.to_csv("Books_df.csv")