# Top 100 Bestsellers on BookDepository

### In this project we will scrape https://www.bookdepository.com/  and extract relevant information to the top 100 bestseller books and transform them into a CSV file. 

### project outline:
- Find the suitable pages url , since each page has 30 books in it, we need to scrape 4 pages.
- Get the HTML file and parse it to get the info which we need: Book title, Author, publish date, ISBN and price.
- create a DataFrame and load this data into it and transform it to a CSV file.

## Let's Start ! 

In [1]:
# importing the libraries we need:

import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# getting the html doc and parse it.

base_url = 'https://www.bookdepository.com/bestsellers'
response = requests.get(base_url)
response.status_code

200

In [3]:
doc = BeautifulSoup(response.content, 'html.parser')

In [4]:
# let's try and get the first 30 titles (from the first page).

titles = []
title_tags = doc.find_all('h3', class_='title')
titles = [title.text.strip() for title in title_tags]
titles[:5]

['Glucose Revolution',
 "It Ends With Us: The most heartbreaking novel you'll ever read",
 'Verity',
 'Seven Husbands of Evelyn Hugo',
 'The Love Hypothesis']

In [5]:
# let's then grab the authors for first 30 books.

authors  = []
author_tags = doc.find_all('p', class_='author')
authors = [author.text.strip() for author in author_tags]
authors[:5]

['Jessie Inchauspe',
 'Colleen Hoover',
 'Colleen Hoover',
 'Taylor Jenkins Reid',
 'Ali Hazelwood']

In [6]:
# so all of the rest we can get in the same fashion except the ISBN.
# if we inspect the html file we can find an 'a' tag with an href attribute with a link that directs to the book.
# and the last 13 digits are the ISBN itself :) 

ISBN = []
ISBN_tags = doc.find_all('a', {'itemprop': 'url'})
for isbn in ISBN_tags:
    ISBN.append(isbn['href'][-13:])
ISBN[:5]

['9781780725239',
 '9781471156267',
 '9781408726600',
 '9781398515697',
 '9781408725764']

In [7]:
# now let's work out how to get the top 100 books instead of 30.
# if we go to the bestsellers page and move to the next page the only change that happens to the url
# is that it adds '?page={page_number}' , so lets create a list with them.

base_url = 'https://www.bookdepository.com/bestsellers'

all_urls = []
for i in range(1,5):
    all_urls.append(base_url + f"?page={i}")
    
all_urls

['https://www.bookdepository.com/bestsellers?page=1',
 'https://www.bookdepository.com/bestsellers?page=2',
 'https://www.bookdepository.com/bestsellers?page=3',
 'https://www.bookdepository.com/bestsellers?page=4']

In [8]:
# now lets build functions to get the work done. .

def get_titles(doc): 
    titles = []
    title_tags = doc.find_all('h3', class_='title')
    titles = [title.text.strip() for title in title_tags]
    return titles


def get_authors(doc):
    authors  = []
    author_tags = doc.find_all('p', class_='author')
    authors = [author.text.strip() for author in author_tags]
    return authors


def get_publish_dates(doc):
    dates = []
    date_tags = doc.find_all('p', class_='published')
    dates = [date.text.strip() for date in date_tags]
    return dates


def get_prices(doc):
    prices = []
    price_tags = doc.find_all('div', class_='price-wrap')  # we use the price wrap and not price because not all books have it.
    prices = [price.text.strip() for price in price_tags]
    return prices 


def get_ISBN(doc):
    ISBN = []
    ISBN_tags = doc.find_all('a', {'itemprop': 'url'})
    for isbn in ISBN_tags:
        ISBN.append(isbn['href'][-13:])
    return ISBN

In [9]:
# Big function to cover all:

def Scrape(url):
    
    response = requests.get(url)
    doc = BeautifulSoup(response.content, 'html.parser')
    
    titles = get_titles(doc)
    authors = get_authors(doc)
    dates = get_publish_dates(doc)
    prices = get_prices(doc)
    ISBN = get_ISBN(doc)
    
    scrape_dict = {
        'Book Title': titles,
        'Author': authors,
        'Publish Date': dates,
        'Price': prices,
        'ISBN 13': ISBN
    }
    
    df = pd.DataFrame(scrape_dict)
    return df

In [10]:
df  = Scrape(all_urls[0])
df.head()

Unnamed: 0,Book Title,Author,Publish Date,Price,ISBN 13
0,Glucose Revolution,Jessie Inchauspe,31 Mar 2022,₪67.71,9781780725239
1,It Ends With Us: The most heartbreaking novel ...,Colleen Hoover,02 Aug 2016,₪46.54,9781471156267
2,Verity,Colleen Hoover,20 Jan 2022,₪44.25,9781408726600
3,Seven Husbands of Evelyn Hugo,Taylor Jenkins Reid,14 Oct 2021,₪46.81,9781398515697
4,The Love Hypothesis,Ali Hazelwood,21 Oct 2021,₪44.65,9781408725764


In [11]:
# concat the dataframes of all 4 pages.

total = [Scrape(all_urls[0]), Scrape(all_urls[1]), Scrape(all_urls[2]), Scrape(all_urls[3])]
total_df= pd.concat(total)

In [12]:
# we notice that the price column is a bit messy.. some prices have in them both of the original price and the discounted one.
# and some without a price because it's out of stock.
# lets choose only the original prices, and those with no price to put zero in it. and convert all to floats.

def fix_price(x):
    if len(x) == 0:
        return 0
    else:
        try:
            index_backslash = x.index("\\")
        except:
            return float(x.split('₪')[1])
        else: 
            first_str = x.split('₪')[1]    # you can replace with whatever currency your browser displays.
            return float(first_str[:index_backslash])    

In [13]:
total_df['Price'] = total_df['Price'].apply(fix_price)
total_df.head()

Unnamed: 0,Book Title,Author,Publish Date,Price,ISBN 13
0,Glucose Revolution,Jessie Inchauspe,31 Mar 2022,67.71,9781780725239
1,It Ends With Us: The most heartbreaking novel ...,Colleen Hoover,02 Aug 2016,46.54,9781471156267
2,Verity,Colleen Hoover,20 Jan 2022,44.25,9781408726600
3,Seven Husbands of Evelyn Hugo,Taylor Jenkins Reid,14 Oct 2021,46.81,9781398515697
4,The Love Hypothesis,Ali Hazelwood,21 Oct 2021,44.65,9781408725764


In [14]:
# now we convert to a csv file. keep in mind we have 120 books in this DataFrame so let's omit the last 20 rows.
# the file will be created and saved in the same directory.
total_df[:100].to_csv('Top 100 Bestsellers.csv', index=False)

### By running the last cell, a file with the name 'Top 100 Bestsellers' should be created in the directory of your project.