# Amazon Best Selling Books - Data Exploration

# Introduction

Amazon is one of the biggest e-commerce websites in the world that started off as from a garage in Seattle. But under Bezos' careful management, the company has thrived and has expanded from selling only books to being a store where you buy almost everything, both physical and digital and it's still the largest online store for all types of books.

Amazon began operations in July 1995, advertising itself as "Earth's Biggest Bookstore" and leveraging significant book distributors and wholesalers to quickly fill its orders.

# Aim of the Project

The aim of this project is to analyze the best selling books on Amazon between 2009 and 2019. The dataset for this project was scrapped from the Amazon website and the insights we are looking forward to derive from this analysis are?

 - Distribution of Genre for the Unique Books - How many Fiction and Non Fiction Books were Best Sellers on Amazon between 2009 and 2019?
 

 - Distribution of Genre for the Unique Years - How many Fiction and Non Fiction Books were sold in each of the years?
 

 -  Top 10 Authors - Based on the number of times their names appeared in the best selling listings from 2009 to 2019
 

 - The Top 10 Books - Based on the number of times the book appeared as a best selling book from 2009 to 2019


 - Relationship between Reviews and Price - do low priced books get a better review and vice versa?


 - Genre performance based on reviews - Overall did Fiction books had more ratings than Non Fiction books?


 - Relationship between price and Year - Were the cost of the books increasing or decreasing with Years?


 - Top 20 Authors with highest User Ratings


 - Top 20 Authors with highest Reviews


 - Relationship between length of book title and and User Ratings - Did Short titled books get more ratings than long titled books?


 - Top 10 Best Selling Authors based on Genre and Number of Appearances


 - Top 10 Books based on User Ratings


 - Top 10 Books Based on Reviews


 - Top 10 Best Selling Books by Genre and Number of Reviews

In [1]:
# load EDA packaes
import numpy as np
import pandas as pd

# load web scrapping packages
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.request import urlopen
from time import sleep


pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)

# to bypass warnings
import warnings
warnings.filterwarnings('ignore')

# Gathering the Data

###### Let's get the urls and the second page for each year beginning from 2009 to 2021

In [2]:
#since the first page and last page has the same pattern url for all the years 
urls =[]
years = [str(i) for i in range(2009,2022)] #list of years between 2009-2022
for year in years:
    urls.append (f"https://www.amazon.com/gp/bestsellers/{year}/books")
    urls.append(f"https://www.amazon.com/gp/bestsellers/{year}/books/ref=zg_bsar_pg_2/ref=zg_bsar_pg_2?ie=UTF8&pg=2")
    
#urls

###### Let's use this function to get the details for each book in each year

In [3]:
def get_dir(book,year): 
    '''to get the details of each book for each year''' 
    
    import numpy as np
    '''to get the name of price'''

    try:
        price = book.find('span',class_="_cDEzb_p13n-sc-price_3mJ9Z").text[1:]
    except Exception as e:
        price = np.nan
    try:
        ranks = book.find('span', class_='zg-bdg-text').text[1:]
    except Exception as e:
        ranks = np.nan
    try:
        title = book.find('div',class_="_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y").text
    except Exception as e:
        title = np.nan
    try:
        ratings= book.find('span',class_="a-icon-alt").text[:3] 
    except Exception as e:
        ratings = np.nan
    try:
        no_of_reviews  = book.find('span',class_="a-size-small").text
    except Exception as e:
        no_of_reviews = np.nan
    try:
        author = book.find('a',class_="a-size-small a-link-child").text
    except Exception as e:
        author = np.nan
    try:
        cover_type = book.find('span',class_="a-size-small a-color-secondary a-text-normal").text
    except Exception as e:
        cover_type = np.nan
    year = year
    return [price,ranks,title,no_of_reviews,ratings,author,cover_type, year]

###### Let's get the year for the first and second page

In [4]:
year = [(str(i),str(i)) for i in range(2009,2022)] #create list that contains the a set of each year
years = [j for i in year for j in i] #get a list from the above line
#years

###### Let's get the books in every page(first and second) of every year from year 2009- 2021
###### Note that this cell takes about 25 minutes to run, so you will have to exercise patience

In [5]:
all_year = [] #Empty list of the content on the first and second page for the all the years of interest
from time import sleep 
for url in urls:  # loop through the urls created in cell 2
    
    website = url 
    
    driver = webdriver.Chrome("C:/webDrivers/chromedriver.exe") #to load the selenium webdriver 
    
    driver.get(website)        # use selenium webdriver above to get the webpage
    
    sleep(30)                  #to make sure the website is fullly loaded before going to the next page
    
    the_soup = BeautifulSoup(driver.page_source, 'html.parser')           #to get the page content
    
    books = the_soup.find_all(id = 'gridItemRoot')                 #get every books on the page.
    
    all_year.append(books) #add the books to the the list above
    
    
    
    driver.quit()                      #to close the chrome windows tab after extracting the data
    
#

In [6]:
len(all_year), len(years)  #to confirm you got all the years(first and second page)
#should be the same

(26, 26)

###### use the code below to get index and year so that looping through the files will be easier

In [7]:
year_index = (list(enumerate(years)))
dc = year_index

###### use the code to ge the observation for all the books in the top 100 for every year with the period of 2009-2021

In [8]:
data = [] #create an empty list for the observation for all the books in the top 100 for every year with the period of 2009-2022
for i in dc:   #loop through the year index in the cell above
    for books in all_year[i[0]]:             #loop through the books for all the years
        for book in books:                   #loop through the books for on each page
            data.append(get_dir(book,i[1]))  # get the details of each book and add to data(line 1)
            
#data #to print the data collected

In [9]:

# open file
with open('Amazon.txt', 'w+') as f:
     
    # write elements of list
    for items in data: 
        try:
            f.write('%s\n' %items)
        except Exception as e:
            f.write('%s\n' 'nothing')
     
    print("File written successfully")
 
 
# close the file
f.close()

File written successfully


###### This cell is for converting the data extracted in list format to a dataframe

In [10]:
best_selling_books= pd.DataFrame(data, columns = [
                         'price',
                         'ranks',
                         'title',
                         'no_of_reviews',
                         'ratings',
                         'author',
                       'cover_type',
                          'year'])


###### save the data to csv file

In [11]:
best_selling_books.to_csv('best_selling_books_2009-2021.csv')   #To save to csv

In [12]:
best_selling_books

Unnamed: 0,price,ranks,title,no_of_reviews,ratings,author,cover_type,year
0,12.81,1,The Lost Symbol,16129,4.4,Dan Brown,Hardcover,2009
1,10.43,2,The Shack: Where Tragedy Confronts Eternity,23398,4.7,William P. Young,Paperback,2009
2,9.93,3,Liberty and Tyranny: A Conservative Manifesto,5037,4.8,Mark R. Levin,Hardcover,2009
3,14.30,4,"Breaking Dawn (The Twilight Saga, Book 4)",16923,4.7,Stephenie Meyer,Hardcover,2009
4,9.99,5,Going Rogue: An American Life,1572,4.6,Sarah Palin,Hardcover,2009
...,...,...,...,...,...,...,...,...
1286,16.69,96,Will,Will Smith,4.8,,Hardcover,2021
1287,7.49,97,Think and Grow Rich: The Landmark Bestseller N...,83367,4.7,Napoleon Hill,Paperback,2021
1288,8.95,98,Dragons Love Tacos,15771,4.8,Adam Rubin,Hardcover,2021
1289,7.49,99,The Truth About COVID-19: Exposing The Great R...,Doctor Joseph Mercola,4.8,,Hardcover,2021


###### This will be the end of this notebook.

We will continue with the Analysis in Part Two Notebook. This is so that we won't keep scrapping the data from Amazon whenever we want to rerun the whole body of code for uniformity.

We would also want to respect the ethics of web scrapping.