# About project

The main goal of this project is to scrap web and get data from Amazon about best seller books. Get title, author, rating, number of users rated and price information for all available pages of products. 
Create a pandas dataframe and keep the scrapped information inside the dataframe, in separate columns.

# Implementation

In [24]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

# The following function extract every single record from the html and return a tuple with everything inside.  
def extract_records(item): 
    
    atag = item.h2.a
    
    # Getting book title
    title = atag.get_text()
    
    # Getting book price, and keep it in try except block, in case if price is not written. 
    try: 
        price = item.find('span', 'a-offscreen').text
    except AttributeError:
        return
    
    # Getting book rating
    try: 
        rating = item.i.text
    except AttributeError:
        return    
    
    # Getting book author
    try: 
        author = item.find('a', 'a-size-base a-link-normal').text
    except AttributeError:
        return 
    
    # Getting number of users rated
    # Here there is a problem, as finding number is not always possible.
    # It is written in a <span> tag under tag <a>, where we can see another text as well. 
    # So I wasn't able to separate only numbers. 
    try: 
        num = item.find_all('span', 'a-size-base')[1].text
    except AttributeError:
        return
    
    # Keep all information about book in tuple. 
    result = (title, author, rating, num, price)
    
    return result

# The following function returns url for given search term. So it is possible to collect data for any search.  
def get_url(search_term):
    temp = "https://www.amazon.com/s?k={}&qid=1638647612&ref=sr_pg_1"
    search_term = search_term.replace(' ', '+')
    url = temp.format(search_term)
    url += "&page={}"
    
    return url

# I will be using search term as "best seller books" in the scope of our project.
search_term = "best seller books"

driver = webdriver.Chrome("C:\\Users\\sonaasat\\Documents\\chromedriver.exe")
url = get_url(search_term)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')

# The following line finds all records from the html source page. 
all_records = soup.find_all("div", {'data-component-type': 's-search-result'})

# Keep all records in data list, to write it in a csv file. 
data = []
for page in range(1,7):
    driver.get(url.format(page))

    soup = BeautifulSoup(driver.page_source, 'html.parser')
    all_records = soup.find_all("div", {'data-component-type': 's-search-result'})

    for items in all_records:
        rec = extract_records(items)
        if rec: 
            data.append(rec)


driver.close()

# Creating a csv file. 
with open('data_books.csv', 'w', newline='', encoding='utf-8') as f: 
    writer = csv.writer(f)
    writer.writerow(['Title', 'Author', 'Rating','Number of users rated','Price'])
    writer.writerows(data)


  driver = webdriver.Chrome("C:\\Users\\sonaasat\\Documents\\chromedriver.exe")


In [2]:
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

In [3]:
# I have manually modified data for "Number of users rated" column, and kept it. 
# The original data is data_books.csv
df = pd.read_csv('data.csv')

# Data pre-processing
df['Number of users rated'] = df['Number of users rated'].str.replace(',', '')
df['Number of users rated'] = pd.to_numeric(df['Number of users rated'])

df['Rating'] = df['Rating'].str.replace(' out of 5 stars', '')
df['Rating'] = pd.to_numeric(df['Rating'])

In [4]:
# Check, if any N/A data exists
df.isna().sum()

Title                    0
Author                   0
Rating                   0
Number of users rated    0
Price                    0
dtype: int64

In [5]:
class Book:
    
    def __init__(self, title, author, rating, num_users_rated, price):
        """
        Create the attributes title, author, rating, num_users_rated and price
        
        Ստեղծեք title, author, rating, num_users_rated և price attribute-ները
        """
        
        self.title = title
        self.author = author
        self.rating = rating
        self.num_users_rated= num_users_rated
        self.price = price
        
    
    def pretty_print(self):
        """
        Print book information in the following format:
        $title by $author with rating $rating costs $price.
        
        Տպեք տվյալ գրքին վերաբերող ինֆորմացիան հետևյալ ֆորմատով՝
        $title by $author with rating $rating costs $price.
        """
        
        # Check if data has more than 1 book
        if len(self.title)>1: 
            info = []
            for i in range(len(self.title)):
                temp = self.title[i] + " by " + self.author[i] + ' with rating ' + str(self.rating[i]) + ' costs ' + str(self.price[i]) + '.'
                info.append(temp) 
        else: 
            info = self.title + " by " + self.author + ' with rating ' + str(self.rating) + ' costs ' + str(self.price)
        
        return info
    
    def rating_eval(self):
        """
        Evaluates the statistical significance of the rating. The method should return 
        "statistically significant rating" (or ssr) in case more than 5000 users have
        rated the following book and should return "not statistically significant rating" (or nssr), otherwise.
        
        Գնահատեք թե որքանով ստատիստիկորեն արժեքավոր է գրքին տրված գնահատականը։ Ֆունկցիան պետք է վերադարձնի 
        "statistically significant rating" (կամ ssr), եթե գիրքը գնահատել է ավելի քան 5000 օգտատեր և  
        "not statistically significant rating" (կամ nssr)՝ հակառակ դեպքում։
        """
        
        if len(self.num_users_rated)>1:
            rating_info = []
            for i in range(len(self.num_users_rated)): 
                if self.num_users_rated[i] > 5000: 
                    rating_info.append("statistically significant rating")
                else: 
                    rating_info.append("not statistically significant rating")
                
        return rating_info 

In [6]:
book = Book(df['Title'], df["Author"], df['Rating'] , df['Number of users rated'], df['Price'])

In [7]:
book.pretty_print()

['The Last Thing He Told Me: A Novel  by Laura Dave with rating 4.3 costs $14.00 .',
 'Where the Crawdads Sing  by Delia Owens with rating 4.8 costs $9.98 .',
 'Peril  by Bob Woodward with rating 4.5 costs $15.00 .',
 'Apples Never Fall  by Kindle with rating 4.2 costs $0.00 .',
 "The Sweetness of Water (Oprah's Book Club): A Novel  by Nathan Harris with rating 4.5 costs $14.99 .",
 'The Stranger in the Lifeboat: A Novel  by Mitch Albom with rating 4.6 costs $14.39 .',
 'The Wish  by Nicholas Sparks with rating 4.8 costs $14.00 .',
 'In Five Years: A Novel  by Rebecca Serle with rating 4.4 costs $11.06 .',
 'The Four Winds: A Novel  by Kristin Hannah with rating 4.5 costs $14.99 .',
 'The Beekeeper of Aleppo: A Novel  by Christy Lefteri with rating 4.4 costs $8.96 .',
 "The Judge's List: A Novel  by Kindle with rating 4.5 costs $0.00 .",
 'A Woman of No Importance: The Untold Story of the American Spy Who Helped Win World War II  by Sonia Purnell with rating 4.6 costs $13.39 .',
 'If Y

In [227]:
book.rating_eval()

['statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'not statistically significant rating',
 'statistically significant rating',
 'not statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'not statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'not statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically significant rating',
 'statistically signif

In [9]:
import random
class Library(Book):
    
    def __init__(self,title, author, rating, num_users_rated, price):
        """
        Create an attribute of type list called Book_list which will be an empty list for now.
        
        Ստեղծեք list տիպի Book_list attribute-ը, որն առայժմ կլինի դատարկ list 
        """
        
        Book_list = []
        self.Book_list = Book_list
        
        self.title = title
        self.author = author
        self.rating = rating
        self.num_users_rated= num_users_rated
        self.price = price
    
    def get_top_5(self):
        """
        Return the information related to 5 books that have the highest rating using the method pretty_print.
        If there are more than 5 books with the same rating, randomly select and show only 5 of them.
        
        Վերադարձրեք 5 ամենաբարձր գնահատական ունեցող գրքերի վերաբերյալ ինֆորմացիա՝ օգտվելով pretty_print
        ֆունկցիայից։ Եթե այդպիսի գրքերը 5-ից ավելի են՝ պատահականության սկզբունքով ընտրեք դրանցից 5-ը։
        """
        
        df_sorted = df.sort_values(by=['Rating'], ascending=False)
        df_sorted = df_sorted.reset_index()  
        
        book = Book(df_sorted['Title'], df_sorted["Author"], df_sorted['Rating'] , df_sorted['Number of users rated'], df_sorted['Price'])
        
        b = list(Book.pretty_print(book))
        
        k=0
        
        while k<5:
            print("Top %d: " %(k+1), b[k])
            print("-------------------------")
            k +=1
            
        return print("Top 5 rated books are being printed!")   
    
    def simple_search(self, title):
        """
        Search a book with required title and print information related to all books with titles exactly 
        matching the searched title using the method pretty_print.
        
        Search արեք տրված վերնագրով գիրք (գրքեր) և վերադարձրեք այդ գրքի (գրքերի) մասին ինֆորմացիան՝ օգտվելով 
        pretty_print ֆունկցիայից։ Վերադարձված գրքերի վերնագրերը պետք է համընկնեն title փոփոխականի արժեքի հետ
        """
        
        book = Book(df['Title'], df["Author"], df['Rating'] , df['Number of users rated'], df['Price'])
        
        b = list(Book.pretty_print(book))
        
        matched_books = []
        
        for i in range(len(df['Title'])):
            if title == df['Title'][i]:
                matched_books.append(b[i])
        
        
        return matched_books
    
    def complex_search(self, title):
        """
        In this case the user may not remember the exact title of the book. If you have a book(s) with 
        title exactly matching the searched title, this method should perform just like the method simple_search.
        Otherwise, if you don't have any books with the searched title, you should return book(s) that have the
        most similar titles. You are free to implement this method however you like. 
        One version that you can implement in case you don't have other ideas:
        return the book(s) with titles containing the most number of words from the searched title 
        (return the information using the method pretty_print), and if there are no such books, 
        return "nothing macthed your search".
        I would be happy to see any other interesting versions of the method :)
        
        
        Այս դեպքում օգտատերը միգուցե չհիշի գրքի կոնկրետ անվանումը կամ այն մի փոքր սխալ հիշի։ Եթե կա գիրք (գրքեր)
        որոնց վերնագիրը համընկնում է title փոփոխականի արժեքի հետ, ապա վերադարձրեք այդ գրքին (գրքերին) 
        վերաբերող ինֆորմացիան՝ օգտվելով pretty_print ֆունկցիայից։ Հակառակ դեպքում, պետք է վերադարձնել այն գրքերի մասին 
        ինֆորմացիա, որոնց վերնագրերը նման են title փոփոխականում պահված վերնագրի արժեքին։ Ազատ եք
        որոշելու այս ֆունկցիայի աշխատանքի տրամաբանությունը։
        Որպես հնարավոր տարբերակ (եթե չկան այլ մտքեր) կարող եք վերադարձնել այն գրքերի ինֆորմացին, որոնց
        վերնագրերը պարունակում են title փոփոխամանում տրված բառերից ամենաշատ քանակության բառերը ու եթե չկան
        այդպիսի գրքեր, վերադարձրեք "nothing macthed your search"։ 
        Ուրախ կլինեմ տեսնել այս ֆունկցիայի հետաքրքիր տարբերակներ :)
        """
        
        book = Book(df['Title'], df["Author"], df['Rating'] , df['Number of users rated'], df['Price'])
        
        b = list(Book.pretty_print(book))
        
        exact_matched_books = []
        
        for i in range(len(df['Title'])):
            if title == df['Title'][i]:
                exact_matched_books.append(b[i])
        
        if len(exact_matched_books)==0:
            match_books_contained_str = []
            for i in range(len(df['Title'])):
                # Handle uppercase and lowercase
                if title.lower() in df['Title'][i].lower():
                    match_books_contained_str.append(b[i])
            
            if len(match_books_contained_str)>0: 
                return match_books_contained_str
            else: 
                books_with_similar_name = []
                
                # Implement of complex search               
                for i in range(len(df['Title'])):
                    # Following line intersect 2 strings, and by this way algorithm will find matching books according to intersection. 
                    # As a condition, script checks if more than 8 characters are matching, return that book using pretty_print() method. 
                    res = ''.join(sorted(set(title) & set(df['Title'][i]), key = title.index)) 
                    if len(res)>8:
                        books_with_similar_name.append(b[i])
                
                
                if len(books_with_similar_name)>0:
                    return books_with_similar_name
                else: 
                    return print("Nothing matched your search. Try anything else.")
            
            return match_books_contained_str
            
        elif len(exact_matched_books)>0:
            return exact_matched_books
        

In [10]:
lib = Library(df['Title'], df["Author"], df['Rating'] , df['Number of users rated'], df['Price'])

In [11]:
lib.get_top_5()

Top 1:  The New York Times Book Review: 125 Years of Literary History  by Kindle with rating 5.0 costs $37.63 .
-------------------------
Top 2:  The Unofficial Masterbuilt Electric Smoker Cookbook: Ultimate Smoker Cookbook for Real Pitmasters, Includes Irresistible Meat, Fish, Poultry, Game, Vegetable Recipes for Your Electric Smoker  by Adam Jones with rating 4.9 costs $22.99 .
-------------------------
Top 3:  American Marxism  by Mark R. Levin with rating 4.9 costs $14.00 .
-------------------------
Top 4:  Nala's World: One Man, His Rescue Cat, and a Bike Ride around the Globe  by Dean Nicholson with rating 4.9 costs $14.49 .
-------------------------
Top 5:  Open Your Eyes: The true story of an Afghan doctor who found the meaning of life by following Rumi’s philosophy of living.  by Dr. Farid Mostamand with rating 4.9 costs $0.00 .
-------------------------
Top 5 rated books are being printed!


In [13]:
df_sorted = df.sort_values(by=['Rating'], ascending=False)
df_sorted = df_sorted.reset_index()

In [14]:
df_sorted.head(10)

Unnamed: 0,index,Title,Author,Rating,Number of users rated,Price
0,45,The New York Times Book Review: 125 Years of L...,Kindle,5.0,6828,$37.63
1,154,The Unofficial Masterbuilt Electric Smoker Coo...,Adam Jones,4.9,9,$22.99
2,93,American Marxism,Mark R. Levin,4.9,21806,$14.00
3,100,"Nala's World: One Man, His Rescue Cat, and a B...",Dean Nicholson,4.9,27900,$14.49
4,240,Open Your Eyes: The true story of an Afghan do...,Dr. Farid Mostamand,4.9,32,$0.00
5,174,Once Upon a Chef: Weeknight/Weekend: 70 Quick-...,Jennifer Segal,4.9,652,$26.99
6,60,A Promised Land,Hardcover,4.9,4366,$0.00
7,46,The 1619 Project: A New Origin Story,Nikole Hannah-Jones,4.8,2027,$23.17
8,244,Signs of Life Series: A Christian Fiction Thri...,Creston Mapes,4.8,92,$0.00
9,159,Extreme Ownership: How U.S. Navy SEALs Lead an...,Paperback,4.8,12069,$0.00


In [15]:
lib.simple_search("American Marxism ")

['American Marxism  by Mark R. Levin with rating 4.9 costs $14.00 .']

In [16]:
lib.complex_search("American")

['A Woman of No Importance: The Untold Story of the American Spy Who Helped Win World War II  by Sonia Purnell with rating 4.6 costs $13.39 .',
 "American Dirt (Oprah's Book Club): A Novel  by Hardcover with rating 4.6 costs $0.00 .",
 'Vanderbilt: The Rise and Fall of an American Dynasty  by Anderson Cooper with rating 4.4 costs $18.00 .',
 'The Pioneers: The Heroic Story of the Settlers Who Brought the American Ideal West (Thorndike Press Large Print Popular and Narrative Nonfiction)  by David McCullough with rating 4.5 costs $34.99 .',
 'All American Christmas  by Rachel Campos-Duffy with rating 4.8 costs $18.92 .',
 'American Marxism  by Mark R. Levin with rating 4.9 costs $14.00 .',
 'Blood and Thunder: The Epic Story of Kit Carson and the Conquest of the American West  by Hampton Sides with rating 4.7 costs $17.10 .',
 'Facing the Mountain: A True Story of Japanese American Heroes in World War II  by Daniel James Brown with rating 4.7 costs $15.00 .']

In [17]:
lib.complex_search("Marxism")

['American Marxism  by Mark R. Levin with rating 4.9 costs $14.00 .']

In [18]:
lib.complex_search("novel")

['The Last Thing He Told Me: A Novel  by Laura Dave with rating 4.3 costs $14.00 .',
 "The Sweetness of Water (Oprah's Book Club): A Novel  by Nathan Harris with rating 4.5 costs $14.99 .",
 'The Stranger in the Lifeboat: A Novel  by Mitch Albom with rating 4.6 costs $14.39 .',
 'In Five Years: A Novel  by Rebecca Serle with rating 4.4 costs $11.06 .',
 'The Four Winds: A Novel  by Kristin Hannah with rating 4.5 costs $14.99 .',
 'The Beekeeper of Aleppo: A Novel  by Christy Lefteri with rating 4.4 costs $8.96 .',
 "The Judge's List: A Novel  by Kindle with rating 4.5 costs $0.00 .",
 'The Last Flight: A Novel  by Julie Clark with rating 4.4 costs $9.48 .',
 'The Couple Next Door: A Novel  by Shari Lapena with rating 4.3 costs $8.99 .',
 "American Dirt (Oprah's Book Club): A Novel  by Hardcover with rating 4.6 costs $0.00 .",
 'The Book of Two Ways: A Novel  by Jodi Picoult with rating 4.1 costs $10.99 .',
 'The Midnight Library: A Novel  by Matt Haig with rating 4.3 costs $13.29 .',
 

In [351]:
lib.complex_search("novel")

['The Last Thing He Told Me: A Novel  by Laura Dave with rating 4.3 costs $14.00 .',
 "The Sweetness of Water (Oprah's Book Club): A Novel  by Nathan Harris with rating 4.5 costs $14.99 .",
 'The Stranger in the Lifeboat: A Novel  by Mitch Albom with rating 4.6 costs $14.39 .',
 'In Five Years: A Novel  by Rebecca Serle with rating 4.4 costs $11.06 .',
 'The Four Winds: A Novel  by Kristin Hannah with rating 4.5 costs $14.99 .',
 'The Beekeeper of Aleppo: A Novel  by Christy Lefteri with rating 4.4 costs $8.96 .',
 "The Judge's List: A Novel  by Kindle with rating 4.5 costs $0.00 .",
 'The Last Flight: A Novel  by Julie Clark with rating 4.4 costs $9.48 .',
 'The Couple Next Door: A Novel  by Shari Lapena with rating 4.3 costs $8.99 .',
 "American Dirt (Oprah's Book Club): A Novel  by Hardcover with rating 4.6 costs $0.00 .",
 'The Book of Two Ways: A Novel  by Jodi Picoult with rating 4.1 costs $10.99 .',
 'The Midnight Library: A Novel  by Matt Haig with rating 4.3 costs $13.29 .',
 

In [19]:
lib.complex_search("life")

['The Stranger in the Lifeboat: A Novel  by Mitch Albom with rating 4.6 costs $14.39 .',
 'The Second Life of Mirielle West: A Haunting Historical Novel Perfect for Book Clubs  by Amanda Skenandore with rating 4.6 costs $8.27 .',
 'The Extraordinary Life of Sam Hell: A Novel  by Robert Dugoni with rating 4.6 costs $0.00 .',
 'The Unfit Heiress: The Tragic Life and Scandalous Sterilization of Ann Cooper Hewitt  by Audrey Clare Farley with rating 4.0 costs $14.99 .',
 'All About Me!: My Remarkable Life in Show Business  by Mel Brooks with rating 4.5 costs $19.57 .',
 'Up and Down: Victories and Struggles in the Course of Life  by Bubba Watson with rating 4.7 costs $20.74 .',
 'Taste: My Life Through Food  by Stanley Tucci with rating 4.7 costs $18.14 .',
 'Unfu*k Yourself: Get Out of Your Head and into Your Life  by Hardcover with rating 4.6 costs $0.00 .',
 'Open Your Eyes: The true story of an Afghan doctor who found the meaning of life by following Rumi’s philosophy of living.  by Dr.

In [20]:
len(lib.complex_search("game of chess"))

55

In [21]:
lib.complex_search("game of chess")

['If You Tell: A True Story of Murder, Family Secrets, and the Unbreakable Bond of Sisterhood  by Gregg Olsen with rating 4.4 costs $11.98 .',
 "The President and the Freedom Fighter: Abraham Lincoln, Frederick Douglass, and Their Battle to Save America's Soul  by Brian Kilmeade with rating 4.8 costs $17.54 .",
 'The Secrets We Keep: A gripping emotional page turner  by Kate Hewitt with rating 4.2 costs $3.99 .',
 "The Story of The Masters: Drama, joy and heartbreak at golf's most iconic tournament  by David Barrett with rating 4.7 costs $21.49 .",
 'The Silent Wife: A gripping emotional page turner with a twist that will take your breath away  by Kerry Fisher with rating 4.3 costs $3.99 .',
 'The Second Life of Mirielle West: A Haunting Historical Novel Perfect for Book Clubs  by Amanda Skenandore with rating 4.6 costs $8.27 .',
 'To Rescue the Republic: Ulysses S. Grant, the Fragile Union, and the Crisis of 1876  by Bret Baier with rating 4.8 costs $17.74 .',
 'Laptop from Hell: Hunt

In [22]:
lib.complex_search("love")

['The Clover Girls: A Novel  by Viola Shipman with rating 4.4 costs $9.99 .',
 'Beyond Biden: Rebuilding the America We Love  by Newt Gingrich with rating 4.7 costs $18.84 .']

In [23]:
lib.complex_search("love story")

['Pretty Things: A Novel  by Paperback with rating 4.3 costs $0.00 .',
 'Freaking Idiots Guide To Selling On eBay: How anyone can make $100 or more everyday selling on eBay (Freaking Idiots Guides)  by Audible Audiobook with rating 4.0 costs $15.99 .',
 'The Greatest Beer Run Ever: A Memoir of Friendship, Loyalty, and War  by Kindle with rating 4.6 costs $14.38 .',
 'The Pioneers: The Heroic Story of the Settlers Who Brought the American Ideal West (Thorndike Press Large Print Popular and Narrative Nonfiction)  by David McCullough with rating 4.5 costs $34.99 .',
 'The Complete Air Fryer Cookbook for Beginners: 600 Affordable, Quick & Easy Air Fryer Recipes with Tips & Tricks to Fry, Grill, Roast, and Bake Your Favorite Daily Meals  by Kindle with rating 4.3 costs $15.99 .',
 'Freaking Idiots Guides 4 Book Bundle Ebay Fiverr Kindle & Public Domain  by Nick Vulich with rating 4.6 costs $5.66 .',
 'No Way Out: A Gripping Novel of Suspense  by Fern Michaels with rating 4.1 costs $0.00 .',

# Summary

I have used webdriver module from selenium package for scrapping. 
Then I have added a function "get_url", which can be used for generating url for any search (not only for "best seller books"). 
And in the end, data have been stored in csv file. 

For second task, mainly implemented 2 search algorithms: first one (simple_search) returns exactly matching books, and the second one(complex_search), returns those books, which user might mean. The algorithm implementation is based on number of characters appears in both strings, and if it is more than 8, book name will be exported. 

And in the end some experiments were being done to check logic of script. 

Thanks for this project! It was very very helpful. 