# BestBuy Review Scraper

Links from *BestBuy Review Sorter* are fed into this notebook. I am repeating the warnings again.

BestBuy has a very complex website so this scraper is not %100 robust. 

* Sometimes it breaks and should be re-started from where it stopped.
* BestBuy offers different versions of their website for differen locales this can break the script as well
* I observed XPATHs for the same elements were changed within a short period of time(3 weeks)

This notebook is the second phase. After the desired amount of reviews are scraped further cleaning is done in another notebook.

In [139]:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
import re

In [140]:
options = webdriver.FirefoxOptions()
options.set_preference("intl.accept_languages", "en-US")
driver = webdriver.Firefox(executable_path="geckodriver.exe", options=options)

### Here to get the links I am using the sorted list of the products

For phones I am only scraping products which have over 400 reviews.

For laptops and TVs this threshold was about 900.

In [142]:
## Reading the data from Review Sorter

links = pd.read_csv("your_reviews_clean.csv")

In [143]:
## In this case I used an arbitrary minimum reviews after I sort the values.
## I am not scraping products which have less than 901 reviews.

links = links[links.Review_Count > 900].sort_values("Review_Count", ascending=False).reset_index(drop=True)

In [144]:
link = links["Link"][0]
link

'https://www.bestbuy.com/site/insignia-32-class-f20-series-led-hd-smart-fire-tv/6247254.p?skuId=6247254'

In [146]:
## The first time you access to the website it asks you which country you are from
## In this case USA is choosen and this is only asked once.

driver.get(link)

flag = driver.find_element_by_xpath("//a[@class='us-link']")
flag.click()

In [188]:
## This Dictionary Holds the Data

review_set = {
    "Rating" : [],
    "Title" : [],
    "Date" : [],
    "Helpful" : [],
    "Unhelpful" : [],
    "Review" : [],
    "Product_Name" : [],
    "Page" : []
}

In [None]:
for l in (links["Link"] + "#tabbed-customerreviews"):
    page = True
    driver.get(l)
    # The website randomly throws a survey pop-up so in every page I am explicitly waiting for it.
    # Also for some reason reviews tab won't open eventough it is clicked the last try/except is a safe guard against it
    # but even that doesn't work properly. All in all function is going to stop and the user must re-start it
    # from where it stopped.
    
    # In this version I added #tabbed-customerreviews at the end of the links in order to go to Review tabs
    # that's why "expand_reviews" elements are commented out.
    try:        
        no_button = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//button[@id='survey_invite_no']")))
        
        no_button.click()
        # it takes a little bit of time for survey pop-up to go away this is why it waits for 5 seconds.
        time.sleep(5)
        
        #expand_reviews = driver.find_element_by_xpath("//button[@aria-controls='ugc-ratings-and-reviews-accordion']")
        #expand_reviews.click()

        see_all_reviews = driver.find_element_by_xpath("//div[@class='see-all-reviews-button-container']/a")
        see_all_reviews.click()
    
    except selenium.common.exceptions.TimeoutException:
        #expand_reviews = driver.find_element_by_xpath("//button[@aria-controls='ugc-ratings-and-reviews-accordion']")
        #expand_reviews.click()
        
        try:
            see_all_reviews = WebDriverWait(driver,10).until(
            EC.presence_of_element_located((By.XPATH, "//div[@class='see-all-reviews-button-container']/a")))

            see_all_reviews.click()
        except selenium.common.exceptions.TimeoutException:
            # If everything goes wrong print the last index of the used link
            print(links[links.Link == l.replace("#tabbed-customerreviews", "")].index[0])       
          
    time.sleep(1)
    
    #This part scrapes the reviews.
    while page:
        product_name = driver.find_element_by_xpath("//a[@class='v-fw-regular']").text
                
        current_page = int(driver.find_element_by_xpath("//li[@class='page active-button']").get_attribute("data-page-number"))
        
        read_more = driver.find_elements_by_xpath("//button[@class='c-button-link btn-read-more read-more-button']")
        
        # Checks if there are truncated reviews and expands them for every page
        if (len(read_more) > 0):
            for i in read_more:
                i.click()
        else:
            pass
        
        # For some reason ratings on the first page have different XPATHs.
        if (current_page == 1):
            ratings = driver.find_elements_by_xpath("//div[@class='c-ratings-reviews c-ratings-reviews-small ']/p[@class='visually-hidden']")
        else:
            ratings = driver.find_elements_by_xpath("//div[@class='c-ratings-reviews-v4 c-ratings-reviews c-ratings-reviews-v4-size-small c-ratings-reviews-small ']/p[@class='visually-hidden']")
        
        for i in ratings:
            review_set["Rating"].append((i.text).split(" ")[1])
            review_set["Product_Name"].append(product_name) # adding product name here
            review_set["Page"].append(current_page) # adding current page here
            
        titles = driver.find_elements_by_xpath("//h4[@class='c-section-title review-title heading-5 v-fw-medium']")
        for i in titles:
            review_set["Title"].append(i.text)
            
        dates = driver.find_elements_by_xpath("//div[@class='disclaimer v-m-right-xxs']/time[@class='submission-date']")
        for i in dates:
            review_set["Date"].append(i.get_attribute("title"))
            
        helpful_votes = driver.find_elements_by_xpath("//button[@class='c-button c-button-outline c-button-sm helpfulness-button no-margin-l']")
        for i in helpful_votes:
            review_set["Helpful"].append((i.text).split(" ")[1])
            
        unhelpful_votes = driver.find_elements_by_xpath("//button[@class='c-button-link link neg-feedback']")
        for i in unhelpful_votes:
            review_set["Unhelpful"].append((i.text).split(" ")[1])
        
        # For some other reason reivew XPATHs are different on the first page as well
        if (current_page == 1):
            reviews = driver.find_elements_by_xpath("//div[@class='ugc-review-body']/p[contains(@class,'pre-white-space')]")
        else:
            reviews = driver.find_elements_by_xpath("//div[@class='ugc-review-body']/div[@class='ugc-components ugc-line-clamp']/p[contains(@class,'pre-white-space')]")    
        for i in reviews:
            review_set["Review"].append(i.text)

        # This here just in case I want to limit the amount of pages to be scraped. For testing purposes.
        #if (current_page == 2):
        #    break
        #    page = False
        
        # Survey pop-up might also show up inside of the reviews and blocks you getting the next page.
        # This exception is for getting rid off that pop-up
        try:
            next_page = driver.find_element_by_xpath("//li[@class='page next']/a[@class='']")
            next_page.click()
            time.sleep(1) # Don't want to spam the website.
                        
        except selenium.common.exceptions.ElementClickInterceptedException:
            
            no_button = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.XPATH, "//button[@id='survey_invite_no']")))
            no_button.click()
            
            time.sleep(10)
            
            next_page = driver.find_element_by_xpath("//li[@class='page next']/a[@class='']")
            next_page.click()
        
        # After getting all reviews from the last page while loop ends
        except selenium.common.exceptions.NoSuchElementException:
            break
            page = False
        
        time.sleep(1)
            
        

In [174]:
## In case the script stops, gets the index of the last link. 

links[links.Link == l.replace("#tabbed-customerreviews", "")].index[0]

77

In [193]:
## Shuts the Selenium Webdriver

driver.close()
driver.quit()

In [194]:
## Control Measure: All lists in the dictionary must have the same length.

for i in review_set.keys():
    print(i,":", len(review_set[i]))

Rating : 7868
Title : 7868
Date : 7868
Helpful : 7868
Unhelpful : 7868
Review : 7868
Product_Name : 7868
Page : 7868


In [195]:
## Converting the dictionary into a Pandas DataFrame

the_data = pd.DataFrame(review_set)
the_data

Unnamed: 0,Rating,Title,Date,Helpful,Unhelpful,Review,Product_Name,Page
0,5,So many free features! GORGEOUS colors and cla...,"Oct 31, 2020 9:53 PM",(182),(64),This tv was recommended to me by a friend who ...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
1,5,"Buy this TV, but buy THIS stand!!","Mar 5, 2022 6:11 PM",(6),(0),"I LOVE my Samsung television, but I was not in...","Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
2,5,samsung where it's at,"Mar 24, 2021 12:25 AM",(23),(6),This tv has been amazing. I've had it for abou...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
3,5,Samsung 50” tv,"Jan 24, 2021 7:27 PM",(40),(10),I love it. It’s amazing! I had my doubts but i...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
4,5,I like,"Jun 19, 2020 7:08 PM",(52),(38),I love it better then my other Samsung tv I sh...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",1
...,...,...,...,...,...,...,...,...
7863,1,Would not recomend,"Jul 25, 2020 10:06 PM",(0),(0),Had many problems setting up. Could not conne...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",394
7864,1,Sorry Samsung,"Feb 26, 2021 10:35 PM",(1),(0),I have and love my Samsung curved monitor so I...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",394
7865,1,Extremely poor durability.,"Nov 27, 2020 10:57 PM",(3),(0),This is our second Samsung and neither has las...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",394
7866,1,Poor interface.,"May 29, 2020 2:43 AM",(1),(0),Probably the slowest TV interface I have ever ...,"Samsung - 50"" Class 7 Series LED 4K UHD Smart ...",394


In [196]:
## Checking if products have the expected amount of reviews. 
## Scrapped data might have more reviews than the data scrapped by BestBuy Review Sorter
## because BestBuy is very active and people write reviews.

the_data.value_counts("Product_Name")

Product_Name
Samsung - 50" Class 7 Series LED 4K UHD Smart Tizen TV    7868
dtype: int64

In [199]:
# Saving the review data as a .csv file

the_data.to_csv("your_review_data_raw.csv", index=False)

In [180]:
## Re-reading it
## Title variable might have missing values, it is OK.

pd.read_csv("your_review_data_raw.csv").info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9729 entries, 0 to 9728
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Rating        9729 non-null   int64 
 1   Title         9726 non-null   object
 2   Date          9729 non-null   object
 3   Helpful       9729 non-null   object
 4   Unhelpful     9729 non-null   object
 5   Review        9729 non-null   object
 6   Product_Name  9729 non-null   object
 7   Page          9729 non-null   int64 
dtypes: int64(2), object(6)
memory usage: 608.2+ KB
