# BestBuy Review Sorter

This script catalogues desired products(All Laptops, All TVs, etc.) from BestBuy, gets total review counts and links to the review section.

After these links to reviews are scraped that data goes to the *BestBuy Review Scraper.* So this is the first stage of review scraping efforts.

BestBuy has a very complex website so this scraper is not %100 robust. 

* Sometimes it breaks and should be re-started from where it stopped.
* BestBuy offers different versions of their website for differen locales this can break the script as well
* I observed XPATHs for the same elements were changed within a short period of time(3 weeks)

For these reasons this should be used carefully and in a semi-supervised manner.

In [1]:
import selenium
from selenium import webdriver
import pandas as pd
import time
import re

In [2]:
## This sets the page language option for Firefox. It doesn't change the Firefox's visual language.

options = webdriver.FirefoxOptions()
options.set_preference('intl.accept_languages', 'en-US')

driver = webdriver.Firefox(executable_path="geckodriver.exe")

In [77]:
# I am supplying the first page of the all products manually
# In this case it is All Phones

driver.get("https://www.bestbuy.com/site/mobile-cell-phones/all-cell-phones/pcmcat1625163553254.c?id=pcmcat1625163553254")

In [78]:
## If you are trying to access the BestBuy from outside of North America
## it asks you to choose a country: USA, Canada, Mexico
## This chooses USA.,
## This also won't be asked as long as you are on the same IP.

flag = driver.find_element_by_xpath("//a[@class='us-link']")
flag.click()

In [80]:
## This dictionary holds the scraped data.

data = {"Title" : [], "Review_Count" : [], "Page" : [],
       "Link" : [], "Model" : [], "SKU" : []}

In [86]:
page = True
while page:  
    try:
        ## Current Page Number
        page_num = int(driver.find_element_by_xpath("//span[@class='trans-button current-page-number']").text)
        
        ## This is the element which holds the products only in the search results.
        ## Then the script uses relative-XPATHs to find the element inside of this umbrella element 
        products = driver.find_elements_by_xpath("//ol[@class='sku-item-list']/li[@class='sku-item']")
        
        for i in products:
            data["Title"].append(i.find_element_by_xpath(".//h4[@class='sku-header']/a").text)
            data["Link"].append(i.find_element_by_xpath(".//h4[@class='sku-header']/a").get_attribute("href"))
            data["Review_Count"].append(i.find_element_by_xpath(".//span[@class='c-reviews order-2']").text)
            # There are packages which have no SKU or Model IDs. 
            try:
                data["Model"].append(i.find_element_by_xpath(".//div/div[1]/span[@class='sku-value']").text)
                data["SKU"].append(i.find_element_by_xpath(".//div/div[2]/span[@class='sku-value']").text)
            except selenium.common.exceptions.NoSuchElementException:
                data["Model"].append("NA")
                data["SKU"].append("NA")
            
            data["Page"].append(page_num)

        
        next_page = driver.find_element_by_xpath("//a[@class='sku-list-page-next']")
        next_page.click()
        
        ## Limit the amount of pages to be scraped for testing purposes
        
        #if page_num >6: 
        #    page = False
        
        time.sleep(3) # I don't want to spam the web site
        
    ## Sometimes Website pops-up a survey window which intercepts the next page button
    except selenium.common.exceptions.ElementClickInterceptedException:
        no_button = driver.find_element_by_xpath("//button[@id='survey_invite_no']")
        no_button.click()
        
        time.sleep(5) # Pop-up goes away a little bit late
    
    ## Stops the while loop after scraping the last page.
    except selenium.common.exceptions.NoSuchElementException:
        page = False

In [87]:
## Control Measure: All lists in the dictionary must have the same length.
for i in data.keys():
    print(len(data[i]))

512
512
512
512
512
512


In [88]:
## Converting the dictionary into a Pandas DataFrame

phones = pd.DataFrame(data)

In [93]:
## Cheking the product counts per page. 
## Should be equal except for the last page but
## sometimes the script gets stuck on the same page
## not a problem because duplicated values are dropped in the cleaning phase.

phones.value_counts("Page")

Page
7     48
1     24
12    24
20    24
19    24
18    24
17    24
16    24
15    24
14    24
13    24
11    24
2     24
10    24
9     24
8     24
6     24
5     24
4     24
3     24
21     8
dtype: int64

# Data Cleaning Starts From Here

From time to to BestBuy sends visitors a survey because of that script scrapes the same page more than once so I am making sure that I have the same number of observations with BestBuy's inventory.

In [95]:
## Here duplicates are droped with respect to Title and Model variables then
## saved as a .csv file. With this way I am keeping the raw data on my RAM.
## If anything goes wrong I can come back to that.

phones.drop_duplicates(["Title", "Model"]).to_csv("reviews.csv", index=False)

In [96]:
## Further cleaning is done on the backup data.
## Reading the saved data again.

phone_all = pd.read_csv("reviews.csv")

Missing Values = Not yet reviewed = Review Count 0

I am removing those observations.

In [98]:
phone_all = phone_all[phone_all.Review_Count != "Not yet reviewed"].reset_index(drop=True)

Cheking some high valued Review Counts so I can observe if there are some thousand separators.

In [101]:
phone_all[phone_all.Review_Count.str.len() > 6]

Unnamed: 0,Title,Review_Count,Page,Link,Model,SKU
42,Apple - Pre-Owned iPhone 7 with 128GB Memory C...,(1.242),2,https://www.bestbuy.com/site/apple-pre-owned-i...,7 128GB BLACK-RB,6351946
186,Samsung - Galaxy Note20 Ultra 5G 128GB (Unlock...,(1.139),9,https://www.bestbuy.com/site/samsung-galaxy-no...,SM-N986UZKAXAA,6420841
235,Apple - Pre-Owned (Excellent) iPhone 7 with 32...,(1.242),11,https://www.bestbuy.com/site/apple-pre-owned-e...,7 32GB SILVER CRB,6219570
284,Samsung - Galaxy Note20 Ultra 5G 128GB - Mysti...,(1.028),13,https://www.bestbuy.com/site/samsung-galaxy-no...,SM-N986UZKAVZW,6422216
296,Samsung - Galaxy Note9 128GB - Midnight Black ...,(1.673),14,https://www.bestbuy.com/site/samsung-galaxy-no...,SPHN960UBLK,6298948
297,Apple - iPhone XS Max 64GB - Gold (AT&T),(3.965),14,https://www.bestbuy.com/site/apple-iphone-xs-m...,MT5C2LL/A,6009657
301,Apple - iPhone XS 64GB - Space Gray (AT&T),(1.984),14,https://www.bestbuy.com/site/apple-iphone-xs-6...,MT942LL/A,6009640
303,Samsung - Galaxy S20+ 5G Enabled 128GB - Aura ...,(1.325),14,https://www.bestbuy.com/site/samsung-galaxy-s2...,SMG986UZBV,6398395
307,Apple - iPhone XS Max 64GB - Space Gray (Verizon),(3.978),14,https://www.bestbuy.com/site/apple-iphone-xs-m...,MT592LL/A,6009877
310,Apple - iPhone 8 256GB - Gold (Sprint),(2.148),15,https://www.bestbuy.com/site/apple-iphone-8-25...,MQ7H2LL/A,6009802


There were package items in the inventory and I removed them as well since they had many duplicated values and most of the single items are in the data set already.

Also in Mobile Phone case, there are same models in the data set but their providers are different so do their reviews. This is why I am not removing duplicates based on "Model" variable alone.

In [117]:
## Cheking the presence of NA values for Model variable

phone_all.Model.isna().sum()

0

Review Counts are in the parentheses and I am extracting them from parentheses by using Regular Expression. And then I will check the new values with the data frame above.

In [119]:
phone_all["Review_Count"] = phone_all.Review_Count.str.extract("\((.+)\)")

In [120]:
phone_all.duplicated("Link").sum()

0

Cheking it last time we don't have duplicated links which is a good sign.

**Removing all thousands delimiters.**

**Depends on your locale so delimiter could be either "." or ",". Just to be safe I am removing both.**



In [121]:
phone_all["Review_Count"] = phone_all.Review_Count.str.replace(".", "")
phone_all["Review_Count"] = phone_all.Review_Count.str.replace(",", "")

  phone_all["Review_Count"] = phone_all.Review_Count.str.replace(".", "")


In [123]:
## For a safety measure I am converting the variable to integer.
## If it raises an exception something went wrong.
phone_all = phone_all.astype({"Review_Count" : "int64"})

At last I am observing duplicated review count values to catch if there are final duplicated values.

In [130]:
phone_all[phone_all.Review_Count > 500].sort_values("Review_Count", ascending=False)[phone_all.duplicated("Review_Count", keep=False)]

  phone_all[phone_all.Review_Count > 500].sort_values("Review_Count", ascending=False)[phone_all.duplicated("Review_Count", keep=False)]


Unnamed: 0,Title,Review_Count,Page,Link,Model,SKU
338,Apple - iPhone XS 256GB - Space Gray (Verizon),2209,16,https://www.bestbuy.com/site/apple-iphone-xs-2...,MT972LL/A,6009869
343,Apple - iPhone XS 64GB - Space Gray (Verizon),2209,16,https://www.bestbuy.com/site/apple-iphone-xs-6...,MT942LL/A,6009860
235,Apple - Pre-Owned (Excellent) iPhone 7 with 32...,1242,11,https://www.bestbuy.com/site/apple-pre-owned-e...,7 32GB SILVER CRB,6219570
42,Apple - Pre-Owned iPhone 7 with 128GB Memory C...,1242,2,https://www.bestbuy.com/site/apple-pre-owned-i...,7 128GB BLACK-RB,6351946
399,Samsung - Galaxy Note20 Ultra 5G 128GB (Unlock...,1139,21,https://www.bestbuy.com/site/samsung-galaxy-no...,SM-N986UZNAXAA,6420838
186,Samsung - Galaxy Note20 Ultra 5G 128GB (Unlock...,1139,9,https://www.bestbuy.com/site/samsung-galaxy-no...,SM-N986UZKAXAA,6420841
185,Samsung - Galaxy S21 Ultra 5G 128GB (Unlocked)...,773,9,https://www.bestbuy.com/site/samsung-galaxy-s2...,SM-G998UZKAXAA,6441109
17,Motorola - Moto G Power 2021 (Unlocked) 64GB M...,773,1,https://www.bestbuy.com/site/motorola-moto-g-p...,PALF0005US,6441178
350,Apple - Pre-Owned iPhone 6s 4G LTE with 16GB M...,697,16,https://www.bestbuy.com/site/apple-pre-owned-i...,6S 16GB ROSE GOLD-RB,5945000
209,Apple - Pre-Owned iPhone 6s 4G LTE with 16GB C...,697,10,https://www.bestbuy.com/site/apple-pre-owned-i...,6S 16GB SILVER-RB,5872534


It seems like there may be duplicated observations. I can't really think of a way to remove these safely right now. I am going to leave the further cleaning after I scrape the reviews.

In [137]:
## Checking the data one last time before saving.

phone_all.sort_values("Review_Count", ascending=False)

Unnamed: 0,Title,Review_Count,Page,Link,Model,SKU
307,Apple - iPhone XS Max 64GB - Space Gray (Verizon),3978,14,https://www.bestbuy.com/site/apple-iphone-xs-m...,MT592LL/A,6009877
297,Apple - iPhone XS Max 64GB - Gold (AT&T),3965,14,https://www.bestbuy.com/site/apple-iphone-xs-m...,MT5C2LL/A,6009657
347,Apple - iPhone 8 256GB - Gold (AT&T),3106,16,https://www.bestbuy.com/site/apple-iphone-8-25...,MQ7H2LL/A,6009697
337,Apple - iPhone XS Max 64GB - Space Gray (Sprint),2815,16,https://www.bestbuy.com/site/apple-iphone-xs-m...,MT592LL/A,6009764
312,Apple - iPhone 8 256GB - Gold (Verizon),2619,15,https://www.bestbuy.com/site/apple-iphone-8-25...,MQ7H2LL/A,6009917
...,...,...,...,...,...,...
377,Samsung - Geek Squad Certified Refurbished Gal...,1,19,https://www.bestbuy.com/site/samsung-geek-squa...,GSRF SM-G973UZKAXAA,6383385
378,Simple Mobile - Samsung Galaxy S10 with 128GB ...,1,19,https://www.bestbuy.com/site/simple-mobile-sam...,SMSAG973U1G3P5P,6337100
79,Verizon Prepaid - Samsung Galaxt A03 32GB Prep...,1,4,https://www.bestbuy.com/site/verizon-prepaid-s...,SMA037UXKVZPP,6495365
381,Samsung - Pre-Owned Galaxy S7 edge 4G LTE with...,1,19,https://www.bestbuy.com/site/samsung-pre-owned...,SM5G935UZSAXAA,6134322


In [129]:
phone_all.to_csv("your_reviews_clean.csv", index=False)

In [135]:
## Shutting Selenium Web Driver

driver.close()
driver.quit()