# AMAZON CUSTOMER REVIEW SCRAPING 

## INTRODUCTION

This notebook presents the data gathering process in the study of Amazon customer satisfaction using sentiment analysis. Through this analysis, we aim to add meaningful perspectives to the relationship between customer sentiment and satisfaction contribute to the broader understanding of customer feedback and their impact on business performance.

## THE DATA GATHERING PROCESS

### Importing the required libraries

The data gathering process involves the use of Python programming language to scrap customer reviews from the Amazon's e-commerce platform. This was achieve using the Selenium library, beginning with the installation of Selenium chrome web driver on a local machine.

The required libraries for the analysis were then imported

In [1]:
# import the necessary packages
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import re 
from selenium.common.exceptions import StaleElementReferenceException


### Loading the ASIN dataframe

To capture customer reviews from different product categories and across different years, a list of Amazon Standard Identification Number (ASIN) was used. ASIN is a unique block of 10 characters used exclusively within Amazon's system to identify products.  The list of ASINs used was downloaded from the Internet Archive, which contains the ASINs of more than 75 million Amazon products. 

**NB: The next code cell will take few minutes to run  because it processes a large file.**

In [3]:
# loading the asin file
products = pd.read_csv('asins.csv')

In [4]:
# checking the shape of the dataframe
products.shape

(75115473, 1)

We have over 75 million ASINs for different Amazon products in the dataframe. 

In [5]:
# checking the head of the dataframe
products.head() 

Unnamed: 0,asin
0,B00000IGGJ
1,B00005JNBQ
2,B00006KGC0
3,B00006KGC2
4,B00007EPJ6


The next is to create and an initially empty dataframe to store the customer review data scrapped from Amazon's e-commerce platform. While the DataFrame starts off empty, it gradually fills up with review data during subsequent iterations of the scraping process.

In [6]:
# This script is meant to run once at the start to create an empty DataFrame that will later be populated with customer review data.
'''
# Creating an empty DataFrame with the specified columns
columns = ['item', 'review', 'date', 'rating', 'review_word', 'item_url']
data = pd.DataFrame(columns=columns)
data
'''

"\n# Creating an empty DataFrame with the specified columns\ncolumns = ['item', 'review', 'date', 'rating', 'review_word', 'item_url']\ndata = pd.DataFrame(columns=columns)\ndata\n"

Subsequently, the dataframe is loaded and updated in the next iteration of the scraping process

In [18]:
data= pd.read_csv('amazon_reviews.csv')

In [19]:
data.head()

Unnamed: 0,item,review,date,rating,review_word,item_url
0,Alice in Wonderland (Disney Gold Classic Colle...,My dvd,"September 19, 2022",5.0,The movie was great,https://www.amazon.com/Alice-Wonderland-Disney...
1,Alice in Wonderland (Disney Gold Classic Colle...,A true tale to follow,"January 31, 2019",5.0,The animation is quite clever and the story is...,https://www.amazon.com/Alice-Wonderland-Disney...
2,Alice in Wonderland (Disney Gold Classic Colle...,Fast shipping,"August 16, 2022",5.0,Fast shipping. Good quality. Thank you,https://www.amazon.com/Alice-Wonderland-Disney...
3,Alice in Wonderland (Disney Gold Classic Colle...,"A Fun, Cynical Ride","January 9, 2019",4.0,"More a cynical Disney film, the movie has no d...",https://www.amazon.com/Alice-Wonderland-Disney...
4,Alice in Wonderland (Disney Gold Classic Colle...,A Disney Classic,"April 4, 2019",5.0,If you are a fan of Disney animation then you ...,https://www.amazon.com/Alice-Wonderland-Disney...


In [20]:
data.shape

(10593, 6)

The next step is to set the path for the ChromeDriver in the local machine and initialize the Chrome browser to allow for an automated interaction. 

In [14]:
# Setting the path to the ChromeDriver 
s = Service("C:\\Users\\akint\\Downloads\\chromedriver-win64\\chromedriver.exe")
driver = webdriver.Chrome(service= s)

The next step is to perform the actual scraping of the customer reviews. The script below will loop through the ASINs in the products DataFrame, visit the Amazon product page, and scrape the customer reviews. The reviews will be stored in the data DataFrame.

**N.B: The script in the next cell will open a new Chrome window and navigate to the Amazon website. To bypass human verification, the code displaced on the page needs to be typed correctly in the input box.**

In [15]:
# Looping through the ASINs in the DataFrame
for asin in products['asin'].loc[2024: 2025]:
    # Then navigate to the specified URL and maximize the window
    driver.get('https://www.amazon.com/')
    driver.maximize_window()
    time.sleep(10)

    # Search for the product
    time.sleep(3)
    search = driver.find_element(By.XPATH, "/html/body/div[1]/header/div/div[1]/div[2]/div/form/div[2]/div[1]/input")
    search.send_keys(asin)
    time.sleep(10)
    print('Searching for product with asin:', asin)

    # Click on the search button
    driver.find_element(By.XPATH, "/html/body/div[1]/header/div/div[1]/div[2]/div/form/div[3]/div/span/input").click()
    time.sleep(3)


    # click on the product
    try:
        driver.find_element(By.XPATH, "/html/body/div[1]/div[1]/div[1]/div[1]/div/span[1]/div[1]/div[2]/div/div/span/div/div").click()
        time.sleep(2)

        # if the url exits, then proceed with this script

        # get the url of the item
        time.sleep(2)
        item_url = driver.current_url

        # click on US reviews
        driver.find_element(By.XPATH, "//div[@data-hook='reviews-medley-footer']//div//a[@data-hook='see-all-reviews-link-foot']").click()

        time.sleep(2)

        #data = pd.DataFrame(columns=['review', 'date', 'rating', 'review_word', 'item'])


        # create empty lists to store the reviews, dates, ratings, review words, and items

        item = []
        review = []
        date = []
        rating = []
        review_word = []

        page =1
        while True:
            item = []
            review = []
            date = []
            rating = []
            review_word = []
            item_urls = []

            print('Current data shape is ', data.shape)
            print(f'THIS IS PAGE {page} -------------------')

            # Then extract the reviews
            reviews = driver.find_elements(By.XPATH, "//a[@data-hook='review-title']")
            # Iterate through each <a> element
            for a in reviews:
                # Find all <span> elements within the current <a> element
                time.sleep(2)
                try:
                    span_elements = a.find_elements(By.XPATH, ".//span")
                    if span_elements:
                        # Extract the text of the last <span> element
                        last_span_text = span_elements[-1].text
                        review.append(last_span_text)
                except StaleElementReferenceException:
                    continue
            print('review scrapped --------------- length:', len(review))

            # Now we will extract the dates
            for i in range(1, 15):
                try:
                    dates = driver.find_elements(By.XPATH, f"//div[{i}]/div[3]/div/div/div/div/span")
                    for date_item in dates:
                        date.append(date_item.text)
                        date = [word for word in date if word not in ['Helpful', 'Report']]
                        date = [data for data in date if 'person' not in data]
                        date = [data for data in date if 'people' not in data]
                        date = [data.replace('Reviewed in the United States on ', '') for data in date]
                except StaleElementReferenceException:
                    continue
            
            print('date scrapped --------------- length:', len(date))

            # Extract the ratings
            for i in range(1, 15):
                try:
                    ratings = driver.find_elements(By.XPATH, f"//div[{i}]/div/div/div[2]/a/i/span[1]")
                    for rating_item in ratings:
                        ratings_html = rating_item.get_attribute('outerHTML')
                        rates = re.search(r'\d+\.\d+', str(ratings_html)).group()
                        rating.append(rates)
                        time.sleep(2)
                except StaleElementReferenceException:
                    continue
            if date == []:
                rating = []
            print('rating scrapped --------------- length:', len(rating))

            # Extract the review words
            for i in range(1, 15):
                try:
                    review_words = driver.find_elements(By.XPATH, f"//body/div/div/div/div/div/div/div/div/div/div[{i}]/div[1]/div[1]/div[4]/span[1]/span[1]")
                    for review_word_item in review_words:
                        if review_word_item.text != "" :
                                review_word.append(review_word_item.text)
                                if date == []:
                                    review_word = []
                                
                        time.sleep(2)
                except StaleElementReferenceException:
                    continue
            print('review word scrapped --------------- length:', len(review_word))

            items = driver.find_element(By.XPATH, "//a[@data-hook='product-link']")
            for i in range(1, len(date)+1):
                item.append(items.text)
            if date == []:
                item = []
            print('item scrapped --------------- length:', len(item))

            #obtain the url of the item
            for i in range(1, len(date)+1):
                item_urls.append(item_url)
            print('item_url scrapped --------------- length:', len(item_url))
            
            print(len(review) ,len(date), len(rating), len(review_word), len(item))


            if len(date) != len(rating) != len(review_word):
                print('The length of the lists are not equal')
                break
            else:
                # Append the scraped data to the DataFrame
                df = pd.DataFrame({'item': item, 'review': review, 'date': date, 'rating': rating, 'review_word': review_word, 'item_url': item_url})
                two = [df, data]
                data = pd.concat(two)
                print(df.shape)
                print(data.shape)
                data
                print(data.head(2))
                time.sleep(2)



            try:
                if page == 1:
                    time.sleep(2)
                    driver.find_element(By.XPATH, "//div[@role='navigation']//ul//li//a").click()
                    time.sleep(4)
                else:
                    time.sleep(2)
                    driver.find_element(By.XPATH, "//span[@data-action='reviews:page-action']//li[2]//a[1]").click()
                    time.sleep(4)
                page += 1
            except:
                page = 'Last page'
                print(f'{items.text} reviews scraped scrapping completed')
                break

    except:
        print('Product not found')
        continue

    #save to csv file
    data.to_csv('amazon_reviews.csv', index=False)

Searching for product with asin: B00004R982
Current data shape is  (10593, 6)
THIS IS PAGE 1 -------------------
review scrapped --------------- length: 10
date scrapped --------------- length: 10
rating scrapped --------------- length: 10
review word scrapped --------------- length: 10
item scrapped --------------- length: 10
item_url scrapped --------------- length: 265
10 10 10 10 10
(10, 6)
(10603, 6)
                 item               review             date rating  \
0  Revolver: New Spin  Dyer Buyers BEWARE!  January 2, 2001    2.0   
1  Revolver: New Spin  Something Different   March 14, 2014    4.0   

                                         review_word  \
0  There are a few good cuts on this album, the k...   
1  I was browsing around looking for something di...   

                                            item_url  
0  https://www.amazon.com/Revolver-Dyer-Good-Time...  
1  https://www.amazon.com/Revolver-Dyer-Good-Time...  
Revolver: New Spin reviews scraped scrapping c

In [21]:
# taking a look at the data
data


Unnamed: 0,item,review,date,rating,review_word,item_url
0,Alice in Wonderland (Disney Gold Classic Colle...,My dvd,"September 19, 2022",5.0,The movie was great,https://www.amazon.com/Alice-Wonderland-Disney...
1,Alice in Wonderland (Disney Gold Classic Colle...,A true tale to follow,"January 31, 2019",5.0,The animation is quite clever and the story is...,https://www.amazon.com/Alice-Wonderland-Disney...
2,Alice in Wonderland (Disney Gold Classic Colle...,Fast shipping,"August 16, 2022",5.0,Fast shipping. Good quality. Thank you,https://www.amazon.com/Alice-Wonderland-Disney...
3,Alice in Wonderland (Disney Gold Classic Colle...,"A Fun, Cynical Ride","January 9, 2019",4.0,"More a cynical Disney film, the movie has no d...",https://www.amazon.com/Alice-Wonderland-Disney...
4,Alice in Wonderland (Disney Gold Classic Colle...,A Disney Classic,"April 4, 2019",5.0,If you are a fan of Disney animation then you ...,https://www.amazon.com/Alice-Wonderland-Disney...
...,...,...,...,...,...,...
10588,Gotta Get the Groove Back,Five Stars,"February 13, 2018",5.0,This was an excellent cd with all his greatest...,https://www.amazon.com/Gotta-Groove-Back-JOHNN...
10589,Gotta Get the Groove Back,Hot hot hot!,"December 19, 2013",5.0,"Who doesn't like ""big head hundreds"" ??? This ...",https://www.amazon.com/Gotta-Groove-Back-JOHNN...
10590,Gotta Get the Groove Back,Glad I purchased it because I love the entire cd,"December 3, 2015",5.0,This is one cd I had not heard completely. Gla...,https://www.amazon.com/Gotta-Groove-Back-JOHNN...
10591,Gotta Get the Groove Back,,"April 11, 2019",4.0,Replacing damaged album,https://www.amazon.com/Gotta-Groove-Back-JOHNN...


In [22]:
# checking the shape of the data
data.shape

(10593, 6)

Now we have successfully scraped the reviews of the products using the ASINs in the DataFrame. Now, the will update the 'amazon_reviews' file by saving the newly scrapped data to the file.  

In [25]:
#save to csv file
data.to_csv('amazon_reviews.csv', index=False)