# Amazon Product Review Scraper

## Introduction to Web Scraping

In the process of Data Analysis, we often need to retrieve data from the web, especially when a defined dataset is not available. This process is called **web scraping**. Web scraping involves automating the process of gathering information from web pages using programming languages like Python, along with libraries such as BeautifulSoup.

In this notebook, we will scrape product reviews from Amazon, focusing on ethical and responsible data collection practices.


## Ethical Data Collection

- **Respect Privacy**: Do not collect personal information without consent.
- **Follow Legal Guidelines**: Comply with local and international laws.
- **Respect Terms of Service**: Abide by the terms of service of the websites.
- **Ensure Data Security**: Protect the data from unauthorized access.
- **Minimize Data Collection**: Collect only the necessary data.

## Amazon Reviews Scraping

In this notebook, we will:
1. Scrape product reviews from Amazon.
2. Save the reviews in a CSV file for further analysis.



## Step-by-Step Guide

Let's begin by importing the necessary libraries.

### Importing Libraries

To scrape data from websites, we need to use some powerful Python libraries. Here are the ones we'll be using:
* `requests`: This library allows us to send HTTP requests to websites and get their HTML content. It's like our way of asking a website, "Can I see your content, please?"
* `BeautifulSoup`: This library helps us parse and navigate through the HTML content we get from websites. It's like a pair of super-powered glasses that let us see and extract the data we need from a webpage. Read more [here](https://beautiful-soup-4.readthedocs.io/en/latest/).
* `pandas`: This library is essential for data manipulation and analysis. It allows us to work with our scraped data in a structured format, like tables.
* `datetime`: This library helps us handle date and time data, which is often useful when working with reviews that have timestamps.

Let's start by importing these libraries:

In [1]:
# Import packages
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime

### Setting Up Headers

When we make requests to a website, it’s important to mimic a real user to avoid getting blocked or flagged as a bot. This is where setting up headers comes into play. Headers contain information about the request being sent, such as the browser type, accepted response formats, and more. By including headers, we make our request look like it’s coming from a regular web browser.

Here’s how we set up the headers:

In [2]:
# Header to set the requests as a browser requests
headers = {
    'authority': 'www.amazon.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

Explanation:
* `authority`: Specifies the domain of the website.
* `accept`: Indicates the types of content the client can process.
* `accept-language`: Specifies the preferred languages for the response.
* `sec-ch-ua`: Provides information about the user agent (browser).
* `user-agent`: Identifies the browser and operating system being used. This is crucial for mimicking a real user.

By setting up these headers, we make our HTTP requests appear as if they are coming from a regular user browsing the website. This helps in reducing the chances of our requests being blocked by the website's security measures.

### Input the Amazon URL

Now that we've set up our headers, the next step is to get the URL of the Amazon product whose reviews we want to scrape. We'll ask the user to input this URL. This makes our script flexible, allowing us to scrape reviews for any product by simply providing its URL.

Here's how we do it:

In [20]:
# URL of the Amazon Review page
# user_input = input("Enter Amazon URL to retrieve reviews: ")
reviews_url = 'https://a.co/d/fhlUOxO'

# print(f"URL set to: {reviews_url}")

## Defining Functions


### Function to Extract HTML Data

Now that we have the product URL, we need a function to extract the HTML data from the Amazon review pages. This function will handle sending requests to the Amazon server and parsing the HTML content using BeautifulSoup.

Here's the function to extract HTML data:

In [21]:
# Extract Data as Html object from amazon Review page
# This function will also extract the product title.
def reviewsHtml(url, len_page):

    # Empty List define to store all pages html data
    soups = []
    title_str = None


    # Loop for gather all reviews from all pages via range
    for page_no in range(1, len_page + 1):

        # parameter set as page no to the requests body
        params = {
            'ie': 'UTF8',
            'reviewerType': 'all_reviews',
            'filterByStar': 'critical',
            'pageNumber': page_no,
        }

        # Request make for each page

        response = requests.get(url, headers=headers)

        # Save Html object by using BeautifulSoup4 and lxml parser
        soup = BeautifulSoup(response.text, 'lxml')

        if page_no == 1:
            title = soup.find("span", attrs={"id":'productTitle'})
            title_str = title.string
            title_str = title_str.strip()

        # Add single Html page data in master soups list
        soups.append(soup)

    return soups,title_str

### Function to Extract Review Data

After collecting the HTML content, the next step is to extract the relevant review data from these pages. This involves parsing the HTML to find specific elements like the reviewer's name, star rating, review title, review date, and the review text.

Here's the function to extract review data:

In [22]:
#Grab Reviews name, description, date, stars, title from HTML
#This function retuens the data dictionary that will then be converted to
#pandas dataframe in order to upload to csv file.
def getReviews(html_data, prod_title, len_page):

    # Create Empty list to Hold all data
    data_dicts = []
    # The below given code is to retrieve productname
    if prod_title:
        prod_name = prod_title
    else:
        prod_name = 'N/A'


    # Select all Reviews BOX html using css selector
    boxes = html_data.select('div[data-hook="review"]')

    # Iterate all Reviews BOX
    for box in boxes:

        if prod_name:
            product_name = prod_name

        try:
            name = box.select_one('[class="a-profile-name"]').text.strip()
        except Exception as e:
            name = 'N/A'

        try:
            stars = box.select_one('[data-hook="review-star-rating"]').text.strip().split(' out')[0]
        except Exception as e:
            stars = 'N/A'

        try:
            title = box.select_one('[data-hook="review-title"]').text.strip()
        except Exception as e:
            title = 'N/A'

        try:
            # Convert date str to dd/mm/yyy format
            datetime_str = box.select_one('[data-hook="review-date"]').text.strip().split(' on ')[-1]
            date = datetime.strptime(datetime_str, '%B %d, %Y').strftime("%d/%m/%Y")
        except Exception as e:
            date = 'N/A'

        try:
            description = box.select_one('[data-hook="review-body"]').text.strip()
        except Exception as e:
            description = 'N/A'

        # create Dictionary with al review data
        data_dict = {
            'Product Name' : product_name,
            'Name' : name,
            'Stars' : stars,
            'Title' : title,
            'Date' : date,
            'Description' : description
        }

        # Add Dictionary in master empty List
        data_dicts.append(data_dict)

    return data_dicts

### Function to Determine Total Pages

In [23]:
def total_pages(url, page_allowance=300):
    try:
        # Send an HTTP GET request to the given URL with the specified headers
        response = requests.get(url, headers=headers)
        # Parse the HTML content of the response using BeautifulSoup with the 'html.parser' parser
        soup = BeautifulSoup(response.text, 'html.parser')
        total_pages = 1  # Initialize total pages to 1

        # Find the element that contains the total review count
        total_reviews_element = soup.find('span', {'data-hook': 'total-review-count'})
        # Split the text of the total review count element to extract the count
        count = total_reviews_element.text.split()

        reviews_per_page = 10  # Assume there are 10 reviews per page

        # Extract the number part from the count
        str_num = count[0]
        # Remove any commas from the number string
        cleaned_str = str_num.replace(',', '')

        if cleaned_str:
            # Convert the cleaned string to an integer
            num = int(cleaned_str)
            print(num)  # Print the total number of reviews
        else:
            print("Error: Invalid integer format")  # Print an error message if the string is not a valid integer

        total_reviews_count = num  # Store the total number of reviews

        if total_reviews_count > 0:
            # Calculate the total number of pages based on the total reviews and reviews per page
            total_pages = (total_reviews_count // reviews_per_page) + 1
            # Limit the total pages to the specified page allowance
            if total_pages > page_allowance:
                total_pages = page_allowance

        return total_pages  # Return the total number of pages
    except Exception as e:
        # Print an error message if an exception occurs during the process
        print("Error processing the response")



## Code Run



### Determine the total number of pages and scrape the reviews.

In [24]:
print('The url for which we are scraping reviews:',reviews_url)

The url for which we are scraping reviews: https://a.co/d/fhlUOxO


Now let's evaluate how many pages of reviews are available for review. Note that we cap the pages to 300 to allow for timely execution. If you have more time, you can increase the page allowance below.

In [25]:
# Extract number of pages
len_page = total_pages(reviews_url,page_allowance=300)
print(f"Total Pages: {len_page}")


15602
Total Pages: 300


In [26]:
len_page = 1

Now let's extract all the html content for our reviews.

**Important Note: This cell will take quite some time to run for products with higher reviews.**

In [29]:
# Get HTML data for all reviews
html_datas,prod_title = reviewsHtml(reviews_url, len_page)

Parse html data to extract review information

In [30]:
# Empty List to Hold all reviews data
reviews = []

# Iterate all Html page
for html_data in html_datas:
    # Grab review data
    review = getReviews(html_data,prod_title, len_page)
    # add review data in reviews empty list
    reviews += review

# Create a dataframe with reviews Data
df_reviews = pd.DataFrame(reviews)
df_reviews.head()

Unnamed: 0,Product Name,Name,Stars,Title,Date,Description
0,"JBL Flip 6 - Portable Bluetooth Speaker, power...",Kassandra,5.0,5.0 out of 5 stars\nFantastic sound quality an...,13/06/2024,The JBL Flip 6 Bluetooth speaker is an absolut...
1,"JBL Flip 6 - Portable Bluetooth Speaker, power...",Shinobu,5.0,5.0 out of 5 stars\nWorks great,19/07/2024,I bought this for my bf and he loves it. The b...
2,"JBL Flip 6 - Portable Bluetooth Speaker, power...",Amazon Customer,5.0,5.0 out of 5 stars\nPerfect for everywhere,23/07/2024,I purchased this for my husband for a special ...
3,"JBL Flip 6 - Portable Bluetooth Speaker, power...",CA29914080,5.0,5.0 out of 5 stars\nSounds great!,02/06/2024,Easy to use and connect Bluetooth. Battery lif...
4,"JBL Flip 6 - Portable Bluetooth Speaker, power...",Pwalli,5.0,5.0 out of 5 stars\nGreat sound for low price,25/07/2024,I purchased this speaker to replace my wireles...


In [31]:
df_reviews.shape

(13, 6)

### Saving the Data

Save the reviews data to a CSV file.



In [32]:
filename = 'beats_studio3_amazon_product_reviews.csv'
df_reviews.to_csv(filename, index=False)
print(f"Data saved to {filename}")

Data saved to beats_studio3_amazon_product_reviews.csv
