# SENTIMENT ANALYSIS PROJECT

***Author:*** Deborah Maffezzoni

***Student ID:*** 5205685

***email:*** deborah.maffezzoni01@icatt.it

# 1. Web Scraping

***Web scraping*** is the process of automatically *extracting* data from websites. It involves fetching web pages, parsing the HTML or XML content, and extracting the desired information, without manually visiting each page and copying the data.

For this project, I decide to mine data from *Amazon* site and collect the reviews of a pair of headphones by using the Object-Oriented Programming (OOP).

***Product*** *URL:* https://www.amazon.com/TOZO-T6-Bluetooth-Headphones-Waterproof/dp/B07RGZ5NKS/ref=cm_cr_arp_d_product_top?ie=UTF8 

***Review*** *URL:* https://www.amazon.com/product-reviews/B07RGZ5NKS/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&sortBy=recent&pageNumber=

In [1]:
# Import Dependencies
import json
import time
import pandas as pd
from requests_html import HTMLSession

The *class* ***Reviews*** provides methods to grab and extract reviews from an Amazon product page, store the scraped data  and save it to a JSON file for further analysis as follows:

* The **__init__()** method is always executed when the class is invoked and it *initializes* the Reviews object with the Amazon Standard Identification Number (ASIN) of a given product. It sets the ASIN, creates an HTTP session object for making requests, defines headers for the requests, and constructs the URL for scraping the reviews.

Regarding the HTTP headers for requests, it is important to store and collect the ***User - Agent***, which is a short bit of text to describe the Software (or Browser) that is making the solicit to the e-commerce site; otherwise the request for extracting data may be blocked. Since I decide to inspect only the most recent data reviewed in the United Kingdom, I specify also the *language preference* ('en-US') weighted with the *quality value* ('q=0.5'). 

* The **pagination()** method takes a page number as input and constructs the *review URL* by appending the current page number, using the HTTP session object initialised in the previous step. Then, it sends an HTTP request to that URL and checks if there are reviews *available* on that page. If reviews are found, it returns the HTML elements corresponding to the reviews; otherwise, it returns False.


* The **get_reviews()** method requires a list of HTML elements representing *reviews* as input. It iterates over each review and extracts relevant information such as *title*, *rating*, *place and date*, and *body* text using CSS selectors, converting the content scraped into text format. It constructs a dictionary collecting all this information and appends it to a total list. Finally, it returns the list of dictionaries containing the review data. 

For instance, to scrape the review title:

**title** = review.find('**a[data-hook = review-title]**', first = True).text

meaning that we need to inspect the *'a'* tag, extract the desired value (*'review-title'*) corresponding to the class *'data-hook'*, and transform the content into text format. It is necessary to set *'first=True'* because request_html will always return a list. 

* The **save()** method stores the list of dictionaries containing review data, and *saves* it to a JSON file named 'ASIN_ID_reviews.json', where 'ASIN_ID' is the ASIN provided during object initialization. 

In [2]:
class Reviews:
    def __init__(self, asin) -> None:
        # asin (Amazon Standard Identification Number) uniquely identifies products on Amazon
        self.asin = asin
        # initialize session object for making an HTTP request and parsing content to scrape out
        self.session = HTMLSession()
        # define the HTTP header for requests
        self.headers = {'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                        'AppleWebKit/537.36 (KHTML, like Gecko)'
                        'Chrome/116.0.0.0 Safari/537.36'),
                        'Accept-Language': 'en-US, en;q=0.5'}
        # f' string because we are going to pass the asin identifier into class
        self.url_part_1 = f'https://www.amazon.co.uk/product-reviews/{self.asin}/ref=cm_cr_arp_d_paging_btm_next_'
        self.url_part_2 = f'?ie=UTF8&reviewerType=all_reviews&pageNumber='

    def pagination(self, page):
        print(f'Requesting page: {page}')
        response = self.session.get(self.url_part_1 + str(page) + self.url_part_2 + str(page), headers = self.headers)
        if not response.html.find('div[data-hook = review]'):
            print("No reviews found on this page")
            return False
        else:
            print("Found reviews on this page")
            return response.html.find('div[data-hook = review]')

    def get_reviews(self, reviews):
        print(f"Processing {len(reviews)} reviews")
        total = []
        for i, review in enumerate(reviews):
            print(f"Processing review {i+1}")
            try:
                # grab the information needed using the CSS selector
                # first = True because request_html will always return a list
                # and then convert the content scraped into text format
                # 'span' is the tag underneath
                title = review.find('a[data-hook = review-title]', first = True).text
                rating = review.find('i[data-hook = review-star-rating] span', first = True).text
                place_date = review.find('span[data-hook = review-date]', first = True).text
                body = review.find('span[data-hook = review-body]', first = True).text.replace('\n', '').strip() # exchange newlines ('\n') with a space
            
            except AttributeError:
                print("AttributeError: Skipping review")
                continue

            # contruct a dictionary using the data scraped
            data = {
                'title': title,
                'rating': rating,
                'place and date': place_date,
                'body': body[: 1000]                  # 1000: number of character
            }
            # append the dictionary to the total list
            total.append(data)

        # Introduce a delay between requests
        time.sleep(60)

        return total

    def save(self, results):
        print(f"Saving {len(results)} reviews to file")
        with open(self.asin + '_reviews.json', 'w') as json_file:
            json.dump(results, json_file)

Let's extract Amazon product reviews for a specific product identified by its ASIN ('B07RGZ5NKS'):

In [3]:
# Create an instance of the Reviews class
asin_id = 'B07RGZ5NKS'
reviews_obj = Reviews(asin_id)

# Initialize an empty list to store all reviews
all_reviews = []

# Loop through review pages and retrieve reviews
page = 1
while page <= 100:
    reviews = reviews_obj.pagination(page)
    if not reviews:
        break
    all_reviews.extend(reviews_obj.get_reviews(reviews))
    page += 1

    time.sleep(5)

# Print the collected reviews for debugging
print(f"Collected {len(all_reviews)} Reviews")

# Save the collected reviews to a JSON file
reviews_obj.save(all_reviews)

Requesting page: 1
No reviews found on this page
Collected 0 Reviews
Saving 0 reviews to file


In [4]:
page = 2
reviews_obj.url_part_1 + str(page) + reviews_obj.url_part_2 + str(page)

'https://www.amazon.co.uk/product-reviews/B07RGZ5NKS/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2'

The following function named **json_to_df** takes a file path as input and *converts* the JSON data from the file into a DataFrame using the **from_dict()** method of pandas. Once extracted the desired review attributes, the function returns a DataFrame with columns for *date*, *place*, *title*, *rating*, and *content* of the reviews:

In [5]:
def json_to_df(file_path):
    with open(file_path, 'r') as json_file:
        data = json.load(json_file)

    print("Loaded JSON data:")
    print(data)  # Print the loaded JSON data

    reviews = []
    for page in data:
              review_data = {
                  'Date': page['place and date'][34:],
                  'Place': page['place and date'][16:30],
                  'Title': page['title'][19:],
                  'Rating': int(page['rating'][:1]),
                  'Content': page['body']
              }
              reviews.append(review_data)

    df = pd.DataFrame(reviews)
    return df

df = json_to_df('B07RGZ5NKS_reviews.json')

# Display the DataFrame
print(df)

Loaded JSON data:
[]
Empty DataFrame
Columns: []
Index: []


Since most of the time the **get_reviews()** method does **not** extract relevant information, returning an *empty* list of dictionaries, for the next steps of the analysis I decide to consider the dataset containing the *highest* number of reviews I have ever collected. Thus, let's consider the *'reviews.cvs'* file that contains **100** rows:

In [6]:
# Provide the file path to JSON file
file_path = 'C:/Users/debby/OneDrive/Desktop/UniCatt/1st Y/IT_Coding/2nd_Module/Project/B07RGZ5NKS_reviews.json'

# Convert JSON to DataFrame
df = json_to_df(file_path)
df.to_csv('reviews.csv')

print(df)

Loaded JSON data:
                Date           Place  \
0      1 August 2023  United Kingdom   
1       20 July 2023  United Kingdom   
2        7 July 2023  United Kingdom   
3     14 August 2023  United Kingdom   
4      4 August 2023  United Kingdom   
..               ...             ...   
95      18 June 2020  United Kingdom   
96  11 December 2019  United Kingdom   
97  23 November 2019  United Kingdom   
98        1 May 2022  United Kingdom   
99       1 July 2020  United Kingdom   

                                                Title  Rating  \
0   Second time buying these headphones as overall...       4   
1                                  One Annoying Issue       4   
2                                                  üëçüëç       4   
3                                             Amazing       5   
4                                   Really waterproof       5   
..                                                ...     ...   
95    Overall a good purchase, with some

In [7]:
df.head(10)

Unnamed: 0,Date,Place,Title,Rating,Content
0,1 August 2023,United Kingdom,Second time buying these headphones as overall...,4,I lost my first pair so I replaced them with t...
1,20 July 2023,United Kingdom,One Annoying Issue,4,You get decent sound quality and comfort / fun...
2,7 July 2023,United Kingdom,üëçüëç,4,I have a pair of the T 6‚Äôs and a pair of the N...
3,14 August 2023,United Kingdom,Amazing,5,Bought for sisters Xmas gift and they are great
4,4 August 2023,United Kingdom,Really waterproof,5,I went with them in the rain and it didn‚Äôt aff...
5,29 May 2023,United Kingdom,Confused,3,"Audio is good for the price, and it's weird (i..."
6,4 July 2023,United Kingdom,The best headphones!,5,These headphones are the best value for money!...
7,1 June 2023,United Kingdom,Pretty Good,4,The sound quality isn't the best and there are...
8,7 July 2023,United Kingdom,Brilliant piece of kit!,5,This is a brilliant piece of kit for the price...
9,29 July 2023,United Kingdom,OK.,5,Fiddly getting them out of the charger.
