In [11]:
# Import necessary libraries
import spacy
import pandas as pd
from spacytextblob.spacytextblob import SpacyTextBlob

# Initialise spaCy with 'en_core_web_sm' model and add SpacyTextBlob component to pipeline
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')

# Initialise a separate spaCy model 'en_core_web_md' for computing similarity as it includes word vectors
nlpx = spacy.load('en_core_web_md')

# Load reviews from a CSV file and remove rows where the 'reviews.text' column is empty
df = pd.read_csv('amazon_product_reviews.csv', sep=',')
reviews_data = df['reviews.text'].dropna()

# Understand the size of the data frame
print(df.shape)

# Function to clean reviews
def clean_reviews(product_reviews):
    # Dictionary to store cleaned reviews
    cleaned_reviews = {}
    
    # Loop over the reviews data using the index
    for index in product_reviews.index:
        review = product_reviews[index]
        # Process the text of the review with spaCy
        doc = nlp(review)
        # List of tokens stripped, lower case, no stop words, no punctuation, no whitespace
        cleaned_review = [
            token.text.strip().lower() for token in doc 
            if not token.is_stop and not token.is_punct and not token.text.isspace()
        ]
        
        # Join the tokens back into a string and store in a dictionary with the review's index as the key
        cleaned_reviews[index] = " ".join(cleaned_review)
        
    return cleaned_reviews

# Function to perform sentiment analysis on the reviews
def sentiment_analysis(reviews):
    for key, value in reviews.items():
        # Process the text with spaCy
        doc = nlp(value)
        
        # Calculate polarity and assign mood based on the polarity score
        polarity = doc._.blob.polarity
        if -0.15 <= polarity <= 0.15:
            Mood = "Neutral"
        elif -1 <= polarity < -0.15:
            Mood = "Negative"
        elif 0.15 < polarity <= 1:
            Mood = "Positive"
            
        # Print the review, its sentiment, and mood
        print(f"\nReview: {key}, {value}")
        print(f"Sentiment: {doc._.blob.sentiment}")
        print(f"Mood: {Mood}")

# Sample of reviews to analyse to reduce computational load
reviews_data_sample = reviews_data.iloc[[1, 25, 54, 76, 100, 299, 804, 3202, 6043, 13598]]
        
# Perform sentiment analysis on the cleaned reviews
sentiment_analysis(clean_reviews(reviews_data_sample))

# Select two specific reviews to compare for similarity
first_review = reviews_data[34]
second_review = reviews_data[18754]

# Define a function to calculate the similarity between two reviews
def similarity_analysis(review_1, review_2):
    # Calculate the similarity score between the two spaCy documents using 'en_core_web_md' for word vectors
    similarity_score = nlpx(review_1).similarity(nlpx(review_2))
    
    # Return the calculated similarity score
    return similarity_score

# Perform similarity analysis on the selected reviews and print the results
print(f"\nFirst Review: {first_review}")
print(f"Second Review: {second_review}")
print(f"Similarity: {similarity_analysis(first_review, second_review)}")


(28332, 24)

Review: 1, bulk expensive way products like
Sentiment: Sentiment(polarity=-0.5, subjectivity=0.7)
Mood: Negative

Review: 25, battery battery good price
Sentiment: Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
Mood: Positive

Review: 54, arrived earlier expected amazon batteries affordable long life
Sentiment: Sentiment(polarity=-0.05000000000000001, subjectivity=0.43333333333333335)
Mood: Neutral

Review: 76, purchased work far
Sentiment: Sentiment(polarity=0.1, subjectivity=1.0)
Mood: Neutral

Review: 100, teacher need tons batteries refused spend excessive amounts figured best option long lasting worth money cute matters lol wo find deal like stores!i highly recommend
Sentiment: Sentiment(polarity=0.3075, subjectivity=0.505)
Mood: Positive

Review: 299, house uses batteries fast like purchasing bulk purchase little cheaper way little nervous purchase amazonbasics brand wanted sure purchased quality battery normally purchase duracell happy quality good brands 

Write a brief report or summary in a PDF file:

sentiment_analysis_report.pdf that must include:

5.1. A description of the dataset used.

5.2. Details of the preprocessing steps.

5.3. Evaluation of results.

5.4. Insights into the model's strengths and limitations.


5.1. Dataset Description
The dataset used in this analysis consists of a collection of Amazon product reviews for AmazonBasics AAA Performance Alkaline Batteries.
The dataset has a shape of 24 columns and 28332 rows.
It includes a variety of fields such as:
Date Added: The date when the product was listed in the dataset. 
Name: The product name. 
Asins: Amazon Standard Identification Numbers. 
Image URLs: URLs to the product images. 
Manufacturer: Manufacturer of the product.
And many more, but the target field is:
Reviews: This includes several sub-fields such as the date of the review, whether the product was recommended, review ID, the number of helpful votes, rating, the review text itself, the review title, and the username of the reviewer.
The sub-field that we will be utilising in this project is the textual content of the review itself under the column reviews.text.
5.2. Preprocessing Steps
The preprocessing of the dataset involved several key steps to prepare the text data for sentiment analysis:
Loading and Cleaning: The dataset was loaded into a pandas DataFrame, and any rows with missing values in the reviews.text column were dropped to ensure the analysis only included complete reviews. .dropna() was used.
Tokenisation and Cleaning: Using spaCy, each review was processed to tokenise the text, converting it into a series of tokens or words. During this process, stopwords and punctuation were removed. Additionally, all tokens were converted to lowercase to maintain consistency and avoid duplication based on case differences. .str(), .strip(), .lower(), .is_stop, .is_punct all utilised.
SpacyTextBlob Integration: The spaCy pipeline was extended with SpacyTextBlob for sentiment analysis, enabling the extraction of sentiment polarity and subjectivity scores directly from the processed text.
5.3. Evaluation of Results
The sentiment analysis yielded polarity and subjectivity scores for each cleaned review. Polarity scores range from -1 (very negative) to 1 (very positive), providing a quantitative measure of sentiment. Subjectivity scores, on the other hand, range from 0 (objective) to 1 (subjective), indicating the degree of personal opinion or factual content in the review.
By analysing these scores, it was possible to classify each review as Positive, Negative, or Neutral, providing a clear overview of customer sentiment towards products. The similarity analysis further allowed for the comparison of textual content between reviews, identifying how similar two reviews are in terms of their sentiment and content.
In practice, the sample selected to test sentiment analysis saw a range of polarity from 0.7 to -0.5, and subjectivity of 1.0 to 0.0.
The classification of reviews into 'Positive', 'Negative', and 'Neutral' categories based on polarity scores demonstrated an effective measure of sentiment. For instance, a review described as "bulk expensive way products like" garnered a polarity of -0.5 and a subjectivity of 0.7, accurately reflecting a 'Negative' sentiment. Conversely, a review noting "battery battery good price" with a polarity of 0.7 and subjectivity of 0.6 was aptly classified as 'Positive', indicating satisfaction.
Moreover, the similarity analysis, particularly between the two random reviews with a high similarity score of 0.88, underscores the model's ability to detect semantic similarities in reviews that may vary in context but share underlying sentiment. This similarity score shows the model's sophistication in identifying and measuring the degree of sentiment alignment between reviews.
In essence, the sentiment and similarity analysis conducted offers a comprehensive snapshot of consumer sentiment, reflecting a broad spectrum of opinions and experiences. The application of these analyses presents a compelling case for the integration of NLP technologies in evaluating based on consumer feedback.
5.4. Insights into the Model's Strengths and Limitations
The model has many areas which it is particularly strong in, and areas where it could be improved, the below lists these areas.
Strengths:
The use of spaCy and SpacyTextBlob provides a flexible and powerful toolset for NLP tasks for efficient text preprocessing and sentiment analysis.
The preprocessing and analysis pipeline is capable of handling large datasets, thanks to spaCy's optimised design. The preprocessing steps are well-organised and is effective at removing noise-inducing elements like stopwords and punctuation, which can enhance the clarity of sentiment analysis.
By focusing on the 'reviews.text' data, the analysis is kept directly relevant to the sentiment task at hand.
The iterative approach to cleaning and analysis ensures efficient memory usage, particularly with large datasets.
By combining sentiment polarity and subjectivity analysis, the model offers nuanced insights into customer opinions and the nature of their reviews. The integration of TextBlob facilitates straightforward sentiment extraction, providing reliable polarity and subjectivity scores.
The function sentiment_analysis can be scaled and integrated into larger systems, highlighting the model's adaptability.
The modular code structure allows for easy adaptations to new datasets or various text inputs.
The sentiment analysis model used specifically for review similarity was purposefully changed to the medium model to allow the use of word vectors to direct the similarity score.
Limitations:
The model may not always fully capture the context or nuances of certain phrases and idioms, potentially leading to inaccuracies in sentiment classification. This is partly due to the use of the small language model and also a possibility of over cleaning text before sentiment analysis.
The current implementation is tailored for English text data. Adapting the pipeline for other languages requires loading different spaCy language models and may necessitate adjustments in preprocessing steps.
The model might struggle with highly subjective or sarcastic content, where the literal interpretation of text does not accurately reflect the intended sentiment. This is an issue seen in many sentiment analysis models.
The preprocessing steps involve cleaning the dataset by removing stopwords, punctuation, and altering the case of the words to lower. While this streamlines the text for analysis, it may inadvertently remove key information that contributes to the sentiment. For instance, stopwords, though often considered "noise," can sometimes carry sentiment; similarly, punctuation can denote intensity or sentiment that would be lost when removed. This is often referred to as over cleaning.
Lemmatisation was deliberately avoided in this analysis as it could strip words of their suffixes which contain added sentiment, leading to a loss of sentimentality in the reviews. Initial trials indicated that lemmatisation could result in reviews that lost significant context and meaning, impacting the accuracy of sentiment analysis. This can be seen as a positive, but this is not seen across the board, in other cases it can improve the accuracy of the subjectivity score.
The conversion of text to lowercase may eliminate useful information such as emphasis denoted by capitalisation, which can be particularly relevant in expressing sentiments.
In conclusion, the sentiment analysis model provides valuable insights into customer sentiments expressed in Amazon product reviews. While it demonstrates strong capabilities in text preprocessing and sentiment evaluation, awareness of its limitations is crucial for interpreting the results accurately. 
Future work could explore more advanced models that incorporate contextual understanding and sarcasm detection to enhance accuracy further. It's important to strike a balance between cleaning the text to aid in analysis and preserving the linguistic features that convey sentiment, this can differ from different use cases, for example product reviews vs a script for a play. We can improve the model experimenting further with lemmatisation and training the model with previously assessed reviews. Also employ techniques to handle negations and intensifiers effectively. Words like "not" can change the sentiment of an otherwise positive or negative statement, while intensifiers like "very" can amplify a sentiment. The model often lacked in this specific area. Also, fine tune the model to be aware of the industry it is working in. This will help the model understand the context the text is in.
