### Introduction


In this task, we will be applying natural language processing (NLP) to sentiment analysis by developing a Python program that performs sentiment analysis on a dataset of product reviews.

Kaggle source:https://www.kaggle.com/datasets/datafiniti/consumer-reviews-of-amazon-products?resource=download

In [5]:
# Import libraries
import pandas as pd
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [6]:
# Load language model
nlp = spacy.load("en_core_web_md")
nlp.add_pipe("spacytextblob")

<spacytextblob.spacytextblob.SpacyTextBlob at 0x1fa832b4c20>

In [7]:
# Load product reviews dataset
try:
    df = pd.read_csv(
        "Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv"
    )
except FileNotFoundError:
    print(
        "The file you are trying to load does not exist or is "
        "in the wrong directory."
    )

Next, we will take a look at the dataset.

In [8]:
# First few rows
print("First 5 rows of dataset:")
df.head()

First 5 rows of dataset:


Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


We will be using the 'reviews.text' column for this sentiment analysis. For easier use later, we will extract this colum into its own variable.

In [9]:
# Extract 'reviews.text' column
reviews_data = df['reviews.text']
reviews_data.head()

0    I order 3 of them and one of the item is bad q...
1    Bulk is always the less expensive way to go fo...
2    Well they are not Duracell but for the price i...
3    Seem to work as well as name brand batteries a...
4    These batteries are very long lasting the pric...
Name: reviews.text, dtype: object

We will start cleaning this colum in the dataframe by removing missing values and stop words.

#### Cleaning and preprocessing

In [10]:
# Remove missing values from reviews.text column
clean_data = df.dropna(subset=['reviews.text'])

In [None]:
# Removing stop words
def preprocess_text(text):
    '''
    This function takes a string, converts it to lowercase
    and removes extra whitespace. It then passes it through
    the NLP model and processes the doc object to remove
    stop words, punctuation marks performs lemmatization.

    Arguments:
    - text = review to be cleaned (str)

    Returns:
    - processed_review = cleaned review (str)
    '''
    text_lower = str(text.lower().strip())  # Convert to lowercase
    doc = nlp(text_lower)  # Convert to doc object through NLP model

    # Empty list to store cleaned tokens
    preprocessed_tokens = []

    # Remove stop words, punctuation, spaces, perform lemmatization
    for token in doc:
        if not token.is_stop and not token.is_punct and not token.is_space:
            preprocessed_tokens.append(token.lemma_)

    # Join tokens
    processed_review = " ".join(preprocessed_tokens).strip()

    return processed_review

# Create new columns with cleaned reviews
clean_data["cleaned_reviews.text"] = (
    clean_data['reviews.text'].apply(preprocess_text)
)

# Confirm changes
clean_data.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs,cleaned_reviews.text
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht...",order 3 item bad quality miss backup spring pc...
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht...",bulk expensive way product like
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht...",duracell price happy
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht...",work brand battery well price
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht...",battery long last price great


Now that our reviews are ready, we will start building out sentiment analysis function that will take the review as an input and return the sentiment.

#### Sentiment analysis

In [None]:
def get_sentiment(reviews):
    '''
    This function takes a review, passes it through the NLP model
    and calculates its polarity and sentiment.

    Arguments:
    - reviews = review to be analyzed (str)

    If the polarity is below 0, the sentiment is 'negative'. If
    the polarity is above 0, the sentiment is 'positive'. If the
    polarity is 0, the sentiment is 'neutral'.

    Returns:
    - review_sentiment along with polarity and subjectivity scoring.
    '''
    doc = nlp(reviews)

    polarity = polarity = doc._.blob.polarity
    sentiment = doc._.blob.sentiment
    subjectivity = sentiment[1]

    if polarity < 0:
        review_sentiment = "Negative"
    elif polarity > 0:
        review_sentiment = "Positive"
    else:
        review_sentiment = "Neutral"

    return (
        f"{review_sentiment}. Polarity: {polarity:.4f}. "
        f"Subjectivity: {subjectivity:.4f}"
    )

We will now test the function on a sample of cleaned reviews. First, we will test it on the first 5 reviews and then the last 5 reviews.

#### Testing on reviews

In [13]:
# Select first 5 cleaned reviews
sample_reviews_head = clean_data["cleaned_reviews.text"].head()

# Test function
print("Sentiment analysis on first 5 reviews")
for i, review in enumerate(sample_reviews_head):
    print(i+1, get_sentiment(review))

Sentiment analysis on first 5 reviews
1 Negative. Polarity: -0.7000. Subjectivity: 0.6667
2 Negative. Polarity: -0.5000. Subjectivity: 0.7000
3 Positive. Polarity: 0.8000. Subjectivity: 1.0000
4 Neutral. Polarity: 0.0000. Subjectivity: 0.0000
5 Positive. Polarity: 0.2500. Subjectivity: 0.4056


In the first 5 samples, we have a review of each class. The first review is analysed as negative since it has a polarity below 0. Its subjectivity has a score of 0.6667 which is closer to 1 and suggests that this review is opinionated rather than factual (source 1).
Review 3 is positive and entirely opinionated. Review 4 has a polarity of 0 causing its sentiment to be neutral

1. Source:
https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart

Let's take a closer look at the full text (original) of these 3 mentioned reviews.

In [14]:
# Select reviews 1, 3 and 4 from original text
print("Review 1:", reviews_data[0])
print("Review 3:", reviews_data[2])
print("Review 4:", reviews_data[3])

Review 1: I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pcs of aluminum to make the battery work.
Review 3: Well they are not Duracell but for the price i am happy.
Review 4: Seem to work as well as name brand batteries at a much better price


As seen in these three samples, the sentiment analysis did a fine job in analysing their cleaned versions. The first sentence contained words such as "bad", "missing". It correctly understood that the customers opinion was used that the customer was not happy. Review three contained words such as "happy" which is always considered a positive word and the sentence is also clearly the customers feelings and opinion. The last review contained words that are not considered positive or negative such as "brand", "price", etc. resulting in the analysis to be neutral.

Next, we will look at the last 5 reviews.

In [15]:
# Select last 5 cleaned reviews
sample_reviews_tail = clean_data["cleaned_reviews.text"].tail()

# Test function
print("Sentiment analysis on last 5 reviews")
for i, review in enumerate(sample_reviews_tail):
    print(i+1, get_sentiment(review))

Sentiment analysis on last 5 reviews
1 Positive. Polarity: 0.4000. Subjectivity: 0.4667
2 Positive. Polarity: 0.3000. Subjectivity: 0.4000
3 Positive. Polarity: 0.4143. Subjectivity: 0.7000
4 Positive. Polarity: 0.4167. Subjectivity: 0.8333
5 Positive. Polarity: 0.3878. Subjectivity: 0.4816


These reviews are all analysed as positive with some being more opinionated than others (subjectivity scores closer to 1) and some slightly closer to a neutral sentiment (polarity scores closer to being 0) than others. Review 2 appear to lean more to the factual side while review 4 appear to be more opinionated than factual.

Let's take a look at these two reviews in their full text.

In [16]:
# Select reviews 2 and 4 from original text
print("Review 2:", reviews_data.iloc[-4])
print("Review 4:", reviews_data.iloc[-2])

Review 2: I bought this for my niece for a Christmas gift.she is 9 years old and she love it.
Review 4: This Tablet does absolutely everything I want! I can watch TV Shows or Movies, check my Mail, Facebook, Google.......pay all my bills. It processes fast and has a beautiful screen. As I said: Everything I want in a Tablet for less than $100!


By reading these reviews, we can see why they are classified as positive. Review 2 contains words such as "love" which the model sees as a positive and happy tone. Review 4 contains words such as "fast" and "beautiful" and repeated phrases such as "everything I want" which the model sees as positive. It is an enthusiastic review which could contribute to the subjectivity showing an opinionated score (closer to 1).

To end off this task, we will compare the similarities of two of the above reviews: review 5 from the first 5 reviews and review 2 from the last five reviews. The reason behind this choice is they have polarity scores and subjectivity scores that are not far apart from each other and they both have a positive sentiment. Perhaps we will also find that they are highly similar.

#### Similarity comparison

In [17]:
# Select chosen reviews and convert into doc objects
review_5_head = nlp(reviews_data[1])
review_2_tail = nlp(reviews_data.iloc[-4])
similarity_score = review_5_head.similarity(review_2_tail)

print(
    "Similarity between review 5 (of first 5 reviews) "
    "and review 2 (of last 5 reviews):", similarity_score
)

Similarity between review 5 (of first 5 reviews) and review 2 (of last 5 reviews): 0.8960418701171875


As expected, these two reviews are highly similar with a score of almost 0.9.

### Conclusion

In conclusion, this task demonstrates a simple and efficient way to perform sentiment analysis on different customer reviews. It accurately predicted the sentiment on the sample of reviews that were tested by seeing words and phrases such as "happy", "love", "beautiful" and "everything I want" as positive sentiments while seeing words like "bad" or "missing" as negative sentiments. In addition, neutral words that do not have any type of emotion associated to it (such as "brand", "price" or "battery") was seen as a neutral sentiment. As expected, two of the 'positive' reviews with polarity and subjectivity scores that were close to each other also had a high similarity score as predicted.
Overall, this project highlights the effectiveness of spaCy's NLP capabilities using the en_core_web_md model to conduct basic sentiment analysis.