# Aspect Based Sentiment Analysis

Getting sentiment analysis on aspects of headphones. 

https://medium.com/nlplanet/quick-intro-to-aspect-based-sentiment-analysis-c8888a09eda7

https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

In [1]:
import pandas as pd
import re
from tqdm import tqdm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

In [2]:
df = pd.read_csv('amazon_reviews.csv')
df.head(2)

Unnamed: 0,ratingScore,reviewTitle,reviewUrl,reviewReaction,reviewedIn,date,country,countryCode,reviewDescription,isVerified,variant,reviewImages,position,productAsin,reviewCategoryUrl,totalCategoryRatings,totalCategoryReviews,filterByRating,product,headphoneName
0,2,First review was @ 11months. Now13 months & ba...,https://www.amazon.ca/gp/customer-reviews/RTBT...,22 people found this helpful,"Reviewed in Canada on November 27, 2022",2022-11-27,Canada,,Edited again March 25th:A month after my last ...,True,Colour Name: Silver,[],1,B094C4VDJZ,https://www.amazon.com/product-reviews/B094C4V...,1018,668,twoStar,"{'price': {'value': 289.99, 'currency': '$'}, ...",sony xm4 earbuds
1,2,"Good quality sound, battery issues, now unusable",https://www.amazon.ca/gp/customer-reviews/R2H1...,,"Reviewed in Canada on November 11, 2023",2023-11-11,Canada,,After 1 year of use: the sound quality is grea...,True,Colour Name: Black,[],2,B094C4VDJZ,https://www.amazon.com/product-reviews/B094C4V...,1018,668,twoStar,"{'price': {'value': 289.99, 'currency': '$'}, ...",sony xm4 earbuds


In [3]:
model_name = "yangheng/deberta-v3-base-absa-v1.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)



In [4]:
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

In [5]:
ex_review = df[df['headphoneName'] == 'sony xm4 earbuds']['reviewDescription'][5]
ex_review

'Huge problem with battery drain. After 1 year of use the left ear bud can only last 1 hour with all features off. The right ear perfectly fine. Beware of purchase not a good long term purchase'

In [6]:
aspects = ['battery', 'comfort', 'noise cancellation', 'sound quality']

In [7]:
for aspect in aspects:
   print(aspect, classifier(ex_review,  text_pair=aspect))

battery [{'label': 'Negative', 'score': 0.9745112061500549}]
comfort [{'label': 'Negative', 'score': 0.8275106549263}]
noise cancellation [{'label': 'Negative', 'score': 0.7531450390815735}]
sound quality [{'label': 'Negative', 'score': 0.7979089021682739}]


In [35]:
#for i in range(df.shape[0]):
#    print(df[df['headphoneName'] == 'sony xm4 earbuds']['reviewDescription'][i], '\n')

## Pre-Processing Text

We want to pre-process the review text and then check if sentiments are in the text. If they are not, then we don't want to consider scores for them for that review text. The primary reason for this is to stem noise cancellation since it could appear as other forms like noise cancelling instead, but also potential typos or mispelled words.

The steps we want to follow here:

1. Pre-process the aspects and review texts.
2. Check if each aspect is in the text. 
3. If it is, then move get a sentiment score for it, otherwise record the sentiment as null.
4. Add these sentiments to a dataframe.

In [8]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters, numbers, and punctuation
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]

    # Stemming
    stemmer = PorterStemmer()
    #lemmatizer = WordNetLemmatizer()
    tokens = [stemmer.stem(token) for token in tokens]

    # Join the tokens back into a sentence
    processed_text = ' '.join(tokens)

    return processed_text

In [22]:
df['preprocessedReviews'] = df['reviewDescription'].fillna('').apply(lambda x: preprocess_text(x))

In [11]:
pre_proc_aspects = []
for aspect in aspects:
    pre_proc_aspects.append(preprocess_text(aspect))
pre_proc_aspects

['batteri', 'comfort', 'nois cancel', 'sound qualiti']

In [31]:
aspects_dct = {}
for aspect in aspects:
    aspect_idx = aspects.index(aspect)
    aspects_dct[aspect] = pre_proc_aspects[aspect_idx]

aspects_dct

{'battery': 'batteri',
 'comfort': 'comfort',
 'noise cancellation': 'nois cancel',
 'sound quality': 'sound qualiti'}

## Getting Sentiments

In [43]:
sentiments = {'battery': [], 'comfort': [], 'noise cancellation': [], 'sound quality': []}

In [44]:
for aspect in aspects_dct.keys():
    for i in range(df.shape[0]):
        if aspects_dct[aspect] in df['preprocessedReviews'][i]:
            sentiments[aspect].append(classifier(df['reviewDescription'][i],  text_pair=aspect)[0])
        else:
            sentiments[aspect].append({'label': 'NA', 'score': 0})

In [70]:
dataframes = sentiments.copy()
for aspect in dataframes.keys():
    aspect_label = aspect.replace(" ", "") #putting aspects in column names and don't want empty spaces
    dataframes[aspect] = pd.DataFrame(dataframes[aspect]).rename(columns={'label': aspect_label+'Label', 'score': aspect_label+'Score'})

In [101]:
sentiments_df = pd.concat(list(dataframes.values()), axis=1)
sentiments_df = pd.concat([df['headphoneName'], sentiments_df], axis=1)
sentiments_df.head()

Unnamed: 0,headphoneName,batteryLabel,batteryScore,comfortLabel,comfortScore,noisecancellationLabel,noisecancellationScore,soundqualityLabel,soundqualityScore
0,sony xm4 earbuds,Negative,0.632094,Negative,0.577964,,0.0,Negative,0.576905
1,sony xm4 earbuds,Negative,0.980668,,0.0,,0.0,Positive,0.983615
2,sony xm4 earbuds,,0.0,,0.0,Positive,0.937881,,0.0
3,sony xm4 earbuds,Negative,0.984746,Negative,0.987884,Positive,0.855375,Positive,0.623085
4,sony xm4 earbuds,,0.0,,0.0,,0.0,Negative,0.956487


In [102]:
sentiments_df.to_csv('cad_sentiments.csv', index=False)

## Getting Sentiments Part 2 - US Reviews

Now we need to repeat the above steps but for the US reviews.

In [105]:
US_df = pd.read_csv('amazon_US_reviews.csv')
US_df.head(2)

Unnamed: 0,ratingScore,reviewTitle,reviewUrl,reviewReaction,reviewedIn,date,country,countryCode,reviewDescription,isVerified,variant,reviewImages,position,productAsin,reviewCategoryUrl,totalCategoryRatings,totalCategoryReviews,filterByRating,product,headphoneName
0,1.0,Possibly worth the trouble for half the price,https://www.amazon.com/gp/customer-reviews/R3U...,15 people found this helpful,"Reviewed in the United States on October 26, 2021",2021-10-26,United States,,"This is the 3rd, & last, set of recently purch...",True,Color: BlackPattern: Headphones,[],1.0,B094C4VDJZ,https://www.amazon.com/product-reviews/B094C4V...,2085.0,1354.0,oneStar,"{'price': {'value': 164.97, 'currency': '$'}, ...",sony xm4 earbuds
1,1.0,TLDR Warning ! But if you going to spend this...,https://www.amazon.com/gp/customer-reviews/ROP...,401 people found this helpful,"Reviewed in the United States on December 7, 2021",2021-12-07,United States,,UPDATE IV:.After several months with many othe...,True,Color: SilverPattern: Headphones,['https://m.media-amazon.com/images/I/618Lflee...,2.0,B094C4VDJZ,https://www.amazon.com/product-reviews/B094C4V...,2085.0,1354.0,oneStar,"{'price': {'value': 164.97, 'currency': '$'}, ...",sony xm4 earbuds


In [107]:
US_df['preprocessedReviews'] = US_df['reviewDescription'].fillna('').apply(lambda x: preprocess_text(x))

In [110]:
US_sentiments = {aspect: [] for aspect in aspects_dct.keys()}
US_sentiments

{'battery': [], 'comfort': [], 'noise cancellation': [], 'sound quality': []}

In [116]:
for aspect in aspects_dct.keys():
    for i in tqdm(range(US_df.shape[0]), desc=f"Processing {aspect}"):
        if aspects_dct[aspect] in US_df['preprocessedReviews'][i]:
            US_sentiments[aspect].append(classifier(US_df['reviewDescription'][i],  text_pair=aspect)[0])
        else:
            US_sentiments[aspect].append({'label': 'NA', 'score': 0})

Processing battery: 100%|████████████████████████████████████████████████████████████| 729/729 [08:00<00:00,  1.52it/s]
Processing comfort: 100%|████████████████████████████████████████████████████████████| 729/729 [08:09<00:00,  1.49it/s]
Processing noise cancellation: 100%|█████████████████████████████████████████████████| 729/729 [07:01<00:00,  1.73it/s]
Processing sound quality: 100%|██████████████████████████████████████████████████████| 729/729 [06:48<00:00,  1.78it/s]


In [117]:
US_dataframes = US_sentiments.copy()
for aspect in US_dataframes.keys():
    aspect_label = aspect.replace(" ", "") #putting aspects in column names and don't want empty spaces
    US_dataframes[aspect] = pd.DataFrame(US_dataframes[aspect]).rename(columns={'label': aspect_label+'Label', 'score': aspect_label+'Score'})

In [118]:
US_sentiments_df = pd.concat(list(US_dataframes.values()), axis=1)
US_sentiments_df = pd.concat([US_df['headphoneName'], US_sentiments_df], axis=1)
US_sentiments_df.head()

Unnamed: 0,headphoneName,batteryLabel,batteryScore,comfortLabel,comfortScore,noisecancellationLabel,noisecancellationScore,soundqualityLabel,soundqualityScore
0,sony xm4 earbuds,Negative,0.894016,Negative,0.886814,,0.0,,0.0
1,sony xm4 earbuds,Positive,0.90851,Positive,0.916273,,0.0,Positive,0.908909
2,sony xm4 earbuds,Negative,0.499079,,0.0,Negative,0.517302,,0.0
3,sony xm4 earbuds,Negative,0.867483,,0.0,Negative,0.843609,,0.0
4,sony xm4 earbuds,Negative,0.885287,,0.0,Negative,0.832005,Negative,0.846645


In [121]:
US_sentiments_df.to_csv('US_sentiments.csv', index=False)