# ASPECT  BASED SENTIMENT ANALYSIS (ABSA)

**Context**: Sentiment analysis is increasingly viewed as a vital task both from an academic and a commercial standpoint. The majority of current approaches, however, attempt to detect the overall polarity of a sentence, paragraph, or text span, regardless of the entities mentioned (e.g., laptops, restaurants) and their aspects (e.g., battery, screen; food, service). By contrast, this task is concerned with aspect based sentiment analysis (ABSA), where the goal is to identify the aspects of given target entities and the sentiment expressed towards each aspect. Datasets consisting of customer reviews with human-authored annotations identifying the mentioned aspects of the target entities and the sentiment polarity of each aspect will be provided.

In [None]:
# Import libraries
import xml.etree.ElementTree as ET
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import RegexpParser
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split



nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kietd\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kietd\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\kietd\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\kietd\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

## RECONSTRUCTION

In this notebook, we will reconstruct from the processed database to orignal format.

In [None]:

tree = ET.parse('data/Restaurants_Test_Data_PhaseA.xml')
root = tree.getroot()

data = []

# Iterate through the XML and extract information
for sentence in root.findall('.//sentence'):
    sentence_id = sentence.get('id')
    text = sentence.find('text').text

    master_id, id_num = sentence_id.rsplit('#', 1)

    data.append({
        'MasterID': master_id,
        'ID': id_num,
        'Text': text
    })

df = pd.DataFrame(data)
df.head(11)

Unnamed: 0,MasterID,ID,Text
0,32897564#894393,2,The bread is top notch as well.
1,33070600#670328,0,I have to say they have one of the fastest del...
2,33070600#670328,2,Food is always fresh and hot- ready to eat!
3,36244464#949326,5,Did I mention that the coffee is OUTSTANDING?
4,32894246#870052,0,"Certainly not the best sushi in New York, howe..."
5,32894246#870052,1,"I trust the people at Go Sushi, it never disap..."
6,32894246#870052,2,"Straight-forward, no surprises, very decent Ja..."
7,35390182#756337,4,"BEST spicy tuna roll, great asian salad."
8,35390182#756337,5,Try the rose roll (not on menu).
9,11447227#436718,3,"I love the drinks, esp lychee martini, and the..."


In [None]:
df.describe()

Unnamed: 0,MasterID,ID,Text
count,800,800,800
unique,279,15,800
top,32894966#1727613,0,The bread is top notch as well.
freq,9,177,1


In [None]:
df[df['MasterID'] == '11313290#1139539']

Unnamed: 0,MasterID,ID,Text
378,11313290#1139539,0,"Even after a few bad evenings at Bardolino, I ..."
779,11313290#1139539,1,"The new menu has a few creative items,they wer..."


After testing by filtering the data, we can see with the same MasterID, we have multiple sentences representing as different IDs. We will reconstruct the data to original format.

> Add blockquote



In [None]:
df['ID'] = df['ID'].astype(int)
# Group by MasterID, sort by ID, and concatenate Text
grouped_df = df.sort_values('ID').groupby('MasterID').agg({
    'Text': ' '.join
}).reset_index()

grouped_df.columns = ['MasterID', 'FullText']
grouped_df.head(11)

Unnamed: 0,MasterID,FullText
0,11302355#533813,"Great food, great waitstaff, great atmosphere,..."
1,11302356#1455624,I've been coming here on and off for the past ...
2,11302357#835238,What can you say about a place where the waitr...
3,11313290#1139539,"Even after a few bad evenings at Bardolino, I ..."
4,11313316#1234433,This is definitely one of the places that I ha...
5,11313359#650269,At night the atmoshere changes turning into th...
6,11313392#560011,Pizzas were excellent in addition to appetizer...
7,11313431#524365,My fiance and I recently wanted to see the cit...
8,11313439#692431,Anytime and everytime I find myself in the nei...
9,11349445#757796,"If you're in the 'hood, definitely stop in. th..."


In [None]:
grouped_df[grouped_df['MasterID'] == '11313290#1139539']

Unnamed: 0,MasterID,FullText
3,11313290#1139539,"Even after a few bad evenings at Bardolino, I ..."


# SENTENCE TOKENIZATION

Currently we have the data in the format of MasterID, FullText

| MasterID | FullText |
|----------|----------|
| 1        | This is a sentence. This is another sentence. This is the third sentence. |

We will tokenize the sentences and reconstruct the data to the format of MasterID, ID, Text

| MasterID | ID | Text |
|----------|----|------|
| 1        | 0  | This is a sentence. |
| 1        | 1  | This is another sentence. |
| 1        | 2  | This is the third sentence. |

In [None]:
grouped_df.columns

Index(['MasterID', 'FullText'], dtype='object')

In [None]:
# Tokenize sentences and create new rows
df_tokenized = grouped_df.apply(lambda row: pd.Series({
    'MasterID': row['MasterID'],
    'Text': nltk.sent_tokenize(row['FullText'])
}), axis=1)

# Explode the 'Text' column to create separate rows for each sentence
df_exploded = df_tokenized.explode('Text').reset_index(drop=True)

# Add an 'ID' column to number the sentences within each MasterID
df_exploded['ID'] = df_exploded.groupby('MasterID').cumcount()

# Reorder columns
df_final = df_exploded[['MasterID', 'ID', 'Text']]

df_final.head(11)

Unnamed: 0,MasterID,ID,Text
0,11302355#533813,0,"Great food, great waitstaff, great atmosphere,..."
1,11302356#1455624,0,I've been coming here on and off for the past ...
2,11302356#1455624,1,"The food is top notch, the service is attentiv..."
3,11302356#1455624,2,"If you' re in New York, you do not want to mis..."
4,11302357#835238,0,What can you say about a place where the waitr...
5,11302357#835238,1,"The service was pretty poor all around, the fo..."
6,11302357#835238,2,"The ambiance was pretty cool, but not worth th..."
7,11302357#835238,3,Probably my worst dining experience in new yor...
8,11313290#1139539,0,"Even after a few bad evenings at Bardolino, I ..."
9,11313290#1139539,1,"The new menu has a few creative items,they wer..."


## Task 1: Aspect term extraction

Given a set of sentences with pre-identified entities (e.g., restaurants), identify the aspect terms present in the sentence and return a list containing all the distinct aspect terms. An aspect term names a particular aspect of the target entity.

For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I loved the staff”. Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g., in “The hard disk is very noisy” the only aspect term is “hard disk”).

In [None]:
def extract_aspect_terms(sentence):
    words = word_tokenize(sentence)
    tagged = pos_tag(words)

    # Define a grammar for chunking
    grammar = r"""
        NBAR: {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        NP: {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    """
    chunk_parser = RegexpParser(grammar)
    chunked = chunk_parser.parse(tagged)

    aspect_terms = []
    for subtree in chunked.subtrees(filter=lambda t: t.label() == 'NP'):
        aspect = ' '.join([leave[0].lower() for leave in subtree.leaves()])
        aspect_terms.append(aspect)

    return list(set(aspect_terms))  # Remove duplicates

In [None]:
df_final['AspectTerms'] = df_final['Text'].apply(extract_aspect_terms)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['AspectTerms'] = df_final['Text'].apply(extract_aspect_terms)


## Task 2: Aspect term polarity

For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative).


For example:

“I loved their fajitas” → {fajitas: positive}
“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive}
“The fajitas are their first plate” → {fajitas: neutral}
“The fajitas were great to taste, but not to see” → {fajitas: conflict}


In [None]:
def get_aspect_polarity(sentence, aspect_term):
    blob = TextBlob(sentence)
    relevant_sentences = [sent for sent in blob.sentences if aspect_term.lower() in sent.lower()]

    if not relevant_sentences:
        return 'neutral'

    ### Is this the polarity for the entire sentence? ###
    polarities = [sent.sentiment.polarity for sent in relevant_sentences]
    avg_polarity = sum(polarities) / len(polarities)

    if avg_polarity > 0.1:
        return 'positive'
    elif avg_polarity < -0.1:
        return 'negative'
    else:
        return 'neutral'

In [None]:
### The polarity for each aspect of a sentence seem to be the same across the df_final.csv ###
### I suspect that you produce the polarity for each sentence, not polarity for each aspect ###
### Issue: in sentences like this : 'Great food but the service was dreadful!', the polarities for 'food' and 'service' are not the same. ###
df_final['AspectPolarities'] = df_final.apply(lambda row:
    {term: get_aspect_polarity(row['Text'], term) for term in row['AspectTerms']}, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['AspectPolarities'] = df_final.apply(lambda row:


## Task 3: Aspect category detection

Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence.

For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}:

“The restaurant was too expensive”  → {price}
“The restaurant was expensive, but the menu was great” → {price, food}

In [None]:

categories = ['food', 'service', 'price', 'ambience', 'miscellaneous']

def train_category_classifier(df):
    # Use the 'Text' column for features
    X = df['Text']

    # For demonstration, we'll use a random category assignment
    ### Let's use the labeled categories from the dataset for this training ###
    y = [categories[i % len(categories)] for i in range(len(df))]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    vectorizer = CountVectorizer()
    X_train_vectorized = vectorizer.fit_transform(X_train)

    clf = MultinomialNB()
    clf.fit(X_train_vectorized, y_train)

    return vectorizer, clf

def detect_aspect_categories(sentence, vectorizer, clf):
    X = vectorizer.transform([sentence])
    predicted_category = clf.predict(X)[0]
    return [predicted_category]

In [None]:
# Train the classifier
vectorizer, clf = train_category_classifier(df_final)

# Use the trained classifier to detect categories
df_final['AspectCategories'] = df_final['Text'].apply(lambda x: detect_aspect_categories(x, vectorizer, clf))

### For this task, if training is required, you can use the labeled 'category' from the dataset to train, don't use random values ###

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['AspectCategories'] = df_final['Text'].apply(lambda x: detect_aspect_categories(x, vectorizer, clf))


## Task 4: Aspect category polarity

Given a set of pre-identified aspect categories (e.g., {food, price}), determine the polarity (positive, negative, neutral or conflict) of each aspect category.

For example:

“The restaurant was too expensive” → {price: negative}
“The restaurant was expensive, but the menu was great” → {price: negative, food: positive}

In [None]:
def get_aspect_category_polarity(sentence, category):
    return get_aspect_polarity(sentence, category)

In [None]:
### Similar to the aspect polarity, if multiple categories presented in the sentence, the polarity has to be calculated separately###
df_final['CategoryPolarities'] = df_final.apply(lambda row:
    {category: get_aspect_category_polarity(row['Text'], category)
     for category in row['AspectCategories']}, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_final['CategoryPolarities'] = df_final.apply(lambda row:


In [None]:
df_final[['MasterID', 'ID', 'Text', 'AspectTerms', 'AspectPolarities', 'AspectCategories', 'CategoryPolarities']].head(11)

Unnamed: 0,MasterID,ID,Text,AspectTerms,AspectPolarities,AspectCategories,CategoryPolarities
0,11302355#533813,0,"Great food, great waitstaff, great atmosphere,...","[great waitstaff, great beer, great food, grea...","{'great waitstaff': 'positive', 'great beer': ...",[food],{'food': 'positive'}
1,11302356#1455624,0,I've been coming here on and off for the past ...,"[years, la lanterna]","{'years': 'negative', 'la lanterna': 'negative'}",[service],{'service': 'neutral'}
2,11302356#1455624,1,"The food is top notch, the service is attentiv...","[top notch, service, atmosphere, food]","{'top notch': 'positive', 'service': 'positive...",[food],{'food': 'positive'}
3,11302356#1455624,2,"If you' re in New York, you do not want to mis...","[re, new york, place]","{'re': 'positive', 'new york': 'positive', 'pl...",[ambience],{'ambience': 'neutral'}
4,11302357#835238,0,What can you say about a place where the waitr...,"[waitress, lip, way, wrong entree, place, year...","{'waitress': 'negative', 'lip': 'negative', 'w...",[miscellaneous],{'miscellaneous': 'neutral'}
5,11302357#835238,1,"The service was pretty poor all around, the fo...","[customer, service, food, cost, place, crazy bum]","{'customer': 'negative', 'service': 'negative'...",[food],{'food': 'negative'}
6,11302357#835238,2,"The ambiance was pretty cool, but not worth th...","[ambiance, hassle]","{'ambiance': 'positive', 'hassle': 'positive'}",[service],{'service': 'neutral'}
7,11302357#835238,3,Probably my worst dining experience in new yor...,"[experience, former waiter, new york]","{'experience': 'negative', 'former waiter': 'n...",[service],{'service': 'neutral'}
8,11313290#1139539,0,"Even after a few bad evenings at Bardolino, I ...","[bardolino, few bad evenings]","{'bardolino': 'negative', 'few bad evenings': ...",[ambience],{'ambience': 'neutral'}
9,11313290#1139539,1,"The new menu has a few creative items,they wer...","[favorite words, old favorites, time, new menu...","{'favorite words': 'positive', 'old favorites'...",[miscellaneous],{'miscellaneous': 'neutral'}


Currently, the null values in 'AspectTerms' 'AspectPolarities' 'AspectCategories' 'CategoryPolarities' was [] {} [] {}, so that we cannot count null as usual.

In [None]:
# Function to check if a value is effectively null (empty list or dict)
def is_effectively_null(value):
    if isinstance(value, list) or isinstance(value, dict):
        return len(value) == 0
    return pd.isna(value)

# Count effectively null values for each column
null_counts = {}
for column in df_final.columns:
    null_counts[column] = df_final[column].apply(is_effectively_null).sum()

# Create a DataFrame to display the results
null_df = pd.DataFrame.from_dict(null_counts, orient='index', columns=['Null Count'])
null_df['Null Percentage'] = (null_df['Null Count'] / len(df_final)) * 100

# Sort by null count in descending order
null_df = null_df.sort_values('Null Count', ascending=False)

null_df

Unnamed: 0,Null Count,Null Percentage
AspectTerms,22,2.743142
AspectPolarities,22,2.743142
MasterID,0,0.0
ID,0,0.0
Text,0,0.0
AspectCategories,0,0.0
CategoryPolarities,0,0.0


In [None]:
### A big missing from this notebook is the accuracy of the methodology you are using ###
### Typically, the accuracies of the four tasks are calculated based on the labels provided in the dataset.###