# Aspect Based Sentiment Analysis NLP Assignment

_N.B. Entire notebook wall time: 3min 30s_

# Part 1
_Perform aspect based sentiment analysis at the sentence level_

1) __Parse xml training data:__
- Import necessary packages
- Parse into pandas data frame to be able to work with review data at the sentence level

In [1]:
# Import necessary packages
import pandas as pd
import xml.etree.ElementTree as ET
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import spacy
import re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import warnings
warnings.simplefilter("ignore")
nlp = spacy.load('en_core_web_lg')

In [2]:
def xml_to_df(filename):
    # Parse xml file into pandas data frame to work with at the sentence level
    tree = ET.parse(filename)
    root = tree.getroot()
    reviews = root.findall('Review')
    df_columns = ['rid','id','text','category','predicted_category','polarity','predicted_polarity']
    reviews_df = pd.DataFrame(columns=df_columns)

    for review in reviews:
        rid = review.get('rid')
        sentences = review.findall('sentences/sentence')
        for sentence in sentences:
            id = sentence.get('id')
            text = sentence.find('text').text
            opinions = sentence.findall('Opinions/Opinion')
            for opinion in opinions:
                category = opinion.get('category')
                polarity = opinion.get('polarity')
                predicted_category = ''
                predicted_polarity = ''
                reviews_df = pd.concat([reviews_df, 
                            pd.DataFrame([[rid, id, text, category, 
                            predicted_category, polarity,predicted_polarity]],
                            columns=df_columns)], ignore_index=True)
    
    return reviews_df

In [3]:
train_reviews_df = xml_to_df(filename='Laptops_Train_p1.xml')
train_reviews_df.head()

Unnamed: 0,rid,id,text,category,predicted_category,polarity,predicted_polarity
0,79,79:1,This computer is absolutely AMAZING!!!,LAPTOP#GENERAL,,positive,
1,79,79:2,10 plus hours of battery...,BATTERY#OPERATION_PERFORMANCE,,positive,
2,79,79:3,super fast processor and really nice graphics ...,CPU#OPERATION_PERFORMANCE,,positive,
3,79,79:3,super fast processor and really nice graphics ...,GRAPHICS#GENERAL,,positive,
4,79,79:4,and plenty of storage with 250 gb(though I wil...,HARD_DISC#DESIGN_FEATURES,,positive,


2) __Text processing__
- Create copy of original data frame, populate with processed text
- tokenise words
- remove stop words
- remove punctuation
- stem words

In [4]:
# Creating a copy of original data frame to populate with processed text
processed_df = train_reviews_df.copy()

In [5]:
def tokenise(text):
    # Tokenising sentences
    tokenised_text = [word_tokenize(sentence.lower()) for sentence in text]
    return tokenised_text

def remove_stopwords(tokenised_text):
    # Remove stop words
    tokens = []
    for token in tokenised_text:
        if token not in stopwords.words('english'):
            tokens.append(token)
    return tokens

def remove_non_alpha(tokenised_text):
    # Remove punctuation
    alpha_tokens = []
    for token in tokenised_text:
        if token.isalpha():
            alpha_tokens.append(token)
    return alpha_tokens

def stem(tokenised_text):
    # Stem tokenised text
    snow_stemmer = SnowballStemmer(language='english')
    stem_tokens = []
    for token in tokenised_text:
        stem_tokens.append(snow_stemmer.stem(token))
  
    stemmed_text = " ".join(stem_tokens)
    return stemmed_text

def preprocess(tokenised_text):
  # output processed text
  pp_data = []
  for sentence in tokenised_text:
    pp_text = remove_stopwords(sentence)
    pp_text = remove_non_alpha(pp_text)
    pp_text = stem(pp_text)
    pp_data.append(pp_text)
  return pp_data

In [6]:
processed_df['text'] = preprocess(tokenise(train_reviews_df['text']))
processed_df = processed_df.rename(columns={"text": "processed_text"})
processed_df.head()

Unnamed: 0,rid,id,processed_text,category,predicted_category,polarity,predicted_polarity
0,79,79:1,comput absolut amaz,LAPTOP#GENERAL,,positive,
1,79,79:2,plus hour batteri,BATTERY#OPERATION_PERFORMANCE,,positive,
2,79,79:3,super fast processor realli nice graphic card,CPU#OPERATION_PERFORMANCE,,positive,
3,79,79:3,super fast processor realli nice graphic card,GRAPHICS#GENERAL,,positive,
4,79,79:4,plenti storag gb though upgrad ram,HARD_DISC#DESIGN_FEATURES,,positive,


3) __Identify entity-attribute (E#A) pairs__
- Topic classification task: ___Logistic Regression___
    - Logistic Regression performs much better on training data category classification than Naive Bayes
    - Logistic Regression ($83$% accuracy) vs. Naive Bayes ($67$% accuracy)
- Identify the entity for each sentence
- Identify the attribute for each sentence
- Combine the Entity and Attribute predictions to produce E#A pair
- Sentences with multiple categories identified and the probabilities for each label computed to inform predictions
- Test LR model predictions on training data with overall accuracy score
    - Classification report produced for test data
- Output predictions in new prediction column in dataframe

- Set of functions below used throughout section 3. Functions used again when making predictions and reports on test data

In [7]:
def e_a_predict(features, X_train_counts, model, threshold):
    # Returns the next most likely topic if a given sentence has multiple categories
    # if probability for next most likely topic is > threshold, else return most likely topic.

    N = len(features)
    tmp = 0
    repeat = 0
    predictions = np.zeros(N)
    for i in range(N):
        if features[i] != tmp:
            predictions[i] = model.predict(X_train_counts[i,:])
            repeat = 0
        else:
            repeat += 1
            arr = model.predict_proba(X_train_counts[i,:])
            sorted_index = np.argsort(arr)[0]
            if arr[0][sorted_index[-repeat-1]] > threshold/repeat:
                predictions[i] = float(sorted_index[-repeat-1])
            else:
                predictions[i] = model.predict(X_train_counts[i,:])
            # handle part 2 error where >8 opinions for single review gives index error
            if repeat > 7:
                repeat -= 1
        tmp = features[i]
    
    return predictions

def sorted_predictions(df):
    # Aligns predictions with matching labels for sentences that have multiple opinions.
    # e.g., ground truth for sentence id=1: LAPTOP#GENERAL, LAPTOP#BATTERY_PERFORMANCE
    # predictions for sentence id=1 pre-alignment: LAPTOP#BATTERY_PERFORMANCE, LAPTOP#GENERAL
    # predictions post-alignment: LAPTOP#GENERAL, LAPTOP#BATTERY_PERFORMANCE

    N = len(df['category'])
    labels_dict = {}
    predictions_dict = {}
    sorted_predictions = []
    i = 0
    tmp = 0

    for id in df['id'].unique():
        labels_dict[id] = list(df.query(f"id == '{id}'")['category'])
        predictions_dict[id] = list(df.query(f"id == '{id}'")['predicted_category'])

    for key in labels_dict.keys():
        for value in labels_dict[key]:
            if value in predictions_dict[key]:

                idx = labels_dict[key].index(value) # obtain label index for matching label & prediction for given sentence id
                idx2 = predictions_dict[key].index(value) # obtain prediction index for matching label & prediction for given sentence id

                tmp = predictions_dict[key][idx]

                predictions_dict[key][idx] = value # re-order predictions so that they align with ground truth for given id
                predictions_dict[key][idx2] = tmp # swap changed values in predictions list for sentence id
    
    for values in predictions_dict.values():
        for value in values:
            sorted_predictions.append(value)

    return list(sorted_predictions)

def numerical_entity_attributes(df):
    # Numerical representation of predicted categories is required for accuracy
    # and classification report. Function converts category predictions to
    # numerical representation

    category, predicted_category = df['category'], df['predicted_category']

    e_a_list = category.unique().tolist()
    i = 0
    e_a_label_dict = {}
    for e_a in e_a_list:
        e_a_label_dict[e_a] = i
        i += 1

    e_a_labels = []
    for i in range(len(category)):
        e_a_labels.append(e_a_label_dict[category[i]])

    e_a_predictions = []
    for i in range(len(category)):
        if predicted_category[i] in e_a_list:
            e_a_predictions.append(e_a_label_dict[predicted_category[i]])
        else:
            e_a_predictions.append(99) # predicted E#A pair not in ground truth set of E#A pairs

    e_a_labels = np.array(e_a_labels, dtype=int)
    e_a_predictions = np.array(e_a_predictions, dtype=int)
    e_a_list.append('N/A')

    return e_a_labels, e_a_predictions, e_a_list, e_a_label_dict

def reverse_dict(my_dict):
    # Reverses the keys and values in a dictionary, useful for several later steps
    reversed_dict = {}
    for key, value in my_dict.items(): 
        reversed_dict[value] = key 
    return reversed_dict

def numerical_labels(df, label_name, new_column_name=str()):
    # Convert ground truth word labels for entities or attributes to numerical representation
    labels_list = df[label_name].unique().tolist()
    df[new_column_name] = ''
    i = 0
    labels_dict = {} # Will use this to convert numerical class predictions back to attributes
    for label in labels_list:
        df.loc[df[label_name] == label, new_column_name] = i
        labels_dict[i] = label
        i += 1
    
    return labels_list, labels_dict

def convert_numerical_predictions(num_predictions, label_dict):
    # Convert numerical predictions back to words
    num_predictions = num_predictions.tolist()
    word_predictions = []

    for pred in num_predictions:
        word_predictions.append(label_dict[pred])
    
    return word_predictions

def combine_entity_attributes(df, entity_pred, attrib_pred):
    # Combine Entity and Attribute predictions to create E#A pair predictions
    # Sort E#A pair category predictions
    combined_list = []
    for i in range(len(df['category'])):
        e_a_pair = entity_pred[i] + "#" + attrib_pred[i]
        combined_list.append(e_a_pair)
    
    return combined_list

def overall_accuracy_absa(df):
    # Returns an overall accuracy score as a % of the instances where both the
    # category predictions AND the sentiment predicitons were correct.
    N = len(df['category'])
    correct_count = 0
    for i in range(N):
        if df['category'][i] == df['predicted_category'][i] and df['polarity'][i] == df['predicted_polarity'][i]:
            correct_count += 1
    
    accuracy = (correct_count/N)*100
    output = f"Overall accuracy for correct category and sentiment predictions: {accuracy:.0f}%"

    return output

- Creating training features for entity and attribute classifiers

In [8]:
X_train = processed_df['processed_text']
# Using counts to extract features from text
count_vectorizer = CountVectorizer()
X_train_counts = count_vectorizer.fit_transform(X_train)

- Training topic classifier first for entities

In [9]:
# Separate entities from categories and add to new column
processed_df['entity'] = [entity.split('#')[0] for entity in processed_df['category']]

In [10]:
# label entities
entities_list, entity_label_dict = numerical_labels(processed_df, 'entity', 'entity_label')

In [11]:
# Training set for entity classification
Y_entity_train = processed_df['entity_label']
Y_entity_train = np.array(Y_entity_train, dtype=int)

# Train the entity classification Logistic Regression model
entity_count_lr = LogisticRegression()
entity_count_lr.fit(X_train_counts, Y_entity_train)


In [12]:
# Testing entity prediction on training data... overfitting likely
Y_entity_train_pred = e_a_predict(X_train,X_train_counts,model=entity_count_lr,threshold=0.1) # threshold hyperparameter tuned on training data
entity_train_accuracy = accuracy_score(Y_entity_train, Y_entity_train_pred)

print(f"Training set entity predicitons accuracy score: {entity_train_accuracy*100:.0f}%")

Training set entity predicitons accuracy score: 84%


In [13]:
# Convert entity predictions from numbers back to words
entity_predictions = convert_numerical_predictions(Y_entity_train_pred, entity_label_dict)

- Training topic classifier for attributes

In [14]:
# Separate attributes from categories and add to new column
processed_df['attribute'] = [attribute.split('#')[1] for attribute in processed_df['category']]

In [15]:
# label attributes
attributes_list, attribute_label_dict = numerical_labels(processed_df, 'attribute', 'attribute_label')

In [16]:
# Training set for attribute classification
Y_attrib_train = processed_df['attribute_label']
Y_attrib_train = np.array(Y_attrib_train, dtype=int)

# Train the attribute classification Logistic Regression model
attrib_count_lr = LogisticRegression()
attrib_count_lr.fit(X_train_counts, Y_attrib_train)

In [17]:
# Testing attribute prediction on training data... overfitting likely
Y_attrib_train_pred = e_a_predict(X_train,X_train_counts,model=attrib_count_lr,threshold=0.1) # threshold hyperparameter tuned on training data
attrib_train_accuracy = accuracy_score(Y_attrib_train, Y_attrib_train_pred)

print(f"Training set attribute predictions accuracy score: {attrib_train_accuracy*100:.0f}%")

Training set attribute predictions accuracy score: 71%


In [18]:
# Convert attribute predictions from numbers back to words
attribute_predictions = convert_numerical_predictions(Y_attrib_train_pred, attribute_label_dict)

In [19]:
# Combine Entity and Attribute predictions to create E#A pair predictions
processed_df['predicted_category'] = combine_entity_attributes(processed_df, entity_predictions, attribute_predictions)

# Sort Entity and Attribute predictions for more representative accuracy score (see example in function comments)
processed_df['predicted_category'] = sorted_predictions(processed_df)
train_reviews_df['predicted_category'] = sorted_predictions(processed_df)

In [20]:
e_a_labels, e_a_predictions, e_a_list, e_a_label_dict  = numerical_entity_attributes(processed_df)
category_train_accuracy = accuracy_score(e_a_labels, e_a_predictions)

print(f"Training set category predictions accuracy score: {category_train_accuracy*100:.0f}%")

Training set category predictions accuracy score: 83%


- Now that Entity and Attribute classifiers have been trained and the predictions have been converted back to words, the dataframes will be populated with the predicted E#A pairs for the training set.
- Combined entity and attribute pair predictions on the training data:
    - E#A accuracy $=83$%

In [21]:
# Reorder columns
reorder_columns = ['rid','id','processed_text','entity','entity_label',
                  'attribute','attribute_label','category','predicted_category',
                  'polarity','predicted_polarity']
processed_df = processed_df.reindex(columns=reorder_columns)
processed_df.head()

Unnamed: 0,rid,id,processed_text,entity,entity_label,attribute,attribute_label,category,predicted_category,polarity,predicted_polarity
0,79,79:1,comput absolut amaz,LAPTOP,0,GENERAL,0,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,
1,79,79:2,plus hour batteri,BATTERY,1,OPERATION_PERFORMANCE,1,BATTERY#OPERATION_PERFORMANCE,BATTERY#OPERATION_PERFORMANCE,positive,
2,79,79:3,super fast processor realli nice graphic card,CPU,2,OPERATION_PERFORMANCE,1,CPU#OPERATION_PERFORMANCE,CPU#OPERATION_PERFORMANCE,positive,
3,79,79:3,super fast processor realli nice graphic card,GRAPHICS,3,GENERAL,0,GRAPHICS#GENERAL,GRAPHICS#GENERAL,positive,
4,79,79:4,plenti storag gb though upgrad ram,HARD_DISC,4,DESIGN_FEATURES,2,HARD_DISC#DESIGN_FEATURES,HARD_DISC#DESIGN_FEATURES,positive,


In [22]:
train_reviews_df.head()

Unnamed: 0,rid,id,text,category,predicted_category,polarity,predicted_polarity
0,79,79:1,This computer is absolutely AMAZING!!!,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,
1,79,79:2,10 plus hours of battery...,BATTERY#OPERATION_PERFORMANCE,BATTERY#OPERATION_PERFORMANCE,positive,
2,79,79:3,super fast processor and really nice graphics ...,CPU#OPERATION_PERFORMANCE,CPU#OPERATION_PERFORMANCE,positive,
3,79,79:3,super fast processor and really nice graphics ...,GRAPHICS#GENERAL,GRAPHICS#GENERAL,positive,
4,79,79:4,and plenty of storage with 250 gb(though I wil...,HARD_DISC#DESIGN_FEATURES,HARD_DISC#DESIGN_FEATURES,positive,


4) __Perform sentiment analysis on the sentence for each identified E#A pair__
- Sentiment classification task: __Logistic Regression__
    - Logistic Regression selected for sentiment classification. Logistic Regression model performed marginally better than Naive Bayes on the training data ($84$% vs $83$%).
- Output predictions in prediction column in dataframe
- Test model predictions on training data with overall accuracy score
- multiclass classification: (positive, neutral or negative)

- Set of functions below used throughout section 4. Functions used again when making predictions and reports on test data

In [23]:
def split_input_text(text):
    # Split input sentence/review by by coordinating conjunctions e.g., ['and','but','because']
    # If no CC in sentence, then split by punctuation
    # Each element (part of sentence) can then be assigned to different categories for the same sentence id, 
    # based on cosine similarity measure

    split_input = []
    for sentence in text:
        sentence = sentence.lower()
        sent_tag = nltk.pos_tag(nltk.word_tokenize(sentence))

        split_words = []
        pos_tags = ['CC']
        for elem in sent_tag:
            if elem[1] in pos_tags:
                split_words.append(elem[0])
        
        punctuation = [',',';','-']
        cc_in_flag = False
        for elem in sent_tag:
            if elem[1] in pos_tags:
                cc_in_flag = True
        
        if not cc_in_flag:
            for elem in sent_tag:
                if elem[1] in punctuation:
                    split_words.append(elem[0])

        result = []
        if split_words:
            for index, word in enumerate(split_words):
                result.append(sentence.split(split_words[index])[0])
                if len(sentence.split(split_words[index]))>1:
                    sentence = sentence.split(split_words[index])[1]
            result.append(sentence)
        else:
            result.append(sentence)

        result = [item.strip() for item in result]
        result = [item for item in result if item]
        split_input.append(result)
        
    return split_input

def filter_sentiment_input(text, cat):
    # Keep only most relevant element in sentence list to determine polarity for category
    # Measuring similarity between nouns in element of sentence and category (E#A) pair
    # Less expensive (more efficient) similarity calculation using only nouns rather than all words in part of sentence
    reduced_text = []
    for index, row in enumerate(text):
        n = len(row)
        tmp_dict = {}
        for i in range(n):
            cat_list = re.split('_|#', cat.loc[index])
            cat_list = [item.title() if item != 'OS' else item for item in cat_list]
            category = nlp(' '.join(cat_list))

            sent_tag = nltk.pos_tag(nltk.word_tokenize(row[i].lower()))
            noun_pos = ['NN','NNS','NNP']
            noun_list = [elem[0] for elem in sent_tag if elem[1] in noun_pos]
            noun_text = nlp(' '.join(noun_list))

            tmp_dict[row[i]] = category.similarity(noun_text)

        reduced_text.append(max(tmp_dict, key = tmp_dict.get))

    return reduced_text

In [24]:
# label sentiments
sentiments_list, polarity_label_dict = numerical_labels(processed_df, 'polarity', 'polarity_label')

In [25]:
# Training set for sentiment classification
Y_sentiment_train = processed_df['polarity_label']
Y_sentiment_train = np.array(Y_sentiment_train, dtype=int)

# Train the attribute classification Naive Bayes model
sentiment_count_lr = LogisticRegression()
sentiment_count_lr.fit(X_train_counts, Y_sentiment_train)

- Creating new dataframe to assist with sentiment predictions for each category:
    - Split sentences on coordinating conjunctions (CC) e.g., ["and", "but", ...]
    - Each category that needs sentiment predicting now has a list of strings as possible input features to determine sentiment
    - Instead of using entire sentence to predict polarity for a specific category, use spacy Cosine similarity function to determine which elements of list (parts of sentence) to keep to determine sentiment using already trained model. 
    - If sentence has no coordinating conjunctions, then split by punctuation in the sentence.

In [26]:
sentiment_processed_df = train_reviews_df.copy()
sentiment_processed_df['text'] = split_input_text(train_reviews_df['text'])

In [27]:
# N.B. cell takes ~30secs to run
sentiment_processed_df['text'] = filter_sentiment_input(sentiment_processed_df['text'], sentiment_processed_df['predicted_category'])

- Can see below the effect of filtering the sentence to only keep the part deemed most similar to the predicted category for aspect based sentiment classification
- For same sentence id 273:9, only using part of the sentence to determine sentiment for specific predicted category
    - Original sentence: _"It's more expensive but well worth it in the long run."_

In [28]:
sentiment_processed_df.query("id == '273:9'")

Unnamed: 0,rid,id,text,category,predicted_category,polarity,predicted_polarity
85,273,273:9,it's more expensive,LAPTOP#PRICE,LAPTOP#PRICE,negative,
86,273,273:9,well worth it in the long run.,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,


In [29]:
# input for sentiment prediction on training set
sentiment_processed_df['processed_text'] = preprocess(tokenise(sentiment_processed_df['text']))

X_sentiment_input = sentiment_processed_df['processed_text']
X_sentiment_input = count_vectorizer.transform(X_sentiment_input)

In [30]:
# Testing sentiment prediction on training data... overfitting likely
Y_sentiment_train_pred = sentiment_count_lr.predict(X_sentiment_input)

sentiment_train_accuracy = accuracy_score(Y_sentiment_train, Y_sentiment_train_pred)
print(f"Training set sentiment prediction accuracy score: {sentiment_train_accuracy*100:.0f}%")

Training set sentiment prediction accuracy score: 84%


In [31]:
cr = classification_report(Y_sentiment_train, Y_sentiment_train_pred, target_names=sentiments_list)
# print(cr)
# Note that only 188 neutral training examples out of 2909... will be difficult to learn properties of neutral input features

In [32]:
# Convert sentiment predictions from numbers back to words
sentiment_predictions = convert_numerical_predictions(Y_sentiment_train_pred, polarity_label_dict)

In [33]:
processed_df['predicted_polarity'] = sentiment_predictions
reorder_columns = ['rid','id','processed_text','entity','entity_label',
                  'attribute','attribute_label','category','predicted_category',
                  'polarity','polarity_label','predicted_polarity']
processed_df = processed_df.reindex(columns=reorder_columns)
processed_df.head()

Unnamed: 0,rid,id,processed_text,entity,entity_label,attribute,attribute_label,category,predicted_category,polarity,polarity_label,predicted_polarity
0,79,79:1,comput absolut amaz,LAPTOP,0,GENERAL,0,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,0,positive
1,79,79:2,plus hour batteri,BATTERY,1,OPERATION_PERFORMANCE,1,BATTERY#OPERATION_PERFORMANCE,BATTERY#OPERATION_PERFORMANCE,positive,0,negative
2,79,79:3,super fast processor realli nice graphic card,CPU,2,OPERATION_PERFORMANCE,1,CPU#OPERATION_PERFORMANCE,CPU#OPERATION_PERFORMANCE,positive,0,positive
3,79,79:3,super fast processor realli nice graphic card,GRAPHICS,3,GENERAL,0,GRAPHICS#GENERAL,GRAPHICS#GENERAL,positive,0,positive
4,79,79:4,plenti storag gb though upgrad ram,HARD_DISC,4,DESIGN_FEATURES,2,HARD_DISC#DESIGN_FEATURES,HARD_DISC#DESIGN_FEATURES,positive,0,positive


In [34]:
train_reviews_df['predicted_polarity'] = sentiment_predictions
train_reviews_df.head()

Unnamed: 0,rid,id,text,category,predicted_category,polarity,predicted_polarity
0,79,79:1,This computer is absolutely AMAZING!!!,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,positive
1,79,79:2,10 plus hours of battery...,BATTERY#OPERATION_PERFORMANCE,BATTERY#OPERATION_PERFORMANCE,positive,negative
2,79,79:3,super fast processor and really nice graphics ...,CPU#OPERATION_PERFORMANCE,CPU#OPERATION_PERFORMANCE,positive,positive
3,79,79:3,super fast processor and really nice graphics ...,GRAPHICS#GENERAL,GRAPHICS#GENERAL,positive,positive
4,79,79:4,and plenty of storage with 250 gb(though I wil...,HARD_DISC#DESIGN_FEATURES,HARD_DISC#DESIGN_FEATURES,positive,positive


5) __Evaluate accuracy__
    - Measure E#A pair prediction and sentiment prediction test set accuracy
    - Using classification report with accuracy measures discussed in Week 5 lecture slides

- Classification report for E#A predictions on test set data

In [35]:
test_reviews_df = xml_to_df(filename='Laptops_Test_p1_gold.xml')
test_processed_df = test_reviews_df.copy()
test_reviews_df.head(3)

Unnamed: 0,rid,id,text,category,predicted_category,polarity,predicted_polarity
0,B0074703CM_108_ANONYMOUS,B0074703CM_108_ANONYMOUS:0,"Well, my first apple computer and I am impressed.",LAPTOP#GENERAL,,positive,
1,B0074703CM_108_ANONYMOUS,B0074703CM_108_ANONYMOUS:1,"Works well, fast and no reboots.",LAPTOP#OPERATION_PERFORMANCE,,positive,
2,B0074703CM_108_ANONYMOUS,B0074703CM_108_ANONYMOUS:4,Glad I did so far.,COMPANY#GENERAL,,positive,


- Pre-processing test set text

In [36]:
# Pre-process test set text
test_processed_df['text'] = preprocess(tokenise(test_reviews_df['text']))

# Obtain test set input features
X_test = test_processed_df['text']
X_test_counts = count_vectorizer.transform(X_test)

- Make Entity predictions using trained model

In [37]:
# Predict test set entities
Y_entity_test_pred = e_a_predict(X_test,X_test_counts,model=entity_count_lr,threshold=0.1) # threshold hyperparameter tuned on training data

# Convert entity predictions back to words
test_entity_predictions = convert_numerical_predictions(Y_entity_test_pred, entity_label_dict)

- Make Attribute predictions using trained model

In [38]:
# Predict test set attributes
Y_attrib_test_pred = e_a_predict(X_test,X_test_counts,model=attrib_count_lr,threshold=0.1) # threshold hyperparameter tuned on training data

# Convert attribute predictions back to words
test_attribute_predictions = convert_numerical_predictions(Y_attrib_test_pred, attribute_label_dict)

In [39]:
# Combine Entity and Attribute predictions to create E#A pair predictions
test_processed_df['predicted_category'] = combine_entity_attributes(test_processed_df, test_entity_predictions, test_attribute_predictions)

# Sort Entity and Attribute predictions for more representative accuracy score (see example in function comments)
test_processed_df['predicted_category'] = sorted_predictions(test_processed_df)
test_reviews_df['predicted_category'] = sorted_predictions(test_processed_df)

- Produce Classification report for E#A pair predictions on test set

In [40]:
e_a_labels, e_a_predictions, e_a_list, e_a_label_dict = numerical_entity_attributes(test_processed_df)

# Note target_names=e_a_list displays E#A categories as text form rather than numerical representation
cr = classification_report(e_a_labels, e_a_predictions, target_names=e_a_list)
print("Predicted E#A Category Classification Report (Test Set):\n")
print(cr)

Predicted E#A Category Classification Report (Test Set):

                                    precision    recall  f1-score   support

                    LAPTOP#GENERAL       0.49      0.84      0.62       158
      LAPTOP#OPERATION_PERFORMANCE       0.54      0.67      0.60        70
                   COMPANY#GENERAL       0.83      0.13      0.23        38
                  LAPTOP#USABILITY       0.39      0.33      0.36        46
              LAPTOP#MISCELLANEOUS       0.36      0.26      0.31        34
            LAPTOP#DESIGN_FEATURES       0.59      0.55      0.57        73
                LAPTOP#PORTABILITY       0.67      0.33      0.44         6
     BATTERY#OPERATION_PERFORMANCE       0.68      0.68      0.68        19
                      LAPTOP#PRICE       0.81      0.52      0.63        25
         HARD_DISC#DESIGN_FEATURES       0.50      0.15      0.24        13
                   DISPLAY#QUALITY       0.55      0.30      0.39        20
                    LAPTOP#QU

- Classification report for sentiment predictions on test data

In [41]:
# Reversing the polarity label dict to faciliatate test set polarity conversion to numerical labels.
reversed_polarity_label_dict = reverse_dict(polarity_label_dict)

Y_sentiment_test_labels = [reversed_polarity_label_dict[test_processed_df['polarity'][i]] for i in range(len(test_processed_df['polarity']))]

In [42]:
# Preparing filtered input for aspect based sentiment predictions on test set
test_sentiment_prep_df = test_reviews_df.copy()
test_sentiment_prep_df['text'] = split_input_text(test_reviews_df['text'])

test_sentiment_prep_df['processed_text'] = preprocess(tokenise(filter_sentiment_input(test_sentiment_prep_df['text'], test_sentiment_prep_df['predicted_category'])))
X_test_sentiment_input = count_vectorizer.transform(test_sentiment_prep_df['processed_text'])

In [43]:
# classifcation report for sentiment precitions on test data
Y_sentiment_test_pred = sentiment_count_lr.predict(X_test_sentiment_input)
Y_sentiment_test_labels = np.array([Y_sentiment_test_labels]).reshape(801,)

cr = classification_report(Y_sentiment_test_labels, Y_sentiment_test_pred, target_names=sentiments_list)
print("Predicted Sentiment Classification Report (Test Set):\n")
print(cr)

Predicted Sentiment Classification Report (Test Set):

              precision    recall  f1-score   support

    positive       0.72      0.75      0.74       481
    negative       0.54      0.57      0.55       274
     neutral       0.20      0.04      0.07        46

    accuracy                           0.65       801
   macro avg       0.49      0.46      0.45       801
weighted avg       0.63      0.65      0.64       801



In [44]:
# Convert sentiment predictions to words and add to test test reviews data frame
test_processed_df['predicted_polarity'] = convert_numerical_predictions(Y_sentiment_test_pred, polarity_label_dict)
test_reviews_df['predicted_polarity'] = convert_numerical_predictions(Y_sentiment_test_pred, polarity_label_dict)

test_reviews_df[['text','category','predicted_category','polarity','predicted_polarity']].head()

Unnamed: 0,text,category,predicted_category,polarity,predicted_polarity
0,"Well, my first apple computer and I am impressed.",LAPTOP#GENERAL,LAPTOP#GENERAL,positive,positive
1,"Works well, fast and no reboots.",LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,positive,positive
2,Glad I did so far.,COMPANY#GENERAL,LAPTOP#GENERAL,positive,positive
3,Glad I did so far.,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,positive
4,s.... L .... o..... w.... rea......llllyy slow.,LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,negative,negative


In [45]:
# View the % of predictions where both the category AND polarity were predicted correct
combined_accuracy = overall_accuracy_absa(test_reviews_df)
print("For Part 1 test data:")
print(combined_accuracy)

For Part 1 test data:
Overall accuracy for correct category and sentiment predictions: 32%


#### __Evaluation (Part 1)__
__Results:__
- Category predictions test set accuracy score of $46$% is $37$% lower than the E#A category predictions on the training data ($83$%) but much better than random predictions that would yield an expected accuracy score of $1.5$%.
    - The category prediction uses two separate Logistic Regression classifiers, one for entity and another for attribute predictions. The predictions are then combined. As a consequence of this, category prediction suffers from compounding errors.
- The sentiment Logistic Regression classifier achieves an overall test accuracy of $65$%, compared to the training accuracy of $84$%.
    - Positive review predictions perform well on the test set with precision, recall and f1-score all above $70$% for positive-sentiment examples.
    - Negative review predictions perform better than random guess ($33$%) but are less accurate than the positive review predictions
    - The performance for identifying neutral sentence reviews is particularly low, this is most likely due to the small number of neutral training set examples (188 out of 2909) compared to the number of positive and negative examples. Insufficient neutral training examples to learn predictive neutral review features using current method.

__Time Efficiency:__

_N.B. time efficiency results produced with macbook air M1 (2020), 8GB RAM._
- Time taken to process, predict and produce classification reports for test data: $13.1s$
- The most time expensive part of the process of predicting the aspect based sentiment predictions is the _filter_sentiment_input()_ function.
    - This function uses the _spacy cosine similarity_ measure to compute the similarity between two documents. 
    - Upon investigation, it is specifically this calculation that is contributing to more than 50% of the time taken to process and predict the test data.
- Therefore, a simple future improvement that could help significantly improve the time efficiency of the algorithm is to find a suitable working replacement for the _spacy similarity_ function used within the _filter_sentiment_input()_ function.

__Future Improvements:__
- The classification for both categories and sentiments could possibly be improved by identifying more information about the sentence structure via part of speech tagging. This may provide a method to produce more predictive input features for category classification. 
    - Additionally, relevant parts of sentence for category-based sentiment classification could be more efficiently extracted by considering wider sentence structure.
- Further improving the quality of input features (e.g., by using word embeddings) may enable a more 'complex' model such as a Recurrent Neural Network (RNN) to generalise better to unseen data and produce better test set results than those that have been achieved with Logistic Regression.
- Finally, a larger amount of labelled training data (especidally neutral sentiment examples) would likely lead to training more accurate models that perform better on unseen data.

# Part 2
_Perform aspect based sentiment analysis at the review level; expand multiclass sentiment classification to identify conflicting reviews; classify overall sentiment for LAPTOP#GENERAL category if not already identified._

1) __Parse xml training data:__
- Parse into pandas data frame to be able to work with it at the sentence level
- Text column contains entire review text for each E#A pair

In [46]:
# Parse xml file into pandas data frame to work with at the review level
def part2_xml_to_df(filename):
    tree = ET.parse(filename)
    root = tree.getroot()
    reviews = root.findall('Review')
    df_columns = ['rid','text','category','predicted_category','polarity','predicted_polarity']
    reviews_df = pd.DataFrame(columns=df_columns)

    for review in reviews:
        rid = review.get('rid')
        sentences = review.findall('sentences/sentence')
        text = ''
        for sentence in sentences:
            text += sentence.find('text').text + ' '
        opinions = review.findall('Opinions/Opinion')
        for opinion in opinions:
            category = opinion.get('category')
            polarity = opinion.get('polarity')
            predicted_category = ''
            predicted_polarity = ''
            reviews_df = pd.concat([reviews_df, 
                        pd.DataFrame([[rid, text, category, predicted_category, polarity, predicted_polarity]], 
                        columns=df_columns)], ignore_index=True)
    
    return reviews_df

In [47]:
train_reviews_df_p2 = part2_xml_to_df(filename='Laptops_Train_p2.xml')
train_reviews_df_p2.head()

Unnamed: 0,rid,text,category,predicted_category,polarity,predicted_polarity
0,348,Most everything is fine with this machine: spe...,LAPTOP#GENERAL,,positive,
1,348,Most everything is fine with this machine: spe...,LAPTOP#OPERATION_PERFORMANCE,,positive,
2,348,Most everything is fine with this machine: spe...,HARD_DISC#DESIGN_FEATURES,,positive,
3,348,Most everything is fine with this machine: spe...,LAPTOP#QUALITY,,positive,
4,348,Most everything is fine with this machine: spe...,DISPLAY#QUALITY,,negative,


2) __Processing__
- Create copy of original data frame, to work with to determine classifications
- Preprocess review text

In [48]:
processed_df_p2 = train_reviews_df_p2.copy()
processed_df_p2['text'] = preprocess(tokenise(train_reviews_df_p2['text']))
processed_df_p2 = processed_df_p2.rename(columns={"text": "processed_text"})
processed_df_p2.head()

Unnamed: 0,rid,processed_text,category,predicted_category,polarity,predicted_polarity
0,348,everyth fine machin speed capac build thing un...,LAPTOP#GENERAL,,positive,
1,348,everyth fine machin speed capac build thing un...,LAPTOP#OPERATION_PERFORMANCE,,positive,
2,348,everyth fine machin speed capac build thing un...,HARD_DISC#DESIGN_FEATURES,,positive,
3,348,everyth fine machin speed capac build thing un...,LAPTOP#QUALITY,,positive,
4,348,everyth fine machin speed capac build thing un...,DISPLAY#QUALITY,,negative,


3) __Identify entity-attribute (E#A) pairs__
- Part 2 training and test sets contain the same sentences and reviews from Part 1, difference being that category and polarity are annotated at the review level.
- Therefore, can use the identified E#A categories for each review in part 1 for part 2. With some added functionality:
    - If LAPTOP#GENERAL not predicted for a given review, include LAPTOP#GENERAL as category
    - If the same category is predicted more than once for the same review, remove the duplicates and predict the next most likely category using spacy cosine similarity measure.
- Output predictions in new prediction column in dataframe
- Determine training accuracy for category predictions
- Classification report will be produced for test data

- Defining functions used to obtain category predictions in Part 2

In [49]:
def sorted_predictions_p2(df):
    # Updated for Part 2 to query unique review id isntead of sentence id
    # Aligns predictions with matching labels for sentences that have multiple opinions.
    # e.g., ground truth for sentence id=1: LAPTOP#GENERAL, LAPTOP#BATTERY_PERFORMANCE
    # predictions for sentence id=1 pre-alignment: LAPTOP#BATTERY_PERFORMANCE, LAPTOP#GENERAL
    # predictions post-alignment: LAPTOP#GENERAL, LAPTOP#BATTERY_PERFORMANCE

    N = len(df['category'])
    labels_dict = {}
    predictions_dict = {}
    sorted_predictions = []
    i = 0
    tmp = 0

    for rid in df['rid'].unique():
        labels_dict[rid] = list(df.query(f"rid == '{rid}'")['category'])
        predictions_dict[rid] = list(df.query(f"rid == '{rid}'")['predicted_category'])

    for key in labels_dict.keys():
        for value in labels_dict[key]:
            if value in predictions_dict[key]:

                idx = labels_dict[key].index(value) # obtain label index for matching label & prediction for given sentence id
                idx2 = predictions_dict[key].index(value) # obtain prediction index for matching label & prediction for given sentence id

                tmp = predictions_dict[key][idx]

                predictions_dict[key][idx] = value # re-order predictions so that they align with ground truth for given id
                predictions_dict[key][idx2] = tmp # swap changed values in predictions list for sentence id
    
    for values in predictions_dict.values():
        for value in values:
            sorted_predictions.append(value)

    return list(sorted_predictions)

def identify_p2_categories(df_p1, df_p2):
    # Identify p2 categories given category predictions for sentences with 
    # matching review id from part 1
    # Additional functionality: if LAPTOP#GENERAL not predicted, remove least
    # similar predicted category and include LAPTOP#GENERAL as predicted category
    # If same category predicted more than once, remove duplicate and predict new

    categories = []
    pred_cat = 'predicted_category'
    for rid in df_p2['rid'].unique():
        len_pred_cats_p1 = len(df_p1.query(f'rid == "{rid}"')[pred_cat])
        len_pred_cats_p2 = len(df_p2.query(f'rid == "{rid}"')[pred_cat])
        if 'LAPTOP#GENERAL' not in set(df_p1.query(f'rid == "{rid}"')[pred_cat]):
            if len_pred_cats_p2 < len_pred_cats_p1:
                categories += ['LAPTOP#GENERAL']
                categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])[:len_pred_cats_p2-1]
            elif len_pred_cats_p1 == len_pred_cats_p2:
                categories += ['LAPTOP#GENERAL']
                categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])[:len_pred_cats_p2-1]
            else:  
                delta = len_pred_cats_p2 - len_pred_cats_p1
                categories += ['LAPTOP#GENERAL']
                categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])
                for i in range(delta-1):
                    categories += ['LAPTOP#GENERAL'] # duplicate predictions will be handled later

        else:
            if len_pred_cats_p2 < len_pred_cats_p1:
                if 'LAPTOP#GENERAL' in list(df_p1.query(f'rid == "{rid}"')[pred_cat])[:len_pred_cats_p2]:
                    categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])[:len_pred_cats_p2]
                else:
                    categories += ['LAPTOP#GENERAL']
                    categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])[:len_pred_cats_p2-1]
            elif len_pred_cats_p1 == len_pred_cats_p2:
                categories += list(df_p1.query(f'rid == "{rid}"')[pred_cat])
            else:
                delta = len_pred_cats_p2 - len_pred_cats_p1
                categories += list(df_p1.query(f'rid == "{rid}"')['predicted_category'])
                for i in range(delta):
                    categories += ['LAPTOP#GENERAL'] # duplicate predictions will be handled later

    return categories

def repredict_duplicate_cats(process_df, clean_df, attrib_list=attributes_list, ent_list=entities_list):
    # Identify duplicate predicted categories for each given review
    # Use spacy cosine similarity measure to re-predict new categories
    categories_new = []
    for rid in process_df['rid'].unique():
        if len(set(process_df.query(f'rid == "{rid}"')['predicted_category'])) < len(list(process_df.query(f'rid == "{rid}"')['predicted_category'])):
            categories_new += set(process_df.query(f'rid == "{rid}"')['predicted_category'])

            delta = len(list(process_df.query(f'rid == "{rid}"')['predicted_category'])) - len(set(process_df.query(f'rid == "{rid}"')['predicted_category']))
            unique_cats = list(set(process_df.query(f'rid == "{rid}"')['predicted_category']))
            duplicates = list(process_df.query(f'rid == "{rid}"')['predicted_category'])
            [duplicates.remove(item) for item in unique_cats if item in duplicates]

            # predict new category from ranked list of 20 most similar categories
            review_text = clean_df.query(f'rid == "{rid}"')['text'].unique()[0]

            entity_sim_dict = {}
            for entity in ent_list[:5]:
                entity_sim_dict[entity] = nlp(entity).similarity(nlp(review_text))
            sorted_entities = sorted(((value,key) for key,value in entity_sim_dict.items()), reverse=True)

            attrib_sim_dict = {}
            for attrib in attrib_list[:4]:
                attrib_sim_dict[attrib] = nlp(attrib).similarity(nlp(review_text))
            sorted_attribs = sorted(((value,key) for key,value in attrib_sim_dict.items()), reverse=True)

            unsorted_categories = {}
            for i in sorted_entities:
                for j in sorted_attribs:
                    unsorted_categories[i[1]+'#'+j[1]] = i[0]*j[0]

            sorted_categories = sorted(((value,key) for key,value in unsorted_categories.items()), reverse=True)
            ranked_categories = [i[1] for i in sorted_categories]

            # Only keep ranked_categories that are not already in unique_cats
            [ranked_categories.remove(item) for item in ranked_categories if item in unique_cats]

            for i in range(delta):
                categories_new += [ranked_categories[i]]
    
        else:
            categories_new += set(process_df.query(f'rid == "{rid}"')['predicted_category'])
        
    return categories_new

- Identifying predicted categories for each review in Part 2

In [50]:
# populating E#A category predictions given that the part 2 data contains the same sentences as part 1
processed_df_p2['predicted_category'] = identify_p2_categories(train_reviews_df, train_reviews_df_p2)

In [51]:
# Re-predict duplicate predicted categories for given review
# Note: cell takes ~45secs to run
processed_df_p2['predicted_category'] = repredict_duplicate_cats(processed_df_p2, train_reviews_df_p2)

In [52]:
# Sort category predictions for more representative accuracy score (see example in function comments)
processed_df_p2['predicted_category'] = sorted_predictions_p2(processed_df_p2)
train_reviews_df_p2['predicted_category'] = sorted_predictions_p2(processed_df_p2)

In [53]:
e_a_labels_p2, e_a_predictions_p2, e_a_list_p2, e_a_label_dict_p2  = numerical_entity_attributes(processed_df_p2)
category_train_accuracy_p2 = accuracy_score(e_a_labels_p2, e_a_predictions_p2)

print(f"Training set category predictions accuracy score: {category_train_accuracy_p2*100:.0f}%")

Training set category predictions accuracy score: 71%


- Worse category predictions than Part 1 training set predictions ($71$% vs $83$%)
    - This might be due to removing repeat predictions. For example, the classifiers are more likely to predict LAPTOP#GENERAL for a given example as this is the most common category in the training data.
    - For Part 1, this works well. For example, 2 sentences in the same review may belong to LAPTOP#GENERAL. The classifiers in Part 1 can identify this with high accuracy.
    - However now for Part 2, each review only has at most one category of each kind. Therefore the accuracy now depends more on how well the classifiers can identify categories that occur less frequently in the training data.
    - As seen in the Part 1 test set classification report, the classifiers perform worse when trying to identify categories that occur less frequently.

In [54]:
# Reorder columns
reorder_columns = ['rid','processed_text','entity','entity_label',
                  'attribute','attribute_label','category','predicted_category',
                  'polarity','predicted_polarity']

processed_df_p2.head()

Unnamed: 0,rid,processed_text,category,predicted_category,polarity,predicted_polarity
0,348,everyth fine machin speed capac build thing un...,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,
1,348,everyth fine machin speed capac build thing un...,LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,positive,
2,348,everyth fine machin speed capac build thing un...,HARD_DISC#DESIGN_FEATURES,LAPTOP#DESIGN_FEATURES,positive,
3,348,everyth fine machin speed capac build thing un...,LAPTOP#QUALITY,LAPTOP#QUALITY,positive,
4,348,everyth fine machin speed capac build thing un...,DISPLAY#QUALITY,DISPLAY#QUALITY,negative,


4) __Perform sentiment analysis at the review level for each identified E#A pair__
- Using trained Sentiment classifier from Part 1
- Some differences in sentiment classification approach:
    - For each predicted category, assign the 2 most similar sentences as determined by cosine similarity as input features
    - Predict polarity of each assigned sentence to category
    - If sentiment prediction for the two assigned sentences is positive and negative, then predict conflict
    - If sentiment prediction is positive and neutral, predict positive; similarly for negative and neutral predictions
    - If both sentences are neutral, predict neutral
    - If predicted category is LAPTOP#GENERAL, then use entire review to make sentiment prediction
- Output predictions in prediction column in dataframe
- Measure predictions on training data with overall accuracy score
    - Classification report will be produced for test data

- Defining functions used to obtain sentiment predictions in part 2

In [55]:
def split_input_text_p2(text):
    # Split input review into list of sentences.
    split_input = []
    for review in text:
        split_input.append(sent_tokenize(review))
        
    return split_input

def filter_sentiment_input_p2(text, cat, threshold=1):
    # If predicted LAPTOP#GENERAL, keep entire review text.
    # If one sentence in review is much more similar to predicted category
    # than the next most similar sentence, keep only that sentence for sentiment prediction.
    # Else keep the two most similar sentences for given category from review.

    filtered_input_1 = []
    filtered_input_2 = []
    for idx, review in enumerate(text):
        if cat[idx] == 'LAPTOP#GENERAL':
            filtered_input_1.append(' '.join(review))
            filtered_input_2.append(' '.join(review))
        else:
            cat_list = re.split('_|#', cat[idx])
            cat_list = [item.title() if item != 'OS' else item for item in cat_list]
            category = nlp(' '.join(cat_list))

            sent_sim_dict = {}
            for sentence in review:
                sent_sim_dict[sentence] = category.similarity(nlp(sentence))
            
            sorted_sents = sorted(((value,key) for key,value in sent_sim_dict.items()), reverse=True)
            
            if len(sorted_sents) > 1:
                # threshold hyperparameter determined on training set
                if (sorted_sents[0][0] - sorted_sents[1][0]) < threshold:
                    filtered_input_1.append(sorted_sents[0][1])
                    filtered_input_2.append(sorted_sents[1][1])
                else:
                    filtered_input_1.append(sorted_sents[0][1])
                    filtered_input_2.append(sorted_sents[0][1])
            else:
                filtered_input_1.append(sorted_sents[0][1])
                filtered_input_2.append(sorted_sents[0][1])
                
    return filtered_input_1, filtered_input_2

def p2_sentiment_predictions(input_1, input_2):
    # make sentiment predictions
    # polarity_labels_dict = {0: 'positive', 1: 'negative', 2: 'neutral', 3: 'conflict'} 
    N,D  = input_1.shape
    Y_sentiment_pred = np.zeros(N)

    for i in range(N):
        prediction_1 = sentiment_count_lr.predict(input_1[i,:])
        prediction_2 = sentiment_count_lr.predict(input_2[i,:])
        
        if prediction_1 == prediction_2:
            Y_sentiment_pred[i] = prediction_1

        # if prediction is positive and negative
        elif prediction_1 == 0 and prediction_2 == 1:
            Y_sentiment_pred[i] = 3 # numerical value for conflict

        # if prediction is positive and neutral
        elif prediction_1 == 0 or prediction_2 == 0: 
            Y_sentiment_pred[i] = 0 # numerical value for positive

        # if prediction is negative and neutral
        elif prediction_1 == 1 or prediction_2 == 1: 
            Y_sentiment_pred[i] = 1 # numerical prediction for negative

        # If prediction is neutral for both sentences
        else:
            Y_sentiment_pred[i] = 2 # numerical prediction for neutral
        
    return Y_sentiment_pred

- Making sentiment predictions for Part 2 training data

In [56]:
# label part 2 training sentiments
sentiments_list_p2, polarity_label_dict_p2 = numerical_labels(processed_df_p2, 'polarity', 'polarity_label')

In [57]:
# Training labels for part 2 sentiment classification
Y_sentiment_train_p2 = processed_df_p2['polarity_label']
Y_sentiment_train_p2 = np.array(Y_sentiment_train_p2, dtype=int)

In [58]:
# Creating copy of data frame to populate with processed input text
# specifically processed for sentiment prediction
sentiment_processed_df_p2 = train_reviews_df_p2.copy()
sentiment_processed_df_p2['text'] = split_input_text_p2(train_reviews_df_p2['text'])

In [59]:
# N.B. cell takes ~60secs to run
sentiment_processed_df_p2['text_1'],sentiment_processed_df_p2['text_2'] = filter_sentiment_input_p2(sentiment_processed_df_p2['text'],
                                        sentiment_processed_df_p2['predicted_category'],
                                        threshold=0.02)

In [60]:
# Pre process input features for sentiment classification
# two sets of input features
# If predicted sentiments for each set is different, predict conflict
sentiment_processed_df_p2['processed_text_1'] = preprocess(tokenise(sentiment_processed_df_p2['text_1']))
X_train_sentiment_p2_1 = count_vectorizer.transform(sentiment_processed_df_p2['processed_text_1'])

sentiment_processed_df_p2['processed_text_2'] = preprocess(tokenise(sentiment_processed_df_p2['text_2']))
X_train_sentiment_p2_2 = count_vectorizer.transform(sentiment_processed_df_p2['processed_text_2'])

In [61]:
# Make sentiment predictions on part 2 Training data
Y_sentiment_train_pred_p2 = p2_sentiment_predictions(X_train_sentiment_p2_1, X_train_sentiment_p2_2)

sentiment_train_accuracy_p2 = accuracy_score(Y_sentiment_train_p2, Y_sentiment_train_pred_p2)
print(f"Training set sentiment prediction accuracy score: {sentiment_train_accuracy_p2*100:.0f}%")

Training set sentiment prediction accuracy score: 76%


In [62]:
cr = classification_report(Y_sentiment_train_p2, Y_sentiment_train_pred_p2, target_names=sentiments_list_p2)
print(cr)

              precision    recall  f1-score   support

    positive       0.80      0.86      0.83      1210
    negative       0.81      0.74      0.77       708
     neutral       0.15      0.09      0.11       123
    conflict       0.04      0.05      0.04        41

    accuracy                           0.76      2082
   macro avg       0.45      0.43      0.44      2082
weighted avg       0.75      0.76      0.75      2082



- The overall sentiment accuracy on the training data is good at $76$%, slightly worse than $84$% accuracy for Part 1
    - The neutral and conflict predictions are very imprecise ($15$% and $4$% respectively), this brings down the overall accuracy.  
    - Part of the difficulty is due to relatively very few examples being labeled as neurtral or conflict, making them very hard to find and predict using the current approach.
    - Only 41 conflict and 123 neutral labelled training examples out of 2082

5) __Evaluate accuracy__
    - Measure E#A pair prediction and sentiment prediction test set accuracy
    - Using classification report for category and sentiment predictions

In [63]:
test_reviews_df_p2 = part2_xml_to_df(filename='Laptops_Test_p2_gold.xml')
train_reviews_df_p2.head(3)

Unnamed: 0,rid,text,category,predicted_category,polarity,predicted_polarity
0,348,Most everything is fine with this machine: spe...,LAPTOP#GENERAL,LAPTOP#GENERAL,positive,
1,348,Most everything is fine with this machine: spe...,LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,positive,
2,348,Most everything is fine with this machine: spe...,HARD_DISC#DESIGN_FEATURES,LAPTOP#DESIGN_FEATURES,positive,


In [64]:
test_processed_df_p2 = test_reviews_df_p2.copy()
test_processed_df_p2['text'] = preprocess(tokenise(test_reviews_df_p2['text']))
test_processed_df_p2 = test_processed_df_p2.rename(columns={"text": "processed_text"})
test_processed_df_p2.head(3)

Unnamed: 0,rid,processed_text,category,predicted_category,polarity,predicted_polarity
0,B0074703CM_108_ANONYMOUS,well first appl comput impress work well fast ...,LAPTOP#OPERATION_PERFORMANCE,,positive,
1,B0074703CM_108_ANONYMOUS,well first appl comput impress work well fast ...,COMPANY#GENERAL,,positive,
2,B0074703CM_108_ANONYMOUS,well first appl comput impress work well fast ...,LAPTOP#GENERAL,,positive,


In [65]:
# populating E#A category predictions given that the part 2 data contains the same sentences as part 1
test_processed_df_p2['predicted_category'] = identify_p2_categories(test_reviews_df, test_reviews_df_p2)

In [66]:
# Re-predict duplicate predicted categories for given review
# Note: cell takes ~20secs to run
test_processed_df_p2['predicted_category'] = repredict_duplicate_cats(test_processed_df_p2, test_reviews_df_p2)

In [67]:
# Sort category predictions for more representative accuracy score (see example in function comments)
test_processed_df_p2['predicted_category'] = sorted_predictions_p2(test_processed_df_p2)
test_reviews_df_p2['predicted_category'] = sorted_predictions_p2(test_processed_df_p2)

- Produce Classification report for E#A pair predictions on Part 2 test set

In [68]:
e_a_labels_p2, e_a_predictions_p2, e_a_list_p2, e_a_label_dict_p2  = numerical_entity_attributes(test_processed_df_p2)

cr = classification_report(e_a_labels_p2, e_a_predictions_p2, target_names=e_a_list_p2)
print("(Part 2) Predicted E#A Category Classification Report (Test Set):\n")
print(cr)

(Part 2) Predicted E#A Category Classification Report (Test Set):

                                    precision    recall  f1-score   support

      LAPTOP#OPERATION_PERFORMANCE       0.76      0.79      0.77        47
                   COMPANY#GENERAL       1.00      0.08      0.15        24
                    LAPTOP#GENERAL       1.00      1.00      1.00        80
                  LAPTOP#USABILITY       0.58      0.70      0.64        30
              LAPTOP#MISCELLANEOUS       0.59      0.42      0.49        24
            LAPTOP#DESIGN_FEATURES       0.78      0.79      0.78        39
                LAPTOP#PORTABILITY       0.00      0.00      0.00         5
     BATTERY#OPERATION_PERFORMANCE       0.90      0.64      0.75        14
                      LAPTOP#PRICE       0.91      0.37      0.53        27
         HARD_DISC#DESIGN_FEATURES       0.83      0.56      0.67         9
                    LAPTOP#QUALITY       0.57      0.90      0.69        29
                   D

- Classification report for sentiment predictions on Part 2 test data

In [69]:
# label part 2 test sentiments
test_processed_df_p2['polarity_label'] = ''
i = 0
for label in sentiments_list_p2:
    test_processed_df_p2.loc[test_processed_df_p2['polarity'] == label, 'polarity_label'] = i
    polarity_label_dict_p2[i] = label
    i += 1


In [70]:
# Test labels for part 2 sentiment classification
Y_sentiment_test_p2 = test_processed_df_p2['polarity_label']
Y_sentiment_test_p2 = np.array(Y_sentiment_test_p2, dtype=int)

In [71]:
# Creating copy of data frame to populate with processed input text
# specifically processed for sentiment prediction
test_sentiment_prep_df_p2 = test_reviews_df_p2.copy()
test_sentiment_prep_df_p2['text'] = split_input_text_p2(test_reviews_df_p2['text'])

In [72]:
# N.B. cell takes ~30secs to run
test_sentiment_prep_df_p2['text_1'],test_sentiment_prep_df_p2['text_2'] = filter_sentiment_input_p2(test_sentiment_prep_df_p2['text'],
                                        test_sentiment_prep_df_p2['predicted_category'],
                                        threshold=0.02)

In [73]:
# Pre process input features for sentiment classification
# two sets of input features
# If predicted sentiments for each set is positive & negative, predict conflict
test_sentiment_prep_df_p2['processed_text_1'] = preprocess(tokenise(test_sentiment_prep_df_p2['text_1']))
X_test_sentiment_p2_1 = count_vectorizer.transform(test_sentiment_prep_df_p2['processed_text_1'])

test_sentiment_prep_df_p2['processed_text_2'] = preprocess(tokenise(test_sentiment_prep_df_p2['text_2']))
X_test_sentiment_p2_2 = count_vectorizer.transform(test_sentiment_prep_df_p2['processed_text_2'])

In [74]:
# Make sentiment predictions on part 2 test data
Y_sentiment_test_pred_p2 = p2_sentiment_predictions(X_test_sentiment_p2_1, X_test_sentiment_p2_2)

# Produce classification report
cr = classification_report(Y_sentiment_test_p2, Y_sentiment_test_pred_p2, target_names=sentiments_list_p2)
print("(Part 2) Predicted Sentiment Classification Report (Test Set):\n")
print(cr)

(Part 2) Predicted Sentiment Classification Report (Test Set):

              precision    recall  f1-score   support

    positive       0.70      0.81      0.75       338
    negative       0.57      0.37      0.45       162
     neutral       0.08      0.03      0.05        31
    conflict       0.12      0.29      0.17        14

    accuracy                           0.62       545
   macro avg       0.37      0.37      0.35       545
weighted avg       0.61      0.62      0.61       545



In [75]:
test_processed_df_p2['predicted_polarity'] = convert_numerical_predictions(Y_sentiment_test_pred_p2, polarity_label_dict_p2)
test_reviews_df_p2['predicted_polarity'] = convert_numerical_predictions(Y_sentiment_test_pred_p2, polarity_label_dict_p2)

test_reviews_df_p2[['text','category','predicted_category','polarity','predicted_polarity']].head()

Unnamed: 0,text,category,predicted_category,polarity,predicted_polarity
0,"Well, my first apple computer and I am impress...",LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,positive,positive
1,"Well, my first apple computer and I am impress...",COMPANY#GENERAL,GRAPHICS#GENERAL,positive,positive
2,"Well, my first apple computer and I am impress...",LAPTOP#GENERAL,LAPTOP#GENERAL,positive,positive
3,s.... L .... o..... w.... rea......llllyy sl...,LAPTOP#OPERATION_PERFORMANCE,LAPTOP#OPERATION_PERFORMANCE,negative,positive
4,s.... L .... o..... w.... rea......llllyy sl...,LAPTOP#USABILITY,LAPTOP#USABILITY,negative,positive


In [76]:
# View the % of predictions where both the category AND polarity were predicted correct
combined_accuracy_p2 = overall_accuracy_absa(test_reviews_df_p2)
print("For Part 2 test data:")
print(combined_accuracy_p2)

For Part 2 test data:
Overall accuracy for correct category and sentiment predictions: 34%


#### __Evaluation (Part 2)__
__Results:__
- Category predictions test set accuracy score of $51$% is $20$% lower than the E#A category predictions on the training data ($71$%) but the accuracy is better than that of the test set category accuracy ($46$%) for Part 1.
    - Interestingly, this may be due to the same reasons that make the category predictions accuracy worse on the Part 2 training data. Due to the information provided that each review must contain a LAPTOP#GENERAL category, a perfect score is achieved on predicting LAPTOP#GENERAL ($100$% for precision, recall and f-1 score). This will bring up overall accuracy
    - A second reason for the improved test set accuracy may be the implementation of the function that removes and re-predicts duplicate predictions for each review. Part 1 has some functionality to avoid predicting duplicates but it there is still a small probability that it predicts duplicates for the same sentence.
- Sentiment predictions test set accuracy score is $62$% and the approach can identify a small number of conflict labelled examples. Even though the test set accuracy is $3$% lower than that achieved on the Part 1 test set, this is still good considering that there are now 4 possible sentiment categories instead of 3.
    - Again, positive review predictions perform well for the part 2 test set as seen in the classification report.
    - Negative review predictions perform better than random guess ($25$%) but are less accurate than the positive review predictions.
    - It is worth noting that the overall sentiment accuracy $62$% is equal to the naive approach of predicting _positive_ for all examples ($62$%). Part of the difficulty in achieving a higher overall accuracy is the poor prediction performance on neutral and conflict labelled examples.
    - The predictions for neutral and conflict labelled examples are imprecise, dragging down the overall accuracy of the sentiment classifier. This suggests that overall sentiment accuracy can be significantly improved by using an approach that can better (and more precisely) identify both neutral and conflict labelled examples.

__Time Efficiency:__

_N.B. time efficiency results produced with macbook air M1 (2020), 8GB RAM._
- Time taken to process, predict and produce classification reports for test data: $54.5s$
- In addition to the time-costly _spacy cosine similarity_ function highlighted in the Evaluation section of Part 1, there are three other major contributors to the increase in time taken to process and make predictions for the test set in Part 2:
    1) Firstly the system now pre-processes the entire review for each predicted category, rather than a single sentence like in Part 1. This results in the text pre-processing cells to take $6.9s$ for Part 2 vs $1s$ in Part 1.
        - An obvious future improvement here would be to only pre-process the same review text for each review only once. The current system pre-processes the review text for each review as many times as the review has categories which is evidently inefficient.
    2) The _repredict_duplicate_cats()_ function is very time costly ($17.3s$) because it again uses the costly _spacy similarity_ function and it also performs a series of sorting calculations and rankings to repredict duplicate categories. However, this function in particular is likely responsible for the improvement in the test set category classification accuracy compared to Part 1.
    3) The other contibutor to the decrease in time efficiency for the algorithm in Part 2 is the _filter_sentiment_input_p2()_ function. This function again uses the costly _spacy similarity_ function to determine which sentences in the review to keep for the aspect based sentiment predictions.
- Given the reasons stated for the main causes of increase in time of the algorithm, the most effective future improvement would be, as mentioned in Part 1 Evaluation section, to find a more efficient replacement for the _space similarity_ function that is used at several steps in the algorithm.
    - One possible solution would be to re-use the classifiers to determine the similarity between the different parts of input text and the predicted categories.
    - However, the difficulty in this lies in sorting the predictions' probabilities and converting these numerical predictions back to words within the functions where the _spacy similarity_ is used.

__Future Improvements:__
- A more precise method of identifying conflict labelled examples is needed. As noted in the Part 1 Evaluation section, predictive performance would likely be improved by incorporating a more detailed part of speech (POS) and sentence structure understaning for determining the appropriate input features for the Logistic Regression models.
- Another thing to consider is that language is sequential. By developing an approach that considers the sequential nature of language (e.g., RNNs), predictive performance may be improved.
- There are also other alternative approaches worth investigating. For example, a Latent Dirichlect Allocation (LDA) approach to identifying the various categories in a review may be an effective method that has not been considered in this ABSA implementation.