## ID: 2446553
## Module code: [06-37812]

# Part 1

## Imports

In [1]:
from bs4 import BeautifulSoup as bs
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
from textblob import TextBlob 
import spacy
from tqdm import tqdm 
from tensorflow.keras.models import load_model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
import numpy as np 
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.metrics import accuracy_score

2023-04-24 02:52:59.736195: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## XML Parsing

Here, I use the beautiful_soup package to parse the xml file and extract the sentences and opinions. Once I am done, I make a sentence-opinion pair for each sentence and opinion. Then, I save it all in a dictionary which I then convert into a pandas dataframe for easy processing.

In [2]:
def parse_xml(filename):
    with open(filename, encoding='utf-8') as file:
        data = file.read()

    data = ''.join(data)
    content = bs(data, 'xml')
    data = {
        'text': [],
        'entity': [],
        'attribute': [],
        'polarity': []
    }

    rid_list = []
    a = list(content.find_all('Review'))
    for i in a:
        rid_list.append(i.get('rid'))
        
    for rid in rid_list:
        for sentence in content.find('Review', {'rid': rid}).find_all('sentence'):
            if sentence.find('Opinions'):
                for opinion in sentence.find('Opinions').find_all('Opinion'):
                    data['text'].append(sentence.find('text').text)
                    category = opinion.get('category')
                    entity, attribute = category.split('#')
                    data['entity'].append(entity)
                    data['attribute'].append(attribute)
                    data['polarity'].append(opinion.get('polarity'))
    df = pd.DataFrame(data)
    return df

## Preprocessing the text

As for preprocessing, I go for the usual tokenization, stop word removal, and then lemmatization. I spotted some misspellings, and tried using the TextBlob package to correct them, but it honestly caused a lot more trouble than good. It converted a lot of correctly spelt words into random combinations of letters. It also took a significant amount of processing time anyway, so it was removed. 

I also remove quite a few words from the stopword list, as they were quite important to our task. For example, the word 'not' is important to differentiate between "happy" and "not happy". Similar argument for the rest of the words. 

In [24]:
### Tokenization 
def process_text(df):
    lemmatizer = WordNetLemmatizer()
    tokenizer = RegexpTokenizer(r"\w+(?:[-.']\w+)*")
    sw_list = stopwords.words('english')
    remove_words = ['not', 'used', 'never', "don't", 'care', "didn't", 'cannot', "didn'", "hasn't", "haven't", "isn", "isn't", "mightn't", "mustn't", "might","needn't","shan't","shouldn't", "wasn't", "weren't", "doesn't", "won't", "wouldn't"]
    add_words = ['homework', 'student', 'science', 'college']
    sw_list = [word for word in sw_list if not word in remove_words]
    sw_list.extend(add_words)

    lemmatized_sent = []

    for i in tqdm(range(len(df.text))):

        text = df.text[i]
        text = str.lower(text)
        tokens = tokenizer.tokenize(text)
#         for i in range(len(tokens)):
#             tokens[i] = str(TextBlob(tokens[i]).correct())
        tokens = [word for word in tokens if not word in sw_list]
        lemmatized_words = [lemmatizer.lemmatize(w) for w in tokens]
        lemmatized_sent.append(' '.join(lemmatized_words))
    
    df['processed'] = lemmatized_sent
    return df

In [11]:
df = parse_xml('/content/Laptops_Train_p1.xml') ## Please give the approp file path here. 
orig = process_text(df)

100%|██████████| 2909/2909 [00:01<00:00, 2117.21it/s]


## Aspect Extraction

In [12]:
entity_labels = ['laptop', 'display', 'keyboard', 'mouse', 'motherboard', 'cpu', 'fans_cooling', 
                 'ports', 'memory', 'power_supply', 'optical_drives', 'battery', 'graphics', 
                 'hard_disk', 'multimedia_devices', 'hardware', 'software', 'os', 'warranty', 'shipping', 'support', 'company']

attribute_labels = ['general', 'price', 'quality', 'design_feature', 'operation_performance', 'usability', 'portability',
                   'connectivity', 'miscellaneous']

Here, we perform chunking using the spacy package to extract noun chunks to form the aspect terms. 

In [13]:
df = orig.copy()
nlp = spacy.load('en_core_web_sm')

aspect_terms = []
for i, review in enumerate(nlp.pipe(df.processed)):
    chunks = [(chunk.root.text) for chunk in review.noun_chunks if chunk.root.pos_ == 'NOUN']
    if not len(chunks):
        df.drop(i, inplace=True)
        continue
    aspect_terms.append(' '.join(chunks))

df['aspect terms'] = aspect_terms

## Model creation

I go for separate models to predict the entity, the aspect and the sentiment. I have seen quite a few examples out there that try to predict the aspect#entity term with a single network and not separating them, but I saw that the dataset had 81 such unique pairs alone, let alone all combinations. I felt that this would make for an inefficient model, and went for this type of prediction. 

For all models, I go for a simple and standard feedforward network as shown below. 

In [14]:
entity_categories_model = Sequential()
entity_categories_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
entity_categories_model.add(Dense(512, activation='relu'))
entity_categories_model.add(Dense(128, activation='relu'))
entity_categories_model.add(Dense(len(entity_labels), activation='softmax'))
entity_categories_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Here, I use Keras' Tokenizer to vectorize the text. (Ref: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer)
I then use a label encoder and then convert the labels into one-hot encoded format as is standard.

In [15]:
vocab_size = 6000 
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df.text)
aspect_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(df['aspect terms']))

label_encoder = LabelEncoder()
integer_entity = label_encoder.fit_transform(df.entity)
entity_category = to_categorical(integer_entity)
integer_attribute = label_encoder.fit_transform(df.attribute)
attribute_category = to_categorical(integer_attribute)

In [16]:
entity_categories_model.fit(aspect_tokenized, entity_category, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdbf8cd1d00>

In [17]:
attribute_categories_model = Sequential()
attribute_categories_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
attribute_categories_model.add(Dense(512, activation='relu'))
attribute_categories_model.add(Dense(128, activation='relu'))
attribute_categories_model.add(Dense(len(attribute_labels), activation='softmax'))
attribute_categories_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [18]:
attribute_categories_model.fit(aspect_tokenized, attribute_category, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdbf20bdb20>

## Sentiment extraction

I do the same for the sentiment terms, but extracting the adjectives and verbs from the sentences this time. 

In [19]:
df = orig.copy()
sentiment_terms = []
rem_ind = []
for i, review in enumerate(nlp.pipe(df['processed'])):
    chunks = [token.lemma_ for token in review if (not token.is_stop and not token.is_punct and (token.pos_ == "ADJ" or token.pos_ == "VERB"))]
    if not len(chunks):
        rem_ind.append(i)
        continue
    sentiment_terms.append(' '.join(chunks))
df = df.drop(rem_ind)
df['sentiment_terms'] = sentiment_terms

In [20]:
sentiment_model = Sequential()
sentiment_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
sentiment_model.add(Dense(512, activation='relu'))
sentiment_model.add(Dense(256, activation='relu'))
sentiment_model.add(Dense(128, activation='relu'))
sentiment_model.add(Dense(3, activation='softmax'))
sentiment_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [21]:
sentiment_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(df.sentiment_terms))
label_encoder = LabelEncoder()
integer_sentiment = label_encoder.fit_transform(df.polarity)
dummy_sentiment = to_categorical(integer_sentiment)

In [22]:
sentiment_model.fit(sentiment_tokenized, dummy_sentiment, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fdbec684310>

# Test

For testing, I use the same preprocessing pipeline. 

In [3]:
test_df = parse_xml('Laptops_Test_p1_gold.xml')
test_df = process_text(test_df)

NameError: name 'process_text' is not defined

In [26]:
aspect_df = test_df.copy()
nlp = spacy.load('en_core_web_sm')

aspect_terms = []
for i, review in enumerate(nlp.pipe(aspect_df.processed)):
    chunks = [(chunk.root.text) for chunk in review.noun_chunks if chunk.root.pos_ == 'NOUN']
    if not len(chunks):
        aspect_df.drop(i, inplace=True)
        continue
    aspect_terms.append(' '.join(chunks))

aspect_df['aspect terms'] = aspect_terms

In [27]:
sentiment_df = test_df.copy()
sentiment_terms = []
rem_ind = []
for i, review in enumerate(nlp.pipe(sentiment_df['processed'])):
    chunks = [token.lemma_ for token in review if (not token.is_stop and not token.is_punct and (token.pos_ == "ADJ" or token.pos_ == "VERB"))]
    if not len(chunks):
        rem_ind.append(i)
        continue
    sentiment_terms.append(' '.join(chunks))
sentiment_df = sentiment_df.drop(rem_ind)
sentiment_df['sentiment_terms'] = sentiment_terms

In [28]:
vocab_size = 6000 
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df.text)
aspect_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(aspect_df['aspect terms']))

label_encoder = LabelEncoder()
integer_entity = label_encoder.fit_transform(aspect_df.entity)
entity_category = to_categorical(integer_entity)
integer_attribute = label_encoder.fit_transform(aspect_df.attribute)
attribute_category = to_categorical(integer_attribute)

In [29]:
sentiment_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(sentiment_df.sentiment_terms))
label_encoder = LabelEncoder()
integer_sentiment = label_encoder.fit_transform(sentiment_df.polarity)
dummy_sentiment = to_categorical(integer_sentiment)

In [30]:
entity_categories_model.evaluate(aspect_tokenized, entity_category)



[2.5159709453582764, 0.5428156852722168]

In [31]:
attribute_categories_model.evaluate(aspect_tokenized, attribute_category)



[2.6327712535858154, 0.2423802614212036]

In [32]:
sentiment_model.evaluate(sentiment_tokenized, dummy_sentiment)



[1.8112872838974, 0.5338753461837769]

-----------------------------------------------------------------------------------------------------------
# Part 2 

Since splitting the text into separate sentences did not make much sense, I go for performing the preprocessing on the entire text as a whole. I follow the same methods as the first task except while evaluating sentiment. 

In [73]:
def parse_xml(filename):
    with open(filename, encoding='utf-8') as file:
        data = file.read()
    data = ''.join(data)
    content = bs(data, 'xml')
    data = {
        'text': [],
        'entity': [],
        'attribute': [],
        'polarity': []
    }

    rid_list = []
    a = list(content.find_all('Review'))
    for i in a:
        rid_list.append(i.get('rid'))
        
    for rid in rid_list:
        sentences = list(content.find('Review', {'rid': rid}).find_all('sentence'))
        opinions = list(content.find('Review', {'rid': rid}).find_all('Opinion'))
        sentences = [x.text for x in sentences]
        text = ' '.join(sentences)
        if len(opinions):
            for opinion in opinions:
                data['text'].append(text)
                category = opinion.get('category')
                entity, attribute = category.split('#')
                data['entity'].append(entity)
                data['attribute'].append(attribute)
                data['polarity'].append(opinion.get('polarity'))
    df = pd.DataFrame(data)
    return df

In [74]:
df = parse_xml('Laptops_Train_p2.xml')
df = process_text(df)

100%|██████████| 2082/2082 [00:00<00:00, 3152.56it/s]


In [75]:
entity_labels = ['laptop', 'display', 'keyboard', 'mouse', 'motherboard', 'cpu', 'fans_cooling', 
                 'ports', 'memory', 'power_supply', 'optical_drives', 'battery', 'graphics', 
                 'hard_disk', 'multimedia_devices', 'hardware', 'software', 'os', 'warranty', 'shipping', 'support', 'company']

attribute_labels = ['general', 'price', 'quality', 'design_feature', 'operation_performance', 'usability', 'portability',
                   'connectivity', 'miscellaneous']

In [76]:
df = orig.copy()
nlp = spacy.load('en_core_web_sm')

aspect_terms = []
for i, review in enumerate(nlp.pipe(df.processed)):
    chunks = [(chunk.root.text) for chunk in review.noun_chunks if chunk.root.pos_ == 'NOUN']
    if not len(chunks):
        df.drop(i, inplace=True)
        continue
    aspect_terms.append(' '.join(chunks))

df['aspect terms'] = aspect_terms

In [77]:
entity_categories_model = Sequential()
entity_categories_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
entity_categories_model.add(Dense(512, activation='relu'))
entity_categories_model.add(Dense(128, activation='relu'))
entity_categories_model.add(Dense(len(entity_labels), activation='softmax'))
entity_categories_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [78]:
vocab_size = 6000 
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df.text)
aspect_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(df['aspect terms']))

label_encoder = LabelEncoder()
integer_entity = label_encoder.fit_transform(df.entity)
entity_category = to_categorical(integer_entity)
integer_attribute = label_encoder.fit_transform(df.attribute)
attribute_category = to_categorical(integer_attribute)

In [79]:
entity_categories_model.fit(aspect_tokenized, entity_category, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdbc9ecd1f0>

In [80]:
attribute_categories_model = Sequential()
attribute_categories_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
attribute_categories_model.add(Dense(512, activation='relu'))
attribute_categories_model.add(Dense(128, activation='relu'))
attribute_categories_model.add(Dense(len(attribute_labels), activation='softmax'))
attribute_categories_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [81]:
attribute_categories_model.fit(aspect_tokenized, attribute_category, epochs=30, verbose=1)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7fdbd4a5b610>

In [82]:
df = orig.copy()
sentiment_terms = []
rem_ind = []
for i, review in enumerate(nlp.pipe(df['processed'])):
    chunks = [token.lemma_ for token in review if (not token.is_stop and not token.is_punct and (token.pos_ == "ADJ" or token.pos_ == "VERB"))]
    if not len(chunks):
        rem_ind.append(i)
        continue
    sentiment_terms.append(' '.join(chunks))
df = df.drop(rem_ind)
df['sentiment_terms'] = sentiment_terms

In [83]:
sentiment_model = Sequential()
sentiment_model.add(Dense(1024, input_shape=(6000,), activation='relu'))
sentiment_model.add(Dense(512, activation='relu'))
sentiment_model.add(Dense(256, activation='relu'))
sentiment_model.add(Dense(128, activation='relu'))
sentiment_model.add(Dense(3, activation='softmax'))
sentiment_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [84]:
sentiment_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(df.sentiment_terms))
label_encoder = LabelEncoder()
integer_sentiment = label_encoder.fit_transform(df.polarity)
dummy_sentiment = to_categorical(integer_sentiment)

In [85]:
sentiment_model.fit(sentiment_tokenized, dummy_sentiment, epochs=20, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fdbec4db100>

# Test

In [56]:
test_df = parse_xml('Laptops_Test_p2_gold.xml')
test_df = process_text(test_df)

100%|██████████| 545/545 [00:00<00:00, 959.56it/s] 


In [57]:
aspect_df = test_df.copy()
nlp = spacy.load('en_core_web_sm')

aspect_terms = []
for i, review in enumerate(nlp.pipe(aspect_df.processed)):
    chunks = [(chunk.root.text) for chunk in review.noun_chunks if chunk.root.pos_ == 'NOUN']
    if not len(chunks):
        aspect_df.drop(i, inplace=True)
        continue
    aspect_terms.append(' '.join(chunks))

aspect_df['aspect terms'] = aspect_terms

In [58]:
sentiment_df = test_df.copy()
sentiment_terms = []
rem_ind = []
for i, review in enumerate(nlp.pipe(sentiment_df['processed'])):
    chunks = [token.lemma_ for token in review if (not token.is_stop and not token.is_punct and (token.pos_ == "ADJ" or token.pos_ == "VERB"))]
    if not len(chunks):
        rem_ind.append(i)
        continue
    sentiment_terms.append(' '.join(chunks))
sentiment_df = sentiment_df.drop(rem_ind)
sentiment_df['sentiment_terms'] = sentiment_terms

In [59]:
vocab_size = 6000 
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(df.text)
aspect_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(aspect_df['aspect terms']))

label_encoder = LabelEncoder()
integer_entity = label_encoder.fit_transform(aspect_df.entity)
entity_category = to_categorical(integer_entity)
integer_attribute = label_encoder.fit_transform(aspect_df.attribute)
attribute_category = to_categorical(integer_attribute)

In [60]:
sentiment_tokenized = pd.DataFrame(tokenizer.texts_to_matrix(sentiment_df.sentiment_terms))
label_encoder = LabelEncoder()
integer_sentiment = label_encoder.fit_transform(sentiment_df.polarity)
dummy_sentiment = to_categorical(integer_sentiment)

In [61]:
entity_categories_model.evaluate(aspect_tokenized, entity_category)



[5.485455513000488, 0.46972477436065674]

In [62]:
entity_categories_model.evaluate(aspect_tokenized, entity_category)



[5.485455513000488, 0.46972477436065674]

Here, I check if the maximum value of the prediction vector is greater than 0.5 i.e. the model is confident about its predictions. If it is, I go ahead with it (I use the inverse encoder to get the original labels back). If not, I label it as 'conflict'. I then proceed to calculate accuracy.

In [106]:
y = sentiment_model.predict(sentiment_tokenized)
preds = []
for i in range(len(y)):
    m = max(y[i])
    if m<0.5:
        preds.append('conflict')
    else:
        a = np.argmax(y[i])
        preds.append(str(label_encoder.inverse_transform([a])[0]))



In [109]:
from sklearn.metrics import accuracy_score
print(accuracy_score(list(df.polarity), preds))

0.8710033076074972


# Final thoughts

This can of course be improved by a lot. For now, this model can only predict one pair of aspect-sentiment for each query, as I was not able to build a model that can output multiple pairs for the same query. using a pre-trained model would also yield much better accuracies as well. As for the preprocessing part, a better spell-checker could have been used. 