# Modeling: Aspect-Based Sentiment Analysis [Simplistic]

**`Goal:`** 

Conduct ABSA using word relatedness and out-of-the-box [ABSA package by ScalaConsultants](https://github.com/ScalaConsultants/Aspect-Based-Sentiment-Analysis). This notebook is meant to serve as a start for tweet aspect annotation by getting as much of the aspects indicated and their corresponding sentiments. 

**Note:** Results will be crosschecked during the annotation phase!

**`Process:`** 
1. List aspects (e.g. speed, price, reliability) determined from earlier data annotation phase
2. Get nouns, adjectives and adverbs from the tweets as these will likely be the parts of speech making meaningful reference to aspects
3. Check if each of the words from step 2 is very similar to any of the aspects (e.g. speed [aspect] and fast [word in tweet]) by computing relatedness score (via word embedding)
4. If relatedness score is past a set thresholdhood, we assume the aspect was referenced in the tweet. Hence, note down that the aspect category was referenced in that given tweet and also note down the word (herein called aspect term) that implied the aspect
6. Conduct ABSA using the ABSA package with the tweet and with the aspect term and note sentiment (positive, negative or neutral) towards the main aspect (price, speed, etc.)
7. If multiple words make reference to a single aspect, find the average of their sentiments and use to assign a single sentiment 

In [1]:
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

## 1. Library Importation

In [1]:
import pandas as pd
import numpy as np
import re
import aspect_based_sentiment_analysis as absa
import nltk
from nltk import pos_tag, RegexpParser

#Packages for word relatedness computation
import spacy
spacy_nlp = spacy.load('en_core_web_lg')

from itertools import product
from cleantext import clean

2021-11-27 21:03:59.379723: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## 2. Define function for annotation

In [2]:
#Load the model for ABSA modeling
nlp = absa.load()

Some layers from the model checkpoint at absa/classifier-rest-0.2 were not used when initializing BertABSClassifier: ['dropout_379']
- This IS expected if you are initializing BertABSClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertABSClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of BertABSClassifier were not initialized from the model checkpoint at absa/classifier-rest-0.2 and are newly initialized: ['dropout_37']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
#1. List aspects determined during the annotation phase
    #Note: This might not be exhaustive! But it should cover most cases. It is also subjective!
    #Also using synonyms of these words will likely yield different results
aspects = ['price','speed','reliability','coverage', 'customer service', 'trustworthiness']

#2. Pair aspects with their tokenized form to avoid recomputation in the ABSA phase below
aspects_with_token = [] #List to store the pairing

#Iterate through the aspects and compute their word vector using spacy
for aspect in aspects:
    aspects_with_token.append((aspect,spacy_nlp(aspect)))
    
aspects_with_token

[('price', price),
 ('speed', speed),
 ('reliability', reliability),
 ('coverage', coverage),
 ('customer service', customer service),
 ('trustworthiness', trustworthiness)]

In [18]:
def tweet_annotator(df, col_name, similarity_threshold = 0.6):
    
    """
    Function to perform unsupervised annotation of tweets based on process outlined above
    
    Inputs:
        - df (pd DataFrame): A pandas dataframe to perform annotation on
        - col_name (str): The specific column in the dataframe containing the tweets to use for annotation
        - similarity_threshold (float): The threshold for aspect detection
        
    Output:
        - absa_df (pd DataFrame): DataFrame containing the tweets and their ABSA annotation (if relevant)
    
    """
    
    #Set to store all seen words
    seen_words = set()

    #Set to store all aspect implying words found – to avoid recomputing similarity scores
    aspect_implying_words_glob = set()

    #Dictionary categorizing all aspect-implying words into their relevant aspects
    aspects_with_implying_words = {'price':set(),'speed':set(),'reliability':set(),'coverage':set(), 
                                   'customer service':set(),'trustworthiness':set()}

    #List to store detected aspects and their sentiments
    df_list = []

    #Similarity threshold
    sim_thresh = similarity_threshold

    #Chunk tags to match – i.e. parts of speech to extract
    CHUNK_TAG = """
    MATCH: {<NN>+|<NN.*>+}
    {<JJ.*>?}
    {<RB.*>?}
    """

    #Initialize chunk tag parser
    cp = nltk.RegexpParser(CHUNK_TAG)

    #Iterate through all the tweets
    for tweet in df[col_name]:

        #Set to store the detected aspects at the sentence level
        # detected_aspects = set()

        #Dictionary to store the sentiment value for each seen aspect
        sentence_lvl_aspect_sentiment = {'price':[],'speed':[],'reliability':[],'coverage':[], 
                                         'customer service':[], 'trustworthiness':[]}

        #Split the tweet into words
        text = tweet.split()

        #Tag the words with their part of speech
        tokens_tag = pos_tag(text)

        #Get the words with relevant POS (noun, adverbs, adjectives)
        chunk_result = cp.parse(tokens_tag)

        #Extract chunk results from tree into list 
        chunk_items = [list(n) for n in chunk_result if isinstance(n, nltk.tree.Tree)]

        #Finally fuse/extract chunked words to get (noun) phrases, nouns, adverbs, adjectives
        #1. List to store the words
        matched_words = []

        #2. Iterate through the chunked words lists and get the relevant words
        for item in chunk_items:
            if len(item) > 1:
                full_string = []

                for word in item:
                    full_string.append(word[0])

                matched_words.append(' '.join(full_string))

            else:
                matched_words.append(item[0][0])

        #Iterate through all the words
        for word_in_focus in matched_words:

            #If the word has been seen before
            if word_in_focus in seen_words:

                #Check if the word is an aspect-implying word
                if word_in_focus in aspect_implying_words_glob:

                    #List to store all the aspects found to related to the certain word/token
                    aspects_implied = []

                #If it is an aspect-implying word, iterate through all the aspects
                for aspect in aspects_with_implying_words.keys():
                    
                    #Check if the word_in_focus was noted as a word implying the aspect
                    if word_in_focus in aspects_with_implying_words[aspect]:
                        
                        #Get all the aspects the word_in_focus implies
                        aspects_implied.append(aspect)
                        
            
            else:
                continue
                    
         
        #If the word hasn't been seen before
        else:
            
            #Mark the word as seen now
            seen_words.add(word_in_focus)
                
            #List to store all the aspects found to related to the certain word/token
            #Ideally a given word won't imply multiple of the aspects as they are fairly independent
            #-but just in case 
            aspects_implied = []

            #Iterate through all the aspects
            for aspect,asp_token in aspects_with_token:

                #Translate word_in_focus to word vector
                spacy_token = spacy_nlp(word_in_focus)

                #Compute the similarity between the two word vectors (i.e. the two words)
                #Round up to 1 d.p.
                similarity_score = round(asp_token.similarity(spacy_token),1)

                #If the max similarity score seen is greater than the threshold
                if similarity_score >= sim_thresh:

                    #Add the word to the set of all aspect-implying words seen
                    aspect_implying_words_glob.add(word_in_focus)

                    #Add the word to the dictionary of the relevant aspect word
                    aspects_with_implying_words[aspect].add(word_in_focus)

                    #Note that the aspect has been found in this particular sentence
                    # detected_aspects.add(aspect)

                    #Add the aspect to the list of aspects that the word_in_focus implies
                    aspects_implied.append(aspect)


                #If the word is not an aspect implying word, continue to next word
                else:

                    continue
                
        #Calculate the sentiment scores for the aspect_implying word in the current sentence
        sentiment = nlp(tweet ,aspects = [word_in_focus])
        sentiment_scores = sentiment.subtasks[word_in_focus].examples[0].scores

        #Note down the scores for all the implied aspects
        for aspect in aspects_implied:
            sentence_lvl_aspect_sentiment[aspect].append(sentiment_scores)
    
        #List to store the detected aspects from the sentence
        detected_aspects = []

        #List to store the determined sentiments of the detected aspects
        detected_sentiments = []

        #Iterate through all the aspects
        for aspect in sentence_lvl_aspect_sentiment.keys():

            #If the aspect was detected in the sentence
            if sentence_lvl_aspect_sentiment[aspect]:

                #Record this
                detected_aspects.append(aspect)

                #Calculate the average sentiment scores across the different terms
                avg_senti_score = np.array(sentence_lvl_aspect_sentiment[aspect]).mean(axis=0)

                #Get the sentiment category (neutral,negative,positive) with the largest probability
                max_idx = np.argmax(avg_senti_score)

                if max_idx == 2:

                    detected_sentiments.append("Positive")

                elif max_idx == 1:

                    detected_sentiments.append("Negative")

                else:

                    detected_sentiments.append("Neutral")

        #Add the detected aspects and sentiments from the sentence to the list
        if detected_aspects:
            df_list.append([tweet,detected_aspects,detected_sentiments])
        else:
            df_list.append([tweet,None,None])


    absa_df = pd.DataFrame(df_list, 
                       columns=[col_name,'Detected aspects','Corresponding sentiment'])
    
    return absa_df

## 3. Annotating the data

### a. More Nigerian ISP data for annotation

#### (i) Loading the data

In [19]:
new_annotations = pd.read_csv('../data/interim/new_text_for_absa_annotation2.csv')
new_annotations.head()

Unnamed: 0,Text,Cleaned text
0,spectranet_ng is this even fair? i won't renew...,spectranetng is this even fair i wont renew ne...
1,eniolashitta youtube is where spectranet start...,eniolashitta youtube is where spectranet start...
2,oluwadamilolaog spectranet_ng my second device...,oluwadamilolaog spectranetng my second device ...
3,mtnng globacomnigeria gloworld airtelnigeria e...,mtnng globacomnigeria gloworld airtelnigeria e...
4,igalaman tizeti no one. and they will still co...,igalaman tizeti no one and they will still col...


#### (ii) Perform the ABSA annotation

In [20]:
newly_annotated_df = tweet_annotator(new_annotations, 'Cleaned text')
newly_annotated_df.insert(0,'Tweets',new_annotations.Text)



In [40]:
with pd.option_context('display.max_colwidth', None):
    display(newly_annotated_df)

Unnamed: 0,Tweets,Cleaned tweets,Detected aspects,Corresponding sentiment
0,spectranet_ng is this even fair? i won't renew next month and you people should not even bother calling me. i will curse you!,spectranetng is this even fair i wont renew next month and you people should not even bother calling me i will curse you,,
1,eniolashitta youtube is where spectranet starts to smile cos data will disappear fast fast 😭😭,eniolashitta youtube is where spectranet starts to smile cos data will disappear fast fast,[speed],[Negative]
2,"oluwadamilolaog spectranet_ng my second device , the big one .",oluwadamilolaog spectranetng my second device the big one,,
3,mtnng globacomnigeria gloworld airtelnigeria etisalat_care spectranet_ng can we please get a 50% cut off price on data? it's a crucial time now. we need a pay cut. #weneeddatapaycut #paycutdata segalink gidi_traffic tosinolugbenga omojuwa drjoeabah seunkuti housengr,mtnng globacomnigeria gloworld airtelnigeria etisalatcare spectranetng can we please get a 50 cut off price on data its a crucial time now we need a pay cut weneeddatapaycut paycutdata segalink giditraffic tosinolugbenga omojuwa drjoeabah seunkuti housengr,[price],[Negative]
4,igalaman tizeti no one. and they will still collect full money,igalaman tizeti no one and they will still collect full money,,
...,...,...,...,...
633,spectranet_ng m_customerfirst sure. thanks!,spectranetng mcustomerfirst sure thanks,,
634,riqueza_cakes get spectranet then .,riquezacakes get spectranet then,,
635,spectranet is always terrible at night. fix up ffs spectranet_ng,spectranet is always terrible at night fix up ffs spectranetng,,
636,spectranet_ng are we getting 100% today ?,spectranetng are we getting 100 today,,


In [22]:
#Write annotated dataframe to CSV
# newly_annotated_df.to_csv('../data/model-generated/tweet_absa_second_annotation.csv',index=False)

### b. Non-Nigerian ISP data for annotation (Analogous data)

#### (i) Loading the data

In [29]:
analogous_tweets = pd.read_csv('../data/interim/cleaned_analogous_tweets.csv')
analogous_tweets.head()

Unnamed: 0,Text,Cleaned text
0,iamrenike: the sexual tension between spectran...,iamrenike the sexual tension between spectrane...
1,spectranet or smile? which is more reliable?,spectranet or smile which is more reliable
2,"spectranet, and glo dey cook me seriously for ...",spectranet and glo dey cook me seriously for here
3,spectranet offer state-of-the-art dedicated li...,spectranet offer stateoftheart dedicated link ...
4,rhanty - lmao make i run the playstation plus....,rhanty lmao make i run the playstation plus sp...


#### (ii) Drop NAs

In [31]:
analogous_tweets = analogous_tweets.dropna()

#### (iii) Perform the ABSA annotation

In [35]:
analogous_annotated = tweet_annotator(analogous_tweets, 'Cleaned text')
analogous_annotated.insert(0,'Tweets',analogous_tweets.Text)

In [36]:
analogous_annotated.head()

Unnamed: 0,Tweets,Cleaned text,Detected aspects,Corresponding sentiment
0,iamrenike: the sexual tension between spectran...,iamrenike the sexual tension between spectrane...,,
1,spectranet or smile? which is more reliable?,spectranet or smile which is more reliable,"[reliability, customer service]","[Positive, Positive]"
2,"spectranet, and glo dey cook me seriously for ...",spectranet and glo dey cook me seriously for here,,
3,spectranet offer state-of-the-art dedicated li...,spectranet offer stateoftheart dedicated link ...,,
4,rhanty - lmao make i run the playstation plus....,rhanty lmao make i run the playstation plus sp...,,


#Write annotated dataframe to CSV
analogous_annotated.to_csv('../data/model-generated/annotated_analogous_data.csv',index=False)