# Modeling: Aspect-Based Sentiment Analysis [Simplistic]
Conducting aspect-based sentiment analysis with [ABSA package by Scala Consultants](https://github.com/ScalaConsultants/Aspect-Based-Sentiment-Analysis)

**`Goal:`** 

Conduct ABSA using word relatedness and out-of-the-box ABSA package. This notebook is meant to serve as a start for tweet aspect annotation by getting as much of the aspects indicated and their corresponding sentiments. 

**Note:** Results will be crosschecked during the annotation phase!

**`Process:`** 
1. List aspects (e.g. speed, price, reliability) determined from earlier data annotation phase
2. Get nouns, adjectives and adverbs from the tweets as these will likely be the parts of speech making meaningful reference to aspects
3. Check if each of the words from step 2 is very similar to any of the aspects (e.g. speed [aspect] and fast [word in tweet]) by computing relatedness score (via word embedding)
4. If relatedness score is past a set thresholdhood, we assume the aspect was referenced in the tweet. Hence, note down that the aspect category was referenced in that given tweet and also note down the word (herein called aspect term) that implied the aspect
6. Conduct ABSA using the ABSA package with the tweet and with the aspect term and note sentiment (positive, negative or neutral) towards the main aspect (price, speed, etc.)
7. If multiple words make reference to a single aspect, find the average of their sentiments and use to assign a single sentiment 

In [1]:
# python -m spacy download en_core_web_md
# python -m spacy download en_core_web_lg

### 1. Library Importation

In [1]:
import pandas as pd
import numpy as np
import re
import aspect_based_sentiment_analysis as absa
import nltk
from nltk import pos_tag, RegexpParser

#Packages for word relatedness computation
import spacy
spacy_nlp = spacy.load('en_core_web_lg')

from itertools import product
from cleantext import clean

2021-11-16 11:34:43.674381: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. Loading the data

In [2]:
df = pd.read_csv('../data/processed/sample_encoded_and_cleaned_no_punct.csv')

In [3]:
df.head()

Unnamed: 0,ISP_Name,Time,Text,Source,sentiment,label
0,sprectranet,2020-02-04 18:30:35+00:00,my family used my spectranet and they dont wan...,Twitter for Android,Neutral,1
1,sprectranet,2019-06-19 04:59:49,spectranetng how can i get the freedom mifi in...,Twitter for iPhone,Neutral,1
2,sprectranet,2020-03-30 07:57:38+00:00,drolufunmilayo iconicremi spectranetng,Twitter for iPhone,Neutral,1
3,sprectranet,2020-12-31 21:07:52+00:00,spectranetng your response just proves how hor...,Twitter for Android,Negative,0
4,sprectranet,2020-09-03 23:09:09+00:00,spectranet is just the worse tbh i cant even w...,Twitter for iPhone,Negative,0


In [4]:
df.sentiment.value_counts()

Negative    216
Neutral     133
Positive     29
Name: sentiment, dtype: int64

In [10]:
#Load the model for ABSA modeling
nlp = absa.load()

Some layers from the model checkpoint at absa/classifier-rest-0.2 were not used when initializing BertABSClassifier: ['dropout_379']
- This IS expected if you are initializing BertABSClassifier from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertABSClassifier from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of BertABSClassifier were not initialized from the model checkpoint at absa/classifier-rest-0.2 and are newly initialized: ['dropout_75']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
#1. List aspects determined during the annotation phase
    #Note: This might not be exhaustive! But it should cover most cases. It is also subjective!
    #Also using synonyms of these words will likely yield different results
aspects = ['price','speed','reliability','coverage', 'customer service']

#2. Pair aspects with their tokenized form to avoid recomputation in the ABSA phase below
aspects_with_token = [] #List to store the pairing

#Iterate through the aspects and compute their word vector using spacy
for aspect in aspects:
    aspects_with_token.append((aspect,spacy_nlp(aspect)))
    
aspects_with_token

[('price', price),
 ('speed', speed),
 ('reliability', reliability),
 ('coverage', coverage),
 ('customer service', customer service)]

In [None]:
#Set to store all seen words
seen_words = set()

#Set to store all aspect implying words found – to avoid recomputing similarity scores
aspect_implying_words_glob = set()

#Dictionary categorizing all aspect-implying words into their relevant aspects
aspects_with_implying_words = {'price':set(),'speed':set(),'reliability':set(),
                               'coverage':set(), 'customer service':set()}

#List to store detected aspects and their sentiments
df_list = []

#Similarity threshold
sim_thresh = 0.5

#Iterate through all the tweets
for tweet in df.Text:
    
    #Set to store the detected aspects at the sentence level
    # detected_aspects = set()
    
    #Dictionary to store the sentiment value for each seen aspect
    sentence_lvl_aspect_sentiment = {'price':[],'speed':[],'reliability':[],
                                     'coverage':[], 'customer service':[]}
        
    #Split the tweet into words
    text = tweet.split()

    #Tag words with part of speech
    tokens_tag = pos_tag(text)

    #Iterate through all the tagged words
    for token in tokens_tag:
        
        #Get the current word in focus
        word_in_focus = token[0]
        
        #If the word has been seen before
        if word_in_focus in seen_words:
            
            #Check if the word is an aspect-implying word
            if word_in_focus in aspect_implying_words_glob:
                
                #List to store all the aspects found to related to the certain word/token
                aspects_implied = []
            
                #If it is an aspect-implying word, iterate through all the aspects
                for aspect in aspects_with_implying_words.keys():
                    
                    #Check if the word_in_focus was noted as a word implying the aspect
                    if word_in_focus in aspects_with_implying_words[aspect]:
                        
                        #Get all the aspects the word_in_focus implies
                        aspects_implied.append(aspect)
                        
            
            else:
                continue
                    
         
        #If the word hasn't been seen before
        else:
            
            #Mark the word as seen now
            seen_words.add(word_in_focus)
        
            #Check if the tagged word is a noun, adjective or adverb
            regex_match = re.match('NN.?|JJ.?|RB.?',token[1])

            #If it is one of the mentioned parts of speech
            if regex_match:
                
                #List to store all the aspects found to related to the certain word/token
                #Ideally a given word won't imply multiple of the aspects as they are fairly independent
                #-but just in case 
                aspects_implied = []
                
                #Iterate through all the aspects
                for aspect,asp_token in aspects_with_token:
                    
                    #Translate word_in_focus to word vector
                    spacy_token = spacy_nlp(word_in_focus)

                    #Compute the similarity between the two word vectors (i.e. the two words)
                    #Round up to 1 d.p.
                    similarity_score = round(asp_token.similarity(spacy_token),1)
                        
                    #If the max similarity score seen is greater than the threshold
                    if similarity_score > sim_thresh:

                        #Add the word to the set of all aspect-implying words seen
                        aspect_implying_words_glob.add(word_in_focus)

                        #Add the word to the dictionary of the relevant aspect word
                        aspects_with_implying_words[aspect].add(word_in_focus)
                        
                        #Note that the aspect has been found in this particular sentence
                        # detected_aspects.add(aspect)

                        #Add the aspect to the list of aspects that the word_in_focus implies
                        aspects_implied.append(aspect)
                        
                        
                     
                    #If the word is not an aspect implying word, continue to next word
                    else:
                        
                        continue
            
            else:
                continue
                
        
        #Calculate the sentiment scores for the aspect_implying word in the current sentence
        sentiment = nlp(tweet ,aspects = [word_in_focus])
        sentiment_scores = sentiment.subtasks[word_in_focus].examples[0].scores

        #Note down the scores for all the implied aspects
        for aspect in aspects_implied:
            sentence_lvl_aspect_sentiment[aspect].append(sentiment_scores)
    
    #List to store the detected aspects from the sentence
    detected_aspects = []
    
    #List to store the determined sentiments of the detected aspects
    detected_sentiments = []
    
    #Iterate through all the aspects
    for aspect in sentence_lvl_aspect_sentiment.keys():
        
        #If the aspect was detected in the sentence
        if sentence_lvl_aspect_sentiment[aspect]:
            
            #Record this
            detected_aspects.append(aspect)
            
            #Calculate the average sentiment scores across the different terms
            avg_senti_score = np.array(sentence_lvl_aspect_sentiment[aspect]).mean(axis=0)
            
            #Get the sentiment category (neutral,negative,positive) with the largest probability
            max_idx = np.argmax(avg_senti_score)

            if max_idx == 2:

                detected_sentiments.append("Positive")

            elif max_idx == 1:

                detected_sentiments.append("Negative")

            else:

                detected_sentiments.append("Neutral")
    
    #Add the detected aspects and sentiments from the sentence to the list
    if detected_aspects:
        df_list.append([tweet,detected_aspects,detected_sentiments])
    else:
        df_list.append([tweet,None,None])

            

In [28]:
#Set to store all seen words
seen_words = set()

#Set to store all aspect implying words found – to avoid recomputing similarity scores
aspect_implying_words_glob = set()

#Dictionary categorizing all aspect-implying words into their relevant aspects
aspects_with_implying_words = {'price':set(),'speed':set(),'reliability':set(),
                               'coverage':set(), 'customer service':set()}

#List to store detected aspects and their sentiments
df_list = []

#Similarity threshold
sim_thresh = 0.5

#Chunk tags to match – i.e. parts of speech to extract
CHUNK_TAG = """
MATCH: {<NN>+|<NN.*>+}
{<JJ.*>?}
{<RB.*>?}
"""

#Initialize chunk tag parser
cp = nltk.RegexpParser(CHUNK_TAG)

#Iterate through all the tweets
for tweet in df.Text:
    
    #Set to store the detected aspects at the sentence level
    # detected_aspects = set()
    
    #Dictionary to store the sentiment value for each seen aspect
    sentence_lvl_aspect_sentiment = {'price':[],'speed':[],'reliability':[],
                                     'coverage':[], 'customer service':[]}
        
    #Split the tweet into words
    text = tweet.split()

    #Tag the words with their part of speech
    tokens_tag = pos_tag(text)
    
    #Get the words with relevant POS (noun, adverbs, adjectives)
    chunk_result = cp.parse(tokens_tag)
    
    #Extract chunk results from tree into list 
    chunk_items = [list(n) for n in chunk_result if isinstance(n, nltk.tree.Tree)]
    
    #Finally fuse/extract chunked words to get (noun) phrases, nouns, adverbs, adjectives
    #1. List to store the words
    matched_words = []
    
    #2. Iterate through the chunked words lists and get the relevant words
    for item in chunk_items:
        if len(item) > 1:
            full_string = []

            for word in item:
                full_string.append(word[0])

            matched_words.append(' '.join(full_string))

        else:
            matched_words.append(item[0][0])
        
    #Iterate through all the words
    for word_in_focus in matched_words:
        
        #If the word has been seen before
        if word_in_focus in seen_words:
            
            #Check if the word is an aspect-implying word
            if word_in_focus in aspect_implying_words_glob:
                
                #List to store all the aspects found to related to the certain word/token
                aspects_implied = []
            
                #If it is an aspect-implying word, iterate through all the aspects
                for aspect in aspects_with_implying_words.keys():
                    
                    #Check if the word_in_focus was noted as a word implying the aspect
                    if word_in_focus in aspects_with_implying_words[aspect]:
                        
                        #Get all the aspects the word_in_focus implies
                        aspects_implied.append(aspect)
                        
            
            else:
                continue
                    
         
        #If the word hasn't been seen before
        else:
            
            #Mark the word as seen now
            seen_words.add(word_in_focus)
                
            #List to store all the aspects found to related to the certain word/token
            #Ideally a given word won't imply multiple of the aspects as they are fairly independent
            #-but just in case 
            aspects_implied = []

            #Iterate through all the aspects
            for aspect,asp_token in aspects_with_token:

                #Translate word_in_focus to word vector
                spacy_token = spacy_nlp(word_in_focus)

                #Compute the similarity between the two word vectors (i.e. the two words)
                #Round up to 1 d.p.
                similarity_score = round(asp_token.similarity(spacy_token),1)

                #If the max similarity score seen is greater than the threshold
                if similarity_score >= sim_thresh:

                    #Add the word to the set of all aspect-implying words seen
                    aspect_implying_words_glob.add(word_in_focus)

                    #Add the word to the dictionary of the relevant aspect word
                    aspects_with_implying_words[aspect].add(word_in_focus)

                    #Note that the aspect has been found in this particular sentence
                    # detected_aspects.add(aspect)

                    #Add the aspect to the list of aspects that the word_in_focus implies
                    aspects_implied.append(aspect)



                #If the word is not an aspect implying word, continue to next word
                else:

                    continue
                
        
        #Calculate the sentiment scores for the aspect_implying word in the current sentence
        sentiment = nlp(tweet ,aspects = [word_in_focus])
        sentiment_scores = sentiment.subtasks[word_in_focus].examples[0].scores

        #Note down the scores for all the implied aspects
        for aspect in aspects_implied:
            sentence_lvl_aspect_sentiment[aspect].append(sentiment_scores)
    
    #List to store the detected aspects from the sentence
    detected_aspects = []
    
    #List to store the determined sentiments of the detected aspects
    detected_sentiments = []
    
    #Iterate through all the aspects
    for aspect in sentence_lvl_aspect_sentiment.keys():
        
        #If the aspect was detected in the sentence
        if sentence_lvl_aspect_sentiment[aspect]:
            
            #Record this
            detected_aspects.append(aspect)
            
            #Calculate the average sentiment scores across the different terms
            avg_senti_score = np.array(sentence_lvl_aspect_sentiment[aspect]).mean(axis=0)
            
            #Get the sentiment category (neutral,negative,positive) with the largest probability
            max_idx = np.argmax(avg_senti_score)

            if max_idx == 2:

                detected_sentiments.append("Positive")

            elif max_idx == 1:

                detected_sentiments.append("Negative")

            else:

                detected_sentiments.append("Neutral")
    
    #Add the detected aspects and sentiments from the sentence to the list
    if detected_aspects:
        df_list.append([tweet,detected_aspects,detected_sentiments])
    else:
        df_list.append([tweet,None,None])

            



In [33]:
aspects_with_implying_words

{'price': {'buy spectranetng',
  'price',
  'purchase',
  'value',
  'value spectranet'},
 'speed': {'download speed',
  'fast',
  'internet speed',
  'slow',
  'slower',
  'snail speed',
  'speed',
  'speed abeg',
  'speeds'},
 'reliability': {'network quality', 'reliable', 'usefulness'},
 'coverage': {'coverage', 'insurance claim', 'network coverage'},
 'customer service': {'business',
  'company',
  'customer',
  'customer care',
  'customer care line isnt',
  'customer service',
  'customer service experience',
  'customers',
  'disgusting customer service',
  'ifes business',
  'internet connection',
  'internet service',
  'internet service provider',
  'isp business',
  'network provider',
  'network quality',
  'network reception',
  'network service',
  'provider i',
  'providers',
  'reliable',
  'service',
  'service i',
  'service provider',
  'service subscription failure',
  'services',
  'services i',
  'spectranet ltd internet subscription n18525',
  'teleport service',

In [29]:
absa_df = pd.DataFrame(df_list, 
                       columns=['Tweets','Detected aspects','Corresponding sentiment'])

In [30]:
absa_df

Unnamed: 0,Tweets,Detected aspects,Corresponding sentiment
0,my family used my spectranet and they dont wan...,,
1,spectranetng how can i get the freedom mifi in...,,
2,drolufunmilayo iconicremi spectranetng,,
3,spectranetng your response just proves how hor...,[customer service],[Negative]
4,spectranet is just the worse tbh i cant even w...,,
...,...,...,...
373,spectranet unlimited value for money,[price],[Positive]
374,from 30th may to date mtn mifi 10k spectranet ...,,
375,spectranetng fritzthejanitor will they help me...,,
376,thefunkydee spectranetng im giving spectranetn...,,


In [31]:
absa_df[absa_df['Detected aspects'].notnull()]['Corresponding sentiment'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 5231, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Negative]                        41
[Positive]                        18
[Neutral]                         11
[Negative, Negative]               4
[Negative, Neutral]                1
[Positive, Positive, Positive]     1
Name: Corresponding sentiment, dtype: int64

In [32]:
absa_df.to_csv('../data/model-generated/tweet_absa_default.csv',index=False)

---

[('price', price),
 ('speed', speed),
 ('reliability', reliability),
 ('coverage', coverage),
 ('customer service', customer service)]

- Struggles with connotations and implied meanings. For example, it detected no aspects in the following sentence, despite it hinting at speed:
    - 'spectranet is just the worse tbh i cant even watch a 5min video without serious lagging'
    - 'day or night spectranetng remain useless' -> Hinting that it is not a reliability service
 
- Struggles to detect references to price or speed if metrics e.g. actual monetary cost in naira or speed in e.g. kbps are referenced
    - depend on the kind of mifi smile is 9800 spectranet is like 15k -> Did not detect price
    - wifisupport1 tizeti well done ooo 2 hours to download 9mb but na broadband
    
- Sometimes accurately detects aspects but that isn't always all we want. We want the aspect detected for a particular ISP. E.g.:
    - adefola09 they said they dont have coverage at my side oo thinking of getting ipnx. It detects coverage there but this doesn't speak to coverage concerns for ipnx
    
Struggles with contextual meaning of words:

- Any reference to service is automatically related to customer service (since the similarity between both is undoubtedly really high) and customers sometimes use the word 'service' to refer to network quality or coverage (see below). This is a problem of using word relatedness:
    - omo ive been wallowing in ignoranceappaz my spectranet sub had elapsed 100gb and i didnt even realize and ive been ranting about **bad service** meanwhile they have another package for 19k that is 500gb and ive been doing 100gb for 18k e be things o
    - ripples143 wifisupport1 tizeti they have a lot of downtime like a lot they are making it seem as though its **free service** i ducking pay for this you are not doing me a favour
    
- Similarly, any reference to speed is assumed to reference the network speed


No real reference to reliability as it is often inferred from downtime or long stretches of poor service
    
**Note:** Only marked down tweets that referred clearly to an aspect of the ISP's service – some where vague and so the aspect couldn't be determined e.g. 'airtel and tizeti have failed me'


Huge overlap between coverage, speed, and reliability.
    
Also the issue that users cite the wrong issue for their problem. For example one tweet says 'can't even get internet to do my work.' It could very well be the case that there is coverage (i.e. internet signal), but the speed isn't ideal
    

## Terminology

Network availability is the percentage of time the infrastructure is operational during a given time period. ... Network reliability tracks how long the infrastructure is functional without interruption.


Network availability provides a glimpse into infrastructure accessibility, while network reliability highlights how well the infrastructure runs to support functional processes.


https://www.techtarget.com/searchnetworking/answer/Whats-the-difference-between-network-availability-and-reliability#:~:text=Although%20the%20terms%20are%20sometimes,during%20a%20given%20time%20period.&text=Network%20reliability%20tracks%20how%20long%20the%20infrastructure%20is%20functional%20without%20interruption.

---

Simply put, network reliability signifies the ability of a network to minimize the scope and frequency of network incidents, continue operations while under pressure and recover as quickly as possible.
- Downtime: How much time does your network to recover from incidents? How does it graph over time?
- Failure frequency: Frequency with which your network fails to act or respond the way it is designed to.


Simply put, network reliability highlights your network’s ability to run the infrastructure and support core processes whereas network availability just provides a measure of infrastructure accessibility.

https://www.newchartertech.com/network-reliability-the-invisible-driver-of-business-productivity/


**Reliability:** Ability of a system or component to perform its required functions under stated
conditions for a specified period of time. 

https://www.iiconsortium.org/pdf/Trustworthiness_Framework_Foundations.pdf

---
## Thoughts

- Include notion of honesty or trust?
- The word relatedness usage is only as good as the word embedding – scam and dishonest were not recognized as being very related.
    - **Solution:** Try other word embeddings? GLove perhaps
- Add device-related feature e.g. device#battery? Based on the following tweet:
    - im essentially grateful for my spectranet battery strength health that shit can go 910 hours on full charge its amazing
- Network generally?
- How might we make it learn speed? E.g. kbps, etc.

In [38]:
aspects

['price', 'speed', 'reliability', 'coverage', 'customer service']

In [37]:
absa_df['Detected aspects']

0                    None
1                    None
2                    None
3      [customer service]
4                    None
              ...        
373               [price]
374                  None
375                  None
376                  None
377                  None
Name: Detected aspects, Length: 378, dtype: object

In [None]:
def micro_precision(true_aspects,aspect_preds):
    
    #Iterate through all the aspects
    for aspect in aspects:
        
        #Initialize counters for true and false positives
        TP,FP = 0, 0
        
        for idx in len(aspect_preds):
            
            
        
        