<a href="https://colab.research.google.com/github/CDL-RecSys/oeaw-ai-winter-school-2023/blob/main/Sentiment_Analysis_%C3%96AW_AI_Winter_School_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Sentiment Analysis Tutorial!

In this tutorial we will explore three of the major approaches of performing sentiment analysis by presenting a ...

*   dictionary based approach ([VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550)).
*   machine learning (ML) based approach (which runs efficiently on a CPU) ([fastText](https://aclanthology.org/E17-2068/))
*   machine learning (ML) based approach (which requires a GPU if used in production) ([BERT](https://aclanthology.org/N19-1423/))


## Dictionary Based Approach

Dictionary-based approaches have played an important role in conducting sentiment analysis in the past. Nevertheless, they have advantages that make them important in the current state of the art (SOTA) as well. Important aspects here are, above all, that they are explainable and transparent. Current SOTA algorithms such as the BERT algorithm are not explainable.

Explainability is not always a necessary requirement, but there are areas where one has to prove why a certain element is marked as positive, neutral or negative.

---

**Source**: *Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.*

In [1]:
# execute this cell to install the vaderSentiment package
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
# source: https://github.com/cjhutto/vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# --- examples -------
sentences = ["VADER is smart, handsome, and funny.",                      # positive sentence example
             "VADER is smart, handsome, and funny!",                      # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.",                 # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",                 # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!",               # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!",  # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",                  # negation sentence example
             "The book was good.",                                        # positive sentence
             "At least it isn't a horrible book.",                        # negated negative sentence with contraction
             "The book was only kind of good.",                           # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",                                                # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol",                # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",                             # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",        # emojis handled
             "Not bad at all"                                             # Capitalized negation
             ]

analyzer_vader = SentimentIntensityAnalyzer()
for sentence in sentences:
    vs = analyzer_vader.polarity_scores(sentence)
    print("{:-<65} {}".format(sentence, str(vs)))

VADER is smart, handsome, and funny.----------------------------- {'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
VADER is smart, handsome, and funny!----------------------------- {'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
VADER is very smart, handsome, and funny.------------------------ {'neg': 0.0, 'neu': 0.299, 'pos': 0.701, 'compound': 0.8545}
VADER is VERY SMART, handsome, and FUNNY.------------------------ {'neg': 0.0, 'neu': 0.246, 'pos': 0.754, 'compound': 0.9227}
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- {'neg': 0.0, 'neu': 0.233, 'pos': 0.767, 'compound': 0.9342}
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}
VADER is not smart, handsome, nor funny.------------------------- {'neg': 0.646, 'neu': 0.354, 'pos': 0.0, 'compound': -0.7424}
The book was good.----------------------------------------------- {'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'co

## Machine Learning Based Approach (fastText)

The fastText framework is a machine learning approach which is optimized to run on standard hardware. It does not require a GPU for training or performing predictions. 

---

**Source**: *Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759.*

In [3]:
# install the fasttext implementation
!pip install fasttext

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
# install datasets package
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


We are now using the same data set which is also used by the BERT based model in the 3rd section of this tutorial ([tweet_eval](https://huggingface.co/datasets/tweet_eval)).

The data set can be retrieved by using the "datasets" package of Hugging Face.

---

**Source:** 

*   *Barbieri, F., Camacho-Collados, J., Espinosa-Anke, L., & Neves, L. (2020). TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification. In Proceedings of Findings of EMNLP.*
*   *Rosenthal, S., Farra, N., & Nakov, P. (2017). SemEval-2017 task 4: Sentiment analysis in Twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) (pp. 502–518).*

In [5]:
# load the same data set as the BERT based approach is using for training
from datasets import get_dataset_config_names
from datasets import get_dataset_split_names
from datasets import load_dataset

# retrieve the data set split names
dataset_split_names = get_dataset_split_names("tweet_eval","sentiment")

# retrieve da available options (could be different languages or data set sub-types)
configs = get_dataset_config_names("tweet_eval")

print(dataset_split_names)
print(configs)

['train', 'test', 'validation']
['emoji', 'emotion', 'hate', 'irony', 'offensive', 'sentiment', 'stance_abortion', 'stance_atheism', 'stance_climate', 'stance_feminist', 'stance_hillary']


In [6]:
# print a preview of the data set structure (to get a grasp on the size of the used data set and the distribution over the splits)
train_dataset = load_dataset("tweet_eval", "sentiment",split="train")
test_dataset = load_dataset("tweet_eval", "sentiment",split="test")
validation_dataset = load_dataset("tweet_eval", "sentiment",split="validation")
print(train_dataset)
print(test_dataset)
print(validation_dataset)



Dataset({
    features: ['text', 'label'],
    num_rows: 45615
})
Dataset({
    features: ['text', 'label'],
    num_rows: 12284
})
Dataset({
    features: ['text', 'label'],
    num_rows: 2000
})


In [7]:
# Labels within the data set:
# 0: negative
# 1: neutral
# 2: positive

print(train_dataset[0])           # print the line without formatting
print('')
print(train_dataset[0]['text'])   # text of the tweet only
print(train_dataset[0]['label'])  # sentiment label of the tweet

{'text': '"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"', 'label': 2}

"QT @user In the original draft of the 7th book, Remus Lupin survived the Battle of Hogwarts. #HappyBirthdayRemusLupin"
2


In [8]:
# pre-processing
# source: https://fasttext.cc/docs/en/supervised-tutorial.html#getting-and-preparing-the-data

import pandas as pd

def adapt_label(df):
  df.loc[df['label'] == 0, 'label'] = '__label__negative'
  df.loc[df['label'] == 1, 'label'] = '__label__neutral'
  df.loc[df['label'] == 2, 'label'] = '__label__positive'
  df['combined'] = df['label']+" "+df['text']
  return df

df_train = pd.DataFrame(train_dataset)
df_train = adapt_label(df_train)

df_test= pd.DataFrame(test_dataset)
df_test = adapt_label(df_test)

df_valid = pd.DataFrame(validation_dataset)
df_valid = adapt_label(df_valid)

print(df_train)

                                                    text              label  \
0      "QT @user In the original draft of the 7th boo...  __label__positive   
1      "Ben Smith / Smith (concussion) remains out of...   __label__neutral   
2      Sorry bout the stream last night I crashed out...   __label__neutral   
3      Chase Headley's RBI double in the 8th inning o...   __label__neutral   
4      @user Alciato: Bee will invest 150 million in ...  __label__positive   
...                                                  ...                ...   
45610  @user \""So amazing to have the beautiful Lady...  __label__positive   
45611  9 September has arrived, which means Apple's n...  __label__positive   
45612  Leeds 1-1 Sheff Wed. Giuseppe Bellusci securin...  __label__positive   
45613  @user no I'm in hilton head till the 8th lol g...   __label__neutral   
45614  WASHINGTON (Reuters) - U.S. Vice President Joe...   __label__neutral   

                                                com

In [9]:
df_train['label'].value_counts() # training data set class distribution

__label__neutral     20673
__label__positive    17849
__label__negative     7093
Name: label, dtype: int64

In [10]:
df_valid['label'].value_counts() # validation data set class distribution

__label__neutral     869
__label__positive    819
__label__negative    312
Name: label, dtype: int64

In [11]:
df_train[['label', 'text' ]].to_csv('df_train.txt', index=None, header=None)
!cat df_train.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > df_train.preprocessed.txt

df_valid[['label', 'text' ]].to_csv('df_valid.txt', index=None, header=None)
!cat df_valid.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > df_valid.preprocessed.txt

In [12]:
import fasttext
model_fasttext_preprocessed = fasttext.train_supervised(input='df_train.preprocessed.txt',lr=1.0,epoch=25)

In [13]:
model_fasttext_preprocessed.test("df_valid.preprocessed.txt")

(2000, 0.6485, 0.6485)

In [14]:
model_fasttext_preprocessed.predict("FastText it's models fit well onto a mobile devices.")

(('__label__positive',), array([0.98458219]))

In [15]:
# source: https://github.com/cjhutto/vaderSentiment

sentences = ["VADER is smart, handsome, and funny.",                      # positive sentence example
             "VADER is smart, handsome, and funny!",                      # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.",                 # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",                 # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!",               # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!",  # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",                  # negation sentence example
             "The book was good.",                                        # positive sentence
             "At least it isn't a horrible book.",                        # negated negative sentence with contraction
             "The book was only kind of good.",                           # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",                                                # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol",                # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",                             # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",        # emojis handled
             "Not bad at all"                                             # Capitalized negation
             ]

with open('vader_sentences.txt','w') as tfile:
	tfile.write('\n'.join(sentences))

In [16]:
# data pre-processing as suggested by fastText (https://fasttext.cc/docs/en/supervised-tutorial.html#preprocessing-the-data)
!cat vader_sentences.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > vader_sentences.preprocessed.txt

In [17]:
# without preprocessing
with open('vader_sentences.txt') as file:
    for line in file:
        vs = model_fasttext_preprocessed.predict(line.rstrip()) # to enable multi class prediction add "k=3"
        print("{:-<65} {}".format(line.rstrip(), str(vs))) 

VADER is smart, handsome, and funny.----------------------------- (('__label__neutral',), array([1.00000787]))
VADER is smart, handsome, and funny!----------------------------- (('__label__neutral',), array([1.00000787]))
VADER is very smart, handsome, and funny.------------------------ (('__label__neutral',), array([0.96613878]))
VADER is VERY SMART, handsome, and FUNNY.------------------------ (('__label__neutral',), array([1.00000787]))
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- (('__label__neutral',), array([1.00000787]))
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- (('__label__neutral',), array([0.99969792]))
VADER is not smart, handsome, nor funny.------------------------- (('__label__neutral',), array([0.98700446]))
The book was good.----------------------------------------------- (('__label__neutral',), array([1.00001001]))
At least it isn't a horrible book.------------------------------- (('__label__negative',), array([1.00001001]))


In [18]:
# with preprocessing
with open('vader_sentences.preprocessed.txt') as file:
    for line in file:
        vs = model_fasttext_preprocessed.predict(line.rstrip()) # to enable multi class prediction add "k=3"
        print("{:-<65} {}".format(line.rstrip(), str(vs))) 

vader is smart ,  handsome ,  and funny .------------------------ (('__label__positive',), array([1.00001001]))
vader is smart ,  handsome ,  and funny !------------------------ (('__label__positive',), array([1.00001001]))
vader is very smart ,  handsome ,  and funny .------------------- (('__label__positive',), array([1.00001001]))
vader is very smart ,  handsome ,  and funny .------------------- (('__label__positive',), array([1.00001001]))
vader is very smart ,  handsome ,  and funny !  !  !------------- (('__label__positive',), array([1.00001001]))
vader is very smart ,  uber handsome ,  and friggin funny !  !  ! (('__label__positive',), array([1.00001001]))
vader is not smart ,  handsome ,  nor funny .-------------------- (('__label__positive',), array([1.00000989]))
the book was good .---------------------------------------------- (('__label__positive',), array([0.99838203]))
at least it isn ' t a horrible book .---------------------------- (('__label__negative',), array([1.0000

## Machine Learning Based Approach (BERT)

*   Link to the used model: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
*   Repository: https://github.com/cardiffnlp/timelms
*   This model is also integrated into the TweetNLP platform (https://tweetnlp.org/)

---

**Source:** *Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-collados. 2022. TimeLMs: Diachronic Language Models from Twitter. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 251–260, Dublin, Ireland. Association for Computational Linguistics.*

In [19]:
# install the transformers package to use Hugging Face models
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [20]:
# source: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

# pre-process text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# load the choosen model
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)

# pt
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
#model.save_pretrained(MODEL)

def predict_bert(text):
  text = preprocess(text)
  encoded_input = tokenizer(text, return_tensors='pt')
  output = model(**encoded_input)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores)

  ranking = np.argsort(scores)
  ranking = ranking[::-1]
  list_predicted_labels = []
  for i in range(scores.shape[0]):
      l = config.id2label[ranking[i]]
      s = scores[ranking[i]]
      label = str(l) + " " + str(np.round(float(s), 4))
      #print(label)
      list_predicted_labels.append(label)
  return(list_predicted_labels)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


We now use the same sentences from before and make use of the BERT model. It can be seen that the performance of this pre-trained BERT model is very good.

In [21]:
with open('vader_sentences.txt') as file:
    for line in file:
        vs = predict_bert(line.rstrip()) # to enable multi class prediction add "k=3"
        print("{:-<65} {}".format(line.rstrip(), str(vs))) 

VADER is smart, handsome, and funny.----------------------------- ['positive 0.9637', 'neutral 0.0318', 'negative 0.0045']
VADER is smart, handsome, and funny!----------------------------- ['positive 0.9789', 'neutral 0.018', 'negative 0.0031']
VADER is very smart, handsome, and funny.------------------------ ['positive 0.9714', 'neutral 0.0248', 'negative 0.0037']
VADER is VERY SMART, handsome, and FUNNY.------------------------ ['positive 0.9746', 'neutral 0.0206', 'negative 0.0048']
VADER is VERY SMART, handsome, and FUNNY!!!---------------------- ['positive 0.9841', 'neutral 0.012', 'negative 0.0039']
VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!--------- ['positive 0.9827', 'neutral 0.0126', 'negative 0.0047']
VADER is not smart, handsome, nor funny.------------------------- ['negative 0.8702', 'neutral 0.1115', 'positive 0.0183']
The book was good.----------------------------------------------- ['positive 0.9514', 'neutral 0.0441', 'negative 0.0044']
At least it isn't 

## Now it's your time to try out some sentences of your choice against the three different methods!

Use the next code block to try out different sentences. Which approach do you like most?

Try to improve the performance by pre-processing the input data. In addition you can try to tune the predictions by adapting the parameters during training.

In [28]:
# You can try to pre-process the input data to increase the performance of the algorithms.
# Information regarding pre-processing is available on the documentation sites for each of the algorithms.

sentence = "This is not bad!"
# Vader
print("{:-<65} {}".format("VADER: "+sentence, analyzer_vader.polarity_scores(sentence.rstrip())))
# fastText
print("{:-<65} {}".format("fastText: "+sentence, str(model_fasttext_preprocessed.predict(sentence.rstrip())))) # to enable multi class prediction add "k=3"
# BERT - to enable multi class prediction add "k=3"
print("{:-<65} {}".format("BERT: "+sentence, str(predict_bert(sentence.rstrip())))) 

VADER: This is not bad!------------------------------------------ {'neg': 0.0, 'neu': 0.488, 'pos': 0.512, 'compound': 0.484}
fastText: This is not bad!--------------------------------------- (('__label__neutral',), array([0.89596003]))
BERT: This is not bad!------------------------------------------- ['positive 0.851', 'neutral 0.1321', 'negative 0.0169']
