# Twitter Sentiment Analysis using `roBERTa`

**Resources**
- [Twitter Sentiment Analysis by Python | best NLP model 2022](https://www.youtube.com/watch?v=uPKnSq6TaAk) + [code](https://github.com/mehranshakarami/AI_Spectrum/blob/main/2022/Sentiment_Analysis/tw-sentiment.py)
- [Twitter-roBERTa-base for Sentiment Analysis | Updated](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

## Setup

In [1]:
import sys 
sys.path.append("../")

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

In [3]:
import numpy as np 
import pickle as pkl 

In [4]:
from lib.preprocessing.tweets import clean_tweet

In [61]:
path = "../backup/data/tweets_preprocessed.pkl"
with open(path, "rb") as f: 
    tweets_preprocessed = pkl.load(f)

## Preprocess tweets

In [62]:
tweets = tweets_preprocessed["cleaned_emojis"]
# tweets = [clean_tweet(tweet, remove_mentions=False) for tweet in tweets]

In [7]:
def preprocess(text):
    """Description. 
    Replace username and link placeholders before feeding to transformer."""
    
    new_text = []
    for t in text.split(" "):
        t = "@user" if t.startswith("@") and len(t) > 1 else t
        t = "http" if t.startswith("http") else t
        new_text.append(t)

    return " ".join(new_text)

In [42]:
tweets = [preprocess(tweet) for tweet in tweets]

In [63]:
n_tweets = len(tweets)
print(f"{n_tweets} tweets to analyse")

92961 tweets to analyse


In [64]:
ix = np.random.randint(0, n_tweets)

print(tweets[ix])

Nous avons modifié le barème kilométrique, versé l'indemnité inflation, revalorisé le chèque énergie, diminué la fiscalité sur l'électricité pour limiter la  à 4% entre octobre et février. Au total le   a déjà engagé 15Md€ pour protéger les Français. 3,535 views 120 ↗


## Load `roBERTa` model & tokenizer

In [35]:
ROBERTA = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(ROBERTA)
tokenizer = AutoTokenizer.from_pretrained(ROBERTA)

Downloading:   0%|          | 0.00/929 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

In [41]:
help(tokenizer)

Help on RobertaTokenizerFast in module transformers.models.roberta.tokenization_roberta_fast object:

class RobertaTokenizerFast(transformers.tokenization_utils_fast.PreTrainedTokenizerFast)
 |  RobertaTokenizerFast(vocab_file=None, merges_file=None, tokenizer_file=None, errors='replace', bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', add_prefix_space=False, trim_offsets=True, **kwargs)
 |  
 |  Construct a "fast" RoBERTa tokenizer (backed by HuggingFace's *tokenizers* library), derived from the GPT-2
 |  tokenizer, using byte-level Byte-Pair-Encoding.
 |  
 |  This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
 |  be encoded differently whether it is at the beginning of the sentence (without space) or not:
 |  
 |  ```
 |  >>> from transformers import RobertaTokenizerFast
 |  >>> tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base

In [36]:
LABELS = ["Negative", "Neutral", "Positive"]

In [37]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [38]:
tokenizer

PreTrainedTokenizerFast(name_or_path='cardiffnlp/twitter-roberta-base-sentiment-latest', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

## Tokenize

- `input_ids` are tensors obtained from converting tweets into numbers
- `attention_mask` indicates to the model which tokens should be attended to

In [101]:
selected = np.random.choice(tweets, size=10).tolist()

In [102]:
tweets_encoded = [tokenizer(tweet, return_tensors="pt") for tweet in selected]

In [103]:
for tokens in tweets_encoded: 
    break 

tokens

{'input_ids': tensor([[    0,   863,   108,   102,  4235,  7427,   181,  3695, 12782,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

## Apply model

In [104]:
outputs = [model(**encoded) for encoded in tweets_encoded]  

In [105]:
for output in outputs: 
    break 

print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.4027,  0.8246, -0.7598]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [106]:
from transformers.modeling_outputs import SequenceClassifierOutput
from typing import Tuple

def get_sentiment_from_output(output: SequenceClassifierOutput) -> Tuple: 
    """Description. Return text sentiment and softmax scores per label."""

    scores = output.logits[0].detach().numpy()
    scores = softmax(scores).tolist()

    sentiment = LABELS[np.argmax(scores)]
    
    return sentiment, scores 

In [107]:
results = [get_sentiment_from_output(output) for output in outputs]

In [109]:
for tweet, (sentiment, scores) in zip(selected, results): 
    print(tweet)
    print(f"{sentiment=}")
    print("-"*100)

J'aime les pâtes
sentiment='Neutral'
----------------------------------------------------------------------------------------------------
Il fait un temps de merde ici
sentiment='Neutral'
----------------------------------------------------------------------------------------------------
Ce gars est symmpa mais il sent pas très bon
sentiment='Neutral'
----------------------------------------------------------------------------------------------------
