# Twitter Sentiment Analysis using `roBERTa`

**Resources**
- [Twitter Sentiment Analysis by Python | best NLP model 2022](https://www.youtube.com/watch?v=uPKnSq6TaAk) + [code](https://github.com/mehranshakarami/AI_Spectrum/blob/main/2022/Sentiment_Analysis/tw-sentiment.py)
- [Twitter-roBERTa-base for Sentiment Analysis | Updated](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)

In [20]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax
import numpy as np 

In [32]:
tweet = "Si on remet les choses dans l'ordre, il s'agit plutôt d'une mesure destinée à compenser l'inflation consécutive à la baisse de la livre causée par le Brexit...\nLe HuffPost\n@LeHuffPost\n·\n31 Dec 2019\nBoris Johnson fait passer le Smic anglais au-dessus du français http://huffp.st/NY3NlnF\n6\n47\n98"
tweet = " ".join(tweet.splitlines())
print(tweet)

Si on remet les choses dans l'ordre, il s'agit plutôt d'une mesure destinée à compenser l'inflation consécutive à la baisse de la livre causée par le Brexit... Le HuffPost @LeHuffPost · 31 Dec 2019 Boris Johnson fait passer le Smic anglais au-dessus du français http://huffp.st/NY3NlnF 6 47 98


In [33]:
def preprocess(text):
    """Description. Replace username and link placeholders."""
    
    new_text = []
    for t in text.split(" "):
        t = "@user" if t.startswith("@") and len(t) > 1 else t
        t = "http" if t.startswith("http") else t
        new_text.append(t)

    return " ".join(new_text)

In [34]:
tweet_prep = preprocess(tweet) 
print(tweet_prep)

Si on remet les choses dans l'ordre, il s'agit plutôt d'une mesure destinée à compenser l'inflation consécutive à la baisse de la livre causée par le Brexit... Le HuffPost @user · 31 Dec 2019 Boris Johnson fait passer le Smic anglais au-dessus du français http 6 47 98


In [6]:
# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment-latest"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ["Negative", "Neutral", "Positive"]

Downloading:   0%|          | 0.00/747 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [7]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [8]:
tokenizer

PreTrainedTokenizerFast(name_or_path='cardiffnlp/twitter-roberta-base-sentiment', vocab_size=50265, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

In [36]:
# sentiment analysis
encoded_tweet = tokenizer(tweet_prep, return_tensors="pt")

print(encoded_tweet)

# input_ids are tensors obtained from converting tweets into numbers
# attention_mask indicates to the model which tokens should be attended to

{'input_ids': tensor([[    0, 35684,    15,  6398,   594,  7427,  1855, 18575,   385,  1253,
           784,   108,  3109,   241,     6,  7675,   579,   108,  1073,   405,
          2968,  1182, 10456,    90,   385,   108,  4438, 10969,  2407, 15357,
           179,  9703,  6534, 29281,   254,   784,   108,   179, 18613,  7407,
          1140,   438, 19172,  6534,   897,   741,  5655,  1090,   263,   897,
         32126,   241, 37771,  9703,  2242,  2084,  2404,   734,  1063, 24884,
           787, 12105, 13339,  1105,  1502,   954, 14335,  1436,   856,  5236,
         17663,  2084,  4966,   636,  5667,  2560,   354,  8477,    12,   417,
          3361,   687,  4279,  6664,   260,  3381,  5655,  2054,   231,  4034,
          8757,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [37]:
# output = model(encoded_tweet["input_ids"], encoded_tweet["attention_mask"])
output = model(**encoded_tweet)
print(output)

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2941,  1.6557, -1.4782]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [38]:
scores = output.logits[0].detach().numpy()
scores = softmax(scores)
print(scores)

sentiment = labels[np.argmax(scores)]
print(f"Tweet sentiment is {sentiment}")

[0.11999977 0.843278   0.03672223]
Tweet sentiment is Neutral
