This notebook is the demo of sentiment prediction using the Roberta model.
At the end of the file, you can input a list of tweets and see the sentiment predicted.

#Turn on GPU please

In [None]:
!pip install emoji
!pip install transformers

Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[?25l[K     |█▉                              | 10 kB 32.9 MB/s eta 0:00:01[K     |███▊                            | 20 kB 38.5 MB/s eta 0:00:01[K     |█████▋                          | 30 kB 42.8 MB/s eta 0:00:01[K     |███████▌                        | 40 kB 35.6 MB/s eta 0:00:01[K     |█████████▍                      | 51 kB 32.6 MB/s eta 0:00:01[K     |███████████▏                    | 61 kB 36.4 MB/s eta 0:00:01[K     |█████████████                   | 71 kB 29.8 MB/s eta 0:00:01[K     |███████████████                 | 81 kB 30.8 MB/s eta 0:00:01[K     |████████████████▉               | 92 kB 32.9 MB/s eta 0:00:01[K     |██████████████████▊             | 102 kB 32.6 MB/s eta 0:00:01[K     |████████████████████▌           | 112 kB 32.6 MB/s eta 0:00:01[K     |██████████████████████▍         | 122 kB 32.6 MB/s eta 0:00:01[K     |████████████████████████▎       | 133 kB 32.6 MB/s eta 0:00:01[K    

In [None]:
import re
import emoji
from torch import nn
import torch
from transformers import RobertaForSequenceClassification, RobertaTokenizer, AutoTokenizer

In [None]:
from google.colab import drive

drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
def roberta_preprocess(text):
    text = text.lower()
    text = re.sub('\@[a-zA-Z0-9]*', '@user', text)
    text = re.sub(r'https?:\/\/\S+', '', text)
    text = re.sub(r"www\.[a-z]?\.?(com)+|[a-z]+\.(com)", '', text)
    text = re.sub(r'{link}', '', text)
    text = re.sub(r"\[video\]", '', text)
    text = re.compile('rt @').sub('@', text).strip()
    text = text.replace("#", "").replace("_", " ").replace(":","")
    text = emoji.demojize(text,language='en')
    return text

In [None]:
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
roberta_model = RobertaForSequenceClassification.from_pretrained("roberta-base", # Use the 12-layer BERT model, with an uncased vocab.
                                                      num_labels = 3, # The number of output labels--2 for binary classification.
                                                      output_attentions = False, # Whether the model returns attentions weights.
                                                      output_hidden_states = False, # Whether the model returns all hidden-states.
                                                      problem_type="multi_label_classification"
)
# Tell pytorch to run this model on the GPU.
# model.cuda()
path = "/content/gdrive/MyDrive/NLP Group Project (2022)/fine-tuning_roberta/best_roberta.pt"
roberta_model.load_state_dict(torch.load(path))

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/478M [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

<All keys matched successfully>

In [None]:
def roberta_demo(clean_tweet):
  """
  Args:
    clean_tweet : list of cleaned tweet using preprocess_dataset function
  
  Returns:
    output : np.array of roberta probabilities
  """
  clean_tweet = [roberta_preprocess(i) for i in clean_tweet]
  tokens = roberta_tokenizer.batch_encode_plus(
      clean_tweet,
      padding='max_length',
      max_length = 64,
      truncation=True,                 
      add_special_tokens = True, # Add '[CLS]' and '[SEP]'  
      return_attention_mask = True
  )
  seq = torch.tensor(tokens['input_ids'])
  mask = torch.tensor(tokens['attention_mask'])
  roberta_model.eval()
  with torch.no_grad():
    preds = roberta_model(seq,token_type_ids=None,attention_mask=mask,return_dict=True)
    m = nn.Softmax(dim=1)
    output = m(preds['logits']).numpy()
    predictions = output.argmax(axis=1) - 1
    print("Probabilities")
    print("")
    print(output)
    print("")
    print("Predictions")
    print("")
    print(predictions)
  return predictions

Test the sentiments of tweets by replacing the strings by tweets
---

In [None]:
_ = roberta_demo(['buy the dip let"s go','HODL','sell btc','SEC impose regularizationon BTC I lost money'])

Probabilities

[[0.7859076  0.04421418 0.16987815]
 [0.37858662 0.18195921 0.43945423]
 [0.27133885 0.19685046 0.53181064]
 [0.7392735  0.16082916 0.09989738]]

Predictions

[-1  1  1 -1]
