# Sentiment Analysis

In this exercise, we're going to perform sentiment analysis on tweets.

In [1]:
import torch

## Data

We're going to use a truncated version of the [Sentiment140 dataset](https://www.kaggle.com/datasets/kazanova/sentiment140). The full dataset contains 1.6M tweets with a sentiment (polarity) label. Each line of the CSV file contains the following 6 fields:
1. target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
2. ids: The id of the tweet (2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)

We load the truncated CSV file `tweets.2000.processed.noemoticon.csv` instead. It contains 2000 tweets (1 tweet per line), 1000 each for label 0 and 4.

In [2]:
import csv
fieldnames = ['label', 'id', 'date', 'query', 'user', 'text']
with open('tweets.2000.processed.noemoticon.csv', 'r', encoding='ISO-8859-1') as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=fieldnames)
    for row in reader:
        print(f"Label: {row['label']}")
        print(f"User: {row['user']}")
        print(f"Text: {row['text']}")
        break
print(f"Label is of type {type(row['label'])}, convert to int like this: {int(row['label'])}")

Label: 0
User: _TheSpecialOne_
Text: @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
Label is of type <class 'str'>, convert to int like this: 0


In [3]:
# read in the csv as a dictionary
with open('tweets.2000.processed.noemoticon.csv', 'r', encoding='ISO-8859-1') as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=fieldnames)
    data = []
    for row in reader:
        data.append(row)
print(f"Number of rows in the csv: {len(data)}")

Number of rows in the csv: 2000


## Model

We're going to use a popular sentiment analysis model from the CardiffNLP group, which we download from HuggingFace: `cardiffnlp/twitter-roberta-base-sentiment`. As you can see from the name, it is based on a pretrained RoBERTa model and was finetuned on tweets for sentiment analysis. Check out the [model description](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment?text=I+like+turtles.) for examples of how to use it.

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
model.eval()

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(f"Using device: {device}")

Using device: cuda


From the model description, we see that we have to preprocess tweets first to bring them into the format expected by the model. Let's implement the `preprocess` method as it is described in their example.

In [6]:
# Preprocess text (username and link placeholders)
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

Let's classify the first tweet in our dataset. Give the probabilities of the sentiment classification model for each of the 3 classes it outputs: negative, neutral and positive sentiment.

In [7]:
labels = ['negative', 'neutral', 'positive']

In [9]:
from torch import softmax

row = data[0]

true_label = int(row['label'])
tweet = row['text']
tweet = preprocess(tweet)
print(f"Preprocessed Tweet: {tweet}")

encoded_tweet = tokenizer(tweet, return_tensors='pt').to(model.device)
output = model(**encoded_tweet)
scores = output.logits[0]
scores = softmax(scores, dim=-1)

print(f"Predicted Scores (true label = {labels[true_label]}):")
for i in range(scores.shape[0]):
    print(f"{labels[i]}: {scores[i]:.4f}")

Preprocessed Tweet: @user http - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
Predicted Scores (true label = negative):
negative: 0.6341
neutral: 0.3154
positive: 0.0505


Now classify all the tweets and evaluate how well the model does in terms of classification accuracy.

In [11]:
from sklearn.metrics import accuracy_score
from tqdm import tqdm

predicted_labels = []
true_labels = []

for row in tqdm(data):
    true_label = int(row['label']) // 2
    true_labels.append(true_label)
    tweet = row['text']
    tweet = preprocess(tweet)

    encoded_tweet = tokenizer(tweet, return_tensors='pt').to(model.device)
    output = model(**encoded_tweet)
    scores = output.logits[0]
    scores = softmax(scores, dim=-1)

    predicted_label = torch.argmax(scores).item()
    predicted_labels.append(predicted_label)

# calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")

100%|██████████| 2000/2000 [00:20<00:00, 98.15it/s] 

Accuracy: 0.5840



