In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split

Let's load the dataset and extract just the comments, rather than using the entire dataset..

In [7]:
data_frame = pd.read_csv("../datasets/reddit_sentiment.csv")
data_frame.head()

Unnamed: 0,clean_comment,category
0,family mormon have never tried explain them t...,1
1,buddhism has very much lot compatible with chr...,1
2,seriously don say thing first all they won get...,-1
3,what you have learned yours and only yours wha...,0
4,for your own benefit you may want read living ...,1


This dataset is far better as it is simple and concise.

In [8]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37249 entries, 0 to 37248
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   clean_comment  37149 non-null  object
 1   category       37249 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 582.1+ KB


There are about 37000 Tuples in this.. Let's drop some empty values:

In [9]:
data_frame = data_frame.dropna()

In [10]:
data_frame.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37149 entries, 0 to 37248
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   clean_comment  37149 non-null  object
 1   category       37149 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 870.7+ KB


Let's split the test and train dataset right away to keep it seperated from training at all times.

In [19]:
X = data_frame["clean_comment"]
y = data_frame["category"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=42)

Let's initialize and test Roberta:

In [21]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = RobertaTokenizer.from_pretrained(model_path)
model = RobertaForSequenceClassification.from_pretrained(model_path).to("cuda")  # Move model to GPU

predicted_sentiments = []
scores = []

batch_size = 32
num_batches = len(X_train) // batch_size + (1 if len(X_train) % batch_size != 0 else 0)

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size if i < num_batches - 1 else len(X_train)
    batch_comments = X_train[start_idx:end_idx]

    # Tokenize and process the batch
    inputs = tokenizer(list(batch_comments), return_tensors="pt", padding=True, truncation=True, max_length=max_length).to("cuda")  # Move inputs to GPU
    
    # Rest of your code for processing the batch

    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract scores and predicted sentiments
    logits = outputs.logits
    softmax_scores = torch.nn.functional.softmax(logits, dim=1)
    preds = torch.argmax(logits, dim=1)
    
   
    sentiment_mapping = {-1: 'negative', 0: 'neutral', 1: 'positive'}
    batch_predicted = [sentiment_mapping[pred.item()] for pred in preds]
    batch_scores = [score[pred.item()].item() for score, pred in zip(softmax_scores, preds)]
    
    predicted_sentiments.extend(batch_predicted)
    scores.extend(batch_scores)

print(predicted_sentiments)



Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


ValueError: cannot reshape array of size 24146 into shape (1161,32)