# Sentiment Analysis using Roberta

Install necessary libraries

In [1]:
!pip install --upgrade accelerate datasets

Collecting accelerate
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=2.0.0->accelerate)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.meta

In [1]:
!pip install transformers
!pip install datasets



In [2]:
# Pytorch Deep Learning
import torch

# Pandas + Numpy
import numpy as np
import pandas as pd

# Sklearn metrics
from sklearn.metrics import balanced_accuracy_score, accuracy_score

# Hugging Face Transformer Libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline, Trainer, TrainingArguments

# Hugging Face
from datasets import Dataset

In [3]:
if torch.cuda.is_available():
  print("CUDA available. GPU will be used for computation.")
  device = 0
else:
  print("CUDA not available. CPU will be used for computation.")
  device = -1

CUDA available. GPU will be used for computation.


# Load Sentiment Dataset


This dataset consists of financial tweets labeled with sentiments: bullish (1), bearish (2), and neutral (0). It includes 17,368 bullish, 8,542 bearish, and 12,181 neutral tweets, sourced from various reputable financial datasets. The data is preprocessed for consistency and quality, making it ideal for fine-tuning machine learning models to predict sentiment trends in financial markets and stock discussions.

In [4]:
# Load in the dataset and map the sentiment to a label.
df = pd.read_parquet("hf://datasets/TimKoornstra/financial-tweets-sentiment/data/train-00000-of-00001.parquet")
df['label_name'] = df['sentiment'].map({0: 'Neutral', 1: 'Positive', 2: 'Negative'})
df.rename(columns={'sentiment': 'sentiment_old'}, inplace=True)

# Mapping according to roberta so it an predict correctly.
df['sentiment'] = df['label_name'].map({'Negative': 0, 'Neutral': 1, 'Positive': 2})
df.head()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Unnamed: 0,tweet,sentiment_old,url,label_name,sentiment
0,$BYND - JPMorgan reels in expectations on Beyo...,2,https://huggingface.co/datasets/zeroshot/twitt...,Negative,0
1,$CCL $RCL - Nomura points to bookings weakness...,2,https://huggingface.co/datasets/zeroshot/twitt...,Negative,0
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan ...",2,https://huggingface.co/datasets/zeroshot/twitt...,Negative,0
3,$ESS: BTIG Research cuts to Neutral https://t....,2,https://huggingface.co/datasets/zeroshot/twitt...,Negative,0
4,$FNKO - Funko slides after Piper Jaffray PT cu...,2,https://huggingface.co/datasets/zeroshot/twitt...,Negative,0


# Looking at the distribution of dataset

In [None]:
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Calculate sentiment counts and percentages
label_count = df['label_name'].value_counts()
label_distribution = df['label_name'].value_counts(normalize=True)

data = pd.DataFrame({
    'Sentiment Label': label_count.index,
    'Count': label_count.values,
    'Percentage': label_distribution.values * 100
})

# Create subplots: one xy for bar chart, one domain for pie chart
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=("Sentiment Labels Distribution", "Sentiment Distribution (%)"),
    specs=[[{"type": "xy"}, {"type": "domain"}]]
)

# Add horizontal bar chart with count labels
fig.add_trace(go.Bar(
    x=data['Count'],
    y=data['Sentiment Label'],
    orientation='h',
    marker_color='gray',
    text=data['Count'],
    textposition='auto'
), row=1, col=1)

# Update axis labels for the horizontal bar chart
fig.update_xaxes(title_text="Number of Tweets", row=1, col=1)
fig.update_yaxes(title_text="Sentiment", row=1, col=1)

# Define a minimal, neutral color palette (using shades of gray)
pie_colors = ['#808080', '#A9A9A9', '#C0C0C0'][:len(data)]

# Add pie chart with both label and percent shown
fig.add_trace(go.Pie(
    labels=data['Sentiment Label'],
    values=data['Percentage'],
    marker=dict(colors=pie_colors),
    textinfo='label+percent'
), row=1, col=2)

fig.update_layout(title="Sentiment Analysis Overview", showlegend=False)
fig.show()


The dataset is imbalanced. The positive class is represented much more than the neutral or negative classes as it makes up around half of all datapoints.
There is a large disparity between positive and negative sentiment counts as positive sentiment is almost twice as frequent.
The RoBERTa model may be biased towards predicting positive sentiment more frequently, since it's the majority class, and struggle to effectively identify negative sentiment as it’s underrepresented.


# Preprocessing the Text

In [11]:
# Install dependencies
!pip install transformers datasets scipy

# Imports
import torch
import numpy as np
from scipy.special import softmax
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
from sklearn.metrics import precision_score, recall_score, f1_score, matthews_corrcoef, roc_auc_score




In [12]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

# Using the Transformer Pipeline

The transformer pipeline for NLP streamlines tasks by Auto Tokenizing text, perofrm model inference like text analysis or generation, and provide straightforward results.

In [13]:
# Load model
MODEL = "cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.to(device)
model.eval()

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/501M [00:00<?, ?B/s]

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

In [None]:
model.config

RobertaConfig {
  "_attn_implementation_autoset": true,
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "negative",
    "1": "neutral",
    "2": "positive"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 0,
    "neutral": 1,
    "positive": 2
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.50.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

In [None]:
id_2_label = model.config.id2label
id_2_label

{0: 'negative', 1: 'neutral', 2: 'positive'}

Labels: 0 -> Negative; 1 -> Neutral; 2 -> Positive

In [14]:
sentiment_pipeline = pipeline(task="sentiment-analysis",
                              model=model,
                              tokenizer = tokenizer,
                              device=device,
                              padding=True,  # Automatically pad sequences to the max length
                              truncation=True,  # Automatically truncate sequences that exceed the max length
                              max_length=512)  # Ensure sequences are capped at 512 tokens



Device set to use cuda:0


In [15]:
# Test sentences
sentence1 = "The market outlook is very positive thanks to the new economic policies"
sentence2 = "The market outlook is very negative thanks to the new economic policies"
sentence3 = "The market outlook is neutral thanks to the new economic policies"

print(sentiment_pipeline(sentence1))
print(sentiment_pipeline(sentence2))
print(sentiment_pipeline(sentence3))

[{'label': 'positive', 'score': 0.9717293381690979}]
[{'label': 'negative', 'score': 0.8411950469017029}]
[{'label': 'positive', 'score': 0.7405543327331543}]


As with the FinBERT, the fintuned RoBERTa does not seem to classify very well on neutral text.

# Make predictions on entire dataset

In [16]:
preds = sentiment_pipeline(df['tweet'].tolist())

In [None]:
preds[0:20]

[{'label': 'neutral', 'score': 0.7637107968330383},
 {'label': 'neutral', 'score': 0.54478520154953},
 {'label': 'neutral', 'score': 0.4959683418273926},
 {'label': 'neutral', 'score': 0.7944470643997192},
 {'label': 'neutral', 'score': 0.8931910991668701},
 {'label': 'neutral', 'score': 0.7817732095718384},
 {'label': 'negative', 'score': 0.5262595415115356},
 {'label': 'neutral', 'score': 0.8672782182693481},
 {'label': 'neutral', 'score': 0.7864125967025757},
 {'label': 'neutral', 'score': 0.6394606828689575},
 {'label': 'neutral', 'score': 0.5629881024360657},
 {'label': 'neutral', 'score': 0.7765313982963562},
 {'label': 'neutral', 'score': 0.7441617250442505},
 {'label': 'neutral', 'score': 0.6873923540115356},
 {'label': 'neutral', 'score': 0.7812108993530273},
 {'label': 'negative', 'score': 0.6529240608215332},
 {'label': 'neutral', 'score': 0.8125539422035217},
 {'label': 'negative', 'score': 0.5313276052474976},
 {'label': 'neutral', 'score': 0.7857143878936768},
 {'label': 

In [17]:
# Extract prediction name from label key
df['prediction'] = [pred['label'] for pred in preds]

In [18]:
df.groupby(['label_name', 'prediction']).size()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
label_name,prediction,Unnamed: 2_level_1
Negative,negative,3503
Negative,neutral,4436
Negative,positive,603
Neutral,negative,959
Neutral,neutral,9363
Neutral,positive,1859
Positive,negative,772
Positive,neutral,7569
Positive,positive,9027


In [None]:
import plotly.express as px

# Pivot to wide format
conf_matrix = df.groupby(['label_name', 'prediction']).size().unstack().fillna(0)

# Plot heatmap
fig = px.imshow(
    conf_matrix,
    text_auto=True,
    color_continuous_scale='Blues',
    labels=dict(x="Predicted Label", y="True Label", color="Count"),
    x=conf_matrix.columns,
    y=conf_matrix.index,
    title="Confusion Matrix"
)
fig.update_layout(xaxis_side="top")
fig.show()

In [None]:
import plotly.express as px

# Prepare data
grouped = df.groupby(['label_name', 'prediction']).size().reset_index(name='count')

# Plot grouped bar chart
fig = px.bar(
    grouped,
    x='label_name',
    y='count',
    color='prediction',
    barmode='group',
    title='Prediction Distribution per True Label',
    labels={'label_name': 'True Label', 'count': 'Number of Tweets', 'prediction': 'Predicted'}
)
fig.show()



In [None]:
import plotly.express as px
import pandas as pd

# Group the data
grouped = df.groupby(['label_name', 'prediction']).size().reset_index(name='count')

# Add correctness column
grouped['correct'] = grouped['label_name'].str.lower() == grouped['prediction'].str.lower()

# Map patterns and legend labels
def get_pattern_and_label(row):
    if row['correct']:
        return '', 'Correct'
    pred = row['prediction'].lower()
    if pred == 'neutral':
        return '/', 'Misclassified as Neutral'
    elif pred == 'positive':
        return '.', 'Misclassified as Positive'
    elif pred == 'negative':
        return 'x', 'Misclassified as Negative'
    else:
        return 'x', f'Misclassified as {row["prediction"]}'

# Apply to get both pattern and label
grouped[['pattern', 'legend_label']] = grouped.apply(
    get_pattern_and_label, axis=1, result_type='expand'
)

# Plot
fig = px.bar(
    grouped,
    x='label_name',
    y='count',
    pattern_shape='legend_label',  # legend will use this
    pattern_shape_sequence=['', '/', '.', 'x'],
    color='legend_label',          # ensures consistent mapping
    color_discrete_sequence=['lightblue'] * 10,
    hover_data=['prediction', 'count', 'correct'],
    title='Prediction Distribution per True Label',
    labels={'label_name': 'True Label', 'count': 'Number of Tweets', 'legend_label': 'Prediction Type'}
)

# Final layout cleanup
fig.update_layout(
    legend_title_text='Prediction Type',
    legend=dict(traceorder="normal")
)

fig.show()

Looking at these graphs, it seems that based on the financial tweets in the dataset, it seems that RoBERTa is best at predicting neutral and not very good at predicting negative or positive labels.

In [21]:
from sklearn.metrics import classification_report

# Map string labels to numeric
label_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
id2label = {v: k.capitalize() for k, v in label_mapping.items()}

y_true = df['label_name'].str.lower().map(label_mapping)
y_pred = df['prediction'].str.lower().map(label_mapping)

# Get per-class scores using classification_report
report = classification_report(y_true, y_pred, target_names=[id2label[i] for i in range(3)], output_dict=True, zero_division=0)

# Convert to DataFrame
report_df = pd.DataFrame(report).transpose()

# Round and clean
report_df = report_df[['precision', 'recall', 'f1-score', 'support']]
report_df = report_df.round(4)

# Optional: Rename index
report_df.index.name = 'Label'
report_df.reset_index(inplace=True)

# Show the table
print(report_df.to_string(index=False))


       Label  precision  recall  f1-score    support
    Negative     0.6693  0.4101    0.5086  8542.0000
     Neutral     0.4382  0.7687    0.5582 12181.0000
    Positive     0.7857  0.5197    0.6256 17368.0000
    accuracy     0.5748  0.5748    0.5748     0.5748
   macro avg     0.6311  0.5662    0.5641 38091.0000
weighted avg     0.6485  0.5748    0.5778 38091.0000


RoBERTas performance is not too bad. The model achieves an overall accuracy of 57.5%, with performance varying across classes. It performs best on Positive tweets, showing high precision (78.6%) but lower recall (51.9%), meaning it predicts them correctly when it tries, but often misses them. For Neutral tweets, recall is high (76.9%) but precision is low (43.8%), suggesting it frequently mislabels other sentiments as neutral. Negative tweets are the most challenging, with both precision (66.9%) and recall (41.0%) being low. The F1 scores reflect this imbalance, with Neutral (55.8%), Positive (62.6%), and Negative (50.9%). The macro-average F1 is 56.4%, indicating uneven class performance, while the weighted F1 is slightly higher (57.8%) due to better performance on more common classes.


# Finetuning

In [8]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
2,17368
1,12181
0,8542


In [22]:
# Split into train/val/tests for later comparison
train_end_point = int(df.shape[0]*0.6) # 60% train, 20% rest
val_end_point = int(df.shape[0]*0.8)

df_train = df.iloc[:train_end_point,:]
df_val = df.iloc[train_end_point:val_end_point,:]
df_test = df.iloc[val_end_point:,:]

print(df_train.shape, df_val.shape, df_test.shape)

(22854, 6) (7618, 6) (7619, 6)


In [None]:
preds = sentiment_pipeline(df['tweet'].tolist())

In [20]:
from sklearn.metrics import classification_report

# Map string labels to numeric
label_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
id2label = {v: k.capitalize() for k, v in label_mapping.items()}

y_true = df_test['label_name'].str.lower().map(label_mapping)
y_pred = df_test['prediction'].str.lower().map(label_mapping)

# Get per-class scores using classification_report
report = classification_report(y_true, y_pred, target_names=[id2label[i] for i in range(3)], output_dict=True, zero_division=0)

# Convert to DataFrame
report_df = pd.DataFrame(report).transpose()

# Round and clean
report_df = report_df[['precision', 'recall', 'f1-score', 'support']]
report_df = report_df.round(4)

# Optional: Rename index
report_df.index.name = 'Label'
report_df.reset_index(inplace=True)

# Show the table
print(report_df.to_string(index=False))


KeyError: 'prediction'

Convert to huggingface datasets for prepration for fine-tuning

In [23]:
# Converting pandas df into hugging face dataset objects:
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

# Tokenizing the datasets:
dataset_train = dataset_train.map(lambda e: tokenizer(e['tweet'], truncation=True, padding = 'max_length', max_length = 512), batched=True)
dataset_val = dataset_val.map(lambda e: tokenizer(e['tweet'], truncation=True, padding = 'max_length', max_length = 512), batched=True)
dataset_test = dataset_test.map(lambda e: tokenizer(e['tweet'], truncation=True, padding = 'max_length', max_length = 512), batched=True)

# Shuffle the training dataset
dataset_train_shuffled = dataset_train.shuffle(seed=42)

Map:   0%|          | 0/22854 [00:00<?, ? examples/s]

Map:   0%|          | 0/7618 [00:00<?, ? examples/s]

Map:   0%|          | 0/7619 [00:00<?, ? examples/s]

In [24]:
dataset_train_shuffled = dataset_train_shuffled.rename_column("sentiment", "labels")
dataset_val = dataset_val.rename_column("sentiment", "labels")
dataset_test = dataset_test.rename_column("sentiment", "labels")

In [25]:
dataset_train_shuffled.column_names

['tweet',
 'sentiment_old',
 'url',
 'label_name',
 'labels',
 'prediction',
 'input_ids',
 'attention_mask']

In [None]:
print(model.config.label2id)
print(model.config.id2label)

{'negative': 0, 'neutral': 1, 'positive': 2}
{0: 'negative', 1: 'neutral', 2: 'positive'}


Define trainer to finetune model

In [46]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    matthews_corrcoef
)

def compute_metrics(eval_pred):
    prediction, labels = eval_pred
    prediction = np.argmax(prediction, axis=-1)
    # Use prediction instead of preds
    return {
        "accuracy": accuracy_score(labels, prediction),
        "balanced_accuracy": balanced_accuracy_score(labels, prediction),
        "precision": precision_score(labels, prediction, average='macro', zero_division=0),
        "recall": recall_score(labels, prediction, average='macro', zero_division=0),
        "f1": f1_score(labels, prediction, average='macro', zero_division=0),
        "mcc": matthews_corrcoef(labels, prediction)
    }


# Calculate this beforehand based on your train set length and batch size
steps_per_epoch = len(dataset_train_shuffled) // 32  # batch size = 32

args = TrainingArguments(
    output_dir='temp/',
    eval_strategy='steps',
    eval_steps=steps_per_epoch,
    save_strategy='steps',
    save_steps=steps_per_epoch,
    save_total_limit=2,
    logging_strategy='steps',
    logging_steps=50,
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model='balanced_accuracy',
    push_to_hub=False,
    report_to="none",
    seed=42
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset_train_shuffled,
    eval_dataset=dataset_val,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Balanced Accuracy,Precision,Recall,F1,Mcc
714,0.1964,0.923573,0.749541,0.751531,0.736734,0.751531,0.742778,0.606532
1428,0.0941,1.207561,0.748753,0.753795,0.742236,0.753795,0.74505,0.608769
2142,0.0844,1.585273,0.74429,0.744716,0.740629,0.744716,0.73976,0.599489


TrainOutput(global_step=2145, training_loss=0.09978189929659828, metrics={'train_runtime': 1440.3911, 'train_samples_per_second': 47.6, 'train_steps_per_second': 1.489, 'total_flos': 1.8039582146267136e+16, 'train_loss': 0.09978189929659828, 'epoch': 3.0})

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Convert trainer log to DataFrame
log_history = pd.DataFrame(trainer.state.log_history)

# Drop rows without a step number (e.g., initial setup logs)
log_history = log_history[log_history['step'].notna()]

# Plot loss
plt.figure(figsize=(10, 4))
if 'loss' in log_history.columns:
    plt.plot(log_history['step'], log_history['loss'], label='Train Loss')
if 'eval_loss' in log_history.columns:
    plt.plot(log_history['step'], log_history['eval_loss'], label='Val Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Plot accuracy-related metrics
for metric in ['eval_accuracy', 'eval_balanced_accuracy', 'eval_f1', 'eval_mcc']:
    if metric in log_history.columns:
        plt.figure(figsize=(10, 4))
        plt.plot(log_history['step'], log_history[metric], label=metric.replace('eval_', '').capitalize())
        plt.xlabel('Step')
        plt.ylabel(metric.replace('eval_', '').capitalize())
        plt.title(f'{metric.replace("eval_", "").capitalize()} over Time')
        plt.legend()
        plt.grid(True)
        plt.tight_layout()
        plt.show()

In [None]:
predictions = trainer.predict(dataset_test)
predictions

PredictionOutput(predictions=array([[-3.3424942 ,  0.57601756,  2.8354967 ],
       [-3.288018  ,  1.0679195 ,  2.1991563 ],
       [-3.639874  ,  3.6975234 ,  0.25189674],
       ...,
       [-3.0269525 , -0.8149146 ,  3.79376   ],
       [-2.9981217 , -0.87865895,  3.8866694 ],
       [ 1.8085198 ,  1.2239182 , -3.3164456 ]], dtype=float32), label_ids=array([2, 2, 2, ..., 2, 2, 0]), metrics={'test_loss': 1.474442481994629, 'test_accuracy': 0.6141225882661767, 'test_balanced_accuracy': 0.6376111254345935, 'test_precision': 0.6491584040265256, 'test_recall': 0.6376111254345935, 'test_f1': 0.5755248985718108, 'test_mcc': 0.41682645321565154, 'test_runtime': 48.0974, 'test_samples_per_second': 158.408, 'test_steps_per_second': 4.969})

Save locally

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Save the model and tokenizer after training
# Sebs
#model_save_path = '/content/drive/MyDrive/Colab-Notebooks/Thesis/Finetuned_RoBERTa'

# Bubs
model_save_path = '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa'
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

trainer.save_model(model_save_path)
trainer.state.save_to_json(f"{model_save_path}/trainer_state.json")

In [None]:
# Save the model and tokenizer after training
model_save_path = '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa'
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

('/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/tokenizer_config.json',
 '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/special_tokens_map.json',
 '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/vocab.json',
 '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/merges.txt',
 '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/added_tokens.json',
 '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa/tokenizer.json')

In [None]:
# Load the model and tokenizer from Google Drive when needed
model = AutoModelForSequenceClassification.from_pretrained(model_save_path)
tokenizer = AutoTokenizer.from_pretrained(model_save_path)

NameError: name 'AutoModelForSequenceClassification' is not defined

Load trained model into the pipeline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
model_save_path = '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa'

# Load the model and tokenizer from Google Drive when needed
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=model_save_path, local_files_only=True)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_save_path, local_files_only=True)

In [None]:
trained_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer,device=device)

Device set to use cuda:0


predict and evaluate accuracy

In [None]:
preds=trained_pipeline(df_test['tweet'].tolist())
df_test.loc[:, 'prediction'] = [pred['label'] for pred in preds]

In [None]:
from sklearn.metrics import classification_report

# Map string labels to numeric
label_mapping = {'negative': 0, 'neutral': 1, 'positive': 2}
id2label = {v: k.capitalize() for k, v in label_mapping.items()}

y_true = df_test['label_name'].str.lower().map(label_mapping)
y_pred = df_test['prediction'].str.lower().map(label_mapping)

# Get per-class scores using classification_report
report = classification_report(y_true, y_pred, target_names=[id2label[i] for i in range(3)], output_dict=True, zero_division=0)

# Convert to DataFrame
report_df = pd.DataFrame(report).transpose()

# Round and clean
report_df = report_df[['precision', 'recall', 'f1-score', 'support']]
report_df = report_df.round(4)

# Optional: Rename index
report_df.index.name = 'Label'
report_df.reset_index(inplace=True)

# Show the table
print(report_df.to_string(index=False))

       Label  precision  recall  f1-score   support
    Negative     0.7881  0.5029    0.6140 1398.0000
     Neutral     0.2542  0.8033    0.3861 1027.0000
    Positive     0.9052  0.6067    0.7265 5194.0000
    accuracy     0.6141  0.6141    0.6141    0.6141
   macro avg     0.6492  0.6376    0.5755 7619.0000
weighted avg     0.7960  0.6141    0.6599 7619.0000


# Load in scraped dataset

In [None]:
# Bubs
# df_scraped = pd.read_excel( '/content/drive/My Drive/Masters Thesis/Colab notebook/final_SPX500_data.xlsx')
df_scraped = pd.read_excel( '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/labeled_sentiment.xlsx')

# Sebs
# df_scraped = pd.read_excel( '/content/drive/MyDrive/Colab-Notebooks/Thesis/final_SPX500_data.xlsx')

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline

# Define model path (where it was saved previously)
# Sebs
# model_save_path = '/content/drive/MyDrive/Colab-Notebooks/Thesis/Finetuned_RoBERTa'

# Bubs
model_save_path = '/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/Finetuned_RoBERTa'

# Load the trained model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_save_path)
tokenizer = AutoTokenizer.from_pretrained(model_save_path)

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)


Device set to use cuda:0


In [None]:
df_scraped.columns

Index(['Author_Handle', 'Date', 'X_Post', 'Reply_Count', 'Repost_Count',
       'Like_Count', 'View_Count', 'Follower_Count', 'Verified_Status',
       'Manual Sentiment'],
      dtype='object')

In [None]:
df_filtered = df_scraped[df_scraped['Manual Sentiment'].notna()]

In [None]:
# Convert text column to a list
X_Posts = df_filtered['X_Post'].tolist()

# Make predictions
predictions = classifier(X_Posts)

# Convert predictions to DataFrame format
df_filtered['Prediction'] = [pred['label'] for pred in predictions]
df_filtered['Confidence'] = [pred['score'] for pred in predictions]


In [None]:
# Define your known labels (manual sentiment categories)
labels = sorted(df_filtered['Manual Sentiment'].dropna().unique())  # or define manually, e.g. ['negative', 'neutral', 'positive']
label_to_int = {label: idx for idx, label in enumerate(labels)}

# Filter only rows that have manual sentiment
mask = df_filtered["Manual Sentiment"].notna() & (df_filtered["Manual Sentiment"] != "")
y_true = df_filtered.loc[mask, "Manual Sentiment"].map(label_to_int)
y_pred = df_filtered.loc[mask, "Prediction"].map(label_to_int)

# Sanity check: ensure no missing mappings
assert not y_true.isna().any(), "Some manual labels couldn't be mapped"
assert not y_pred.isna().any(), "Some predicted labels couldn't be mapped"

In [None]:
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    f1_score,
    matthews_corrcoef,
    precision_score,
    recall_score
)

accuracy = accuracy_score(y_true, y_pred)
balanced_acc = balanced_accuracy_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred, average="weighted")
mcc = matthews_corrcoef(y_true, y_pred)
precision = precision_score(y_true, y_pred, average="weighted")
recall = recall_score(y_true, y_pred, average="weighted")

print("Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Balanced Accuracy: {balanced_acc:.4f}")
print(f"F1 Score (weighted): {f1:.4f}")
print(f"Matthews Correlation Coefficient (MCC): {mcc:.4f}")
print(f"Precision (weighted): {precision:.4f}")
print(f"Recall (weighted): {recall:.4f}")


Evaluation Metrics:
Accuracy: 0.8019
Balanced Accuracy: 0.8017
F1 Score (weighted): 0.7997
Matthews Correlation Coefficient (MCC): 0.7074
Precision (weighted): 0.8081
Recall (weighted): 0.8019


Now, we have seen that RoBERTa is quite good at predicting after finetuning, let us predict on the whole dataset.

In [None]:
df_twitter = pd.read_excel( '/content/drive/My Drive/Masters Thesis/Colab notebook/final_SPX500_data.xlsx')

# Load the trained model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(model_save_path)
tokenizer = AutoTokenizer.from_pretrained(model_save_path)

classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

# Convert text column to a list
X_Posts = df_twitter['X_Post'].tolist()

# Make predictions
predictions = classifier(X_Posts)

# Convert predictions to DataFrame format
df_twitter['Prediction'] = [pred['label'] for pred in predictions]
df_twitter['Confidence'] = [pred['score'] for pred in predictions]


Device set to use cuda:0


In [None]:
df_twitter.head()

Unnamed: 0,Author_Handle,Date,X_Post,Reply_Count,Repost_Count,Like_Count,View_Count,Follower_Count,Verified_Status,Prediction,Confidence
0,SanctionsAml,2022-01-01 11:04:50,2021 amid 10%+ #inflation (1980 methodology). ...,0,0,0,0,502,0,positive,0.545173
1,sachinksd1,2022-01-01 04:00:24,#DXY breaking down .. #SPX500 is on thin ICE ....,1,1,0,0,3302,1,negative,0.986617
2,Smartmoov2,2022-01-01 00:48:28,2021 📆 Asset Performance:\n\n#Gold ⛏ -4%\n#SPX...,0,0,0,0,15,0,neutral,0.519559
3,GrizzlyBulls,2022-01-01 20:13:24,\nWeekly Market Analysis #investing #Stocks #t...,0,0,0,0,39,0,neutral,0.994965
4,InvariantPersp1,2022-01-01 04:22:59,#recession ... #StockMarket #Bubble edition\n\...,0,2,0,0,5643,0,negative,0.494702


In [None]:
# Bubs
df_twitter.to_excel("/content/drive/My Drive/Masters Thesis/Colab notebook/Sentiment analysis/RoBERTa_sentiment_preds.xlsx", index=False)

# Sebs
#df_scraped.to_excel("/content/drive/MyDrive/Colab-Notebooks/Thesis/SPX500_final_RoBERTa.xlsx", index=False)