# 💬 Twitter‑RoBERTa Sentiment Classification – A Deep Dive  
This notebook shows how to

1. Load the *CardiffNLP Twitter‑RoBERTa* model and its tokenizer  
2. Understand the architecture & training recipe behind the model  
3. Build a lightweight inference helper that returns *negative / neutral / positive*  
4. Apply it to a sample of 100 news articles  
5. Inspect the resulting sentiment distribution

> **Why Twitter‑RoBERTa?**  \
> RoBERTa 🔗 [paper](https://arxiv.org/pdf/1907.11692.pdf) reprimes the original BERT pre‑training by removing next‑sentence prediction and training longer.  
> CardiffNLP adapted it to social‑media English with **58 M–124 M tweets** and then fine‑tuned it on the **TweetEval** sentiment benchmark, reaching state‑of‑the‑art performance. :contentReference[oaicite:0]{index=0}


In [1]:
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import numpy as np
from scipy.special import softmax
import csv
import urllib.request
import pandas as pd
import torch

  from .autonotebook import tqdm as notebook_tqdm


### 1 · Choosing the right 🤗 `Auto*` class  

| Class | Framework | Typical use‑case |
|-------|-----------|------------------|
| `AutoModelForSequenceClassification` | **PyTorch** (default in Transformers) | Most tutorials, GPU/CPU inference |
| `TFAutoModelForSequenceClassification` | **TensorFlow / Keras** | TF‑native pipelines or TPU serving |

Both wrap the same checkpoint weights; only the tensor containers differ. \
In this notebook we stay with **PyTorch** (`torch.no_grad()` disables gradients during inference for speed & memory savings).


In [2]:
# task='sentiment'
# MODEL = f"./cardiffnlp/twitter-roberta-base-{task}"

# tokenizer = AutoTokenizer.from_pretrained(MODEL)
# tokenizer.save_pretrained("tokenizer_roberta/token_izer")

# # download label mapping
# labels=[]
# mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
# with urllib.request.urlopen(mapping_link) as f:
#     html = f.read().decode('utf-8').split("\n")
#     csvreader = csv.reader(html, delimiter='\t')
# labels = [row[1] for row in csvreader if len(row) > 1]

# model = AutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)

In [3]:
tokenizer = AutoTokenizer.from_pretrained("./tokenizer_roberta/sentiment_tokenizer/")

In [4]:
model = AutoModelForSequenceClassification.from_pretrained("./cardiffnlp/twitter-roberta-base-sentiment/")

2025-04-12 02:27:43.262188: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-12 02:27:43.414584: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-04-12 02:27:43.492074: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744424863.595402   37906 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744424863.622668   37906 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1744424863.822339   37906 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linkin

### Under the hood 🔍  

* **Encoder** – *RoBERTa‑base* (12 layers, 12 heads, 768 hidden, 125 M parameters).  
* **Pre‑training corpus** – 58 M tweets (or 124 M in the _latest_ variant), cleaned & deduplicated. \
  RoBERTa’s dynamic masking + longer training lets the model learn robust slang & emoji semantics. :contentReference[oaicite:1]{index=1}  
* **Fine‑tuning** – `tweet_eval/sentiment` task (20 k labelled tweets) with cross‑entropy on three classes.  
  The classification head is a single fully‑connected layer on top of the pooled **[CLS]** representation.


In [9]:
def sentiment(row):
    text = row[:1500]
    encoded_input = tokenizer(text, return_tensors='pt')
    with torch.no_grad():  # Disable gradient calculation during inference
        output = model(**encoded_input)
    scores = output[0][0].cpu().numpy()  # Move scores to CPU for further processing
    scores = softmax(scores)


    ranking = np.argsort(scores)
    ranking = ranking[::-1]
    max_score = 0
    labels = ["negative", "neutral", "positive"]  # Example definition
    for i in range(scores.shape[0]):
        l = labels[ranking[i]]
        s = scores[ranking[i]]
        if s > max_score:
            final = l
            max_score = s
    return final

In [13]:
df = pd.read_csv("../datasets/labelled.csv")
df = df.iloc[:100, :]

In [14]:
df.head()

Unnamed: 0,Heading,Body,Category,URL
0,free speech not hate speech madras high court ...,madras high court issue significant remark ami...,Judiciary,https://www.indiatoday.in/law/high-courts/stor...
1,comment take context say us cop mock indian st...,seattle police officer guild friday come defen...,Crime,https://www.indiatoday.in/world/story/indian-s...
2,first meeting one nation one election committe...,first official meeting one nation one election...,Politics,https://www.indiatoday.in/india/story/one-nati...
3,us airlines flight depressurize midair plummet...,united airlines jet head rome turn around less...,Crime,https://www.indiatoday.in/world/story/us-fligh...
4,terrorist kill security force foil infiltratio...,three terrorist kill infiltration bid foil sec...,Crime,https://www.indiatoday.in/india/story/one-terr...


In [15]:
df.shape

(100, 4)

In [16]:
df["Sentiment"] = df.Body.apply(lambda x: sentiment(str(x)))

In [None]:
df.head()

Unnamed: 0,Heading,Body,Category,URL,Sentiment
0,'Free speech should not be hate speech': Madra...,madras high court issue significant remark ami...,law,https://www.indiatoday.in/law/high-courts/stor...,neutral
1,"'Comments taken out of context', says US cop w...",seattle police officer guild friday come defen...,world,https://www.indiatoday.in/world/story/indian-s...,neutral
2,First meeting of 'One Nation One Election' com...,first official meeting one nation one election...,india,https://www.indiatoday.in/india/story/one-nati...,neutral
3,"US airlines’ flight depressurizes mid-air, plu...",united airlines jet head rome turn around less...,world,https://www.indiatoday.in/world/story/us-fligh...,neutral
4,3 terrorists killed as security forces foil in...,three terrorist kill infiltration bid foil sec...,india,https://www.indiatoday.in/india/story/one-terr...,negative
