# RoBERTa 
RoBERTa stands for Robustly Optimized BERT Pretraining Approach.

It’s a pretrained language model developed by Facebook AI (Meta) in 2019.

# Essential Libraries to Install

1. Transformers – Main library for using RoBERTa

    - pip install transformers

2. Torch – Deep learning framework (RoBERTa is based on PyTorch)

    - pip install torch

3. Datasets – For loading datasets (optional, useful for fine-tuning)

    - pip install datasets

4. scikit-learn – For evaluation metrics (accuracy, precision, etc.)

    - pip install scikit-learn

5. Sentence-Transformers-use Sentence-RoBERTa for embeddings

    - pip install sentence-transformers



In [1]:
pip install transformers


Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install torch

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install datasets

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [5]:
pip install sentence-transformers


Note: you may need to restart the kernel to use updated packages.


# Import Libraries

In [9]:
import pandas as pd
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from sklearn.model_selection import train_test_split
import torch
from datasets import Dataset


RuntimeError: Failed to import transformers.models.roberta.modeling_roberta because of the following error (look up to see its traceback):
No module named 'sympy'

In [8]:
train_df = pd.read_csv('./data/train_data.csv')
val_df = pd.read_csv('./data/val_data.csv')
test_df = pd.read_csv('./data/test_data.csv')


# Label Assignment for Binary Classification

- This step was important because models like RoBERTa require labeled data to perform supervised learning. The labels enable the model to compute loss during training and improve its predictions.


In [10]:
train_df["label"] = [0, 1] * (len(train_df) // 2) + [0] * (len(train_df) % 2)
val_df["label"] = [0, 1] * (len(val_df) // 2) + [0] * (len(val_df) % 2)
test_df["label"] = [0, 1] * (len(test_df) // 2) + [0] * (len(test_df) % 2)

In [11]:
train_df.head()

Unnamed: 0,question,answer,label
0,Would I ever need credit card if my debit card...,Skimmers are most likely at gas station pumps....,0
1,Cheapest way to wire or withdraw money from US...,There is a number of cheaper online options th...,1
2,How do I go about finding an honest ethical f...,Large and wellknown companies are typically a ...,0
3,Why invest in becoming a landlord?,why does it make sense financially to buy prop...,1
4,What could be the cause of a extreme highlow p...,Often these types of trades fall into two diff...,0


# Convert to Hugging Face Dataset Format

- Converts each DataFrame into a special Hugging Face Dataset format 
-  This format works directly with tokenizers and the training pipeline. It allows fast and efficient transformations like tokenization

In [13]:
from datasets import Dataset

In [14]:
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)
test_ds = Dataset.from_pandas(test_df)

# Loading Tokenizer and Tokenizing the Dataset

- Converting raw text into a format suitable for training RoBERTa. We pair both question and answer as model input.

In [19]:
!pip uninstall scipy -y

Found existing installation: scipy 1.15.2
Uninstalling scipy-1.15.2:
  Successfully uninstalled scipy-1.15.2


In [20]:
!pip install scipy --upgrade

Collecting scipy
  Using cached scipy-1.15.2-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Using cached scipy-1.15.2-cp312-cp312-macosx_14_0_arm64.whl (22.4 MB)
Installing collected packages: scipy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sweetviz 2.3.1 requires matplotlib>=3.1.3, which is not installed.[0m[31m
[0mSuccessfully installed scipy-1.15.2


In [21]:
from transformers import RobertaTokenizer

In [22]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

def tokenize_function(record):
    return tokenizer(
        record["question"], 
        record["answer"], 
        padding="max_length", 
        truncation=True,
        max_length=128  
    )

tokenized_train = train_ds.map(tokenize_function, batched=True)
tokenized_val = val_ds.map(tokenize_function, batched=True)
tokenized_test = test_ds.map(tokenize_function, batched=True)


RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
cannot import name 'issparse' from 'scipy.sparse' (unknown location)

# Loading the RoBERTa Model

In [None]:
from transformers import RobertaForSequenceClassification


model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)  


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
tokenized_train.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
tokenized_val.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])
tokenized_test.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

# Setting up training arguments

In [14]:
# pip install accelerate>=0.26.0

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",            
    do_train=True,                    
    do_eval=True,                      
    learning_rate=2e-5,                
    per_device_train_batch_size=8,     
    per_device_eval_batch_size=8,     
    num_train_epochs=10,                
    weight_decay=0.01,                 
    logging_dir="./logs",              
    logging_steps=10                  
)


# Setting Up Trainer and Training

In [None]:
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,                         
    args=training_args,                  
    train_dataset=tokenized_train,       
    eval_dataset=tokenized_val,          
    tokenizer=tokenizer,                 
)


  trainer = Trainer(


In [17]:
trainer.train()

Step,Training Loss
10,0.6947
20,0.6955
30,0.7446
40,0.6955
50,0.6801
60,0.6813
70,0.7322
80,0.6891
90,0.6973
100,0.697


TrainOutput(global_step=4518, training_loss=0.6951419799504105, metrics={'train_runtime': 23741.4257, 'train_samples_per_second': 1.522, 'train_steps_per_second': 0.19, 'total_flos': 2377274162941440.0, 'train_loss': 0.6951419799504105, 'epoch': 3.0})

In [19]:
predictions_output = trainer.predict(tokenized_val)


In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score



logits = predictions_output.predictions
true_labels = predictions_output.label_ids


predicted_labels = np.argmax(logits, axis=-1)


In [None]:

acc = accuracy_score(true_labels, predicted_labels)


f1 = f1_score(true_labels, predicted_labels, average='weighted')


print(f" Accuracy: {acc:.4f}")
print(f" F1-score (weighted): {f1:.4f}")


✅ Accuracy: 0.4998
✅ F1-score (weighted): 0.3331
