<a href="https://colab.research.google.com/github/BaberFaisal/NLP_Text-classification_HW/blob/main/Fine_tuned_a_pre_trained_model_using_BERT_and_RoBERTa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Chosen Model Description**

I haave fine-tuned two transformer-based models for sequence classification:

BERT (bert-base-uncased): A bidirectional transformer pre-trained on English text using masked language modeling and next-sentence prediction.

RoBERTa (roberta-base): An optimized variant of BERT that removes next-sentence prediction, uses dynamic masking, and trains on larger batches and more data.

Both models were adapted for binary classification (predicting disaster vs. non-disaster tweets) by adding a classification head. Training utilized mixed-precision (autocast and GradScaler) for efficiency on GPUs.



**Install Required Libraries**

Install and import necessary libraries for handling datasets, tokenization, and fine-tuning transformer models.

In [None]:
!pip install transformers datasets torch scikit-learn

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.

In [None]:
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, RobertaTokenizer, BertForSequenceClassification, RobertaForSequenceClassification
from transformers import AdamW, get_scheduler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from torch.cuda.amp import autocast, GradScaler
from transformers import get_scheduler
from torch.optim import AdamW



**Load and Preprocess Dataset**
# Load the dataset, preprocess text, and split into train and validation sets.

In [None]:
df = pd.read_csv('/content/train (1).csv')
train_texts, val_texts, train_labels, val_labels = train_test_split(df['text'].tolist(), df['target'].tolist(), test_size=0.2, random_state=42)

**Tokenization**
# Tokenize text using BERT and RoBERTa tokenizers, ensuring proper padding and truncation.

In [None]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
roberta_tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

def tokenize_function(texts, tokenizer):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=64)

train_encodings_bert = tokenize_function(train_texts, bert_tokenizer)
val_encodings_bert = tokenize_function(val_texts, bert_tokenizer)
train_encodings_roberta = tokenize_function(train_texts, roberta_tokenizer)
val_encodings_roberta = tokenize_function(val_texts, roberta_tokenizer)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

**Create Custom Dataset Class**
# Define a PyTorch Dataset to handle tokenized inputs efficiently.

In [None]:
class DisasterTweetDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

train_dataset_bert = DisasterTweetDataset(train_encodings_bert, train_labels)
val_dataset_bert = DisasterTweetDataset(val_encodings_bert, val_labels)
train_dataset_roberta = DisasterTweetDataset(train_encodings_roberta, train_labels)
val_dataset_roberta = DisasterTweetDataset(val_encodings_roberta, val_labels)

**Load Pretrained Models**
# Load both BERT and RoBERTa models for classification.

In [None]:
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
roberta_model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#Training Models

In [None]:
def train_model(model, train_dataset, val_dataset, tokenizer, model_name, epochs=3, batch_size=16, lr=3e-5, accumulation_steps=4):
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_scheduler('linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=len(train_loader) * epochs)

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    loss_fn = nn.CrossEntropyLoss()


    scaler = GradScaler() if torch.cuda.is_available() else None

    for epoch in range(epochs):
        model.train()
        total_loss, correct, total = 0, 0, 0
        optimizer.zero_grad()

        for i, batch in enumerate(train_loader):
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)


            with autocast() if torch.cuda.is_available() else torch.no_grad():
                outputs = model(**inputs).logits
                loss = loss_fn(outputs, labels) / accumulation_steps


            if scaler:
                scaler.scale(loss).backward()

            if (i + 1) % accumulation_steps == 0 or (i + 1) == len(train_loader):
                if scaler:
                    scaler.step(optimizer)
                    scaler.update()
                else:
                    optimizer.step()
                optimizer.zero_grad()
                scheduler.step()

            total_loss += loss.item() * accumulation_steps
            correct += (outputs.argmax(1) == labels).sum().item()
            total += labels.size(0)

        print(f"{model_name} Epoch {epoch+1}, Loss: {total_loss/len(train_loader)}, Accuracy: {correct/total}")

    model.eval()
    predictions, true_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            labels = batch['labels'].to(device)
            outputs = model(**inputs).logits
            predictions.extend(outputs.argmax(1).cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(true_labels, predictions)
    print(f"{model_name} Validation Accuracy: {acc}")
    print(classification_report(true_labels, predictions))


train_model(bert_model, train_dataset_bert, val_dataset_bert, bert_tokenizer, 'BERT', epochs=4, batch_size=32, lr=5e-6, accumulation_steps=8)

  scaler = GradScaler() if torch.cuda.is_available() else None
  with autocast() if torch.cuda.is_available() else torch.no_grad():


BERT Epoch 1, Loss: 0.6421658580839946, Accuracy: 0.6541871921182266
BERT Epoch 2, Loss: 0.5396153695920375, Accuracy: 0.7673234811165845
BERT Epoch 3, Loss: 0.45889766155425166, Accuracy: 0.8095238095238095
BERT Epoch 4, Loss: 0.41065625495311475, Accuracy: 0.8331691297208539
BERT Validation Accuracy: 0.8200919238345371
              precision    recall  f1-score   support

           0       0.84      0.85      0.84       874
           1       0.79      0.79      0.79       649

    accuracy                           0.82      1523
   macro avg       0.82      0.82      0.82      1523
weighted avg       0.82      0.82      0.82      1523



In [None]:
train_model(roberta_model, train_dataset_roberta, val_dataset_roberta, roberta_tokenizer, 'RoBERTa')

  scaler = GradScaler() if torch.cuda.is_available() else None
  with autocast() if torch.cuda.is_available() else torch.no_grad():


RoBERTa Epoch 1, Loss: 0.4580866710876855, Accuracy: 0.7945812807881774
RoBERTa Epoch 2, Loss: 0.3489782963524966, Accuracy: 0.8558292282430213
RoBERTa Epoch 3, Loss: 0.28008711615728893, Accuracy: 0.8880131362889984
RoBERTa Validation Accuracy: 0.8332239001969797
              precision    recall  f1-score   support

           0       0.86      0.84      0.85       874
           1       0.79      0.82      0.81       649

    accuracy                           0.83      1523
   macro avg       0.83      0.83      0.83      1523
weighted avg       0.83      0.83      0.83      1523



**Generate Predictions for Kaggle Submission**
# Load and preprocess test data.

In [None]:
test_df = pd.read_csv('/content/test (1).csv')
test_encodings_bert = tokenize_function(test_df['text'].tolist(), bert_tokenizer)
test_encodings_roberta = tokenize_function(test_df['text'].tolist(), roberta_tokenizer)
test_dataset_bert = DisasterTweetDataset(test_encodings_bert, [0] * len(test_df))
test_dataset_roberta = DisasterTweetDataset(test_encodings_roberta, [0] * len(test_df))

def generate_submission(model, test_dataset, tokenizer, model_name):
    test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)
    model.eval()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    predictions = []
    with torch.no_grad():
        for batch in test_loader:
            inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
            outputs = model(**inputs).logits
            predictions.extend(outputs.argmax(1).cpu().numpy())
    submission = pd.DataFrame({'id': test_df['id'], 'target': predictions})
    submission.to_csv(f'submission_{model_name}.csv', index=False)
    print(f"Submission file for {model_name} created.")


# Generate submissions

In [None]:
generate_submission(bert_model, test_dataset_bert, bert_tokenizer, 'BERT')
generate_submission(roberta_model, test_dataset_roberta, roberta_tokenizer, 'RoBERTa')


Submission file for BERT created.
Submission file for RoBERTa created.


**Conclusion**

1. Best Model in Terms of Quality/Resources

RoBERTa achieved slightly higher validation accuracy (83.3% vs. BERT’s 82%) with comparable resource usage. While both models are resource-intensive, RoBERTa’s architectural optimizations likely contributed to better performance without significantly increasing computational costs.

2. Improvements for Results

Hyperparameter Tuning: Adjust learning rates, batch sizes, or epochs (e.g., RoBERTa trained for only 3 epochs vs. BERT’s 4).

Data Augmentation: Expand the dataset with techniques like synonym replacement or back-translation.

Ensemble Learning: Combine predictions from both models for robustness.

Advanced Tokenization: Experiment with longer sequence lengths or domain-specific tokenization.

3. Difficulties Encountered

Dependency Conflicts: CUDA/cuDNN version mismatches during library installations (e.g., nvidia-cudnn-cu12).

Resource Limitations: Large model sizes (e.g., BERT: ~440MB, RoBERTa: ~499MB) may strain memory during training.

Class Imbalance: The dataset’s class distribution (e.g., 874 vs. 649 samples in validation) could bias predictions, though this was not explicitly addressed.

RoBERTa’s marginal superiority in accuracy makes it the preferred choice, though further tuning and addressing class imbalance could enhance results.