<a href="https://colab.research.google.com/github/RobyRoshna/Insensitive-Lang-Detection/blob/main/BERTtraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT training for Identifying Insensitive Language about Disabled People Using Semantic Analysis and Machine Learning

### Table of Contents

1. [Imports](#imports)
2. [Data Splitting: Train, Test, and Validation (80:10:10)](#split-to-train-test-and-validation)
3. [Tokenizer](#tokenizer)
4. [Dataset Creation for BERT](#dataset-Creation-for-BERT)
5. [Max Sentence Length for Tokenizer](#max-Sentence-Length)
6. [PyTorch Dataloader for Future Extension](#pytorch-dataloader-for-future-extension)
7. [Dataset Check](#dataset-check)
8. [BERT Base Model](#bert-base-model)


##**1. Imports** <a name="imports"></a>

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from huggingface_hub import login
login(token='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')
import wandb
from transformers import Trainer, TrainingArguments, BertTokenizer, BertForSequenceClassification
import pandas as pd
import torch # Only if extending or customizing the trainer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


## **2. Split to Train, Test, and Validation (80:10:10)** <a name="split-to-train-test-and-validation"></a>

In [3]:

# The annotated dataset
file_path = '/content/drive/MyDrive/Honours MiscData(Roshna)/Abstract_annotations.xlsx'  # Update with your path
data = pd.read_excel(file_path)

# cleaning data
data = data[['Sentence', 'Manual_Annotation']]
data = data.dropna()

# 1 for insensitive and 0 for notInsensitive
data['Manual_Annotation'] = data['Manual_Annotation'].apply(lambda x: 1 if x.lower() == 'insensitive' else 0)

# Split the data into train, validation, and test sets
train_data, temp_data = train_test_split(data, test_size=0.2, random_state=42, stratify=data['Manual_Annotation'])
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42, stratify=temp_data['Manual_Annotation'])

print(f"Train size: {len(train_data)}, Validation size: {len(val_data)}, Test size: {len(test_data)}")


Train size: 870, Validation size: 109, Test size: 109


## **3. Tokenizer** <a name="tokenizer"></a>

In [4]:
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Function to tokenize data
def tokenize_data(data, tokenizer, max_length=109):
    return tokenizer(
        list(data['Sentence']),  # Tokenize sentences
        padding=True,            # Pad shorter sentences
        truncation=True,         # Truncate longer sentences
        max_length=max_length,   # Max token length
        return_tensors='pt'      # Return PyTorch tensors
    )

train_labels = list(train_data['Manual_Annotation'])
val_labels = list(val_data['Manual_Annotation'])
test_labels = list(test_data['Manual_Annotation'])

# Tokenize the data
train_encodings = tokenize_data(train_data, tokenizer)
val_encodings = tokenize_data(val_data, tokenizer)
test_encodings = tokenize_data(test_data, tokenizer)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## **4. Dataset Creation for BERT** <a name="dataset-creation-for-bert"></a>

In [5]:
# Custom Dataset Class for Tokenized Data
class SentenceDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        """
        Initializes the dataset.

        Args:
            encodings: Dictionary containing tokenized input IDs, attention masks, etc.
            labels: List of labels corresponding to the sentences (e.g., 0 for NotInsensitive, 1 for Insensitive).
        """
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        """
        Returns the total number of samples in the dataset.
        """
        return len(self.labels)

    def __getitem__(self, idx):
        """
        Retrieves the tokenized inputs and the corresponding label for the given index.

        Args:
            idx: Index of the data sample.

        Returns:
            A dictionary containing the tokenized inputs (input IDs, attention masks, etc.)
            and the label for the specified index.
        """
        # Convert tokenized data for the index to PyTorch tensors
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])  # Add the corresponding label
        return item
# Create datasets for train, validation, and test sets
train_dataset = SentenceDataset(train_encodings, train_labels)
val_dataset = SentenceDataset(val_encodings, val_labels)
test_dataset = SentenceDataset(test_encodings, test_labels)


## **5. Max Sentence Length for Tokenizer** <a name="max-Sentence-Length"></a>

In [6]:
sentence_lengths = [len(tokenizer.tokenize(sent)) for sent in train_data['Sentence']]
print(f"Max length: {max(sentence_lengths)}")
print(f"Average length: {sum(sentence_lengths)/len(sentence_lengths)}")


Max length: 109
Average length: 32.96206896551724


## **6. PyTorch Dataloader for Future Extension** <a name="pytorch-dataloader-for-future-extension"></a>

In [7]:
from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)
test_loader = DataLoader(test_dataset, batch_size=16)


## **7. Dataset Check** <a name="dataset-check"></a>

In [8]:
#Examples from the training dataset
for i in range(5):
    item = train_dataset[i]
    print("Input IDs:", item['input_ids'])
    print("Attention Mask:", item['attention_mask'])
    print("Label:", item['labels'])  # 0 for Not Insensitive, 1 for Insensitive


Input IDs: tensor([  101,  2122,  2913,  1998,  3141,  3906,  6592,  4022,  2005,  2925,
        27758,  2015,  1997,  1996,  2291,  2000,  5770,  6397,  1998, 17453,
        18234,  5198,  1999,  4547,  1010,  2658,  1010,  1998, 10517, 18046,
         1012,   102,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0])
Attention Mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 

  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


## **8. BERT Base Model** <a name="bert-base-model"></a>

In [9]:
# Load pre-trained BERT for binary classification
modelBbase = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Metrics Function

In [10]:
# Function to compute metrics
def compute_metrics(pred):
    predictions, labels = pred
    preds = predictions.argmax(axis=1)
    acc = accuracy_score(labels, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

### Bert-base Training

In [11]:
# Safe close any previous WandB session
wandb.finish()

# Initialize WandB with a specific run name
wandb.init(project="Insensitive Lang Detecton", entity="Roshna", name="Bert_base")

# TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    logging_dir='./logs',
    logging_steps=10,
    report_to=["wandb"],  # WandB is used for logging
    run_name="Bert_base"  # the run name for this Trainer
)

# Trainer
trainer = Trainer(
    model=modelBbase,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Train
trainer.train()

results = trainer.evaluate(test_dataset, metric_key_prefix="test")
wandb.log(results)

wandb.finish()  # Close the evaluation session


[34m[1mwandb[0m: Currently logged in as: [33mroshnaroby[0m ([33mRoshna[0m). Use [1m`wandb login --relogin`[0m to force relogin


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1858,0.171925,0.93578,0.943396,0.925926,0.934579
2,0.1167,0.090302,0.972477,1.0,0.944444,0.971429
3,0.0588,0.081843,0.972477,0.963636,0.981481,0.972477


  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
  item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}


0,1
epoch,▁
eval/accuracy,▁██
eval/f1,▁██
eval/loss,█▂▁
eval/precision,▁█▄
eval/recall,▁▃█
eval/runtime,▂▁█
eval/samples_per_second,▇█▁
eval/steps_per_second,▇█▁
test/accuracy,▁

0,1
epoch,3.0
eval/accuracy,0.97248
eval/f1,0.97248
eval/loss,0.08184
eval/precision,0.96364
eval/recall,0.98148
eval/runtime,0.6232
eval/samples_per_second,174.895
eval/steps_per_second,11.232
test/accuracy,0.95413
