We import various libraries required for our text classification task using BERT. These libraries include:

- Transformers: Provides pre-trained transformer models like BERT. We import BertTokenizer, BertForSequenceClassification, AdamW, and get_linear_schedule_with_warmup.
- Pandas: Used for data manipulation and analysis.
- Sklearn: Offers utilities for machine learning tasks, including data splitting and evaluation metrics.
- Re: Provides regular expression operations for text preprocessing.
- Torch: PyTorch library for building and training deep learning models. We import necessary modules like nn, DataLoader, and Dataset.

In [15]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import re
import torch
import numpy as np
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset

We load our dataset using pandas.read_csv method. The dataset is expected to be in CSV format and contains text data along with labels for classification.

In [2]:
df = pd.read_csv('dataset-merged.csv')

Then, we perform some data cleanup.

In [3]:
df = df.drop('sr', axis=1)
df.columns

Index(['text', 'label', 'wcount'], dtype='object')

In [4]:
df = df.dropna()
df.isnull().sum()

text      0
label     0
wcount    0
dtype: int64

Text preprocessing involves cleaning and preparing the raw text for model training. The preprocess_text function removes digits and extra spaces from the text using regular expressions. This step ensures that the text is in a consistent format for tokenization.



In [5]:
# Preprocess the text data
def preprocess_text(text):
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    return text

In [6]:
df['text'] = df['text'].apply(preprocess_text)

We split the dataset into training and testing sets using train_test_split from sklearn. This function shuffles and splits the data into a specified ratio. Here, we use 80% of the data for training and 20% for testing.



In [7]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

We initialize the BERT tokenizer using BertTokenizer.from_pretrained. The tokenizer converts text into tokens that the BERT model can understand. We use the 'bert-base-multilingual-cased' model, which supports multiple languages.



In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')

We define a custom dataset class NewsDataset that extends torch.utils.data.Dataset. This class handles the tokenization of text and prepares the inputs required by the BERT model. The __getitem__ method returns a dictionary containing the tokenized input IDs, attention masks, and labels.

In [9]:
# custom dataset class

class NewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'text': text,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }


We define several hyperparameters for training the model:

- MAX_LEN: Maximum length of input sequences.
- BATCH_SIZE: Number of samples per batch.
- EPOCHS: Number of training epochs.
- LEARNING_RATE: Learning rate for the optimizer.

In [10]:
# hyperparameters

MAX_LEN = 100
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 2e-5

We create instances of NewsDataset for the training and testing data. We then use DataLoader to create data loaders that will handle batching and shuffling of the dataset during training and evaluation.

In [11]:
# creating dataloaders

train_dataset = NewsDataset(X_train.tolist(), y_train.tolist(), tokenizer, MAX_LEN)
test_dataset = NewsDataset(X_test.tolist(), y_test.tolist(), tokenizer, MAX_LEN)

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)

# Initializing the Model, Optimizer, Scheduler, and Loss Function

We initialize the BERT model for sequence classification with BertForSequenceClassification. The model is moved to the appropriate device (GPU or CPU).

The optimizer is set to AdamW, which is an optimized version of Adam for the transformer models. The learning rate scheduler get_linear_schedule_with_warmup is used to linearly decrease the learning rate after a warm-up period.

We use nn.CrossEntropyLoss as the loss function, suitable for multi-class classification tasks.

In [12]:
# initialize the model, optimizer, scheduler and loss function

model = BertForSequenceClassification.from_pretrained('bert-base-multilingual-cased', num_labels=2)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

criterion = nn.CrossEntropyLoss()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We define methods for training and evaluating the model:

- train_epoch: This function trains the model for one epoch. It computes the loss, updates the model parameters, and tracks the training accuracy.
- eval_model: This function evaluates the model on the validation/test set. It computes the loss and accuracy without updating the model parameters.

In [13]:
# training and evaluation method def

def train_epoch(model, data_loader, criterion, optimizer, device, scheduler):
    model = model.train()
    losses = []
    correct_predictions = 0

    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        loss = outputs.loss
        logits = outputs.logits

        _, preds = torch.max(logits, dim=1)
        correct_predictions += torch.sum(preds == labels)
        losses.append(loss.item())

        loss.backward()
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

    return correct_predictions.double() / len(data_loader.dataset), np.mean(losses)

def eval_model(model, data_loader, criterion, device):
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            loss = outputs.loss
            logits = outputs.logits

            _, preds = torch.max(logits, dim=1)
            correct_predictions += torch.sum(preds == labels)
            losses.append(loss.item())

    return correct_predictions.double() / len(data_loader.dataset), np.mean(losses)

We train the model for the specified number of epochs. For each epoch, we print the training and validation loss and accuracy. We also save the model's state if it achieves a higher validation accuracy than previously observed.

In [16]:
# training loop

best_accuracy = 0

for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)

    train_acc, train_loss = train_epoch(model, train_loader, criterion, optimizer, device, scheduler)
    print(f'Train loss {train_loss} accuracy {train_acc}')

    val_acc, val_loss = eval_model(model, test_loader, criterion, device)
    print(f'Val   loss {val_loss} accuracy {val_acc}')

    if val_acc > best_accuracy:
        best_accuracy = val_acc
        torch.save(model.state_dict(), 'best_model_state.bin')

Epoch 1/3
----------
Train loss 0.2841733164102917 accuracy 0.8861877646371733
Val   loss 0.2893315942654776 accuracy 0.8797080291970804
Epoch 2/3
----------
Train loss 0.19467090397786302 accuracy 0.9277266754270697
Val   loss 0.3171310005146404 accuracy 0.8817518248175183
Epoch 3/3
----------
Train loss 0.15935424442730123 accuracy 0.9451744780259892
Val   loss 0.3171310005146404 accuracy 0.8817518248175183


We load the best model saved during training and evaluate its performance on the test set. We compute metrics like accuracy, precision, recall, and F1 score to assess the model's performance.

In [17]:
# load the best model for evaluation

model.load_state_dict(torch.load('best_model_state.bin'))

y_pred = []
y_true = []

model = model.eval()
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        logits = outputs.logits

        _, preds = torch.max(logits, dim=1)
        y_pred.extend(preds)
        y_true.extend(labels)

y_pred = torch.stack(y_pred).cpu()
y_true = torch.stack(y_true).cpu()

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

Accuracy: 0.8817518248175182
Precision: 0.8807947019867549
Recall: 0.83125
F1 Score: 0.8553054662379421


Classification using a pre-trained LLM yeilds us the best accuracy out of all.