# Multiclass Classification with Deep Learning using BERT

### About the Dataset

The dataset contains 2,507 research paper titles, and have been manually classified into 5 categories (i.e. conferences) that can be downloaded from here.

### 1. Data Loading and Exploration

In [1]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
from tqdm.notebook import tqdm # Progress bar library
from transformers import BertTokenizer # For BERT tokenization
from torch.utils.data import TensorDataset # Efficient dataset creation
from transformers import BertForSequenceClassification # Pre-trained BERT for classification
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load the dataset
df = pd.read_csv('title_conference.csv')
df.head()

Unnamed: 0,Title,Conference
0,Innovation in Database Management: Computer Sc...,VLDB
1,High performance prime field multiplication fo...,ISCAS
2,enchanted scissors: a scissor interface for su...,SIGGRAPH
3,Detection of channel degradation attack by Int...,INFOCOM
4,Pinning a Complex Network through the Betweenn...,ISCAS


In [3]:
# Inspect an example text
df.Conference.iloc[0]

'VLDB'

In [4]:

# Check the distribution of sentiment labels
df.Conference.value_counts()


Conference
ISCAS       864
INFOCOM     515
VLDB        423
WWW         379
SIGGRAPH    326
Name: count, dtype: int64

**Explanation:**

* **Libraries:** We begin by importing necessary libraries:
    * `torch`: For core PyTorch functionality.
    * `tqdm`: To display progress bars during training.
    * `transformers`: For loading BERT models and tokenizer.
    * `TensorDataset`: To create PyTorch datasets efficiently.
* **Loading Data:**  We load our conference dataset (`title_conference.csv`).
* **Initial Exploration:**
    * `df.head()`:  Displays the first few rows of the DataFrame, allowing us to understand its structure.
    * `df.conference.iloc[0]`: Prints the first text sample, giving us a feel for the data.
    * `df.conference.value_counts()`: Counts occurrences of each sentiment label, helping identify potential class imbalances.


### 2. Label Encoding

In [5]:
possible_labels = df.Conference.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{'VLDB': 0, 'ISCAS': 1, 'SIGGRAPH': 2, 'INFOCOM': 3, 'WWW': 4}

In [6]:
df['label'] = df['Conference'].replace(label_dict)
df.head()

Unnamed: 0,Title,Conference,label
0,Innovation in Database Management: Computer Sc...,VLDB,0
1,High performance prime field multiplication fo...,ISCAS,1
2,enchanted scissors: a scissor interface for su...,SIGGRAPH,2
3,Detection of channel degradation attack by Int...,INFOCOM,3
4,Pinning a Complex Network through the Betweenn...,ISCAS,1


**Explanation:**

* **Converting Labels to Numbers:** Machine learning models need numerical inputs. Here, we create a dictionary (`label_dict`) to map each unique  label (e.g., "VLDB", "ISCAS") to a numerical value (e.g., 0, 1).


### 3. Train and Validation Split

In [7]:
X_train, X_val, y_train, y_val = train_test_split(
    df.index.values,
    df.label.values,
    test_size=0.15,
    random_state=42,
    stratify=df.label.values
)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['Conference', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Title
Conference,label,data_type,Unnamed: 3_level_1
INFOCOM,3,train,438
INFOCOM,3,val,77
ISCAS,1,train,734
ISCAS,1,val,130
SIGGRAPH,2,train,277
SIGGRAPH,2,val,49
VLDB,0,train,359
VLDB,0,val,64
WWW,4,train,322
WWW,4,val,57


**Explanation:**

* **Splitting for Training and Evaluation:**
    * We import `train_test_split` to divide our data into training and validation sets.
    * `test_size=0.15` allocates 15% for validation, leaving 85% for training.
    * `random_state=42` ensures reproducible splitting.
    * `stratify=df.label.values` maintains the original label distribution in both sets.
* **Marking Data Splits:**  We add a `data_type` column to the DataFrame to easily track which data points belong to the training and validation sets.
* **Verification:**  The `groupby` operation helps us verify that the label and conference distributions are consistent across the training and validation splits.

### 4. BERT Tokenizer and Encoding the Data

In [8]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Title.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].Title.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [9]:
# ...

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

# ... (similarly for validation set: input_ids_val, attention_masks_val, labels_val)
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)


dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

**Explanation:**

* **Tokenizer:**
    * We load the BERT tokenizer (`bert-base-uncased`) and specify `do_lower_case=True` to lowercase all text.
* **Encoding:**  
    * `batch_encode_plus` efficiently encodes multiple text sequences at once.
    * `add_special_tokens=True`: Adds BERT's special tokens ([CLS], [SEP])
    * `return_attention_mask=True`:  Generates masks to handle padding.
    * `pad_to_max_length=True`:  Pads sequences to a fixed length (256 here).
    * `return_tensors='pt'`:  Returns PyTorch tensors, ready for model input.
* **Extracting Encoded Data:**
    * `input_ids_train`, `attention_masks_train`, `labels_train` are extracted from the encoding results. These are essential for training BERT.
* **TensorDataset:**  
    * We create `TensorDataset` objects to combine the encoded inputs, attention masks, and labels into a convenient format for training and validation.


### 5. BERT Pre-trained Model:

In [10]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Explanation:**

In this code snippet, a pre-trained BERT model for sequence classification is loaded using the `BertForSequenceClassification` class from the Hugging Face `transformers` library. The model is initialized with the "bert-base-uncased" pre-trained weights, specifying the number of labels for classification, and setting options to not output attentions or hidden states.


### 6. Data Loaders:

In [11]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 16

dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

**Explanation:**

Data loaders are created using the `DataLoader` class from PyTorch. Two data loaders are defined for training and validation datasets. Each data loader is initialized with a specific dataset, a sampler (RandomSampler for training and SequentialSampler for validation), and a batch size of 3.


### 7. Optimizer & Scheduler:

In [12]:
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5,
                  eps=1e-8)

epochs = 5

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)



**Explanation**

The AdamW optimizer is used to optimize the parameters of the BERT model with a learning rate of 1e-5 and epsilon value of 1e-8. Additionally, a linear scheduler with warmup is set up using the `get_linear_schedule_with_warmup` function from the `transformers` library. The number of warmup steps is set to 0, and the total number of training steps is calculated based on the number of batches and epochs.


### 8. Performance Metrics:

In [13]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}

    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

**Explanation**

Two functions are defined for evaluating model performance metrics:
   
   - `f1_score_func`: Calculates the F1 score for the model predictions compared to the actual labels. The F1 score is computed with the 'weighted' average.
   
   - `accuracy_per_class`: Computes the accuracy per class by comparing the model predictions with the true labels. It prints the accuracy for each class based on the class index in the `label_dict`.


### 9. Training Loop

In [14]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
print(device)

cuda


In [16]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

def evaluate(dataloader_val):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in dataloader_val:

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(dataloader_val)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

for epoch in tqdm(range(1, epochs+1)):

    model.train()

    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()

        batch = tuple(b.to(device) for b in batch)

        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        outputs = model(**inputs)

        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


    #torch.save(model.state_dict(), f'data_volume/finetuned_BERT_epoch_{epoch}.model')
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')

    tqdm.write(f'\nEpoch {epoch}')

    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'Training loss: {loss_train_avg}')

    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/5 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/134 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.6104545768961978
Validation loss: 0.537546090160807
F1 Score (Weighted): 0.8088577960292905


Epoch 2:   0%|          | 0/134 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.4735344742883497
Validation loss: 0.48486001541217166
F1 Score (Weighted): 0.8282455530596166


Epoch 3:   0%|          | 0/134 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.3831960843014183
Validation loss: 0.4662523455917835
F1 Score (Weighted): 0.8343023080607171


Epoch 4:   0%|          | 0/134 [00:00<?, ?it/s]


Epoch 4
Training loss: 0.32882490920931545
Validation loss: 0.4548906007160743
F1 Score (Weighted): 0.8382535041221896


Epoch 5:   0%|          | 0/134 [00:00<?, ?it/s]


Epoch 5
Training loss: 0.3101360922429099
Validation loss: 0.4548906007160743
F1 Score (Weighted): 0.8382535041221896


**Explanation:**

* **Seed Setting:** The code starts by setting random seeds for Python, NumPy, and PyTorch. This ensures consistent results across multiple runs, which is crucial for reproducibility in research and debugging.

* **`evaluate()` function:** This function takes a data loader (`dataloader_val`) as input, sets the model to evaluation mode, and iterates over the data loader to calculate the average validation loss and obtain predictions. Key points:
    * **`model.eval()`:** Informs PyTorch that the model is in evaluation mode.  Disables dropout and batch normalization layers, which behave differently during training.
    * **`torch.no_grad()`:**  Temporarily disables gradient calculations.  Improves speed and memory efficiency during evaluation, as gradients aren't needed.
    * **Moving data to the device:** Ensures that tensors are on the correct device (CPU or GPU).
    * **Detaching and moving to the CPU:** For calculations and metrics, data is detached from the computation graph (`detach()`) and moved to the CPU (`cpu()`) for compatibility with NumPy.

* **Main Training Loop:**  Iterates over the specified number of epochs (`epochs`), training the model and evaluating its performance.
    * **`model.train()`:**  Sets the model to training mode, activating dropout and batch normalization.
    * **`model.zero_grad()`:** Clears the gradients from previous iterations. Crucial because PyTorch accumulates gradients.
    * **Gradient Clipping (`torch.nn.utils.clip_grad_norm_`)**:  Prevents exploding gradients (very large gradients that can destabilize training) by clipping them to a maximum norm.
    * **`optimizer.step()`:** Updates the model's parameters based on the calculated gradients.
    * **`scheduler.step()`:** Updates the learning rate according to the scheduler's strategy.
    * **Saving the Model (`torch.save`)**: Saves the model's state dictionary (weights and biases) after each epoch.
    * **Evaluation:** After each epoch, the model is evaluated on the validation set using the `evaluate()` function.  Training and validation losses are printed, along with the F1 score.

### 10. Loading and Evaluating the Model

In [17]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

model.load_state_dict(torch.load('finetuned_BERT_epoch_1.model', map_location=torch.device('cpu')))

_, predictions, true_vals = evaluate(dataloader_validation)
accuracy_per_class(predictions, true_vals)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Class: VLDB
Accuracy: 42/64

Class: ISCAS
Accuracy: 124/130

Class: SIGGRAPH
Accuracy: 32/49

Class: INFOCOM
Accuracy: 66/77

Class: WWW
Accuracy: 42/57



**Explanation:**

* **Loading the pre-trained model and moving it to device:** Similar to before, we load a fresh BERT model and transfer it to the chosen device (CPU or GPU).

* **Loading Model Checkpoint:** The `torch.load` function loads the saved state dictionary of the trained model from the specified path. `map_location` might be needed to load the model on a different device (CPU in this case) than the one it was saved on.

* **Final Evaluation:**  The loaded model is evaluated one last time on the validation set to assess its performance using the previously defined `evaluate` and `accuracy_per_class` functions.

**Key Points and Enhancements for Intermediate/Advanced Readers:**

* **Early Stopping:** To prevent overfitting, consider implementing early stopping. Monitor the validation loss and stop training if it doesn't improve for a certain number of epochs.
* **Hyperparameter Tuning:** Experiment with different batch sizes, learning rates, optimizers (e.g., Adam, SGD), schedulers, and the number of epochs to find the best configuration for your specific dataset. Tools like Weights & Biases (wandb) can help track and visualize these experiments.
* **Model Complexity:** Try larger BERT models (e.g., `bert-large-uncased`) for potentially better performance, but be mindful of increased computational resources.
* **Data Augmentation:**  Explore techniques like back-translation or synonym replacement to augment your dataset and improve model generalization.
