Fine Tuning DistilBert for MultiClass Text Classification

==> Flow of the notebook: 

- Importing Python Libraries and preparing the environment
- Importing and Pre-Processing the domain data
- Preparing the Dataset and Dataloader
- Creating the Neural Network for Fine Tuning
- Fine Tuning the Model
- Validating the Model Performance
- Saving the model and artifacts for Inference in Future



==> Language Model Used:

- DistilBERT this is a smaller transformer model as compared to BERT or Roberta. It is created by process of distillation applied to Bert.
- Research Paper : (https://arxiv.org/abs/1910.01108)
- HuggingFace Documentation for python : (https://huggingface.co/transformers/model_doc/distilbert.html)

==> Script Objective:

- The objective of this script is to fine tune DistilBERT to be able to do MeSH Classification (Hepatitis)

==> Importing Python Libraries and preparing the environment :

- Pandas , Numpy
- Pytorch and Pytorch Utils for Dataset and Dataloader
- Transformers
- DistilBERT Model and Tokenizer
- sklearn for metrics (to caLculate the hamming loss and the hamming score)
- warnings
- tqdm 
- logging

- Followed by that we will preapre the device for CUDA execeution. This configuration is needed if you want to leverage on onboard GPU but you can use CPU also (you will graduate before finishing the training :D).

==> Very Important Note : 

=>the overall mechanisms for a multiclass and multilabel problems are similar, except two major points:
  
- Loss function is designed to evaluate all the probability of categories individually rather than as compared to other         
  categories. Hence the use of Binary Cross Entropy "BCE" rather than Cross Entropy when defining loss:       
  1) https://medium.com/dejunhuang/learning-day-57-practical-5-loss-function-crossentropyloss-vs-bceloss-in-pytorch-softmax-vs-bd866c8a0d23
  
  
- Sigmoid of the outputs calcuated to rather than Softmax. so, we use The loss metrics and Hamming Score for direct comparison     of expected vs predicted: 
  1) https://towardsdatascience.com/sigmoid-activation-and-binary-crossentropy-a-less-than-perfect-match-b801e130e31
  
  2) https://www.linkedin.com/pulse/hamming-score-multi-label-classification-chandra-sharat/

In [33]:
# Importing the needed libraries

import warnings
warnings.simplefilter('ignore')
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import metrics
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertTokenizer, DistilBertModel
import logging
logging.basicConfig(level=logging.ERROR)

In [34]:
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

==> Importing and Pre-Processing the domain data

- Import the file in a dataframe 
- remove the pmid column from the data.
- A new dataframe is made and input text is stored in the text column.
- The values of all the categories converted into a list.
- The list is appened as a new column names as labels.

In [35]:
data = pd.read_csv('pubmed.csv')

In [36]:
data.head(10)

Unnamed: 0,pmid,txt,label_cir,label_nfl,label_hep,label_hpc
0,30558055,abo incompatible living donor liver transplant...,1,0,0,0
1,30558011,a human ciliopathy with polycystic ovarian syn...,1,0,0,0
2,30540737,vibrio cholerae no - o1 no - o139 bacteremia i...,1,0,0,0
3,30531115,ruptured ascending colonic varices in a patien...,1,0,0,0
4,30526986,a 44 - year - old woman with sudden breathless...,1,0,0,0
5,30507656,short article sequence variations of pkhd1 und...,1,0,0,0
6,30508897,primary biliary cirrhosis with refractory hypo...,1,0,0,0
7,30482030,auxiliary partial orthotopic liver transplanta...,1,0,0,0
8,30471833,intraoperative management of a patient with im...,1,0,0,0
9,30464164,successful treatment of repeated hematemesis s...,1,0,0,0


In [37]:
data.drop(['pmid'], inplace=True, axis=1)

In [38]:
data.head(10)

Unnamed: 0,txt,label_cir,label_nfl,label_hep,label_hpc
0,abo incompatible living donor liver transplant...,1,0,0,0
1,a human ciliopathy with polycystic ovarian syn...,1,0,0,0
2,vibrio cholerae no - o1 no - o139 bacteremia i...,1,0,0,0
3,ruptured ascending colonic varices in a patien...,1,0,0,0
4,a 44 - year - old woman with sudden breathless...,1,0,0,0
5,short article sequence variations of pkhd1 und...,1,0,0,0
6,primary biliary cirrhosis with refractory hypo...,1,0,0,0
7,auxiliary partial orthotopic liver transplanta...,1,0,0,0
8,intraoperative management of a patient with im...,1,0,0,0
9,successful treatment of repeated hematemesis s...,1,0,0,0


In [39]:
new_df = pd.DataFrame()
new_df['text'] = data['txt']
new_df['labels'] = data.iloc[:, 1:].values.tolist()

In [40]:
new_df.head(10)

Unnamed: 0,text,labels
0,abo incompatible living donor liver transplant...,"[1, 0, 0, 0]"
1,a human ciliopathy with polycystic ovarian syn...,"[1, 0, 0, 0]"
2,vibrio cholerae no - o1 no - o139 bacteremia i...,"[1, 0, 0, 0]"
3,ruptured ascending colonic varices in a patien...,"[1, 0, 0, 0]"
4,a 44 - year - old woman with sudden breathless...,"[1, 0, 0, 0]"
5,short article sequence variations of pkhd1 und...,"[1, 0, 0, 0]"
6,primary biliary cirrhosis with refractory hypo...,"[1, 0, 0, 0]"
7,auxiliary partial orthotopic liver transplanta...,"[1, 0, 0, 0]"
8,intraoperative management of a patient with im...,"[1, 0, 0, 0]"
9,successful treatment of repeated hematemesis s...,"[1, 0, 0, 0]"


==> Preparing the Dataset and Dataloader

- We will start by defining few key variables that will be used later during the training/fine tuning stage. 
- Followed by creation of MultiLabelHepatitis class : This defines how the text is pre-processed before sending it to the neural network. 
- We will also define the Dataloader that will feed the data in batches to the neural network for suitable training and     
  processing. Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing   and its passage to neural network. For further reading into Dataset and Dataloader read the docs at PyTorch


==> MultiLabelHepatitis Dataset Class

- This class is defined to accept the tokenizer, dataframe and max_length as input and generate tokenized output and tags that     is used by the DistilBERT model for training.
- We are using the DistilBERT tokenizer to tokenize the data in the text column of the dataframe.
- The tokenizer uses the encode_plus method to perform tokenization and generate the necessary outputs, namely: 
  1)ids  2)attention_mask  3)token_type_ids
- targets is the list of categories labled as 0 or 1 in the dataframe.
- this class is used to create 2 datasets, for training and for validation.
- Training Dataset : we use 80% of the PubMed data for the fine tunining 
- Validation Dataset : is used to evaluate the performance of the model.

==> Dataloader

- Dataloader is used for creating training and validation dataloader that load data to the neural network in a defined manner   because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and   then passed to the neural network needs to be controlled.
- This control is achieved by using the parameters such as batch_size and max_len.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively
- for further reading about Dataloader : https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

In [51]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 128

# The number of training samples that will be propagated through the neural network
TRAIN_BATCH_SIZE = 4 

# The number of validation samples that will be propagated through the neural network
VALID_BATCH_SIZE = 4

# How many times the entire dataset is passed forward and backward through the neural network
EPOCHS = 6

# A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights 
# are updated (a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving 
# toward a minimum of a loss function).
LEARNING_RATE = 1e-05

In [52]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', truncation=True, do_lower_case=True)

In [53]:
class MultiLabelHepatits(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.text
        self.targets = self.data.labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask'] 
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [54]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_data=new_df.sample(frac=train_size,random_state=200)
test_data=new_df.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)


print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = MultiLabelHepatits(train_data, tokenizer, MAX_LEN)
testing_set = MultiLabelHepatits(test_data, tokenizer, MAX_LEN)

FULL Dataset: (6726, 2)
TRAIN Dataset: (5381, 2)
TEST Dataset: (1345, 2)


In [55]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

==> Creating the Neural Network for Fine Tuning

==> Neural Network:

- We will be creating a neural network with the DistillBERTClass.
- This network will have the DistilBERT Language model followed by a dropout (for Regularization) and finally a Linear layer (for Classification) to obtain the final outputs.
- The data will be fed to the DistilBERT Language model as defined in the dataset.
- Final layer outputs is what will be compared to the encoded category to determine the accuracy of models prediction.
- The number of dimensions for Linear Layer is "2" because that is the total number of categories in which we are looking to classify our model (number of categories in the labels column).
- We will initiate an instance of the network called model. This instance will be used for training and then to save the final trained model for future inference.

==> Loss Function and Optimizer:

- loss_fn : is the Loss Function that is used the calculate the difference in the output created by the model and the actual                 output. the loss function used will be a combination of Binary Cross Entropy which is implemented as BCELogits Loss             in PyTorch
- Optimizer : is used to update the weights of the neural network to improve its performance.

==> Further Reading:

- Pytorch Tutorials to get an intuition of Loss Function and Optimizer : (https://github.com/abhimishra91/pytorch-tutorials)
- Pytorch Tutorials for BCE Loss : https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html#torch.nn.BCEWithLogitsLoss
- Pytorch Documentation for Loss Function : (https://pytorch.org/docs/stable/nn.html#loss-functions)
- Pytorch Documentation for Optimizer : (https://pytorch.org/docs/stable/optim.html)

In [56]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistilBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistilBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 256) #reduce the dim instead of 768 ==> 256
        self.dropout = torch.nn.Dropout(0.1)
        self.classifier = torch.nn.Linear(256, 4) #modified

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

model = DistilBERTClass()
model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistilBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_featu

In [57]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [58]:
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

==> Fine Tuning the Model

- Here we define a training function that trains the model on the training dataset created above 

==> Fine Tunining the Neural Network

- The dataloader passes data to the model based on the batch size (batch size to lead to some errors, so try to choose a number wisely).
- Subsequent output from the model and the actual category are compared to calculate the loss.
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 (not a constant number) steps the loss value is printed in the console.
- As you can see just in "3" epochs by the final step the model was working with a miniscule loss of 0.012 i.e. the network output is extremely close to the actual output, 
- note : in the training output, number of epochs similar to the python indexing, starts from 0.

In [59]:
def train(epoch):
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [60]:
for epoch in range(EPOCHS):
    train(epoch)

0it [00:00, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
1it [00:00,  4.20it/s]

Epoch: 0, Loss:  0.7057749032974243


1346it [08:18,  2.70it/s]
1it [00:00,  3.78it/s]

Epoch: 1, Loss:  0.04012896865606308


1346it [08:19,  2.70it/s]
1it [00:00,  4.09it/s]

Epoch: 2, Loss:  0.02784394472837448


1346it [08:19,  2.69it/s]
1it [00:00,  3.63it/s]

Epoch: 3, Loss:  0.012844439595937729


1346it [08:19,  2.69it/s]
1it [00:00,  3.69it/s]

Epoch: 4, Loss:  0.1364152580499649


1346it [08:19,  2.69it/s]
1it [00:00,  3.69it/s]

Epoch: 5, Loss:  0.003160130698233843


1346it [08:20,  2.69it/s]


==> Validating the Model

- in this step, we pass the Testing Dataset to the model. This step tells us how good the model performs on this unseen data.
- This data is the 20% of 'pubmed.csv' which was seperated during the Dataset creation stage. 
- During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value,     This comparison is then used to calcuate the accuracy of the model.
- During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model       performs on the unseen data.


- To our models performance we are using the following metrics:  
  1)Hamming Score : is the fraction of correct predictions compared to the total labels (Accuracy : the overall                       percentage of predictions without errors)   
  2)Hamming Loss : is a good measure of model performance. lower the Hamming loss better the model performance which here equals to '0.036'
- for further reading: https://www.linkedin.com/pulse/hamming-score-multi-label-classification-chandra-sharat/

In [61]:
def DataValidation(testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [62]:
outputs, targets = DataValidation(testing_loader)
final_outputs = np.array(outputs) >=0.5

337it [00:40,  8.42it/s]


In [87]:
def hamming_score(y_true, y_pred, normalize=True, sample_weight=None):
    acc_list = []
    for i in range(y_true.shape[0]):
        set_true = set( np.where(y_true[i])[0] )
        set_pred = set( np.where(y_pred[i])[0] )
        tmp_a = None
        if len(set_true) == 0 and len(set_pred) == 0:
            tmp_a = 1
        else:
            tmp_a = len(set_true.intersection(set_pred))/\
                    float( len(set_true.union(set_pred)) )
        acc_list.append(tmp_a)
    return np.mean(acc_list)

val_hamming_loss  = metrics.hamming_loss(targets, final_outputs)
val_hamming_score = hamming_score(np.array(targets), np.array(final_outputs))

print(f"Hamming Score= {val_hamming_score}")
print(f"Hamming Loss = {val_hamming_loss}")

Hamming Score= 0.9066914498141264
Hamming Loss = 0.03828996282527881


==> Saving the Trained Model for inference

- This is the final step in the process of fine tuning the model.
- The model and its vocabulary are saved in your local path. 
- These files are then used in the future to make inference on new inputs of news headlines.

In [22]:
# Saving the files for inference

output_model_file = 'pytorch_distilbert_pubmed.bin'
output_vocab_file = 'vocabulary_distilbert_.bin'

torch.save(model, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('Saved')

Saved
