# BERT model pytorch implementation with attention
References:

https://pytorch.org/docs/stable/index.html

https://www.kaggle.com/vpkprasanna/bert-model-with-0-845-accuracy

In [None]:
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
import random
import time

In [3]:
train_data = pd.read_csv('BC7-LitCovid-Train.csv')

In [4]:
processed_train_data = pd.DataFrame()
def split(string):
    return str(string).split(';')

processed_train_data['labels']=train_data['label'].apply(split)
# processed_train_data['text']=train_data['abstract']

In [5]:
for i in range(len(train_data)):
    s = train_data.loc[i,'title']+'[SEP]'+train_data.loc[i,'abstract']
    processed_train_data.loc[i,'text'] = s

In [6]:
label_mlb = MultiLabelBinarizer()
label_mle = label_mlb.fit_transform(processed_train_data['labels'])
print(label_mle.shape)
print(label_mlb.classes_)

(24960, 7)
['Case Report' 'Diagnosis' 'Epidemic Forecasting' 'Mechanism' 'Prevention'
 'Transmission' 'Treatment']


In [7]:
processed_train_data['labels'] = label_mle.tolist()

In [8]:
# Temporary
# processed_train_data = processed_train_data.head(1000)

In [9]:
processed_train_data.head()

Unnamed: 0,labels,text
0,"[0, 0, 0, 1, 0, 0, 1]",Potential role for tissue factor in the pathog...
1,"[0, 0, 0, 0, 1, 0, 1]",Dietary therapy and herbal medicine for COVID-...
2,"[1, 0, 0, 0, 0, 0, 0]",First report of manic-like symptoms in a COVID...
3,"[0, 0, 0, 0, 1, 0, 0]",Epidemiological Investigation of OHCWs with CO...
4,"[0, 0, 0, 0, 0, 0, 1]",The impact of sofosbuvir/daclatasvir or ribavi...


In [10]:
text = processed_train_data.text.values
labels = np.array(list(processed_train_data.labels.values))
# X_train, X_val, y_train, y_val =train_test_split(text, labels, test_size=0.1, random_state=2020)

In [11]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


In [12]:
input_ids = []
attention_masks = []
MAX_LEN = 512
for sent in text:
    # 'encode_plus will':
    # (1) Tokenize the sentence
    # (2) Add the `[CLS]` and `[SEP]` token to the start and end
    # (3) Truncate/Pad sentence to max length
    # (4) Map tokens to their IDs
    # (5) Create attention mask
    # (6) Return a dictionary of outputs
    encoded_sent = tokenizer.encode_plus(
        text = sent,   
        add_special_tokens = True,         #Add `[CLS]` and `[SEP]`
        max_length= MAX_LEN,             #Max length to truncate/pad
        pad_to_max_length = True,          #pad sentence to max length 
        return_attention_mask= True,       #Return attention mask

    )
    # Add the outputs to the lists
    input_ids.append(encoded_sent.get('input_ids'))
    attention_masks.append(encoded_sent.get('attention_mask'))

print('Original: ', text[0])
print('Token IDs:', input_ids[0])
      

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Original:  Potential role for tissue factor in the pathogenesis of hypercoagulability associated with in COVID-19.[SEP]In December 2019, a new and highly contagious infectious disease emerged in Wuhan, China. The etiologic agent was identified as a novel coronavirus, now known as Severe Acute Syndrome Coronavirus-2 (SARS-CoV-2). Recent research has revealed that virus entry takes place upon the union of the virus S surface protein with the type I transmembrane metallo-carboxypeptidase, angiotensin converting enzyme 2 (ACE-2) identified on epithelial cells of the host respiratory tract. Virus triggers the synthesis and release of pro-inflammatory cytokines, including IL-6 and TNF-alpha and also promotes downregulation of ACE-2, which promotes a concomitant increase in levels of angiotensin II (AT-II). Both TNF-alpha and AT-II have been implicated in promoting overexpression of tissue factor (TF) in platelets and macrophages. Additionally, the generation of antiphospholipid antibodies as

In [13]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler, random_split
# Combine the training inputs into a TensorDataset.
input_ids = torch.tensor(input_ids)
attention_masks = torch.tensor(attention_masks)
labels = torch.tensor(labels)
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 95-05 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.95 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

23,712 training samples
1,248 validation samples


In [14]:
# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 4

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
val_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [15]:
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel
from transformers import AdamW, get_linear_schedule_with_warmup

In [16]:
class EmbeddingAttention(nn.Module):
    def __init__(self, num_input_features, num_hidden_features):
        super(EmbeddingAttention,self).__init__()
        self.l1 = nn.Linear(num_input_features,num_hidden_features)
        self.act_1 = nn.LeakyReLU()
        self.l2 = nn.Linear(num_hidden_features, 1) # the final attention weight for the input
        self.sigmoid = nn.Sigmoid()
        self.attention_weights = torch.zeros((1,1))

    def getAttentionWeights(self):
        return self.attention_weights

    def forward(self,x): # input format ==> (m,num_input_features)
        l1_out = self.l1(x)
        act1_out = self.act_1(l1_out)
        l2_out = self.l2(act1_out)
        self.attention_weights = self.sigmoid(l2_out) # this would be (m,1) dimensional
        return torch.mul(self.attention_weights,x) # broadcasting will happen so final result is elementwide multiplication of (m,1) and (m,num_features) == (m,num_features)

In [17]:
class BertClassifier(nn.Module):
    """
        Bert Model for classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param   bert: a BertModel object
        @param   classifier: a torch.nn.Module classifier
        @param   freeze_bert (bool): Set `False` to fine_tune the Bert model
        """
        super(BertClassifier,self).__init__()
        # Specify hidden size of Bert, hidden size of our classifier, and number of labels
        # D_in, H,D_out = 768,50,7 (old)
        A_in,A_h = 768,50
        D_in,H,D_out = 768,50,7
        
        self.bert = BertModel.from_pretrained("bert-base-uncased")

        self.attention = EmbeddingAttention(A_in, H)
        
        self.classifier = nn.Sequential(
                            nn.Linear(D_in, H),
                            nn.ReLU(),
                            nn.Linear(H, D_out))
        
        self.sigmoid = nn.Sigmoid()         # might not be needed

        # Freeze the Bert Model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
    
    def forward(self,input_ids,attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        outputs = self.bert(input_ids=input_ids,
                           attention_mask = attention_mask)
        
        # 512 x 768

        # 768
        
        # (OLD)Extract the last hidden state of the token `[CLS]` for classification task
        # last_hidden_state_cls = outputs[0][:,0,:]
        
        # Calculate attention scores for each token from the attention layer
        # attention_scores=[]
        # for i in range(511):
        #   input_token = outputs[0][:,1+i,:]
        #   score = self.attention(input_token)
        #   attention_scores.append(score)
        # attention_scores = torch.cat(attention_scores,dim=1)

        important_tokens = outputs[0][:,1:,:]
        attention_out = self.attention.forward(important_tokens)

        mean_att = torch.mean(attention_out,dim=1)

        # Feed attention scores to classifier to compute logits
        logit = self.classifier(mean_att)
        
        # logits = self.sigmoid(logit)
        
        return logit

In [18]:
def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)
    
    bert_classifier.to(device)
    
    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                     lr=5e-5, #Default learning rate
                     eps=1e-8 #Default epsilon value
                     )
    
    # Total number of training steps
    total_steps = len(train_dataloader) * epochs
    
    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                              num_warmup_steps=0, # Default value
                                              num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

In [19]:
# Specify loss function
loss_fn = nn.BCEWithLogitsLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    """Train the BertClassifier model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels.float())
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20--50000 batches
            if (step % 50000 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels.float())
        val_loss.append(loss.item())

        # Get the predictions
        #preds = torch.argmax(logits, dim=1).flatten()
        
        # Calculate the accuracy rate
        #accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        accuracy = accuracy_thresh(logits.view(-1,7),b_labels.view(-1,7))
        
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

def accuracy_thresh(y_pred, y_true, thresh:float=0.5, sigmoid:bool=True):
    "Compute accuracy when `y_pred` and `y_true` are the same size."
    if sigmoid: 
        y_pred = y_pred.sigmoid()
    return ((y_pred>thresh)==y_true.byte()).float().mean().item()
    #return np.mean(((y_pred>thresh).float()==y_true.float()).float().cpu().numpy(), axis=1).sum()

In [None]:
set_seed(2021) # Set seed for reproducibility
epochs=40
bert_classifier, optimizer, scheduler = initialize_model(epochs=epochs)
train(bert_classifier, train_dataloader, val_dataloader, epochs=epochs, evaluation=True)

In [24]:
# Tried checkpointing but it didn't work
# set_seed(2021) # Set seed for reproducibility
# epochs=1
# bert_classifier, optimizer, scheduler = initialize_model(epochs=epochs)
# torch.save(bert_classifier,"BERT-Att_model_ep0.pth")
# for i in range(40):
#     prev_file = "BERT-Att_model_ep"+str(i)+".pth"
#     bc = torch.load(prev_file)
#     set_seed(2021)
#     train(bc, train_dataloader, val_dataloader, epochs=epochs, evaluation=True)
#     file = "BERT-Att_model_ep"+str(i+1)+".pth" 
#     torch.save(bc,file)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |  5927   |   0.693370   |     -      |     -     |  4379.65 
----------------------------------------------------------------------
   1    |    -    |   0.693370   |  0.693240  |   0.61    |  4459.69 
----------------------------------------------------------------------


Training complete!
Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |  5927   |   0.693370   |     -      |     -     |  4375.40 
----------------------------------------------------------------------
   1    |    -    |   0.693370   |  0.693240  |   0.61    |  4455.61 
----------------------------------------------------------------------


Training complete!
Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Ac

KeyboardInterrupt: 

In [21]:
file = "BERT-Att_model_.pth"
torch.save(bert_classifier,file)

In [22]:
bert_classifier.attention.getAttentionWeights()

tensor([[[0.9993],
         [0.9992],
         [0.9994],
         ...,
         [0.9994],
         [0.9992],
         [0.9993]],

        [[0.9997],
         [0.9997],
         [0.9997],
         ...,
         [0.9997],
         [0.9996],
         [0.9997]],

        [[0.9992],
         [0.9989],
         [0.9992],
         ...,
         [0.9990],
         [0.9993],
         [0.9992]],

        [[0.9992],
         [0.9992],
         [0.9992],
         ...,
         [0.9993],
         [0.9993],
         [0.9993]]], device='cuda:0', grad_fn=<SigmoidBackward>)

In [24]:
torch.cuda.empty_cache()