https://colab.research.google.com/github/NadirEM/nlp-notebooks/blob/master/Fine_tune_ALBERT_sentence_pair_classification.ipynb#scrollTo=xtkcmIY9t6AF

## Installation of libraries and imports

In [1]:
!pip install transformers
!pip install datasets

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 8.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 45.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 55.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 53.8 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacre

In [2]:
# Clone the dataset repository from github
!git clone https://github.com/leocomelli/score-freetext-answer.git

Cloning into 'score-freetext-answer'...
remote: Enumerating objects: 511, done.[K
remote: Total 511 (delta 0), reused 0 (delta 0), pack-reused 511[K
Receiving objects: 100% (511/511), 478.34 KiB | 17.08 MiB/s, done.
Resolving deltas: 100% (263/263), done.


In [3]:
import torch
import torch.nn as nn
import os
import matplotlib.pyplot as plt
import copy
import torch.optim as optim
import random
import numpy as np
import pandas as pd
import glob
import xml.etree.ElementTree as ET
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp import autocast, GradScaler
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel, AdamW, get_linear_schedule_with_warmup
from datasets import load_dataset, load_metric

os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [4]:
# Check that we are using 100% of GPU memory footprint support libraries/code
# from https://github.com/patrickvonplaten/notebooks/blob/master/PyTorch_Reformer.ipynb
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip -q install gputil
!pip -q install psutil
!pip -q install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
 process = psutil.Process(os.getpid())
 print("Gen RAM Free: " + humanize.naturalsize( psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
 print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

  Building wheel for gputil (setup.py) ... [?25l[?25hdone
Gen RAM Free: 12.1 GB  | Proc size: 811.6 MB
GPU RAM Free: 15109MB | Used: 0MB | Util   0% | Total 15109MB



In case GPU utilisation (Util) is not at 0%, you can uncomment and run the following line to kill all processes to get the full GPU afterwards. Make sure to comment out the line again to not constantly crash the notebook on purpose.

In [None]:
# !kill -9 -1

## Loading the dataset

In [5]:
training_data_directory = '/content/score-freetext-answer/src/main/resources/corpus/semeval2013-task7/training/2way/sciEntsBank'
test_data_directory = '/content/score-freetext-answer/src/main/resources/corpus/semeval2013-task7/test/2way/sciEntsBank/test-unseen-questions'

In [6]:
def parse_xml_file(xml_file_path):

  question = ""
  ref = ""
  results = []

  for elem in ET.parse(xml_file_path).getroot():
    if elem.tag == 'questionText':
      question = elem.text
    for subelem in elem:
      if subelem.tag == 'referenceAnswer':
        ref = subelem.text
      else:
        results.append({
            'question': question,
            'ref': ref,
            'response': subelem.text,
            'score': subelem.attrib['accuracy']
        })

  return results

In [7]:
training_data = []
test_data = []
num_training_questions = 0
num_test_questions = 0

for data_file in glob.glob(training_data_directory + '/*'):
  training_data += parse_xml_file(data_file)
  num_training_questions += 1

for data_file in glob.glob(test_data_directory + '/*'):
  test_data += parse_xml_file(data_file)
  num_test_questions += 1

print("Number of Training Questions:", num_training_questions)
print("Number of Training Responses:", len(training_data))

print("Number of Test Questions:", num_test_questions)
print("Number of Test Responses:", len(test_data))

Number of Training Questions: 135
Number of Training Responses: 4969
Number of Test Questions: 15
Number of Test Responses: 733


In [8]:
class ShortAnswerGradingDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        # Note: I handle the parsing in the data loading from XML section
        # Returns a dict for each item with the following keys: 'question', 'ref', 'response', 'score' all of type 'str'
        return self.dataset[idx]

In [9]:
training_dataset = ShortAnswerGradingDataset(training_data)
test_dataset = ShortAnswerGradingDataset(test_data)

In [10]:
# for training_item in training_dataset:
#   print(training_item)

# for test_item in test_dataset:
#   print(test_item)
print(training_dataset[0])
print(len(training_dataset))
print(len(test_dataset))

{'question': "Why was the siren's sound designed to have this property?", 'ref': 'So that people notice the sound.', 'response': 'Because so it can tell people to get out of the way in case of a fire.', 'score': 'correct'}
4969
733


In [None]:
print(0 if 0==1 else 1)

1


In [11]:
# Concate the reference answer and student answer to creat new input for both train and test set
number_test = 100
train_ref = []
train_res = []
train_score = []
test_ref = []
test_res = []
test_score = []

for training_item in training_data:
  train_ref.append(training_item["ref"])
  train_res.append(training_item["response"])
  train_score.append(0 if training_item["score"]=='incorrect' else 1)


for test_item in test_dataset:
  test_ref.append(test_item["ref"])
  test_res.append(test_item["response"])
  test_score.append(0 if test_item["score"]=='incorrect' else 1)

train = {'idx': list(range(number_test,len( training_data))), 'label': train_score[number_test:], 'sentence1': train_ref[number_test:], 'sentence2': train_res[number_test:]}
valid = {'idx': list(range(number_test)), 'label': train_score[0:number_test], 'sentence1': train_ref[0:number_test], 'sentence2': train_res[0:number_test]}
test = {'idx': list(range(len(test_dataset))), 'label': test_score, 'sentence1': test_ref, 'sentence2': test_res}


# Transform data into pandas dataframes
df_train = pd.DataFrame(train)
df_valid = pd.DataFrame(valid)
df_test = pd.DataFrame(test)

In [12]:
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

(4869, 4)
(100, 4)
(733, 4)


In [13]:
df_train.head()

Unnamed: 0,idx,label,sentence1,sentence2
0,100,0,You can see how much time is taken for the ear...,The information you get is that you do not see...
1,101,0,You can see how much time is taken for the ear...,"You cannot see in the maps, the time where the..."
2,102,0,You can see how much time is taken for the ear...,The time it happens.
3,103,0,You can see how much time is taken for the ear...,The information that you do get is what happen...
4,104,0,You can see how much time is taken for the ear...,The information I did not get from the stream ...


## Classes and functions

In [14]:
class CustomDataset(Dataset):

    def __init__(self, data, maxlen, with_labels=True, bert_model='bert-base-uncased'):

        self.data = data  # pandas dataframe
        #Initialize the tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(bert_model)  

        self.maxlen = maxlen
        self.with_labels = with_labels 

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):

        # Selecting sentence1 and sentence2 at the specified index in the data frame
        sent1 = str(self.data.loc[index, 'sentence1'])
        sent2 = str(self.data.loc[index, 'sentence2'])

        # Tokenize the pair of sentences to get token ids, attention masks and token type ids
        encoded_pair = self.tokenizer(sent1, sent2, 
                                      padding='max_length',  # Pad to max_length
                                      truncation=True,  # Truncate to max_length
                                      max_length=self.maxlen,  
                                      return_tensors='pt')  # Return torch.Tensor objects
        
        token_ids = encoded_pair['input_ids'].squeeze(0)  # tensor of token ids
        attn_masks = encoded_pair['attention_mask'].squeeze(0)  # binary tensor with "0" for padded values and "1" for the other values
        token_type_ids = encoded_pair['token_type_ids'].squeeze(0)  # binary tensor with "0" for the 1st sentence tokens & "1" for the 2nd sentence tokens

        if self.with_labels:  # True if the dataset has labels
            label = self.data.loc[index, 'label']
            return token_ids, attn_masks, token_type_ids, label  
        else:
            return token_ids, attn_masks, token_type_ids

In [15]:
class SentencePairClassifier(nn.Module):

    def __init__(self, bert_model="bert-base-uncased", freeze_bert=False):
        super(SentencePairClassifier, self).__init__()
        #  Instantiating BERT-based model object
        self.bert_layer = AutoModel.from_pretrained(bert_model)

        #  Fix the hidden-state size of the encoder outputs (If you want to add other pre-trained models here, search for the encoder output size)
        if bert_model == "albert-base-v2":  # 12M parameters
            hidden_size = 768
        elif bert_model == "albert-large-v2":  # 18M parameters
            hidden_size = 1024
        elif bert_model == "albert-xlarge-v2":  # 60M parameters
            hidden_size = 2048
        elif bert_model == "albert-xxlarge-v2":  # 235M parameters
            hidden_size = 4096
        elif bert_model == "bert-base-uncased": # 110M parameters
            hidden_size = 768
        elif bert_model == 'allenai/scibert_scivocab_uncased':
            hidden_size = 768

        # Freeze bert layers and only train the classification layer weights
        if freeze_bert:
            for p in self.bert_layer.parameters():
                p.requires_grad = False

        # Classification layer
        self.cls_layer = nn.Linear(hidden_size, 1)

        self.dropout = nn.Dropout(p=0.1)

    @autocast()  # run in mixed precision
    def forward(self, input_ids, attn_masks, token_type_ids):
        '''
        Inputs:
            -input_ids : Tensor  containing token ids
            -attn_masks : Tensor containing attention masks to be used to focus on non-padded values
            -token_type_ids : Tensor containing token type ids to be used to identify sentence1 and sentence2
        '''

        # Feeding the inputs to the BERT-based model to obtain contextualized representations
        model_output = self.bert_layer(input_ids, attn_masks, token_type_ids)

        # Feeding to the classifier layer the last layer hidden-state of the [CLS] token further processed by a
        # Linear Layer and a Tanh activation. The Linear layer weights were trained from the sentence order prediction (ALBERT) or next sentence prediction (BERT)
        # objective during pre-training.

        logits = self.cls_layer(self.dropout(model_output.pooler_output))

        return logits

In [16]:
def set_seed(seed):
    """ Set all seeds to make results reproducible """
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    np.random.seed(seed)
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    

def evaluate_loss(net, device, criterion, dataloader):
    net.eval()

    mean_loss = 0
    count = 0

    with torch.no_grad():
        for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(dataloader)):
            seq, attn_masks, token_type_ids, labels = \
                seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device)
            logits = net(seq, attn_masks, token_type_ids)
            mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
            count += 1

    return mean_loss / count

In [17]:
print("Creation of the models' folder...")
!mkdir models

Creation of the models' folder...


Link for mixed precision training, gradient scaling and gradient accumulation  : https://pytorch.org/docs/stable/notes/amp_examples.html#amp-examples

If you would like to learn more about Training Neural Nets on Larger Batches, I suggest reading this post of Thomas Wolf :
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255

In [18]:
def train_bert(net, criterion, opti, lr, lr_scheduler, train_loader, val_loader, epochs, iters_to_accumulate):

    best_loss = np.Inf
    best_ep = 1
    nb_iterations = len(train_loader)
    print_every = nb_iterations // 5  # print the training loss 5 times per epoch
    iters = []
    train_losses = []
    val_losses = []

    scaler = GradScaler()

    for ep in range(epochs):

        net.train()
        running_loss = 0.0
        for it, (seq, attn_masks, token_type_ids, labels) in enumerate(tqdm(train_loader)):

            # Converting to cuda tensors
            seq, attn_masks, token_type_ids, labels = \
                seq.to(device), attn_masks.to(device), token_type_ids.to(device), labels.to(device)
    
            # Enables autocasting for the forward pass (model + loss)
            with autocast():
                # Obtaining the logits from the model
                logits = net(seq, attn_masks, token_type_ids)

                # Computing loss
                loss = criterion(logits.squeeze(-1), labels.float())
                loss = loss / iters_to_accumulate  # Normalize the loss because it is averaged

            # Backpropagating the gradients
            # Scales loss.  Calls backward() on scaled loss to create scaled gradients.
            scaler.scale(loss).backward()

            if (it + 1) % iters_to_accumulate == 0:
                # Optimization step
                # scaler.step() first unscales the gradients of the optimizer's assigned params.
                # If these gradients do not contain infs or NaNs, opti.step() is then called,
                # otherwise, opti.step() is skipped.
                scaler.step(opti)
                # Updates the scale for next iteration.
                scaler.update()
                # Adjust the learning rate based on the number of iterations.
                lr_scheduler.step()
                # Clear gradients
                opti.zero_grad()


            running_loss += loss.item()

            if (it + 1) % print_every == 0:  # Print training loss information
                print()
                print("Iteration {}/{} of epoch {} complete. Loss : {} "
                      .format(it+1, nb_iterations, ep+1, running_loss / print_every))

                running_loss = 0.0


        val_loss = evaluate_loss(net, device, criterion, val_loader)  # Compute validation loss
        print()
        print("Epoch {} complete! Validation Loss : {}".format(ep+1, val_loss))

        if val_loss < best_loss:
            print("Best validation loss improved from {} to {}".format(best_loss, val_loss))
            print()
            net_copy = copy.deepcopy(net)  # save a copy of the model
            best_loss = val_loss
            best_ep = ep + 1

    # Saving the model
    path_to_model='models/{}_lr_{}_val_loss_{}_ep_{}.pt'.format(bert_model.replace('/', '_'), lr, round(best_loss, 5), best_ep)
    torch.save(net_copy.state_dict(), path_to_model)
    print("The model has been saved in {}".format(path_to_model))

    del loss
    torch.cuda.empty_cache()

## Parameters

In [91]:
bert_model = "bert-base-uncased"  # 'albert-base-v2', 'albert-large-v2', 'albert-xlarge-v2', 'albert-xxlarge-v2', 'bert-base-uncased', ...
freeze_bert = True  # if True, freeze the encoder weights and only update the classification layer weights
maxlen = 128  # maximum length of the tokenized input sentence pair : if greater than "maxlen", the input is truncated and else if smaller, the input is padded
bs = 20  # batch size
iters_to_accumulate = 2  # the gradient accumulation adds gradients over an effective batch of size : bs * iters_to_accumulate. If set to "1", you get the usual batch size
lr = 5e-4  # learning rate
epochs = 6  # number of training epochs

## Training and validation

Link for the AdamW optimizer and the learning rate scheduler :
https://huggingface.co/transformers/main_classes/optimizer_schedules.html

In [92]:
#  Set all seeds to make reproducible results
set_seed(1)

# Creating instances of training and validation set
print("Reading training data...")
train_set = CustomDataset(df_train, maxlen, bert_model)
print("Reading validation data...")
val_set = CustomDataset(df_valid, maxlen, bert_model)
# Creating instances of training and validation dataloaders
train_loader = DataLoader(train_set, batch_size=bs, num_workers=5)
val_loader = DataLoader(val_set, batch_size=bs, num_workers=5)


device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
net = SentencePairClassifier(bert_model, freeze_bert=freeze_bert)

if torch.cuda.device_count() > 1:  # if multiple GPUs
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    net = nn.DataParallel(net)

net.to(device)

criterion = nn.BCEWithLogitsLoss()
opti = AdamW(net.parameters(), lr=lr, weight_decay=1e-2)
num_warmup_steps = 0 # The number of steps for the warmup phase.
num_training_steps = epochs * len(train_loader)  # The total number of training steps
t_total = (len(train_loader) // iters_to_accumulate) * epochs  # Necessary to take into account Gradient accumulation
lr_scheduler = get_linear_schedule_with_warmup(optimizer=opti, num_warmup_steps=num_warmup_steps, num_training_steps=t_total)

train_bert(net, criterion, opti, lr, lr_scheduler, train_loader, val_loader, epochs, iters_to_accumulate)

Reading training data...
Reading validation data...


  cpuset_checked))


Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
 20%|██        | 50/244 [00:02<00:09, 20.53it/s]


Iteration 48/244 of epoch 1 complete. Loss : 0.3522648327052593 


 40%|████      | 98/244 [00:05<00:07, 20.45it/s]


Iteration 96/244 of epoch 1 complete. Loss : 0.32805893911669654 


 60%|█████▉    | 146/244 [00:07<00:04, 20.30it/s]


Iteration 144/244 of epoch 1 complete. Loss : 0.33627312382062274 


 80%|███████▉  | 194/244 [00:09<00:02, 20.69it/s]


Iteration 192/244 of epoch 1 complete. Loss : 0.36195255691806477 


 99%|█████████▉| 242/244 [00:12<00:00, 20.45it/s]


Iteration 240/244 of epoch 1 complete. Loss : 0.33573864959180355 


100%|██████████| 244/244 [00:12<00:00, 19.55it/s]
100%|██████████| 5/5 [00:00<00:00,  8.05it/s]



Epoch 1 complete! Validation Loss : 0.6959778010845185
Best validation loss improved from inf to 0.6959778010845185



 21%|██        | 51/244 [00:02<00:09, 20.34it/s]


Iteration 48/244 of epoch 2 complete. Loss : 0.34241660746435326 


 41%|████      | 99/244 [00:05<00:07, 20.45it/s]


Iteration 96/244 of epoch 2 complete. Loss : 0.31508357139925164 


 60%|██████    | 147/244 [00:07<00:04, 20.29it/s]


Iteration 144/244 of epoch 2 complete. Loss : 0.32979377017666894 


 80%|███████▉  | 195/244 [00:09<00:02, 20.17it/s]


Iteration 192/244 of epoch 2 complete. Loss : 0.34421926115949947 


100%|█████████▉| 243/244 [00:12<00:00, 20.22it/s]


Iteration 240/244 of epoch 2 complete. Loss : 0.3306786175817251 


100%|██████████| 244/244 [00:12<00:00, 19.62it/s]
100%|██████████| 5/5 [00:00<00:00,  7.75it/s]



Epoch 2 complete! Validation Loss : 0.6882092356681824
Best validation loss improved from 0.6959778010845185 to 0.6882092356681824



 20%|██        | 50/244 [00:02<00:09, 19.78it/s]


Iteration 48/244 of epoch 3 complete. Loss : 0.33696525481839973 


 41%|████      | 100/244 [00:05<00:07, 19.94it/s]


Iteration 96/244 of epoch 3 complete. Loss : 0.3150783705835541 


 61%|██████    | 148/244 [00:07<00:04, 19.92it/s]


Iteration 144/244 of epoch 3 complete. Loss : 0.32569303270429373 


 80%|███████▉  | 195/244 [00:10<00:02, 19.43it/s]


Iteration 192/244 of epoch 3 complete. Loss : 0.33722567403068143 


100%|█████████▉| 243/244 [00:12<00:00, 19.80it/s]


Iteration 240/244 of epoch 3 complete. Loss : 0.329235197044909 


100%|██████████| 244/244 [00:12<00:00, 19.12it/s]
100%|██████████| 5/5 [00:00<00:00,  7.74it/s]



Epoch 3 complete! Validation Loss : 0.669543468952179
Best validation loss improved from 0.6882092356681824 to 0.669543468952179



 21%|██        | 51/244 [00:03<00:10, 18.72it/s]


Iteration 48/244 of epoch 4 complete. Loss : 0.3303165103619297 


 40%|████      | 98/244 [00:05<00:07, 19.70it/s]


Iteration 96/244 of epoch 4 complete. Loss : 0.3152905385941267 


 60%|█████▉    | 146/244 [00:08<00:04, 19.68it/s]


Iteration 144/244 of epoch 4 complete. Loss : 0.3209957579771678 


 80%|███████▉  | 195/244 [00:10<00:02, 19.81it/s]


Iteration 192/244 of epoch 4 complete. Loss : 0.3282302375882864 


100%|█████████▉| 243/244 [00:13<00:00, 19.64it/s]


Iteration 240/244 of epoch 4 complete. Loss : 0.3278226175655921 


100%|██████████| 244/244 [00:13<00:00, 18.01it/s]
100%|██████████| 5/5 [00:00<00:00,  8.03it/s]



Epoch 4 complete! Validation Loss : 0.6546350061893463
Best validation loss improved from 0.669543468952179 to 0.6546350061893463



 21%|██        | 51/244 [00:02<00:09, 19.70it/s]


Iteration 48/244 of epoch 5 complete. Loss : 0.3300441578030586 


 41%|████      | 99/244 [00:05<00:07, 19.45it/s]


Iteration 96/244 of epoch 5 complete. Loss : 0.3162546247864763 


 60%|██████    | 147/244 [00:07<00:05, 19.05it/s]


Iteration 144/244 of epoch 5 complete. Loss : 0.31200413902600604 


 80%|███████▉  | 195/244 [00:10<00:02, 19.37it/s]


Iteration 192/244 of epoch 5 complete. Loss : 0.32578602029631537 


100%|█████████▉| 243/244 [00:12<00:00, 19.38it/s]


Iteration 240/244 of epoch 5 complete. Loss : 0.3277521611501773 


100%|██████████| 244/244 [00:13<00:00, 18.74it/s]
100%|██████████| 5/5 [00:00<00:00,  7.78it/s]



Epoch 5 complete! Validation Loss : 0.6493316769599915
Best validation loss improved from 0.6546350061893463 to 0.6493316769599915



 20%|██        | 50/244 [00:02<00:09, 19.45it/s]


Iteration 48/244 of epoch 6 complete. Loss : 0.32633860409259796 


 40%|████      | 98/244 [00:05<00:07, 19.34it/s]


Iteration 96/244 of epoch 6 complete. Loss : 0.321514551838239 


 60%|█████▉    | 146/244 [00:07<00:05, 19.50it/s]


Iteration 144/244 of epoch 6 complete. Loss : 0.3086075863490502 


 80%|███████▉  | 194/244 [00:10<00:02, 19.52it/s]


Iteration 192/244 of epoch 6 complete. Loss : 0.32162291401376325 


 99%|█████████▉| 242/244 [00:12<00:00, 19.49it/s]


Iteration 240/244 of epoch 6 complete. Loss : 0.32918917542944354 


100%|██████████| 244/244 [00:13<00:00, 18.76it/s]
100%|██████████| 5/5 [00:00<00:00,  7.58it/s]



Epoch 6 complete! Validation Loss : 0.6960294842720032
The model has been saved in models/bert-base-uncased_lr_0.0005_val_loss_0.64933_ep_5.pt


You can download the model saved in the folder "models" by browsing the files on the left of the colab notebook

In [None]:
# If you encounter a CUDA out of memory error: 
# - uncomment the kill command, run the "kill" command (and comment it)
# - reduce the batch size
# - then run all cells from the begining 

# !kill -9 -1

## Prediction

In [21]:
print("Creation of the results' folder...")
!mkdir results

Creation of the results' folder...


In [22]:
def get_probs_from_logits(logits):
    """
    Converts a tensor of logits into an array of probabilities by applying the sigmoid function
    """
    probs = torch.sigmoid(logits.unsqueeze(-1))
    return probs.detach().cpu().numpy()

def test_prediction(net, device, dataloader, with_labels=True, result_file="results/output.txt"):
    """
    Predict the probabilities on a dataset with or without labels and print the result in a file
    """
    net.eval()
    w = open(result_file, 'w')
    probs_all = []

    with torch.no_grad():
        if with_labels:
            for seq, attn_masks, token_type_ids, _ in tqdm(dataloader):
                seq, attn_masks, token_type_ids = seq.to(device), attn_masks.to(device), token_type_ids.to(device)
                logits = net(seq, attn_masks, token_type_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()
        else:
            for seq, attn_masks, token_type_ids in tqdm(dataloader):
                seq, attn_masks, token_type_ids = seq.to(device), attn_masks.to(device), token_type_ids.to(device)
                logits = net(seq, attn_masks, token_type_ids)
                probs = get_probs_from_logits(logits.squeeze(-1)).squeeze(-1)
                probs_all += probs.tolist()

    w.writelines(str(prob)+'\n' for prob in probs_all)
    w.close()

I'm sharing below an ALBERT pre-trained model (45 Mo) so you can reproduce my results on the MRPC validation set (**91.19** as F1 score and **87.5** as accuracy). It's just in case but if all the code run as expected, you should get after the model training the correct model in the *models* folder

You can download it and upload it (~ 3 minutes) in the *models* folder by browsing the files on the left of the colab notebook :

https://drive.google.com/file/d/1AcRLGvALAH3BVSiDVjY_b8CggJgVfksp/view?usp=sharing

In [93]:
path_to_model = '/content/models/bert-base-uncased_lr_0.0005_val_loss_0.64933_ep_5.pt'  
# path_to_model = '/content/models/...'  # You can add here your trained model

path_to_output_file = 'results/output.txt'

print("Reading test data...")
test_set = CustomDataset(df_test, maxlen, bert_model)
test_loader = DataLoader(test_set, batch_size=bs, num_workers=5)

model = SentencePairClassifier(bert_model)
if torch.cuda.device_count() > 1:  # if multiple GPUs
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)

print()
print("Loading the weights of the model...")
model.load_state_dict(torch.load(path_to_model))
model.to(device)

print("Predicting on test data...")
test_prediction(net=model, device=device, dataloader=test_loader, with_labels=True,  # set the with_labels parameter to False if your want to get predictions on a dataset without labels
                result_file=path_to_output_file)
print()
print("Predictions are available in : {}".format(path_to_output_file))

Reading test data...


  cpuset_checked))
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Loading the weights of the model...
Predicting on test data...


100%|██████████| 37/37 [00:02<00:00, 16.47it/s]


Predictions are available in : results/output.txt





You can download the predictions saved in the folder "results" by browsing the files on the left of the colab notebook

## Evaluation

In [94]:
path_to_output_file = 'results/output.txt'  # path to the file with prediction probabilities

labels_test = df_test['label']  # true labels

probs_test = pd.read_csv(path_to_output_file, header=None)[0]  # prediction probabilities

threshold = 0.5   # you can adjust this threshold for your own dataset
preds_test=(probs_test>=threshold).astype('uint8') # predicted labels using the above fixed threshold

metric = load_metric("glue", "mrpc")

In [97]:
# Compute the accuracy and F1 scores
metric._compute(predictions=preds_test, references=labels_test)

{'accuracy': 0.5552523874488404, 'f1': 0.5302593659942364}

In [46]:
for i in range(len(preds_test)):
  if preds_test[i] != labels_test[i]:
    print(i)
    print("Ground Truth: ", labels_test[i])
    print("Test prediction probability: ",probs_test[i])


1
Ground Truth:  0
Test prediction probability:  0.5625
4
Ground Truth:  1
Test prediction probability:  0.350341796875
6
Ground Truth:  1
Test prediction probability:  0.498779296875
7
Ground Truth:  0
Test prediction probability:  0.52685546875
10
Ground Truth:  0
Test prediction probability:  0.53759765625
11
Ground Truth:  0
Test prediction probability:  0.52783203125
14
Ground Truth:  0
Test prediction probability:  0.57373046875
24
Ground Truth:  0
Test prediction probability:  0.5576171875
25
Ground Truth:  0
Test prediction probability:  0.5302734375
26
Ground Truth:  1
Test prediction probability:  0.41845703125
27
Ground Truth:  0
Test prediction probability:  0.564453125
28
Ground Truth:  0
Test prediction probability:  0.52587890625
33
Ground Truth:  0
Test prediction probability:  0.5322265625
45
Ground Truth:  1
Test prediction probability:  0.48583984375
48
Ground Truth:  1
Test prediction probability:  0.43310546875
50
Ground Truth:  1
Test prediction probability:  0.45

In [None]:
### SciBERT

# n = 280 false positive
# n = 425
# n = 281
# n = 171
# n = 564 #false positive
# n = 540 
# n = 565
n = 414

# n =663 false negative
# n = 267 false negative
# n = 575
# n  =585
# n = 684
# n = 125
# n = 336

print(test_data[n]['question'])
print(test_data[n]['score'])
print(df_test['label'].loc[n])
print(probs_test[n])
print(df_test['sentence1'].loc[n])
print(df_test['sentence2'].loc[n])

In [98]:
for i in range(len(preds_test)):
  if preds_test[i] != labels_test[i]:
    print(i)
    print("Ground Truth: ", labels_test[i])
    print("Test prediction probability: ",probs_test[i])


0
Ground Truth:  0
Test prediction probability:  0.5048828125
1
Ground Truth:  0
Test prediction probability:  0.5400390625
2
Ground Truth:  0
Test prediction probability:  0.5478515625
3
Ground Truth:  0
Test prediction probability:  0.52685546875
5
Ground Truth:  0
Test prediction probability:  0.52294921875
8
Ground Truth:  0
Test prediction probability:  0.53125
9
Ground Truth:  0
Test prediction probability:  0.5146484375
10
Ground Truth:  0
Test prediction probability:  0.548828125
11
Ground Truth:  0
Test prediction probability:  0.544921875
14
Ground Truth:  0
Test prediction probability:  0.54443359375
15
Ground Truth:  0
Test prediction probability:  0.5078125
17
Ground Truth:  0
Test prediction probability:  0.50439453125
18
Ground Truth:  0
Test prediction probability:  0.5361328125
19
Ground Truth:  0
Test prediction probability:  0.5185546875
21
Ground Truth:  0
Test prediction probability:  0.51708984375
22
Ground Truth:  0
Test prediction probability:  0.53466796875
23


In [120]:
### BERT

# n = 406  #false positive
# n = 555
# n = 559
# n = 570
# n = 140
# n = 149
# n = 223
# n =424
# n = 425
n = 534

# n = 291  # false negative
# n = 336
# n = 575
# n = 593
# n = 421
# n = 404
# n = 279
# n = 267
# n = 241

print(test_data[n]['question'])
print(test_data[n]['score'])
print(df_test['label'].loc[n])
print(probs_test[n])
print(df_test['sentence1'].loc[n])
print(df_test['sentence2'].loc[n])

What is the evidence in this experiment that OJ1 or OJ2 had more acid?
incorrect
0
0.572265625
The syringe for OJ1 rose higher because OJ1 produced more gas, so OJ1 has more acid than OJ2.
The evidence is that OJ1 has a higher dot.


In [None]:
# Result -----

# bert_model = "bert-base-uncased" 
# freeze_bert = False 
# maxlen = 128 
# bs = 20
# iters_to_accumulate = 2 
# lr = 5e-4  
# epochs = 6
# test dataset: Unseen questions
# accuracy = 0.6125
# f1 = 0.413

# Result -----

# bert_model = "bert-base-uncased" 
# freeze_bert = False 
# maxlen = 128 
# bs = 20
# iters_to_accumulate = 2 
# lr = 5e-4  
# epochs = 10
# test dataset: Unseen questions
# accuracy = 0.592
# f1 = 0.4432


# bert_model = "bert-base-uncased" 
# freeze_bert = True 
# maxlen = 128 
# bs = 20
# iters_to_accumulate = 2 
# lr = 5e-4  
# epochs = 10
# test dataset: Unseen questions
# accuracy = 0.585
# f1 = 0.4411

# bert_model = "allenai/scibert_scivocab_uncased" 
# freeze_bert = True
# maxlen = 128 
# bs = 20
# iters_to_accumulate = 2 
# lr = 5e-4  
# epochs = 6
# test dataset: Unseen questions
# accuracy = 0.618
# f1 = 0.4696