# Finetuning BERT for the NSP-task with the Cranfield set
generic Next Sentence Prediction training inspired by [this post](https://towardsdatascience.com/how-to-fine-tune-bert-with-nsp-8b5615468e12) by James Brigg

 The following notebook creates a MLM-training set eand trains a pre-trained BERT model for the NSP task, finaly is saves the model in de Models dir.

NOTE: out of the various options, one has to be choosen, so running all the cells of the notebook, will cause the set to reflect the last option.

In [4]:
%cd /content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/BERT-BM25-Thesis-Project/bert-meets-cranfield-enrich/Code

/content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/BERT-BM25-Thesis-Project/bert-meets-cranfield-enrich/Code


# Dependencies & Prerequisites

In [5]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.2-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 36.1 MB/s 
Collecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.6 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 36.9 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 47.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 52.1 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 

In [6]:
#@title Import { display-mode: "form" }

from transformers import BertTokenizer, BertForNextSentencePrediction, AdamW
import torch

import data_utils
import random
from tqdm import tqdm, tqdm_notebook
import numpy as np


## Make it fast
Set the GPU! Beware to not overdo it, Colab policiy is quite strickt. You can use CUDA with the CPU, although inference using a loaded model+params will be slow, but it is workable.

A sollution to overcome this is make use of a service such as Google Cloud Platform (GCP)

In [7]:
if not torch.cuda.is_available():
  print('WARNING: GPU device not found.')
else:
  print('SUCCESS: Found GPU: {}'.format(torch.cuda.get_device_name()))
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')



In [8]:
#@title Prerequisites

MAX_LENGTH = 128
BATCH_SIZE = 16
LEARNING_RATE = 2e-5
EPOCHS = 1
DO_TRAINING = True
SEED = 987
DO_SAVING = True
SAVE_FULL_MODEL = False

print('MAX_LENGTH:    ', MAX_LENGTH)
print('BATCH_SIZE:    ', BATCH_SIZE)
print('LEARNING_RATE: ', LEARNING_RATE)
print('EPOCHS:        ', EPOCHS)
print('DO_TRAINING:   ', DO_TRAINING  )
print('SAVE FULL:     ', SAVE_FULL_MODEL  )


MAX_LENGTH:     128
BATCH_SIZE:     16
LEARNING_RATE:  2e-05
EPOCHS:         1
DO_TRAINING:    True
SAVE FULL:      False


In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForNextSentencePrediction.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForNextSentencePrediction: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Prepare the data

## Get the Data

In [10]:
def get_corpus(location):
    with open(location, 'r') as document_file:
        all_sentences = document_file.read().split('.I ')[1:]
        # make sure the dots are included
        return [" ".join(als.split('.W\n')[1].replace('\n', ' ').split()) for als in all_sentences]

In [11]:
# import the corpus using the modified data utils
corpus = get_corpus('../Data/cran/cran.all.1400')
# NOTE: later I learned that BERT does not take periods in to account,
# it doe

In [12]:
# create a bag with all the sentences 
bag = [ item for sentence in corpus for item in sentence.split('.') if item != '']
bag_size = len(bag)
print(bag_size)

11352


# Make training and test sets

## Option A: default, with 50% consecutive
random-sentence (name in the table)


Two same size sets of sentences, that have a 50% change of being consecutive. Eache sentence pair is constructed such that sentence A is a random sentence of a unique paragraph, sentence B has a 50% chance of being the next or a random sentence from the bag.

In [13]:
# build and label a dataset
# ref: https://towardsdatascience.com/how-to-fine-tune-bert-with-nsp-8b5615468e12
sentence_a = []
sentence_b = []
label = []
random.seed(SEED)

for paragraph in corpus:
  sentences = [ sentence for sentence in paragraph.split('.') if sentence != '']
  num_sentences = len(sentences)
  if num_sentences > 1:
    start = random.randint(0, num_sentences-2)
    # take 50/50 wether IsNextSentence or IsNotNextSentence
    if random.random()  >= 0.5:
      # this is IsNextSentence
      sentence_a.append(sentences[start])
      sentence_b.append(sentences[start+1])
      label.append(0)
    else:
      index = random.randint(0, bag_size-1)
      # this is IsNotNextSentence
      sentence_a.append(sentences[start])
      sentence_b.append(bag[index])
      label.append(1)

## Option S: All period-separated sentences
*all-sentence* (name in the table)

take every 'normal' sentence from each paragraph 

In [14]:
## Basic method all sentences
## similar to the above method
sentence_a = []
sentence_b = []
label = []
random.seed(SEED)

for paragraph in corpus:
  sentences = [ sentence for sentence in paragraph.split('.') if sentence != '']
  num_sentences = len(sentences)
  for i in range(num_sentences-2):
    # only odd other to prevent reusing sentece b as a
    if num_sentences > 1 and (i%2) == 1: 
      # start = random.randint(0, num_sentences-2)
      # take 50/50 wether IsNextSentence or IsNotNextSentence
      if random.random()  >= 0.5:
        # this is IsNextSentence
        sentence_a.append(sentences[i])
        sentence_b.append(sentences[i+1])
        label.append(0)
      else:
        index = random.randint(0, bag_size-1)
        # this is IsNotNextSentence
        sentence_a.append(sentences[i])
        sentence_b.append(bag[index])
        label.append(1)

## Option N: n-words size subsenteces
*sub-sentence* (name in the table)


Let's see if we can make a bigger data set. We therefore exploit the fact that BERT does not require sentences to be 'natural' sentences. We also make use of the fact that the tokenizer results in more tokens than words, so instead of takening sentences of 64 words (128/2) we take 42 (~128/3)

In [15]:
# import random
# build larger dataset
# https://stackoverflow.com/questions/1964999/split-a-large-string-into-multiple-substrings-containing-n-number-of-words-via
sentence_a = []
sentence_b = []
label = []
random.seed(SEED)
# words = corpus.split()
n = int(np.round(MAX_LENGTH/3)) # will give you 3247 sentences if 128/3=42
n = 30 # because 42 is might be a bit long

for paragraph in corpus:
    words = paragraph.split()
    sentences = [" ".join(words[i:i+n]) for i in range(0, len(words), n)]
    num_sentences = len(sentences)
    for i in range(num_sentences-2):
        if num_sentences > 1: 
          # take 50/50 wether IsNextSentence or IsNotNextSentence
          if random.random()  >= 0.5:
            # this is IsNextSentence
            sentence_a.append(sentences[i])
            sentence_b.append(sentences[i+1])
            label.append(0)
          else:
            index = random.randint(0, bag_size-1)
            # this is IsNotNextSentence
            sentence_a.append(sentences[i])
            sentence_b.append(bag[index])
            label.append(1)

In [16]:
len(sentence_a)

5540

## Option T: Title and paragraph
Use the title sentence as a special sentence, two options are explored

### Option: T-rand
_title+50/50 rand/next_ (name in table)

similar to option A, but sentence a is always the (first) sentence title

In [17]:
sentence_a = []
sentence_b = []
label = []
random.seed(SEED)

# title  + 50/50 random/next
for paragraph in corpus:
  sentences  = [sentence for sentence in paragraph.split('.') if sentence != '']
  num_sentences = len(sentences)
  # start with title sentence
  if num_sentences > 1:
    sentence_a.append(sentences[0]) # the title sentence
    if random.random() > 0.5:
      sentence_b.append(sentences[1])
      label.append(0)
    else:
      index = random.randint(0, bag_size-1)
      sentence_b.append(bag[index])
      label.append(1)

### Option T-all
_title+all doc_ (name in table) title+all in par would be better

like above, however this time make pairs with all sentences in paragraph. This yield a high number of pairs. To keep the training balanced, we also make false pairs which are not next sentence


In [18]:
sentence_a = []
sentence_b = []
label = []
random.seed(SEED)

# title  + next AND title plus random
for paragraph in corpus:
  sentences  = [sentence for sentence in paragraph.split('.') if sentence != '']
  num_sentences = len(sentences)
  # start with title sentence
  if num_sentences > 1:
    # first case: title + next
    sentence_a.append(sentences[0]) # the title sentence
    sentence_b.append(sentences[1]) # the next one
    label.append(0)
    # next case: title + random
    sentence_a.append(sentences[0]) # the title sentence
    index = random.randint(0, bag_size-1) # a random one
    sentence_b.append(bag[index])
    label.append(1)

  # to create a balanced set
  # now do the random approach, discarding the title sentence
  if num_sentences > 2:
    start = random.randint(1, num_sentences-2)
    # take 50/50 wether IsNextSentence or IsNotNextSentence
    if random.random()  >= 0.5:
      # this is IsNextSentence
      sentence_a.append(sentences[start])
      sentence_b.append(sentences[start+1])
      label.append(0)
    else:
      index = random.randint(0, bag_size-1)
      # this is IsNotNextSentence
      sentence_a.append(sentences[start])
      sentence_b.append(bag[index])
      label.append(1)

In [19]:
len(sentence_a)

4174

## Tokenize

In [20]:
input = tokenizer(sentence_a, sentence_b, return_tensors='pt', 
                  max_length=MAX_LENGTH, truncation=True, padding='max_length')

In [21]:
input.input_ids[random.randint(0, len(sentence_a))]
# separator is 102, padded with zeros

tensor([  101,  2006,  9612,  1997,  8123, 24335, 12589, 23760, 18585, 14969,
         4230,  3909,  2083,  1996,  7224,   102,  1996,  4446,  2187,  1999,
         1996,  7224,  2011,  1037,  2303,  3048,  2012, 23760, 18585, 10898,
         2003,  1996,  3395,  1997,  9373,  3949,   102,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

In [22]:
input['labels'] = torch.LongTensor([label]).T

## Set up data pipeline input

In [23]:
class CranfieldNSPDataset(torch.utils.data.Dataset):
  def __init__(self, encodings):
    self.encodings = encodings
  def __len__(self):
    return len(self.encodings.input_ids)
    # return len(self.encodings.input_ids.shape[0])
  def __getitem__(self, idx):
    return {key: tensor[idx] for key, tensor in self.encodings.items()}

In [24]:
dataset = CranfieldNSPDataset(input)

In [25]:
loader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=False)

# Training

In [None]:
model.to(device)
model.train() 

optim = AdamW(model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
  loop = tqdm_notebook(loader, leave=True)
  for batch in loop:
    optim.zero_grad()
    input_ids = batch['input_ids'].to(device)
    token_type_ids = batch['token_type_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = batch['labels'].to(device)

    outputs = model(input_ids, token_type_ids=token_type_ids,
                   attention_mask=attention_mask,
                   labels=labels)
    loss = outputs.loss
    loss.backward()
    optim.step()

    loop.set_description(f'Epoch {epoch}')
    loop.set_postfix(loss=loss.item())


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  import sys


  0%|          | 0/261 [00:00<?, ?it/s]

## Save trained model - filtered mode

In [None]:
model_dir = '/content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/bert-meets-cranfield/Models/'
output_model_file = model_dir + 'BERT_Cranfield_NSP_model-title-option2-' + str(MAX_LENGTH) + '-' + str(BATCH_SIZE) + '-' + str(LEARNING_RATE) + '-' + str(EPOCHS) + '.bin'

In [None]:
if DO_TRAINING:
  model_to_save = model.module if hasattr(model, 'module') else model
  to_save_dict = model_to_save.state_dict()
  to_save_with_prefix= {}
  for key, value in to_save_dict.items():
    if key.startswith('bert.encoder'):
      to_save_with_prefix[key] = value
  torch.save(to_save_with_prefix, output_model_file)
  print('saved to : ', output_model_file)

saved to :  /content/drive/MyDrive/COMPUTING SCIENCE/THESIS_PROJECT/bert-meets-cranfield/Models/BERT_Cranfield_NSP_model-title-option2-128-16-2e-05-1.bin
